Text Mining the Listserve Emails

The Listserve is an email list where people sign up for a chance to send an email out to the entire list to discuss whatever they want. Currently the number of people enrolled is about 20,000 and there has been one email per day since 16 April 2012. I thought that since this project has been running for about a year, it would be a nice opportunity to learn a little more about text mining.

In this first part I will discuss how I fetched all those emails and parsed them and in a second blog post I will talk about what I found.

The first issue was how to get the emails off the server and after trying a few solutions I finally ended up using the Python imaplib which is a Python library for connecting with an IMAP4 email server which is used by all the major providers such as Yahoo and Google. After connecting I used the Python email library which helped facilitate selecting certain parts of the email. I relied heavily on the function email.message_from_string() to fetch email attributes such as Message-ID or Sender. I took all these emails and dumped them into a SQLite database to later parse with R.

I use R almost daily for work so it was nice to tackle this part of the project with tools I knew pretty well. I used sapply() and strsplit() mostly to parse out parts of various email attributes and then used the tm package to handle all of the text processing. The tm package makes it easier to get all the emails into a term document matrix which is much easier to work with a large corpus of text such as this. I used an English dictonary with the tm package to remove stop words and for stemming (reducing the word to its base form). There have been two emails so far in Portuguese but the rest are all in English.

Initially I thought I could track all the emails by date but this proved to be a difficult task due to the nuances of email and when they actually got sent off the server. Instead I ended up using the Message-ID for making sure that I did not duplicate emails in the analysis.

I put up all the source code on a github repo.

r,, tm