I spent some time on Friday and Monday writing a script to do some analysis of the Enron Email Dataset. I’m working on a new type of message list view for thunderbird, well a whole new layout actually, but for the message view I wanted to have an idea of message size and content.
It turns out that decent email data is relatively hard to come by. Because of privacy concerns it’s nearly impossible to have access to a companies email where you can see the full exchange between a number of different people. Luckily the Enron dataset has become publicly available exactly for this kind of research into email problems.
The enron dataset is broken down into directories for many of the people involved and sub-directories of their emails.
The script I wrote is designed to read in email files in the directory and analyze the message body for its content. Then is spurts out the numbers with median and averages computed.
If you’ve seen Mail Trends, you know that Mihai Parparita analyzed the enron emails for time, size, threading, and people comparisons. If you download the code you can run it against your own email and will likely see some amazing results (someone should pull this into Thunderbird!).
However the information I was looking for was not available in the mail trends analysis. Mail trends analyzes only email headers to create relationship statistics between emails. And while it does have the size of messages in terms of KB I was looking for the size of message in terms of the number of words.
You had me at Hello?
I’ve had this hypothesis or assumption that within the first 2 sentences of an email I can tell what it’s going to be about without reading the rest. Please try this out on your own! Read the first two sentences of any email and take a second to think if you can at least prioritize your response required for the message.
Combine this assumption with the my other assumption that it’s more important for me to process my mails than it is for me to actually read the entirety of any message. I know people are probably thinking, “you should read the whole message”; but in all honesty more than half the messages I get aren’t important to me at all so reading them would just waste time. This second part of my hypothesis stems from ideas like Inbox Zero and GTD where processing all those “things” is the most important part to being productive.
45 is Median Number of Words Per Message
Analyzing all those emails gave a bit of a statistics problem. On average it turned out to be something like 120 words per message. This high average number came from a few outliers of 500+ word messages that were skewing the results towards the high end, when the numbers should really be reflecting the low end where more results were present. So on average the median number of words per email message was 45. That’s the average of all the medians… rounded. Probably should have just included the standard deviation and called it quits.
I didn’t analyze the kinds of words or their length, which would be something else that’s pretty interesting to know. A next step could be to simply analyze the number of characters per message, that could give interesting hints on how to display the message in it’s entirety.
Back to the Message List View
Here’s a rough breakdown of what GMail gives me when I look at any given message. It’s just enough to understand who this message is from and what it’s probably about.
It’s possible with the [x] checkbox and the actions menu that I could process this mail and move on. However usually I end up opening every message to make sure there’s nothing else I should see. I’m not sure if that’s because I really need to read the rest of the message or what.
So my question continues to be this: Given a little bit more of the message itself, or a little bit more of the context of the message… is there a better way for me to process my emails? I have some mockups and ideas on how I think it could be done, but they need more refining. Will post soon.