I spent some time on Friday and Monday writing a script to do some analysis of the Enron Email Dataset. I’m working on a new type of message list view for thunderbird, well a whole new layout actually, but for the message view I wanted to have an idea of message size and content.
Email Data
It turns out that decent email data is relatively hard to come by. Because of privacy concerns it’s nearly impossible to have access to a companies email where you can see the full exchange between a number of different people. Luckily the Enron dataset has become publicly available exactly for this kind of research into email problems.
The enron dataset is broken down into directories for many of the people involved and sub-directories of their emails.
- maildir
- taylor-m
- all_documents
- archive
- australia_trading
- boat
- brazil_trading
- …
- mclaughlin-e
- all_documents
- calendar
- contacts
- deleted_items
- discussion_threads
- …
- …
- taylor-m
The script I wrote is designed to read in email files in the directory and analyze the message body for its content. Then is spurts out the numbers with median and averages computed.
Mail Trends
If you’ve seen Mail Trends, you know that Mihai Parparita analyzed the enron emails for time, size, threading, and people comparisons. If you download the code you can run it against your own email and will likely see some amazing results (someone should pull this into Thunderbird!).
However the information I was looking for was not available in the mail trends analysis. Mail trends analyzes only email headers to create relationship statistics between emails. And while it does have the size of messages in terms of KB I was looking for the size of message in terms of the number of words.
You had me at Hello?
I’ve had this hypothesis or assumption that within the first 2 sentences of an email I can tell what it’s going to be about without reading the rest. Please try this out on your own! Read the first two sentences of any email and take a second to think if you can at least prioritize your response required for the message.
Combine this assumption with the my other assumption that it’s more important for me to process my mails than it is for me to actually read the entirety of any message. I know people are probably thinking, “you should read the whole message”; but in all honesty more than half the messages I get aren’t important to me at all so reading them would just waste time. This second part of my hypothesis stems from ideas like Inbox Zero and GTD where processing all those “things” is the most important part to being productive.
45 is Median Number of Words Per Message
Analyzing all those emails gave a bit of a statistics problem. On average it turned out to be something like 120 words per message. This high average number came from a few outliers of 500+ word messages that were skewing the results towards the high end, when the numbers should really be reflecting the low end where more results were present. So on average the median number of words per email message was 45. That’s the average of all the medians… rounded. Probably should have just included the standard deviation and called it quits.
I didn’t analyze the kinds of words or their length, which would be something else that’s pretty interesting to know. A next step could be to simply analyze the number of characters per message, that could give interesting hints on how to display the message in it’s entirety.
Back to the Message List View
Here’s a rough breakdown of what GMail gives me when I look at any given message. It’s just enough to understand who this message is from and what it’s probably about.
![]()
It’s possible with the [x] checkbox and the actions menu that I could process this mail and move on. However usually I end up opening every message to make sure there’s nothing else I should see. I’m not sure if that’s because I really need to read the rest of the message or what.
So my question continues to be this: Given a little bit more of the message itself, or a little bit more of the context of the message… is there a better way for me to process my emails? I have some mockups and ideas on how I think it could be done, but they need more refining. Will post soon.












Bryan,
Is it possible that the fact we can glean so much meaning from the first 2 sentences of an email is a learned behavior?
We’ve all probably had bosses or other people who said “Your message was too long, I didn’t bother to read it.” Most likely this is because Outlook (by far the most widespread corporate email client) and most other email clients, at this point, have “Summary View” or something similar which shows the sender, subject, and the first 2-4 lines of the message. If people want to get their message noticed, they would squeeze 1-2 summary sentences into the space that those two lines would accommodate.
I’m not criticizing, by any means. I think that you’re dead-on. This is also why I think that the summary and 3-pane views in Outlook/Entourage/etc. provide this little bit of email for context.
Ken: Its completely possible that you’re dead on here and the machine is making our world. Hopefully I’ll be able to document all the different clients so we can see clearly how they are representing mails like this.
Getting the message across quickly and succinctly has always been a hallmark of good writing (‘tell them what you’re going to tell them, tell them, and then tell them what you told them.’) I agree that the tools do reinforce that message, but they aren’t the cause.
FWIW, I do prefer GMail to my work inbox precisely because GMail shows the first few words and my work e-mail client doesn’t.
IIRC from statistics, you can discard any data that is more than two standard deviations from the mean since that should account for 95% of the data. Since your data set is pretty large, you should be able to drop those 500+ word emails and get a more meaningful average.
Yes, absolutely — GMail and Outlook have it right. I can trash messages much faster with just a few lines of context.
I would love it if Thunderbird did the same thing.