Index of /code/gmane

Icon  Name                    Last modified      Size  Description
[PARENTDIR] Parent Directory - [   ] 2015-12-04 14:05 11K [   ] d3.v2.js 2015-12-04 14:05 248K [TXT] 2016-02-15 19:45 945 [TXT] gline.htm 2015-12-04 14:05 763 [   ] gline.jsonp 2015-12-04 14:05 3.7K [TXT] 2016-02-15 19:45 1.6K [TXT] 2016-02-15 19:45 4.4K [TXT] 2016-02-08 18:45 7.7K [TXT] gword.htm 2015-12-04 14:05 1.1K [   ] gword.jsonp 2015-12-04 14:05 2.8K [TXT] 2016-02-15 19:45 1.3K [TXT] 2016-02-15 19:45 1.6K [   ] mapping.sqlite 2016-02-15 14:56 12K
Analyzing an EMAIL Archive vizualizing the data using the 
D3 JavaScript library

Here is a copy of the Sakai Developer Mailing list from 2006-2014.

You should install the SQLite browser to view and modify the databases from:

The base URL is hard-coded in the  Make sure to delete the 
content.sqlite file if you switch the base url.  The file 
operates as a spider in that it runs slowly and retrieves one mail 
message per second so as to avoid getting throttled.   It stores all of
its data in a database and can be interrupted and re-started 
as often as needed.   It may take many hours to pull all the data
down.  So you may need to restart several times.

To give you a head-start, I have put up 600MB of pre-spidered Sakai 
email here:

If you download and unzip this, you can "catch up with the 
latest" by running

Navigate to the folder where you extracted the

Here is a run of getting the last five messages of the
sakai developer list:

Mac: python 

How many messages:10 9443 2005-12-09T13:32:29+00:00 re: lms/vle rants/comments 3586 2005-12-09T13:32:31-06:00 re: sakaiportallogin and presense 10600 2005-12-09T13:42:24+00:00 re: lms/vle rants/comments

The program scans content.sqlite from 1 up to the first message number not
already spidered and starts spidering at that message.  It continues spidering
until it has spidered the desired number of messages or it reaches a page
that does not appear to be a properly formatted message.

Sometimes there is missing a message.  Perhaps administrators can delete messages
or perhaps they get lost - I don't know.   If your spider stops, and it seems it has hit
a missing message, go into the SQLite Manager and add a row with the missing id - leave
all the other fields blank - and then restart   This will unstick the 
spidering process and allow it to continue.  These empty messages will be ignored in the next
phase of the process.

One nice thing is that once you have spidered all of the messages and have them in 
content.sqlite, you can run again to get new messages as they get sent to the
list. will quickly scan to the end of the already-spidered pages and check 
if there are new messages and then quickly retrieve those messages and add them 
to content.sqlite.

The content.sqlite data is pretty raw, with an innefficient data model, and not compressed.
This is intentional as it allows you to look at content.sqlite to debug the process.
It would be a bad idea to run any queries against this database as they would be 

The second process is running the program reads the rough/raw 
data from content.sqlite and produces a cleaned-up and well-modeled version of the 
data in the file index.sqlite.  The file index.sqlite will be much smaller (often 10X
smaller) than content.sqlite because it also compresses the header and body text.

Each time runs - it completely wipes out and re-builds index.sqlite, allowing
you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the 
data cleaning process.

Running works as follows:

Mac: python

Loaded allsenders 1588 and mapping 28 dns mapping 1
1 2005-12-08T23:34:30-06:00
251 2005-12-22T10:03:20-08:00
501 2006-01-12T11:17:34-05:00
751 2006-01-24T11:13:28-08:00

The program does a number of data cleaing steps

Domain names are truncated to two levels for .com, .org, .edu, and .net 
other domain names are truncated to three levels.  So becomes and becomes   Also mail addresses are
forced to lower case and some of the address like the following

are converted to the real address whenever there is a matching real email
address elsewhere in the message corpus.

If you look in the content.sqlite database there are two tables that allow
you to map both domain names and individual email addresses that change over 
the lifetime of the email list.  For example, Steve Githens used the following
email addresses over the life of the Sakai developer list:

We can add two entries to the Mapping table -> ->

And so all the mail messages will be collected under one sender even if 
they used several email addresses over the lifetime of the mailing list.

You can also make similar entries in the DNSMapping table if there are multiple
DNS names you want mapped to a single DNS.  In the Sakai data I add the following
mapping: ->

So all the folks from the various Indiana University campuses are tracked together

You can re-run the over and over as you look at the data, and add mappings
to make the data cleaner and cleaner.   When you are done, you will have a nicely
indexed version of the email in index.sqlite.   This is the file to use to do data
analysis.   With this file, data analysis will be really quick.

The first, simplest data analysis is to do a "who does the most" and "which 
organzation does the most"?  This is done using

Mac: python 

How many to dump? 5
Loaded messages= 51330 subjects= 25033 senders= 1584

Top 5 Email list participants 2657 1742 1591 1304 1184

Top 5 Email list organizations 7339 6243 2451 2258 2055

You can look at the data in index.sqlite and if you find a problem, you 
can update the Mapping table and DNSMapping table in content.sqlite and

There is a simple vizualization of the word frequence in the subject lines
in the file

Mac: python

Range of counts: 33229 129
Output written to gword.js

This produces the file gword.js which you can visualize using the file 

A second visualization is in  It visualizes email participation by 
organizations over time.

Mac: python 

Loaded messages= 51330 subjects= 25033 senders= 1584
Top 10 Oranizations
['', '', '', '', '', '', '', '', '', '']
Output written to gline.js

Its output is written to gline.js which is visualized using gline.htm.

Some URLs for visualization ideas:

As always - comments welcome.

-- Dr. Chuck
Sun Sep 29 00:11:01 EDT 2013