Skip to content

Emails

.nlp.email Emails getGraph Get the graph of who emailed who loadEmails Convert an mbox file to a table of parsed metadata parseMail Extract meta information from an email

One of the most important document formats for analysis in natural-language processing is emails, particularly for surveillance, and spam detection. The following functions form a basis for the handling of email-format data.

In the below examples, emails stored in an MBOX file were used. This collection of emails can be found in the data folder of the mlnotebooks.

The MBOX file is the most common format for storing email messages on a hard drive. All the messages for each mailbox are stored as a single, long, text file in a string of concatenated e-mail messages, starting with the From header of the message.

.nlp.email.getGraph

Get the graph of who emailed who, including the number of times they emailed

.nlp.email.getGraph emails

Where emails is a table containing parsed metadata and content of an mbox file (as returned by .nlp.email.loadEmails) returns a table of to-from pairing.

q)email:.nlp.email.loadEmails["/home/kx/nlp/datasets/tdwg-lit.mbox"]
q).nlp.email.getGraph[emails]
sender                           to                               volume
------------------------------------------------------------------------
Donald.Hobern@csiro.au           tdwg-img@lists.tdwg.org          1
Donald.Hobern@csiro.au           tdwg@lists.tdwg.org              1
Donald.Hobern@csiro.au           vchavan@gbif.org                 1
RichardsK@landcareresearch.co.nz tdwg-img@lists.tdwg.org          1
Robert.Morris@cs.umb.edu         Tdwg-tag@lists.tdwg.org          1
Robert.Morris@cs.umb.edu         tdwg-img@lists.tdwg.org          1
mdoering@gbif.org                lee@blatantfabrications.com      1
mdoering@gbif.org                tdwg-img@lists.tdwg.org          1
morris.bob@gmail.com             tdwg-img@lists.tdwg.org          1
ram@cs.umb.edu                   RichardsK@landcareresearch.co.nz 1
ram@cs.umb.edu                   tdwg-img@lists.tdwg.org          2
...

.nlp.email.loadEmails

Convert an MBOX file to a table of parsed metadata

.nlp.email.loadEmails filepath

Where filepath is a string of the path to the MBOX file returns a table containing parsed metadata and content of the MBOX file.

column      type                            content
--------------------------------------------------------------------------
sender      string                          name and address of sender
to          string                          name and address of receiver/s
date        timestamp                       date
subject     string                          subject
text        string                          original text
contentType string                          content type
payload     string or list of dictionaries  payload
q)email:.nlp.email.loadEmails["/home/kx/nlp/datasets/tdwg-lit.mbox"]
q)cols email
`sender`to`date`subject`contentType`payload`text

.nlp.email.parseMail

Extract meta information from an email

.nlp.email.parseMail filepath

Where filepath is a string of the path to the MBOX file, returns a dictionary containing meta information from the email.

q)emailstring:"/home/kx/nlp/datasets/tdwg-lit.mbox"
q)table:.nlp.email.parseMail emailString
q)cols table
`sender`to`date`subject`contentType`payload