Skip to content

Emails

One of the most important document formats for analysis in natural-language processing is emails, particularly for surveillance, and spam detection. The following functions form a basis for the handling of email-format data.

.nlp.loadEmails

An MBOX file as a table of parsed metadata

Syntax: .nlp.loadEmails x

Where x is a string of the filepath, returns a table.

column type content
sender string Name and address of sender
to string Name and address of receiver/s
date timestamp Date
subject string Subject
text string Original text
contentType string Content type
payload string or list of dictionaries Payload

The MBOX file is the most common format for storing email messages on a hard drive. All the messages for each mailbox are stored as a single, long, text file in a string of concatenated e-mail messages, starting with the From header of the message.

q)email:.nlp.loadEmails["/home/kx/nlp/datasets/tdwg.mbox"]
q)cols email
`sender`to`date`subject`contentType`payload`text

.nlp.email.getGraph

Graph of who emailed whom, with the number of times they emailed

Syntax: .nlp.email.getGraph x

Where x is a table (result from .nlp.email.i.parseMbox), returns a table of to-from pairing.

q).nlp.email.getGraph[emails]

sender                           to                               volume
------------------------------------------------------------------------
Donald.Hobern@csiro.au           tdwg-img@lists.tdwg.org          1
Donald.Hobern@csiro.au           tdwg@lists.tdwg.org              1
Donald.Hobern@csiro.au           vchavan@gbif.org                 1
RichardsK@landcareresearch.co.nz tdwg-img@lists.tdwg.org          1
Robert.Morris@cs.umb.edu         Tdwg-tag@lists.tdwg.org          1
Robert.Morris@cs.umb.edu         tdwg-img@lists.tdwg.org          1
mdoering@gbif.org                lee@blatantfabrications.com      1
mdoering@gbif.org                tdwg-img@lists.tdwg.org          1
morris.bob@gmail.com             tdwg-img@lists.tdwg.org          1
ram@cs.umb.edu                   RichardsK@landcareresearch.co.nz 1
ram@cs.umb.edu                   tdwg-img@lists.tdwg.org          2
ricardo@tdwg.org                 a.rissone@nhm.ac.uk              3
ricardo@tdwg.org                 leebel@netspace.net.au           3
ricardo@tdwg.org                 tdwg-img@lists.tdwg.org          3
ricardo@tdwg.org                 tdwg-lit@lists.tdwg.org          3
ricardo@tdwg.org                 tdwg-obs@lists.tdwg.org          3
ricardo@tdwg.org                 tdwg-process@lists.tdwg.org      3
ricardo@tdwg.org                 tdwg-tag@lists.tdwg.org          3
ricardo@tdwg.org                 tdwg-tapir@lists.tdwg.org        3
roger@tdwg.org                   Tdwg-img@lists.tdwg.org          1

.nlp.email.i.parseMail

Parses an email in string format

Syntax: .nlp.email.i.parseMail x

Where x is an email in a string format, returns a dictionary of the headers and content.

q)table:.nlp.email.parseMail emailString
q)cols table
`sender`to`date`subject`contentType`payload