Utility functions

The NLP library contains functions useful for in-depth document analysis. They extract elements of the text that can be applied to NLP algorithms, or that can help you with your analysis.

.nlp.findTimes

All the times in a document

Syntax: .nlp.findTimes x

Where x is a string, returns a general list:

  1. time
  2. text of the time (string)
  3. start index (long)
  4. index after the end index (long)
q).nlp.findTimes "I went to work at 9:00am and had a coffee at 10:20"
09:00:00.000 "9:00am" 18 24
10:20:00.000 "10:20"  45 50

.nlp.findDates

All the dates in a document

Syntax: .nlp.findDates x

Where x is a string, returns a general list:

  1. start date of the range
  2. end date of the range
  3. text of the range
  4. start index of the date (long)
  5. index after the end index (long)
q).nlp.findDates "I am going on holidays on the 12/04/2018 to New York and come back on the 18.04.2018"
2018.04.12 2018.04.12 "12/04/2018" 30 40
2018.04.18 2018.04.18 "18.04.2018" 74 84

.nlp.findRegex

Find regular expressions within a string of text

Syntax: .nlp.findRegex[text;expr]

Where

  • text is the string of text to extract the regular expressions from
  • expr is the expression type to be searched for within the text

returns a dictionary, extracting the expression along with the indices within the expression occurs.

The optional expressions that can be searched for within the text are as follows:

`specialChars                 `year
`money                        `yearfull
`phoneNumber                  `am
`emailAddress                 `pm
`url                          `time12
`zipCode                      `time24
`postalCode                   `time
`postalOrZipCode              `yearmonthList
`dtsep (date separator)       `yearmonthdayList
`day                          `yearmonth
`month                        `yearmonthday
q)txt:"You can call the number 123 456 7890 or email us on name@email.com in book an 
   appoinment for January,February and March for £30.00"
q).nlp.findRegex[txt;`phoneNumber`emailAddress`yearmonthList`money]
phoneNumber  | ,(" 123 456 7890";23;36)
emailAddress | ,("name@email.com";52;66)
yearmonthList| (("January";97;104);("February";105;113);("March";118;123);("30";129;131);("00";13..
money        | ,("\302\24330.00";128;134)

.nlp.getSentences

A document partitioned into sentences.

Syntax: .nlp.getSentences x

Where x is a dictionary or a table of document records or subcorpus, returns a list of strings.

/finds the sentences in the first chapter of MobyDick
q) .nlp.getSentences corpus[0]
"CHAPTER 1\n\n  Loomings\n\n\n\nCall me Ishmael."
" Some years ago--never mind how long precisely-- having little or no money in my purse, and noth..
" It is a way I have of driving off the spleen and regulating the circulation."
"Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in ..
" This is my substitute for pistol and ball."
"With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship."
" There is nothing surprising in this."
"If they but knew it, almost all men in their degree, some time or other, cherish very nearly the..
"\n\nThere now is your insular city of the Manhattoes, belted round by wharves as Indian isles by..
"Right and left, the streets take you waterward."
" Its extreme downtown is the battery, where that noble mole is washed by waves, and cooled by br..
"Look at the crowds of water-gazers there."
"\n\nCircumambulate the city of a dreamy Sabbath afternoon."
..

.nlp.loadTextFromDir

All the files in a directory, imported recursively

Syntax: .nlp.loadTextFromDir x

Where x is the directory’s filepath as a string, returns a table of filenames, paths and texts.

q).nlp.loadTextFromDir["./datasets/maildir/skilling-j"]

fileName path                                           text                 ..
-----------------------------------------------------------------------------..
1.       :./datasets/maildir/skilling-j/_sent_mail/1.   "Message-ID: <1461010..
10.      :./datasets/maildir/skilling-j/_sent_mail/10.  "Message-ID: <1371054..
100.     :./datasets/maildir/skilling-j/_sent_mail/100. "Message-ID: <47397.1..
101.     :./datasets/maildir/skilling-j/_sent_mail/101. "Message-ID: <2486283..

Remove characters

Remove characters from a string of text

.nlp.rmv_custom

Remove aspects of a string of text containing certain characters or expressions

Syntax: .nlp.rmv_custom[text;char]

Where

  • text is a string of text
  • char is a list of characters or expressions to be removed from the text

returns the string a text without anything that contains the defined characters.

q)rmv_list   :("*\n*";"*?!*";"*,";"*&*";"*[0-9]*")
q)(jeffemails`text)100
"Re:\n\nHow much to you have?!  SRS\n\n\n\n\nKevin Hannon @ ENRON COMMUNICATIONS on 04/20/2001 08..
q).nlp.rmv_custom[(jeffemails`text)100;rmv_list]
"much to you  SRS\n\n\n\n\nKevin Hannon ENRON COMMUNICATIONS on  \n\n\nOK Sherri how much do you ..

.nlp.rmv_master

Remove certain individual characters from a string of text and replace them

Syntax: .nlp.rmv_master[text;char;n]

Where

  • text is a string of text
  • char is the string of characters to be removed
  • n is the character which will replace the removed character

returns the string of text with the characters removed and replaced.

q).nlp.rmv_master[(jeffemails`text)100;",.:?!/@'\n";"??"]
"Re????????How much to you have????  SRS??????????Kevin Hannon ?? ENRON COMMUNICATIONS on 04??20?..

.nlp.ascii

Remove any non-ASCII characters from a string of text

Syntax: .nlp.ascii[text]

Where text is a string of text returns the string of text with all non-ASCII characters removed.

q).nlp.ascii["This is ä senteñcê"]
"This is  sentec"