Feature vectors¶
.nlp Feature vector functions Feature vectors for documents keywordsContinuous Relevance scores for tokens in a text TFIDF TF-IDF scores for terms in each document of a corpus
Feature vectors for words biGram Probability of a word appearing next in a sequence extractPhrases Tokens that contain the term where each consecutive word has an above-average co-occurrence findRelatedTerms Find related terms and their significance to a word nGram Probability of n tokens appearing together
After applying data-processing procedures, you can treat pieces of text as feature vectors.
You can generate a dictionary of descriptive terms, which consist of terms and their associated weights. These dictionaries are called feature vectors and they are very useful as they give a uniform representation that can describe words, sentences, paragraphs, documents, collections of documents, clusters, concepts and queries.
parsedTab
In the examples below, the parsedTab
variable is the result from the .nlp.newParser
example in the data-preprocessing section.
Feature vectors for documents¶
The values associated with each term in a feature vector are how significant that term is as a descriptor of the entity. For documents, this can be calculated by comparing the frequency of words in that document to the frequency of words in the rest of the corpus.
Sorting the terms in a feature vector by their significance, you get the keywords that distinguish a document most from the corpus, forming a terse summary of the document. This shows the most significant terms in the feature vector for one of the chapters in Moby Dick is whale
.
TF-IDF is an algorithm that weighs a term’s frequency (TF) and its inverse document frequency (IDF). Each word or term has its respective TF and IDF score. The product of the TF and IDF scores of a term is called the TF-IDF weight of that term.
Feature vectors for words¶
The feature vector for a word can be calculated as a collection of how well other words predict the given keyword. The weight given to these words is a function of how much higher the actual co-occurrence rate is from the expected co-occurrence rate the terms would have if they were randomly distributed.
.nlp.biGram
¶
Probability of a word appearing next in a sequence of words
.nlp.biGram parsedTab
Where parsedTab
is a table of parsed documents (as returned by .nlp.newParser
) returns a dictionary containing the probability that the secondary word in the sequence follows the primary word.
q).nlp.biGram parsedTab
chapter loomings | 0.005780347
loomings ishmael | 1
ishmael years | 0.05
years ago | 0.1770833
ago mind | 0.03030303
mind long | 0.02597403
long precisely--| 0.003003003
precisely-- little | 1
little money | 0.004016064
money purse | 0.07692308
purse particular | 0.1428571
The parsedTab
argument must contain column/s tokens
and isStop
.nlp.extractPhrases
¶
Tokens that contain the term where each consecutive word has an above-average co-occurrence with the term
.nlp.extractPhrases[parsedTab;term]
Where
parsedTab
is a table of parsed documents (as returned by.nlp.newParser
)term
is the term as a symbol to extract phrases around
returns a dictionary with phrases as the keys and their relevance as the values.
Search for the phrases that contain captain
and see which phrase has the largest occurrence; we find captain ahab
occurs most often in the book: 50 times.
q).nlp.extractPhrases[parsedTab;`captain]
`captain`ahab | 50
`captain`peleg | 25
`captain`bildad | 10
`stranger`captain | 6
`captain`sleet | 5
`sea`captain | 3
...
The parsedTab
argument must contain column tokens
.nlp.findRelatedTerms
¶
Related terms and their significance to a word
.nlp.findRelatedTerms[parsedTab;term]
Where
parsedTab
is a table of parsed documents (as returned by.nlp.newParser
)term
is a symbol which is the token for which to find related terms
returns a dictionary of the related tokens and their relevances.
q).nlp.findRelatedTerms[parsedTab;`captain]
peleg | 1.665086
bildad | 1.336501
ahab | 1.236744
ship | 1.154238
cabin | 0.9816231
Phrases can be found by looking for runs of words with an above-average significance to the query term.
The parsedTab
argument must contain columns tokens
, isStop
, and sentIndices
.nlp.keywordsContinuous
¶
Relevance scores for tokens in a text
.nlp.keywordsContinuous parsedTab
Where parsedTab
is a table of parsed documents (as returned by .nlp.newParser
) returns a dictionary of keywords and their significance.
Treating all of Moby Dick as a single document, the most significant keywords are Ahab, Bildad, Peleg (the three captains on the boat) and whale.
q)5#keywords:.nlp.keywordsContinuous parsedTab
ahab | 64.24125
peleg | 52.37642
bildad | 46.86506
whale | 42.41664
stubb | 37.82133
For an input which is conceptually a single document, such as a book, this will give better results than using TF-IDF.
The parsedTab
argument must contain columns tokens
and isStop
.nlp.nGram
¶
Probability of n
tokens appearing together in a text
.nlp.nGram[parsedTab;n]
Where
parsedTab
is a table of parsed documents (as returned by.nlp.newParser
)n
is the number of words to occur together
returns a dictionary containing the the probability of n
tokens appearing together in a text.
q).nlp.nGram[parsedTab;3]
chapter loomings ishmael | 1
loomings ishmael years | 1
ishmael years ago | 1
years ago mind | 0.05882353
years ago poor | 0.05882353
years ago nathan | 0.05882353
years ago know | 0.05882353
years ago plan | 0.05882353
years ago scoresby | 0.05882353
years ago commodore | 0.05882353
The parsedTab
argument must contain columns tokens
and isStop
.nlp.TFIDF
¶
TF-IDF scores for terms in each document of a corpus
.nlp.TFIDF parsedTab
Where parsedTab
is a table of parsed documents (as returned by .nlp.newParser
) returns for each document, a dictionary of tokens and their relevance.
Extract a specific document and find the most significiant words in that document:
q)5#desc .nlp.TFIDF[parsedTab]100
whales | 0.02578393
straits| 0.02454199
herd | 0.01933972
java | 0.01729666
sunda | 0.01675828
The parsedTab
argument must contain columns tokens
and isStop