.nlp Comparison functions compareCorpora Compare corpora compareDocs Compare two documents compareDocToCentroid Find outliers and representative documents compareDocToCorpus Compare document to corpus cosineSimilarity Compare two feature vectors explainSimilarity How much each term contributes to cosine similarity jaroWinkler Compare two strings
Following the application of data-processing procedures, it is possible to compare feature vectors, corpora and documents.
In the examples below, the
parsedTab variable is the output from the
.nlp.newParser example defined in the data-preprocessing section.
Affinity between terms in two corpora
parsedTab2 are tables of parsed documents (as returned by
.nlp.newParser) returns a dictionary of terms and their affinity for
Below we compare the chapters in the novel that contain the term whale with the remaining chapters.
// Separate text containing the term "whale" q)whaleText:parsedTab i:where (parsedTab[`text] like "*whale*") q)remaining:parsedTab til[count parsedTab]except i q)show compare:.nlp.compareCorpora[whaleText;remaining] `whale`whales`sperm`fish`boat`white`boats`great`oil`far`.. `night`queequeg`bed`man`aye`sleeping`ahab`morning`sat`th.. q)5#first compare whale | 26.16359 whales| 12.40908 sperm | 10.20464 fish | 7.951354 boat | 7.824179 q)5#last compare night | 23.62646 queequeg| 19.54203 bed | 15.60707 man | 14.87776 aye | 13.4208
A quick way to compare corpora is to find words common to the whole dataset, but with a strong affinity to only one corpus. This can be used to find key words which differentiate one corpus from another
parsedTab argument must contain columns
This function allows you to calculate the similarity of two different documents. It finds the keywords that are present in both the corporas, and calculates the cosine similarity.
Cosine similarity of two documents
keywords2 are dictionaries of keywords and their significance scores in a document (as returned by
.nlp.newParser), returns the cosine similarity of two documents.
q)show 5#keywords1:first parsedTab`keywords chapter | 0.001145475 loomings| 0.006885035 ishmael | 0.00849496 years | 0.002474785 ago | 0.006219469 q)show 5#keywords2:last parsedTab`keywords chapter | 0.0125 whaling | 0.02481605 ribs | 0.04515925 trucks | 0.05780426 occasion| 0.04635063 q).nlp.compareDocs[keywords1;keywords2] 0.0362958
Find outliers and representative documents
centroidis the sum of all the keywords’ significance scores as a dictionary
keywordsis a dictionary of keywords and their significance scores in a corpus (as returned by
returns the cosine similarity of the two documents as a float.
Below, all the chapters containing the term whale are extracted and the centroid calculated. The chapters furthest from the centroids are identified.
q)whaleText:parsedTab i:where (parsedTab[`text] like "*whale*") q)centroid:sum whaleText`keywords q)show compare:.nlp.compareDocToCentroid[centroid]each whaleText`keywords 0.3849759 0.3286244 0.3994688 0.3833975 0.2 q)5#whaleText iasc compare text .. ------------------------------------------------------------------.. "CHAPTER 6\n\nThe Street\n\n\nIf I had been astonished at first ca.. "CHAPTER 88\n\nSchools and Schoolmasters\n\n\nThe previous chapter.. "CHAPTER 107\n\nThe Carpenter\n\n\nSeat thyself sultanically among.. "CHAPTER 95\n\nThe Cassock\n\n\nHad you stepped on board the Pequo.. "CHAPTER 15\n\nChowder\n\n\nIt was quite late in the evening when ..
Cosine similarity between a document and other documents in a corpus
keywordsis a list of dictionaries containing keywords and their significance scores in a corpus (as returned by
idxis the index of the
keywordsto compare with the rest of the corpus' keywords
returns as a float the document’s significance to the rest of the corpus.
Comparing the first chapter with the rest of the book:
q).nlp.compareDocToCorpus[parsedTab`keywords;0] 0.078517 0.1048744 0.06266384 0.07095197 0.08974005..
Compare two feature vectors
keywords1is a dictionary of keywords and their significance scores in a corpus (result of
keywords2is a dictionary of keywords and their significance scores in a corpus (result of
returns the cosine similarity of the two.
q)show 5#keywords1:first parsedTab`keywords chapter | 0.001145475 loomings| 0.006885035 ishmael | 0.00849496 years | 0.002474785 ago | 0.006219469 q)show 5#keywords2:last parsedTab`keywords chapter | 0.0125 whaling | 0.02481605 ribs | 0.04515925 trucks | 0.05780426 occasion| 0.04635063 q).nlp.cosineSimilarity[keywords1;keywords2] 98.17588
By extracting the keywords in a corpus and calculating their associated significance scores, we are able to treat pieces of text as feature vectors to further analyze the content.
A vector can be thought of either as
- the co-ordinates of a point
- describing a line segment from the origin to a point
The view of a vector as a line segment starting at the origin is useful, as any two vectors will have an angle between them, corresponding to their similarity, as calculated by cosine similarity.
The cosine similarity of two vectors is the dot product of two vectors over the product of their magnitudes. It is a standard distance metric for comparing documents.
How much each term contributes to the cosine similarity
keywords2 are dictionaries of keywords and their significance scores in a corpus (as returned by
.nlp.newParser) returns a dictionary of how much of the similarity score each token is responsible for.
q)5#.nlp.explainSimilarity . parsedTab[`keywords]0 100 whale| 0.1864428 time | 0.06867081 sea | 0.02967095 ship | 0.02693201 long | 0.02690912
For any pair of documents or centroids, the list of features can be sorted by how much they contribute to the similarity.
Calculate the Jaro-Winkler distance of two strings, scored between 0 and 1
str2 are strings, returns the Jaro-Winkler score of the two: a number between 0 and 1. 1 indicates the strings are identical; 0, completely dissimilar.
The centroid of a collection of documents is the average of their feature vectors. As such, documents close to the centroid are representative, while those far away are the outliers. Given a collection of documents, finding outliers can be a quick way to find interesting documents, those that have been mis-clustered, or those not relevant to the collection.