Text comparison¶
.nlp Comparison functions compareCorpora Compare corpora compareDocs Compare two documents compareDocToCentroid Find outliers and representative documents compareDocToCorpus Compare document to corpus cosineSimilarity Compare two feature vectors explainSimilarity How much each term contributes to cosine similarity jaroWinkler Compare two strings
Following the application of data-processing procedures, it is possible to compare feature vectors, corpora and documents.
In the examples below, the parsedTab
variable is the output from the .nlp.newParser
example defined in the data-preprocessing section.
.nlp.compareCorpora
¶
Affinity between terms in two corpora
.nlp.compareCorpora[parsedTab1;parsedTab2]
Where parsedTab1
and parsedTab2
are tables of parsed documents (as returned by .nlp.newParser
) returns a dictionary of terms and their affinity for parsedTab2
over parsedTab1
.
Below we compare the chapters in the novel that contain the term whale with the remaining chapters.
// Separate text containing the term "whale"
q)whaleText:parsedTab i:where (parsedTab[`text] like "*whale*")
q)remaining:parsedTab til[count parsedTab]except i
q)show compare:.nlp.compareCorpora[whaleText;remaining]
`whale`whales`sperm`fish`boat`white`boats`great`oil`far`..
`night`queequeg`bed`man`aye`sleeping`ahab`morning`sat`th..
q)5#first compare
whale | 26.16359
whales| 12.40908
sperm | 10.20464
fish | 7.951354
boat | 7.824179
q)5#last compare
night | 23.62646
queequeg| 19.54203
bed | 15.60707
man | 14.87776
aye | 13.4208
A quick way to compare corpora is to find words common to the whole dataset, but with a strong affinity to only one corpus. This can be used to find key words which differentiate one corpus from another
The parsedTab
argument must contain columns tokens
and isStop
This function allows you to calculate the similarity of two different documents. It finds the keywords that are present in both the corporas, and calculates the cosine similarity.
.nlp.compareDocs
¶
Cosine similarity of two documents
.nlp.compareDocs[keywords1;keywords2]
Where keywords1
and keywords2
are dictionaries of keywords and their significance scores in a document (as returned by .nlp.newParser
), returns the cosine similarity of two documents.
q)show 5#keywords1:first parsedTab`keywords
chapter | 0.001145475
loomings| 0.006885035
ishmael | 0.00849496
years | 0.002474785
ago | 0.006219469
q)show 5#keywords2:last parsedTab`keywords
chapter | 0.0125
whaling | 0.02481605
ribs | 0.04515925
trucks | 0.05780426
occasion| 0.04635063
q).nlp.compareDocs[keywords1;keywords2]
0.0362958
.nlp.compareDocToCentroid
¶
Find outliers and representative documents
.nlp.compareDocToCentroid[centroid;keywords]
Where
centroid
is the sum of all the keywords’ significance scores as a dictionarykeywords
is a dictionary of keywords and their significance scores in a corpus (as returned by.nlp.newParser
)
returns the cosine similarity of the two documents as a float.
Below, all the chapters containing the term whale are extracted and the centroid calculated. The chapters furthest from the centroids are identified.
q)whaleText:parsedTab i:where (parsedTab[`text] like "*whale*")
q)centroid:sum whaleText`keywords
q)show compare:.nlp.compareDocToCentroid[centroid]each whaleText`keywords
0.3849759 0.3286244 0.3994688 0.3833975 0.2
q)5#whaleText iasc compare
text ..
------------------------------------------------------------------..
"CHAPTER 6\n\nThe Street\n\n\nIf I had been astonished at first ca..
"CHAPTER 88\n\nSchools and Schoolmasters\n\n\nThe previous chapter..
"CHAPTER 107\n\nThe Carpenter\n\n\nSeat thyself sultanically among..
"CHAPTER 95\n\nThe Cassock\n\n\nHad you stepped on board the Pequo..
"CHAPTER 15\n\nChowder\n\n\nIt was quite late in the evening when ..
.nlp.compareDocToCorpus
¶
Cosine similarity between a document and other documents in a corpus
.nlp.compareDocToCorpus[keywords;idx]
Where
keywords
is a list of dictionaries containing keywords and their significance scores in a corpus (as returned by.nlp.newParser
)idx
is the index of thekeywords
to compare with the rest of the corpus' keywords
returns as a float the document’s significance to the rest of the corpus.
Comparing the first chapter with the rest of the book:
q).nlp.compareDocToCorpus[parsedTab`keywords;0]
0.078517 0.1048744 0.06266384 0.07095197 0.08974005..
.nlp.cosineSimilarity
¶
Compare two feature vectors
.nlp.cosineSimilarity[keywords1;keywords2]
Where
keywords1
is a dictionary of keywords and their significance scores in a corpus (result of.nlp.newParser
)keywords2
is a dictionary of keywords and their significance scores in a corpus (result of.nlp.newParser
)
returns the cosine similarity of the two.
q)show 5#keywords1:first parsedTab`keywords
chapter | 0.001145475
loomings| 0.006885035
ishmael | 0.00849496
years | 0.002474785
ago | 0.006219469
q)show 5#keywords2:last parsedTab`keywords
chapter | 0.0125
whaling | 0.02481605
ribs | 0.04515925
trucks | 0.05780426
occasion| 0.04635063
q).nlp.cosineSimilarity[keywords1;keywords2]
98.17588
By extracting the keywords in a corpus and calculating their associated significance scores, we are able to treat pieces of text as feature vectors to further analyze the content.
A vector can be thought of either as
- the co-ordinates of a point
- describing a line segment from the origin to a point
The view of a vector as a line segment starting at the origin is useful, as any two vectors will have an angle between them, corresponding to their similarity, as calculated by cosine similarity.
The cosine similarity of two vectors is the dot product of two vectors over the product of their magnitudes. It is a standard distance metric for comparing documents.
.nlp.explainSimilarity
¶
How much each term contributes to the cosine similarity
.nlp.explainSimilarity[keywords1;keywords2]
Where keywords1
and keywords2
are dictionaries of keywords and their significance scores in a corpus (as returned by .nlp.newParser
) returns a dictionary of how much of the similarity score each token is responsible for.
q)5#.nlp.explainSimilarity . parsedTab[`keywords]0 100
whale| 0.1864428
time | 0.06867081
sea | 0.02967095
ship | 0.02693201
long | 0.02690912
For any pair of documents or centroids, the list of features can be sorted by how much they contribute to the similarity.
.nlp.jaroWinkler
¶
Calculate the Jaro-Winkler distance of two strings, scored between 0 and 1
.nlp.jaroWinkler[str1;str2]
Where str1
and str2
are strings, returns the Jaro-Winkler score of the two: a number between 0 and 1. 1 indicates the strings are identical; 0, completely dissimilar.
q).nlp.jaroWinkler[parsedTab[0]`text;parsedTab[1]`text]
0.835967
The centroid of a collection of documents is the average of their feature vectors. As such, documents close to the centroid are representative, while those far away are the outliers. Given a collection of documents, finding outliers can be a quick way to find interesting documents, those that have been mis-clustered, or those not relevant to the collection.