
Ask HN: Algorithms for text fingerprinting? - vixsomnis
I remember reading an article a year or so ago about (the NSA) identifying users based on how they write: vocabulary, spelling mistakes, grammar, dialect, and so on.<p>This is interesting to me because it is extremely difficult to change the vocabulary I use in writing and speaking. Being able to estimate the amount of similarity between two pieces of text would be useful.<p>The closest I can think of right now would be the proprietary algorithms used to check for plagiarism (for schools and universities, for instance).<p>Are there any publicly available algorithms for this? Where can I go to learn more? (Academic journals?) Am I just DDGing the wrong search terms?
======
j42
Figured I'd chime in here since I developed an algorithm recently that could
be applied to this problem with some basic ML.

Basically the first step would be shingling the text (choosing a sampling
domain) and generating a MinHash struct (computationally cheap) which can then
be used to find the "similarity" between sets, or, the "Jaccard Index."

If you're clever about this, you can use HyperLogLogs to encode these MinHash
structs gaining a great deal of speed with a marginal error rate, all while
allowing for arbitrary N-levels of intersection.

If you're looking to build a model to analyze two (or N) text bodies for
stylometric similarities, I'd approach the problem in two steps:

1) Minimize the relevant input text.

\- Use a bernoulli/categorical distribution to weight words according to
uniqueness--NLP and sentiment extraction techniques may also help

\- Design a markov process to represent more complex phrasing patterns for the
text as a whole

\- Filter by a variable threshold to minimize the resulting set of
shingles/bins/"interesting nodes" into a computationally-manageable #

2) Use an efficient MinHash intersection to compute a similarity vector (0-1)
for the two texts.

I think given the prevalence of training data (I mean, what's more ubiquitous
than the written word...) you could probably tune this to a reasonable
accuracy and efficient complexity.

Just a 5m thought exercise, but if anyone else has ideas I'd be curious as
well :)

~~~
patientfrog
Could you describe in a little more detail what you mean by this sentence:

> "... use HyperLogLogs to encode these MinHash structs gaining a great deal
> of speed with a marginal error rate, all while allowing for arbitrary
> N-levels of intersection"

Thanks

~~~
j42
Sure thing.

Let's say we have two sets, and we're trying to find out how similar they are:

    
    
        setOne = ['the','brown','fox','jumped','a','log']
        setTwo = ['the','quick','brown','log','jumped','over','the','fox']
    

You could use an array intersection when its small, but if you want to do this
efficiently at scale you need to take advantage of probabilistic data
structures.

Let's say you created a HyperLogLog for each set:

    
    
        setOne = (bitfield representing setOne)
        setTwo = (bitfield representing setTwo)
    

Now HyperLogLogs are cool, because you can merge them together w/o losing
anything. You can't retrieve the data, but you can efficiently check if a
value exists inside. You can have a false positive, but never a false
negative.

You might first try simple combinatorics (a la, similarity ~= |A or B| / |A ∪
B|) however this can get hairy depending upon the representation of the
HyperLogLog (sparse/dense) and its respective cardinality.

Eventually you realize you can't accept an exponentially-compounding error
rate, but still need the raw efficiency, and thus you sacrifice by doubling
your storage cost.

Now instead of just the initial sets, you'd also build a parallel MinHash
struct:

    
    
        setOne_minhashes = [<int>, <int>, <int>]
        setTwo_minhashes = [<int>, <int>, <int>]
    
        setOne = (bitfield representing setOne)
        setTwo = (bitfield representing setTwo)
        setOneMinHash = (bitfield representation of setOne minhashes)
        setTwoMinHash = (bitfield representation of setTwo minhashes)
    

I won't go into excessive detail about the minhash algorithm itself, but
essentially it provides a way to sample a large set of values by
selecting/retaining the smallest output hashes. What that means is, you can
then intersect the minhash bitfields as many times over as you like, and
extract a predictably accurate similarity index in linear time with definable
confidence bounds.

~~~
vixsomnis
I don't have a background in ML (I'm still somewhat early in my CS Bachelor's
degree, and I'm not focusing on ML anyway), but this is very interesting,
regardless that I had to look up nearly every term you used.

I was initially thinking a directed weighted graph might work well here, but
I'm assuming that would scale terribly relative to something like this.

------
moyix
The relevant search term is "stylometry". One particular paper I remember is
from Dawn Song's group at Berkeley a couple years back:

[http://www.cs.berkeley.edu/~dawnsong/papers/2012%20On%20the%...](http://www.cs.berkeley.edu/~dawnsong/papers/2012%20On%20the%20Feasibility%20of%20Internet-
Scale%20Author%20Identification.pdf)

There's a lot of public work on the topic, but it looks like right now the
best place to look is still in academic papers (I don't know of any open
source libraries, for example).

~~~
Osmium
I'm not sure whether it's appropriate to mention this or not, but I was just
looking up the authors of that paper to see what they're doing now. Very
sadly, I found out that one of the authors, Emil Stefanov, died last year at
the age of 26.

[http://www.rememberingemil.org](http://www.rememberingemil.org)

~~~
functional_test
Thank you for posting this (it was quite a surprise to see his name here).

I was a close friend of Emil for a long time. Before he passed, I had gotten
busy with work and hadn't spoken to him in a while. I saw some of our mutual
friends the weekend before it happened and had planned to call him. Really
wish I had made that call sooner.

------
thatcat
Jstylo might be what you're looking for.
[https://github.com/psal/jstylo](https://github.com/psal/jstylo)

The same group also has created a text obfuscation tool called anonymouth that
helps you obfuscate your word choices, but it has still yet to be released.
[https://psal.cs.drexel.edu/index.php/JStylo-
Anonymouth](https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth)

------
benten10
I may be late to the party, but finally my time to shine! As has been
mentioned earlier, the field gou are looking for is called stylometry, and has
almost a century of history behind it, and also the field if my thesis. After
looking at what everyone has been saying, I just felt like copying and pasting
my 20 page literature review here, but I'd recommend you look at narayanan at
all (2012) paper (internet scale authorship attribution). The algorithm it
uses is not particularly complex and would take you a week, tops, to implement
if you put in a few hours a day, and that's including doing all the related
research and catching up with the linear algebra involved if you need to.

------
gnur
I've started a simular project myself recently, I check on various parameters
(reading level score, words per sentence, syllables per word, sentences per
paragraph, average word length, average syllable count) and calculate the
distance between 2 texts / authors using a simple euclidean distance.

I started out with the code provided on
[https://github.com/mac389/ToxTweet/blob/master/textanalyzer....](https://github.com/mac389/ToxTweet/blob/master/textanalyzer.py)
I use it in a private project, but the results are promising!

------
pdpd
I wonder if this is how the authors of Truecrypt where identified. I remember
reading something similar about this in regards to coding style.

I am sure the Truecrypt authors contributed to more than one project.

------
MalcolmDiggs
When I was in college we turned in papers via "Turnitin" which checked for
plagiarism and uniqueness etc.

There's an interesting research paper about their algorithms here:
[https://www.cs.auckland.ac.nz/courses/compsci725s2c/archive/...](https://www.cs.auckland.ac.nz/courses/compsci725s2c/archive/termpapers/jrotzky.pdf)

And if you search for "Turnitin Plagiarism Algorithm" I'm sure you'll find a
few more resources.

~~~
apendleton
Plagiarism detection is a somewhat different problem, in that it's looking for
specific common text rather than just stylistic choices. Usually it's just
looking for high percentages of overlapping ngrams between a test document and
documents in a corpus, but two different documents written by the same person
wouldn't test positive.

------
nodelessness
This one time JK Rowling was found out writing by a pseudonym[1] using this
program:

[http://evllabs.com/jgaap/w/index.php/Main_Page](http://evllabs.com/jgaap/w/index.php/Main_Page)

[1] [http://blogs.wsj.com/speakeasy/2013/07/16/the-science-
that-u...](http://blogs.wsj.com/speakeasy/2013/07/16/the-science-that-
uncovered-j-k-rowlings-literary-hocus-pocus/)

~~~
lucb1e
If I remember correctly, someone actually talked about it and they claimed
using this program to cover for the person who leaked. But I may be wrong.

Edit:
[https://en.wikipedia.org/wiki/The_Cuckoo%27s_Calling#Authors...](https://en.wikipedia.org/wiki/The_Cuckoo%27s_Calling#Authorship)

> However, it was later reported that Rowling's authorship was leaked to a
> Times reporter via Twitter by the friend of the wife of a lawyer at Russells
> Solicitors, who had worked for Rowling. The firm has since apologised[29]
> and made a "substantial charitable donation" to the Soldiers' Charity as a
> result of legal action brought by Rowling.[30]

------
MasterScrat
JGAAP is pretty awesome, it has both Java API and GUI:
[https://github.com/evllabs/JGAAP](https://github.com/evllabs/JGAAP)

JStylo that was already mentioned is based on JGAAP. You have some more here:
[http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_ar...](http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_are_out_there.3F)

------
gull
Check this out:
[http://www.secretlifeofpronouns.com/exercises.php](http://www.secretlifeofpronouns.com/exercises.php)

~~~
vixsomnis
I did the bottle exercise. Very interesting.

I see the term I was looking for was LSM (language style matching). That
should help me do more research.

Thanks for the link.

~~~
gull
Here's a project idea you are free to steal: write a browser plugin that
analyzes HN comments and shows up something on the screen. Like "honest",
"deceptive", "gratuitously negative".

Brownie points for HN admins if they add text analysis as feedback when we
press submit.

I'm tired of signing up with a brand new HN account every few months to cover
my tracks after embarrassing myself with less than noble posts.

~~~
vixsomnis
I'm less interested in analyzing the meaning of the text than the privacy
implications for identification. The way I see it, textual fingerprinting is
similar to facial recognition in that it's next to impossible to avoid in
public spaces (walking into a subway, using Facebook's app, posting on HN,
etc.).

Maybe a style obfuscator. It'd be easier than trying to recognize emotional
content.

------
amazing_jose
The results could be horrible, but can imagine a simple technique for hiding
all those clues. Just send the text to google translate, translate it to an
intermediate language and the back to the original one. I can warranty an
excellent t rinse and clean. Change the intermediate language and you will
change the features of the final text. Of course, you risk horrible semantic
changes in the final text ;)

UPDATE: fix typos.

~~~
vixsomnis
Took only a minute to try:

English -> Filipino (Tagalog) -> Chinese-simplified (Mandarin?) -> English

    
    
        I remember reading an article about a year ago (NSA) to identify the
        user, based on how they are written, vocabulary, spelling errors,
        grammar, language, and so on.
    
        It is interesting to me, because it is difficult to change the written
        and spoken word in use. It can be estimated that there are between two
        characters similar amount of help.
    
        Recently I can think of now is to check plagiarism (used in schools and
        universities, for example) is proprietary algorithm.
    
        Are there any public this algorithm? I can find out more information?
        (Academic journals?) I just DDGing wrong search terms? 
    

I have to say, this is a great idea. There was some information lost in
transit, but most of my thoughts came through (albeit broken). It's probably
worse since I used multiple intermediaries and Mandarin doesn't map onto
English (or vice-versa) in grammar or vocabulary.

edit: A site for this exists.
[http://ackuna.com/badtranslator](http://ackuna.com/badtranslator)

~~~
tokenizerrr
Of course you're sending all this information to Google now. Are there any
offline translators that are advanced enough to be used for something like
this? I imagine most just naively map words 1:1 which wouldn't do much good
here.

~~~
giancarlostoro
I was wondering the same. Oddly enough, on your phone you can use Google
Translate offline, they let you download language packs. Not sure if they ever
"phone home".

------
CSDude
There is a public domain fingerprinting tool, that is used in MOSS (Measure Of
Software Similarity). You can get the ideas from there.

[http://theory.stanford.edu/~aiken/publications/papers/sigmod...](http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf)

------
bolomega10000
A simple one is based on analysing stop words. I guess you could do vector
similarity of stop word relative frequency. You could try additional features
such as word bigrams and trigrams and contain stop words. In other words,
things like, "all the words the author uses that commonly surround 'of'" to
select on stop word containing common phrases.

There is something about the stop word use pattern that makes them harder to
forge.

I've never tried this and I don't know much more about it than that, so I
strongly suggest you also find papers that treat authorship attribution by
stop words.

~~~
pimlottc
What's a "stop word"? Last word in a sentence?

~~~
apendleton
No, they're usually function words that are in most documents and, at least
looking at a document as a bag of words, consequently aren't very analytically
useful. Think "the," "of," "and," etc.

------
fauxfauxpas
somewhat related - excerpt from Cryptonomicon - The percussionist stands up.
"Every radio operator has a distinctive style of keying—we call it his fist.
With a bit of practice, our Y Service people can recognize different German
operators by their fists—we can tell when one of them has been transferred to
a different unit, for example."

and this article by Schneier - Identifying People By Their Writing Style -
[https://www.schneier.com/blog/archives/2011/08/identifying_p...](https://www.schneier.com/blog/archives/2011/08/identifying_peo_2.html)

------
bane
Here's a quick one:

1) tokenize each text into a different bag(set) of words.

2) Compute the Jaccard index[1] using the two sets.

Here's another

1) tokenize each text into a multi-bag(set) of words, keeping track of token
frequency

2) keeping the token frequency, order the sets into lists

3) map the lists of words onto an n-dimensional space (where n is say...all of
the words into the two documents) as vectors

4) compute the cosine similarity [2]

Here's another:

1) tokenize the texts into two bags of words

2) compute the set difference going both ways.

3) does either difference contain discriminator tokens that rule it out as
being from that person

4 (optional)): extend to 2-3-n-grams

Here's another (a variant of the one above):

1) compute 1-2-3-n-grams from one of the texts

2) insert the n-grams into a set

3) compute the same for the second document and test for set membership

4) compute the number of total n-grams from your second document

5) compute (non-in-set/total-n-grams) * 100 to yield a "uniqueness" measure

6) determine if the second document is "unique" enough

And another:

1) assuming you have a sample corpus from a writer and want to know if a new
text belongs in that corpus

2) follow the method above but for step #1 and 2 do it with the entire
reference corpus

And another:

1) produce an ontology of discriminator terms and categories unique to the
writer

2) use an (named entity recognition) NER tool of some kind to find those terms
in each document

3) use the set of found terms as an alternative to a bag of words for the
Jaccard or Vector models above

You may need to play with stopword list removal, tokenization schemes and
n-gram windows (for example, omitting 1-grams might focus the analysis on
phrase usage vs. vocabulary usage)

1 -
[https://en.wikipedia.org/wiki/Jaccard_index](https://en.wikipedia.org/wiki/Jaccard_index)

2 -
[https://en.wikipedia.org/wiki/Cosine_similarity](https://en.wikipedia.org/wiki/Cosine_similarity)

~~~
ai_maker
This is an implementation in Python for the second option, but instead of word
frequencies it uses word presence (a binary vector in the end):

[https://github.com/atrilla/vsmpy](https://github.com/atrilla/vsmpy)

------
espe
afaik, stylo
([https://sites.google.com/site/computationalstylistics/stylo](https://sites.google.com/site/computationalstylistics/stylo))
is the academic go-to solution. it is even sporting a gui

------
wodenokoto
You want to look for 'author attribution' as your keyword.

There are 2 main ways for assessing author attribution. One is through
stylistic markers, where you look for a set of predefined features. The is
average length per paragraph, or the number of times 'whenever' is used. This
is highly language dependant.

The other way is through character n-gram analysis. You chose for which N you
want to harvest N-grams and your author profile is the frequency of top 2000
n-grams and you compare this profile with a documents top 2000 n-grams and the
profile with the shortest distance is your match.

Robert Layton has a tutorial and some code on N-gram attribution on Github:

* [https://github.com/robertlayton/authorship_tutorials](https://github.com/robertlayton/authorship_tutorials)

* [https://github.com/robertlayton/author-detection](https://github.com/robertlayton/author-detection)

And here's a list of papers I've reviewed while doing a similar project.

[1] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni.
Gender, genre, and

writing style in formal written texts.

23(3):321–346, 2003.

[2] John F Burrows. ‘an ocean where each kind...’: Statistical analysis and
some major determinants

of literary style. Computers and the Humanities, 23(4-5):309–321, 1989.

[3] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and
Sokratis Katsikas. Source

code author identification based on n-gram author profiles. In Artificial
Intelligence Applica- tions and Innovations, pages 508–515. Springer, 2006.

[4] Sheena Gardner and Hilary Nesi. A classification of genre families in
university student writing.

Applied linguistics, 34(1):25–52, 2013.

[6] John Houvardas and Efstathios Stamatatos. N-gram feature selection for
authorship identifica- tion. In Artificial Intelligence: Methodology, Systems,
and Applications, pages 77–86. Springer,

2006.

[7] Patrick Juola. Authorship attribution. Foundations and Trends in
information Retrieval,

1(3):233–334, 2006.

[8] Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based
author profiles

for authorship attribution. In Proceedings of the conference pacific
association for computational

linguistics, PACLING, volume 3, pages 255–264, 2003.

[9] Maarten Lambers and Cor J Veenman. Forensic authorship attribution using
compression dis- tances to prototypes. In Computational Forensics, pages
13–24. Springer, 2009.

[11] Fiona J Tweedie and R Harald Baayen. How variable may a constant be?
measures of lexical

richness in perspective. Computers and the Humanities, 32(5):323–352, 1998.

[12] Cor J Veenman and Zhenshi Li. Authorship verification with compression
features.

[13] Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. A framework for
authorship identifi-

cation of online messages: Writing-style features and classification
techniques. Journal of the

American Society for Information Science and Technology, 57(3):378–393, 2006.

