Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Algorithms for text fingerprinting?
92 points by vixsomnis on June 15, 2015 | hide | past | favorite | 41 comments
I remember reading an article a year or so ago about (the NSA) identifying users based on how they write: vocabulary, spelling mistakes, grammar, dialect, and so on.

This is interesting to me because it is extremely difficult to change the vocabulary I use in writing and speaking. Being able to estimate the amount of similarity between two pieces of text would be useful.

The closest I can think of right now would be the proprietary algorithms used to check for plagiarism (for schools and universities, for instance).

Are there any publicly available algorithms for this? Where can I go to learn more? (Academic journals?) Am I just DDGing the wrong search terms?




Figured I'd chime in here since I developed an algorithm recently that could be applied to this problem with some basic ML.

Basically the first step would be shingling the text (choosing a sampling domain) and generating a MinHash struct (computationally cheap) which can then be used to find the "similarity" between sets, or, the "Jaccard Index."

If you're clever about this, you can use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection.

If you're looking to build a model to analyze two (or N) text bodies for stylometric similarities, I'd approach the problem in two steps:

1) Minimize the relevant input text.

- Use a bernoulli/categorical distribution to weight words according to uniqueness--NLP and sentiment extraction techniques may also help

- Design a markov process to represent more complex phrasing patterns for the text as a whole

- Filter by a variable threshold to minimize the resulting set of shingles/bins/"interesting nodes" into a computationally-manageable #

2) Use an efficient MinHash intersection to compute a similarity vector (0-1) for the two texts.

I think given the prevalence of training data (I mean, what's more ubiquitous than the written word...) you could probably tune this to a reasonable accuracy and efficient complexity.

Just a 5m thought exercise, but if anyone else has ideas I'd be curious as well :)


Could you describe in a little more detail what you mean by this sentence:

> "... use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection"

Thanks


Sure thing.

Let's say we have two sets, and we're trying to find out how similar they are:

    setOne = ['the','brown','fox','jumped','a','log']
    setTwo = ['the','quick','brown','log','jumped','over','the','fox']
You could use an array intersection when its small, but if you want to do this efficiently at scale you need to take advantage of probabilistic data structures.

Let's say you created a HyperLogLog for each set:

    setOne = (bitfield representing setOne)
    setTwo = (bitfield representing setTwo)
Now HyperLogLogs are cool, because you can merge them together w/o losing anything. You can't retrieve the data, but you can efficiently check if a value exists inside. You can have a false positive, but never a false negative.

You might first try simple combinatorics (a la, similarity ~= |A or B| / |A ∪ B|) however this can get hairy depending upon the representation of the HyperLogLog (sparse/dense) and its respective cardinality.

Eventually you realize you can't accept an exponentially-compounding error rate, but still need the raw efficiency, and thus you sacrifice by doubling your storage cost.

Now instead of just the initial sets, you'd also build a parallel MinHash struct:

    setOne_minhashes = [<int>, <int>, <int>]
    setTwo_minhashes = [<int>, <int>, <int>]

    setOne = (bitfield representing setOne)
    setTwo = (bitfield representing setTwo)
    setOneMinHash = (bitfield representation of setOne minhashes)
    setTwoMinHash = (bitfield representation of setTwo minhashes)
I won't go into excessive detail about the minhash algorithm itself, but essentially it provides a way to sample a large set of values by selecting/retaining the smallest output hashes. What that means is, you can then intersect the minhash bitfields as many times over as you like, and extract a predictably accurate similarity index in linear time with definable confidence bounds.


I don't have a background in ML (I'm still somewhat early in my CS Bachelor's degree, and I'm not focusing on ML anyway), but this is very interesting, regardless that I had to look up nearly every term you used.

I was initially thinking a directed weighted graph might work well here, but I'm assuming that would scale terribly relative to something like this.


The relevant search term is "stylometry". One particular paper I remember is from Dawn Song's group at Berkeley a couple years back:

http://www.cs.berkeley.edu/~dawnsong/papers/2012%20On%20the%...

There's a lot of public work on the topic, but it looks like right now the best place to look is still in academic papers (I don't know of any open source libraries, for example).


I'm not sure whether it's appropriate to mention this or not, but I was just looking up the authors of that paper to see what they're doing now. Very sadly, I found out that one of the authors, Emil Stefanov, died last year at the age of 26.

http://www.rememberingemil.org


Thank you for posting this (it was quite a surprise to see his name here).

I was a close friend of Emil for a long time. Before he passed, I had gotten busy with work and hadn't spoken to him in a while. I saw some of our mutual friends the weekend before it happened and had planned to call him. Really wish I had made that call sooner.


JGAAP [1] may be useful (AGPLv3 license):

"JGAAP is intended to tackle two different problems, firstly to allow people unfamiliar with machine learning and quantitative analysis the ability to use cutting edge techniques on their text based stylometry / textometry problems, and secondly to act as a framework for testing and comparing the effectiveness of different analytic techniques' performance on text analysis quickly and easily."

[1]: http://evllabs.com/jgaap/w/

Looks like there are some other recommendations at http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_ar...


That paper is exactly the sort I was looking for. Thanks.

I'm surprised there isn't any open source effort in this area yet (I couldn't find any either). It's just as important as TOR and other anonymity services, since it affects not just passively consuming information, but actively creating it.


If you're interested in defeating stylometry, you should check out Sadia Afroz's PhD thesis as well:

https://www.eecs.berkeley.edu/~sa499/thesis.pdf



Jstylo might be what you're looking for. https://github.com/psal/jstylo

The same group also has created a text obfuscation tool called anonymouth that helps you obfuscate your word choices, but it has still yet to be released. https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth


I may be late to the party, but finally my time to shine! As has been mentioned earlier, the field gou are looking for is called stylometry, and has almost a century of history behind it, and also the field if my thesis. After looking at what everyone has been saying, I just felt like copying and pasting my 20 page literature review here, but I'd recommend you look at narayanan at all (2012) paper (internet scale authorship attribution). The algorithm it uses is not particularly complex and would take you a week, tops, to implement if you put in a few hours a day, and that's including doing all the related research and catching up with the linear algebra involved if you need to.


I've started a simular project myself recently, I check on various parameters (reading level score, words per sentence, syllables per word, sentences per paragraph, average word length, average syllable count) and calculate the distance between 2 texts / authors using a simple euclidean distance.

I started out with the code provided on https://github.com/mac389/ToxTweet/blob/master/textanalyzer.... I use it in a private project, but the results are promising!


I wonder if this is how the authors of Truecrypt where identified. I remember reading something similar about this in regards to coding style.

I am sure the Truecrypt authors contributed to more than one project.


When I was in college we turned in papers via "Turnitin" which checked for plagiarism and uniqueness etc.

There's an interesting research paper about their algorithms here: https://www.cs.auckland.ac.nz/courses/compsci725s2c/archive/...

And if you search for "Turnitin Plagiarism Algorithm" I'm sure you'll find a few more resources.


Plagiarism detection is a somewhat different problem, in that it's looking for specific common text rather than just stylistic choices. Usually it's just looking for high percentages of overlapping ngrams between a test document and documents in a corpus, but two different documents written by the same person wouldn't test positive.


This one time JK Rowling was found out writing by a pseudonym[1] using this program:

http://evllabs.com/jgaap/w/index.php/Main_Page

[1] http://blogs.wsj.com/speakeasy/2013/07/16/the-science-that-u...


If I remember correctly, someone actually talked about it and they claimed using this program to cover for the person who leaked. But I may be wrong.

Edit: https://en.wikipedia.org/wiki/The_Cuckoo%27s_Calling#Authors...

> However, it was later reported that Rowling's authorship was leaked to a Times reporter via Twitter by the friend of the wife of a lawyer at Russells Solicitors, who had worked for Rowling. The firm has since apologised[29] and made a "substantial charitable donation" to the Soldiers' Charity as a result of legal action brought by Rowling.[30]


JGAAP is pretty awesome, it has both Java API and GUI: https://github.com/evllabs/JGAAP

JStylo that was already mentioned is based on JGAAP. You have some more here: http://evllabs.com/jgaap/w/index.php/FAQ#What_other_tools_ar...



I did the bottle exercise. Very interesting.

I see the term I was looking for was LSM (language style matching). That should help me do more research.

Thanks for the link.


Here's a project idea you are free to steal: write a browser plugin that analyzes HN comments and shows up something on the screen. Like "honest", "deceptive", "gratuitously negative".

Brownie points for HN admins if they add text analysis as feedback when we press submit.

I'm tired of signing up with a brand new HN account every few months to cover my tracks after embarrassing myself with less than noble posts.


I'm less interested in analyzing the meaning of the text than the privacy implications for identification. The way I see it, textual fingerprinting is similar to facial recognition in that it's next to impossible to avoid in public spaces (walking into a subway, using Facebook's app, posting on HN, etc.).

Maybe a style obfuscator. It'd be easier than trying to recognize emotional content.


The results could be horrible, but can imagine a simple technique for hiding all those clues. Just send the text to google translate, translate it to an intermediate language and the back to the original one. I can warranty an excellent t rinse and clean. Change the intermediate language and you will change the features of the final text. Of course, you risk horrible semantic changes in the final text ;)

UPDATE: fix typos.


There's a great paper that investigated this technique: https://www.eecs.berkeley.edu/~sa499/papers/adversarial_styl...

From the conclusions: "Translation with widely available machine translation services does not appear to be a viable mode of circumvention. Our evaluation did not demonstrate sufficient anonymization and the translated document has, at best, questionable grammar and quality."


Has there been any other work on anti-fingerprinting? Seems like just taking the fingerprinting algorithms and applying transforms that randomize the features they look for would go a long way.


Took only a minute to try:

English -> Filipino (Tagalog) -> Chinese-simplified (Mandarin?) -> English

    I remember reading an article about a year ago (NSA) to identify the
    user, based on how they are written, vocabulary, spelling errors,
    grammar, language, and so on.

    It is interesting to me, because it is difficult to change the written
    and spoken word in use. It can be estimated that there are between two
    characters similar amount of help.

    Recently I can think of now is to check plagiarism (used in schools and
    universities, for example) is proprietary algorithm.

    Are there any public this algorithm? I can find out more information?
    (Academic journals?) I just DDGing wrong search terms? 
I have to say, this is a great idea. There was some information lost in transit, but most of my thoughts came through (albeit broken). It's probably worse since I used multiple intermediaries and Mandarin doesn't map onto English (or vice-versa) in grammar or vocabulary.

edit: A site for this exists. http://ackuna.com/badtranslator


Google Translate almost holds it together...until the very end...

Original:

  In the beginning God created the heaven and the earth. And the earth was without form, and void; and darkness was upon the face of the deep. And the Spirit of God moved upon the face of the waters.
[a half dozen languages later...]

Result:

  God created heaven and earth in the beginning. And the earth was formless and empty, and darkness was on the face of the abyss, man. Answer the Spirit of God on the water surface.


Of course you're sending all this information to Google now. Are there any offline translators that are advanced enough to be used for something like this? I imagine most just naively map words 1:1 which wouldn't do much good here.


I was wondering the same. Oddly enough, on your phone you can use Google Translate offline, they let you download language packs. Not sure if they ever "phone home".


Google translate to Tagalog is generally not great.

If you chain through European languages you barely lose anything. Doing English->Dutch->German->English is a good set to use.


There is a public domain fingerprinting tool, that is used in MOSS (Measure Of Software Similarity). You can get the ideas from there.

http://theory.stanford.edu/~aiken/publications/papers/sigmod...


A simple one is based on analysing stop words. I guess you could do vector similarity of stop word relative frequency. You could try additional features such as word bigrams and trigrams and contain stop words. In other words, things like, "all the words the author uses that commonly surround 'of'" to select on stop word containing common phrases.

There is something about the stop word use pattern that makes them harder to forge.

I've never tried this and I don't know much more about it than that, so I strongly suggest you also find papers that treat authorship attribution by stop words.


What's a "stop word"? Last word in a sentence?


No, they're usually function words that are in most documents and, at least looking at a document as a bag of words, consequently aren't very analytically useful. Think "the," "of," "and," etc.


somewhat related - excerpt from Cryptonomicon - The percussionist stands up. "Every radio operator has a distinctive style of keying—we call it his fist. With a bit of practice, our Y Service people can recognize different German operators by their fists—we can tell when one of them has been transferred to a different unit, for example."

and this article by Schneier - Identifying People By Their Writing Style - https://www.schneier.com/blog/archives/2011/08/identifying_p...


Here's a quick one:

1) tokenize each text into a different bag(set) of words.

2) Compute the Jaccard index[1] using the two sets.

Here's another

1) tokenize each text into a multi-bag(set) of words, keeping track of token frequency

2) keeping the token frequency, order the sets into lists

3) map the lists of words onto an n-dimensional space (where n is say...all of the words into the two documents) as vectors

4) compute the cosine similarity [2]

Here's another:

1) tokenize the texts into two bags of words

2) compute the set difference going both ways.

3) does either difference contain discriminator tokens that rule it out as being from that person

4 (optional)): extend to 2-3-n-grams

Here's another (a variant of the one above):

1) compute 1-2-3-n-grams from one of the texts

2) insert the n-grams into a set

3) compute the same for the second document and test for set membership

4) compute the number of total n-grams from your second document

5) compute (non-in-set/total-n-grams) * 100 to yield a "uniqueness" measure

6) determine if the second document is "unique" enough

And another:

1) assuming you have a sample corpus from a writer and want to know if a new text belongs in that corpus

2) follow the method above but for step #1 and 2 do it with the entire reference corpus

And another:

1) produce an ontology of discriminator terms and categories unique to the writer

2) use an (named entity recognition) NER tool of some kind to find those terms in each document

3) use the set of found terms as an alternative to a bag of words for the Jaccard or Vector models above

You may need to play with stopword list removal, tokenization schemes and n-gram windows (for example, omitting 1-grams might focus the analysis on phrase usage vs. vocabulary usage)

1 - https://en.wikipedia.org/wiki/Jaccard_index

2 - https://en.wikipedia.org/wiki/Cosine_similarity


This is an implementation in Python for the second option, but instead of word frequencies it uses word presence (a binary vector in the end):

https://github.com/atrilla/vsmpy


afaik, stylo (https://sites.google.com/site/computationalstylistics/stylo) is the academic go-to solution. it is even sporting a gui


You want to look for 'author attribution' as your keyword.

There are 2 main ways for assessing author attribution. One is through stylistic markers, where you look for a set of predefined features. The is average length per paragraph, or the number of times 'whenever' is used. This is highly language dependant.

The other way is through character n-gram analysis. You chose for which N you want to harvest N-grams and your author profile is the frequency of top 2000 n-grams and you compare this profile with a documents top 2000 n-grams and the profile with the shortest distance is your match.

Robert Layton has a tutorial and some code on N-gram attribution on Github:

* https://github.com/robertlayton/authorship_tutorials

* https://github.com/robertlayton/author-detection

And here's a list of papers I've reviewed while doing a similar project.

[1] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. Gender, genre, and

writing style in formal written texts.

23(3):321–346, 2003.

[2] John F Burrows. ‘an ocean where each kind...’: Statistical analysis and some major determinants

of literary style. Computers and the Humanities, 23(4-5):309–321, 1989.

[3] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. Source

code author identification based on n-gram author profiles. In Artificial Intelligence Applica- tions and Innovations, pages 508–515. Springer, 2006.

[4] Sheena Gardner and Hilary Nesi. A classification of genre families in university student writing.

Applied linguistics, 34(1):25–52, 2013.

[6] John Houvardas and Efstathios Stamatatos. N-gram feature selection for authorship identifica- tion. In Artificial Intelligence: Methodology, Systems, and Applications, pages 77–86. Springer,

2006.

[7] Patrick Juola. Authorship attribution. Foundations and Trends in information Retrieval,

1(3):233–334, 2006.

[8] Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based author profiles

for authorship attribution. In Proceedings of the conference pacific association for computational

linguistics, PACLING, volume 3, pages 255–264, 2003.

[9] Maarten Lambers and Cor J Veenman. Forensic authorship attribution using compression dis- tances to prototypes. In Computational Forensics, pages 13–24. Springer, 2009.

[11] Fiona J Tweedie and R Harald Baayen. How variable may a constant be? measures of lexical

richness in perspective. Computers and the Humanities, 32(5):323–352, 1998.

[12] Cor J Veenman and Zhenshi Li. Authorship verification with compression features.

[13] Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. A framework for authorship identifi-

cation of online messages: Writing-style features and classification techniques. Journal of the

American Society for Information Science and Technology, 57(3):378–393, 2006.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: