Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A simple one is based on analysing stop words. I guess you could do vector similarity of stop word relative frequency. You could try additional features such as word bigrams and trigrams and contain stop words. In other words, things like, "all the words the author uses that commonly surround 'of'" to select on stop word containing common phrases.

There is something about the stop word use pattern that makes them harder to forge.

I've never tried this and I don't know much more about it than that, so I strongly suggest you also find papers that treat authorship attribution by stop words.



What's a "stop word"? Last word in a sentence?


No, they're usually function words that are in most documents and, at least looking at a document as a bag of words, consequently aren't very analytically useful. Think "the," "of," "and," etc.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: