Hacker News new | past | comments | ask | show | jobs | submit login

You want to look for 'author attribution' as your keyword.

There are 2 main ways for assessing author attribution. One is through stylistic markers, where you look for a set of predefined features. The is average length per paragraph, or the number of times 'whenever' is used. This is highly language dependant.

The other way is through character n-gram analysis. You chose for which N you want to harvest N-grams and your author profile is the frequency of top 2000 n-grams and you compare this profile with a documents top 2000 n-grams and the profile with the shortest distance is your match.

Robert Layton has a tutorial and some code on N-gram attribution on Github:

* https://github.com/robertlayton/authorship_tutorials

* https://github.com/robertlayton/author-detection

And here's a list of papers I've reviewed while doing a similar project.

[1] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. Gender, genre, and

writing style in formal written texts.

23(3):321–346, 2003.

[2] John F Burrows. ‘an ocean where each kind...’: Statistical analysis and some major determinants

of literary style. Computers and the Humanities, 23(4-5):309–321, 1989.

[3] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. Source

code author identification based on n-gram author profiles. In Artificial Intelligence Applica- tions and Innovations, pages 508–515. Springer, 2006.

[4] Sheena Gardner and Hilary Nesi. A classification of genre families in university student writing.

Applied linguistics, 34(1):25–52, 2013.

[6] John Houvardas and Efstathios Stamatatos. N-gram feature selection for authorship identifica- tion. In Artificial Intelligence: Methodology, Systems, and Applications, pages 77–86. Springer,

2006.

[7] Patrick Juola. Authorship attribution. Foundations and Trends in information Retrieval,

1(3):233–334, 2006.

[8] Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based author profiles

for authorship attribution. In Proceedings of the conference pacific association for computational

linguistics, PACLING, volume 3, pages 255–264, 2003.

[9] Maarten Lambers and Cor J Veenman. Forensic authorship attribution using compression dis- tances to prototypes. In Computational Forensics, pages 13–24. Springer, 2009.

[11] Fiona J Tweedie and R Harald Baayen. How variable may a constant be? measures of lexical

richness in perspective. Computers and the Humanities, 32(5):323–352, 1998.

[12] Cor J Veenman and Zhenshi Li. Authorship verification with compression features.

[13] Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. A framework for authorship identifi-

cation of online messages: Writing-style features and classification techniques. Journal of the

American Society for Information Science and Technology, 57(3):378–393, 2006.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: