You want to look for 'author attribution' as your keyword.
There are 2 main ways for assessing author attribution. One is through stylistic markers, where you look for a set of predefined features. The is average length per paragraph, or the number of times 'whenever' is used. This is highly language dependant.
The other way is through character n-gram analysis. You chose for which N you want to harvest N-grams and your author profile is the frequency of top 2000 n-grams and you compare this profile with a documents top 2000 n-grams and the profile with the shortest distance is your match.
Robert Layton has a tutorial and some code on N-gram attribution on Github:
code author identification based on n-gram author profiles. In Artificial Intelligence Applica-
tions and Innovations, pages 508–515. Springer, 2006.
[4] Sheena Gardner and Hilary Nesi. A classification of genre families in university student writing.
Applied linguistics, 34(1):25–52, 2013.
[6] John Houvardas and Efstathios Stamatatos. N-gram feature selection for authorship identifica-
tion. In Artificial Intelligence: Methodology, Systems, and Applications, pages 77–86. Springer,
2006.
[7] Patrick Juola. Authorship attribution. Foundations and Trends in information Retrieval,
1(3):233–334, 2006.
[8] Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based author profiles
for authorship attribution. In Proceedings of the conference pacific association for computational
There are 2 main ways for assessing author attribution. One is through stylistic markers, where you look for a set of predefined features. The is average length per paragraph, or the number of times 'whenever' is used. This is highly language dependant.
The other way is through character n-gram analysis. You chose for which N you want to harvest N-grams and your author profile is the frequency of top 2000 n-grams and you compare this profile with a documents top 2000 n-grams and the profile with the shortest distance is your match.
Robert Layton has a tutorial and some code on N-gram attribution on Github:
* https://github.com/robertlayton/authorship_tutorials
* https://github.com/robertlayton/author-detection
And here's a list of papers I've reviewed while doing a similar project.
[1] Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni. Gender, genre, and
writing style in formal written texts.
23(3):321–346, 2003.
[2] John F Burrows. ‘an ocean where each kind...’: Statistical analysis and some major determinants
of literary style. Computers and the Humanities, 23(4-5):309–321, 1989.
[3] Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. Source
code author identification based on n-gram author profiles. In Artificial Intelligence Applica- tions and Innovations, pages 508–515. Springer, 2006.
[4] Sheena Gardner and Hilary Nesi. A classification of genre families in university student writing.
Applied linguistics, 34(1):25–52, 2013.
[6] John Houvardas and Efstathios Stamatatos. N-gram feature selection for authorship identifica- tion. In Artificial Intelligence: Methodology, Systems, and Applications, pages 77–86. Springer,
2006.
[7] Patrick Juola. Authorship attribution. Foundations and Trends in information Retrieval,
1(3):233–334, 2006.
[8] Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. N-gram-based author profiles
for authorship attribution. In Proceedings of the conference pacific association for computational
linguistics, PACLING, volume 3, pages 255–264, 2003.
[9] Maarten Lambers and Cor J Veenman. Forensic authorship attribution using compression dis- tances to prototypes. In Computational Forensics, pages 13–24. Springer, 2009.
[11] Fiona J Tweedie and R Harald Baayen. How variable may a constant be? measures of lexical
richness in perspective. Computers and the Humanities, 32(5):323–352, 1998.
[12] Cor J Veenman and Zhenshi Li. Authorship verification with compression features.
[13] Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. A framework for authorship identifi-
cation of online messages: Writing-style features and classification techniques. Journal of the
American Society for Information Science and Technology, 57(3):378–393, 2006.