I remember reading an article a year or so ago about (the NSA) identifying users based on how they write: vocabulary, spelling mistakes, grammar, dialect, and so on.
This is interesting to me because it is extremely difficult to change the vocabulary I use in writing and speaking. Being able to estimate the amount of similarity between two pieces of text would be useful.
The closest I can think of right now would be the proprietary algorithms used to check for plagiarism (for schools and universities, for instance).
Are there any publicly available algorithms for this? Where can I go to learn more? (Academic journals?) Am I just DDGing the wrong search terms?
Basically the first step would be shingling the text (choosing a sampling domain) and generating a MinHash struct (computationally cheap) which can then be used to find the "similarity" between sets, or, the "Jaccard Index."
If you're clever about this, you can use HyperLogLogs to encode these MinHash structs gaining a great deal of speed with a marginal error rate, all while allowing for arbitrary N-levels of intersection.
If you're looking to build a model to analyze two (or N) text bodies for stylometric similarities, I'd approach the problem in two steps:
1) Minimize the relevant input text.
- Use a bernoulli/categorical distribution to weight words according to uniqueness--NLP and sentiment extraction techniques may also help
- Design a markov process to represent more complex phrasing patterns for the text as a whole
- Filter by a variable threshold to minimize the resulting set of shingles/bins/"interesting nodes" into a computationally-manageable #
2) Use an efficient MinHash intersection to compute a similarity vector (0-1) for the two texts.
I think given the prevalence of training data (I mean, what's more ubiquitous than the written word...) you could probably tune this to a reasonable accuracy and efficient complexity.
Just a 5m thought exercise, but if anyone else has ideas I'd be curious as well :)