But I do notice this in the JSAN tutorial pdf:
"Building a Better Corpus with Amazon Mechanical Turk
"• Only 45 of 101 of submissions are usable!
"• 45 Accepted Submissions."
"• This corpus is large, diverse, and unique."
45 7000-word submissions is large and diverse?
I don't know why this branch of NLP/machine learning has been fairly okay with small corpora. When I did my last project we used 100 authors which was a lot compared to most of the literature at the time.
On the flip side, the only AA papers I've seen that use a large corpus are projects that scrape blogs, and I think it is safe to say that the traditional stylometric features are not optimized for that kind of language.