

Anonymouth: Authorship anonymization framework - programmernews3
https://psal.cs.drexel.edu/index.php/JStylo-Anonymouth

======
mcguire
I've been playing with stylometry for a different (and completely unrelated to
authorship attribution) project recently and this is very interesting. Thanks!

But I do notice this in the JSAN tutorial pdf[1]:

" _Building a Better Corpus with Amazon Mechanical Turk_

" _• Only 45 of 101 of submissions are usable!_

" _• 45 Accepted Submissions._ "

[...]

" _• This corpus is large, diverse, and unique._ "

(Page 14.)

45 7000-word submissions is large and diverse?

[1]
[http://events.ccc.de/congress/2011/Fahrplan/attachments/2019...](http://events.ccc.de/congress/2011/Fahrplan/attachments/2019_28C3-authorship.pdf)

~~~
wodenokoto
Yes and no. I've seen published authorship attribution (AA) papers that used
13 submissions, so compared to that it is big.

I don't know why this branch of NLP/machine learning has been fairly okay with
small corpora. When I did my last project we used 100 authors which was a lot
compared to most of the literature at the time.

On the flip side, the only AA papers I've seen that use a large corpus are
projects that scrape blogs, and I think it is safe to say that the traditional
stylometric features are not optimized for that kind of language.

------
randomname2
Any examples of this in action?

