Hacker News new | comments | ask | show | jobs | submit login
Anonymouth: Authorship anonymization framework (drexel.edu)
34 points by programmernews3 on June 17, 2015 | hide | past | web | favorite | 3 comments

I've been playing with stylometry for a different (and completely unrelated to authorship attribution) project recently and this is very interesting. Thanks!

But I do notice this in the JSAN tutorial pdf[1]:

"Building a Better Corpus with Amazon Mechanical Turk

"• Only 45 of 101 of submissions are usable!

"• 45 Accepted Submissions."


"• This corpus is large, diverse, and unique."

(Page 14.)

45 7000-word submissions is large and diverse?

[1] http://events.ccc.de/congress/2011/Fahrplan/attachments/2019...

Yes and no. I've seen published authorship attribution (AA) papers that used 13 submissions, so compared to that it is big.

I don't know why this branch of NLP/machine learning has been fairly okay with small corpora. When I did my last project we used 100 authors which was a lot compared to most of the literature at the time.

On the flip side, the only AA papers I've seen that use a large corpus are projects that scrape blogs, and I think it is safe to say that the traditional stylometric features are not optimized for that kind of language.

Any examples of this in action?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact