Hacker News new | comments | ask | show | jobs | submit login
Anonymouth: Authorship anonymization framework (drexel.edu)
34 points by programmernews3 on June 17, 2015 | hide | past | web | favorite | 3 comments



I've been playing with stylometry for a different (and completely unrelated to authorship attribution) project recently and this is very interesting. Thanks!

But I do notice this in the JSAN tutorial pdf[1]:

"Building a Better Corpus with Amazon Mechanical Turk

"• Only 45 of 101 of submissions are usable!

"• 45 Accepted Submissions."

[...]

"• This corpus is large, diverse, and unique."

(Page 14.)

45 7000-word submissions is large and diverse?

[1] http://events.ccc.de/congress/2011/Fahrplan/attachments/2019...


Yes and no. I've seen published authorship attribution (AA) papers that used 13 submissions, so compared to that it is big.

I don't know why this branch of NLP/machine learning has been fairly okay with small corpora. When I did my last project we used 100 authors which was a lot compared to most of the literature at the time.

On the flip side, the only AA papers I've seen that use a large corpus are projects that scrape blogs, and I think it is safe to say that the traditional stylometric features are not optimized for that kind of language.


Any examples of this in action?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: