Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Author identification by machine learning - ethics?
1 point by Zak on Feb 9, 2010 | hide | past | favorite | 7 comments
I'm working on a personal project involving automated text classification for a variety of purposes, including identifying the author of a sample of text given samples of training text from a very large number of authors.

I'm far from the first to use machine learning algorithms to identify the author of a text, but I think I have something a little better than most of the research projects I've read about and open-source tools I've tested. Initial results show significantly greater accuracy and a couple orders of magnitude more speed in situations involving thousands of possible authors.

I can imagine potentially good uses for this sort of tech, ranging from keeping banned users out of an online community to identifying the author of a ransom note, death threat or the like. I can also imagine evil uses, such as identifying political dissidents to persecute.

I'm not sure how I feel about releasing such a thing in to the world (as open-source or as a product), knowing that it will be used for both good and evil. Any comments?



Cryptography can both keep criminals out of jail and (literally) save the lives of human rights activists. Nuclear power can both be used to produce plutonium for use in weapons and produce CO2-emissions free electricity. Insulin can both keep diabetics alive and be used (highly dangerously) as an anabolic drug.

At some point, progress requires that you shrug your shoulders and say "I don't know how the good and bad uses will weigh against each other, but I'm going to go ahead anyway".


I, personally, think its fine to release and use. When it gets to the point where you're able to determine too much information, then, it becomes a problem.

A tool that lets you identify a work as belonging to a specific author with relative certainty would be enormously useful to matching potentially fake texts and their authors. There are lots of less technical ways to identify political dissidents ^_^.

But, if you were able to classify people by race, education, sexual-orientation, or even psychological profile, then I could see that becoming a powder-keg. Thanks to the internet, there are a LOT of people who value anonymous self-expression. Destroying that could prevent a lot of excellent works from being created.


But, if you were able to classify people by race, education, sexual-orientation, or even psychological profile, then I could see that becoming a powder-keg.

It's a text classification system; it can be trained on just about anything. I'm not sure how accurate it would be for other purposes. Things like that are already out there, e.g. http://genderanalyzer.com/


I suppose that its more about accuracy when it comes to those touchy topics, like being condemned as a political dissident for being an outlier in the results of the code >_>.


A bullst detector would be awesome :). <-- not sarcastic.


So, what is the answer, William Shakespeare or Francis Bacon?


My accuracy rate is worse than CRM114 with a very small number of categories. I don't think general accuracy will get much above 90% when dealing with thousands of authors; this will always be a tool to aid human investigations, not replace them.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: