Author identification by machine learning - ethics?

cperciva · on Feb 9, 2010

Cryptography can both keep criminals out of jail and (literally) save the lives of human rights activists. Nuclear power can both be used to produce plutonium for use in weapons and produce CO2-emissions free electricity. Insulin can both keep diabetics alive and be used (highly dangerously) as an anabolic drug.

At some point, progress requires that you shrug your shoulders and say "I don't know how the good and bad uses will weigh against each other, but I'm going to go ahead anyway".

thegoleffect · on Feb 9, 2010

I, personally, think its fine to release and use. When it gets to the point where you're able to determine too much information, then, it becomes a problem.

A tool that lets you identify a work as belonging to a specific author with relative certainty would be enormously useful to matching potentially fake texts and their authors. There are lots of less technical ways to identify political dissidents ^_^.

But, if you were able to classify people by race, education, sexual-orientation, or even psychological profile, then I could see that becoming a powder-keg. Thanks to the internet, there are a LOT of people who value anonymous self-expression. Destroying that could prevent a lot of excellent works from being created.

Zak · on Feb 9, 2010

But, if you were able to classify people by race, education, sexual-orientation, or even psychological profile, then I could see that becoming a powder-keg.

It's a text classification system; it can be trained on just about anything. I'm not sure how accurate it would be for other purposes. Things like that are already out there, e.g. http://genderanalyzer.com/

thegoleffect · on Feb 9, 2010

I suppose that its more about accuracy when it comes to those touchy topics, like being condemned as a political dissident for being an outlier in the results of the code >_>.

thegoleffect · on Feb 11, 2010

A bullst detector would be awesome :). <-- not sarcastic.

raymondh · on Feb 9, 2010

So, what is the answer, William Shakespeare or Francis Bacon?

Zak · on Feb 9, 2010

My accuracy rate is worse than CRM114 with a very small number of categories. I don't think general accuracy will get much above 90% when dealing with thousands of authors; this will always be a tool to aid human investigations, not replace them.