

Author identification by machine learning - ethics? - Zak

I'm working on a personal project involving automated text classification for a variety of purposes, including identifying the author of a sample of text given samples of training text from a very large number of authors.<p>I'm far from the first to use machine learning algorithms to identify the author of a text, but I think I have something a little better than most of the research projects I've read about and open-source tools I've tested. Initial results show significantly greater accuracy and a couple orders of magnitude more speed in situations involving thousands of possible authors.<p>I can imagine potentially good uses for this sort of tech, ranging from keeping banned users out of an online community to identifying the author of a ransom note, death threat or the like. I can also imagine evil uses, such as identifying political dissidents to persecute.<p>I'm not sure how I feel about releasing such a thing in to the world (as open-source or as a product), knowing that it will be used for both good and evil. Any comments?
======
cperciva
Cryptography can both keep criminals out of jail and (literally) save the
lives of human rights activists. Nuclear power can both be used to produce
plutonium for use in weapons and produce CO2-emissions free electricity.
Insulin can both keep diabetics alive and be used (highly dangerously) as an
anabolic drug.

At some point, progress requires that you shrug your shoulders and say "I
don't know how the good and bad uses will weigh against each other, but I'm
going to go ahead anyway".

------
thegoleffect
I, personally, think its fine to release and use. When it gets to the point
where you're able to determine too much information, then, it becomes a
problem.

A tool that lets you identify a work as belonging to a specific author with
relative certainty would be enormously useful to matching potentially fake
texts and their authors. There are lots of less technical ways to identify
political dissidents ^_^.

But, if you were able to classify people by race, education, sexual-
orientation, or even psychological profile, then I could see that becoming a
powder-keg. Thanks to the internet, there are a LOT of people who value
anonymous self-expression. Destroying that could prevent a lot of excellent
works from being created.

~~~
Zak
_But, if you were able to classify people by race, education, sexual-
orientation, or even psychological profile, then I could see that becoming a
powder-keg._

It's a text classification system; it can be trained on just about anything.
I'm not sure how accurate it would be for other purposes. Things like that are
already out there, e.g. <http://genderanalyzer.com/>

~~~
thegoleffect
I suppose that its more about accuracy when it comes to those touchy topics,
like being condemned as a political dissident for being an outlier in the
results of the code >_>.

------
raymondh
So, what is the answer, William Shakespeare or Francis Bacon?

~~~
Zak
My accuracy rate is worse than CRM114 with a very small number of categories.
I don't think general accuracy will get much above 90% when dealing with
thousands of authors; this will always be a tool to aid human investigations,
not replace them.

