
“Should this even be released?” Deep learning tool that may be used for doxxing - Zuider
https://github.com/mlpoll/machinematch/issues/1
======
Animats
There are papers on how to do this, and many approaches work. Traditional
statistical methods[1], support vector machines,[2][3] (software available at
[4]) and a random forest algorithm [5] have been shown to work, more or less.
This isn't a new idea. All this new code does is let us compare how deep
learning does on the problem.

[1]
[https://www.aclweb.org/anthology/E/E99/E99-1021.pdf](https://www.aclweb.org/anthology/E/E99/E99-1021.pdf)
[2] [http://ceur-ws.org/Vol-1391/126-CR.pdf](http://ceur-
ws.org/Vol-1391/126-CR.pdf) [3]
[https://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS13/paper/vie...](https://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS13/paper/viewFile/5917/6043)
[4]
[http://www.cs.waikato.ac.nz/ml/weka/](http://www.cs.waikato.ac.nz/ml/weka/)
[5]
[http://ntv.ifmo.ru/en/article/15185/kompyuternaya_kriminalis...](http://ntv.ifmo.ru/en/article/15185/kompyuternaya_kriminalistika:_identifikaciya_avtora_internet-
tekstov.htm)

------
CM30
Yeah, it should be released.

I mean sure, there are various 'immoral' uses for it (like doxxing), but there
are also many good ones. Such as:

1\. Working out who wrote a bunch of anonymous reviews on Amazon or other such
sites, which could be used to stop fake reviews. You actually mention this
usage in your article.

2\. Being able to identify troublemakers in a community (such as a forum or a
social networking site). I'm sure a lot of administrators would love to know
if that suspicious looking new guy is the alias of a banned troll from a few
weeks back (posting through a proxy server).

3\. Literary analysis, like working out who wrote many anonymous works of
fiction. Or as someone said below, determining which parts of Shakespeare
plays were actually written by Shakespeare.

4\. Crime solving. If it works anywhere near as well as you say, it could
theoretically help unmask the Zodiac Killer, or perhaps even Jack the Ripper
(if any of those letters were real).

All the uses above would be a net positive for humanity, and would be great
possible uses for a deep learning tool like this.

Don't let the worries about its usage by 'bad' people overshadow the good you
can do by releasing it.

~~~
a_small_island
>"2\. Being able to identify troublemakers in a community (such as a forum or
a social networking site). I'm sure a lot of administrators would love to know
if that suspicious looking new guy is the alias of a banned troll from a few
weeks back (posting through a proxy server)."

Orwellian.

~~~
CM30
So, what do you when a troll persistently and constantly attacks your
community website?

As in, flames everyone to a crisp, posts as much porn as possible, tries to
incite a civil war between a few members that might not be on good terms with
each other or the staff and registers hundreds of accounts, some of which stay
semi dormant until they strike?

Because that can happen very easily online, especially if you get the ire of
someone with a lot of free time and very few morals. Or if your site ends up
at war with a troll site/gets raided by 4chan.

Do you avoid the hassle now, or wait until the situation blows up and half the
site is now in the middle of it?

~~~
DeadHitchhiker
Ban behavior, not people. You can convert non-productive people to productive
people, most of them only want to be noticed or accepted and past a certain
point there is only so much you can do to block anyone anyway. Stylometric
analysis would just be another simple hurdle to cross for a persistent person.

The idea that this tool would be useful for community management is terrible.

~~~
ubernostrum
_You can convert non-productive people to productive people, most of them only
want to be noticed or accepted_

I'm a moderator of multiple online spaces.

A few months ago, in one of them, a user got too heated and started flinging
insults at someone else. As was standard policy for the place where it was
occurring, I issued the user a ban of a few days (enforced cooling-off) and
pointed to our guidelines on how to behave.

This user then proceeded, over a period of months, to continually harass me,
send me increasingly graphic threats, and try to track me down in real life.

Pray tell, how exactly would _you_ go about "converting" such a person to be
productive? I come to you since you are apparently quite the expert on it, or
else you wouldn't be giving out advice to just "convert" people.

------
tomcam
If the tool is as accurate as the author implies, I think it would be great to
release it for literary analysis. For example, it might help to determine
which fragments of Shakespeare plays are written by Shakespeare and which, if
any, were remembered incorrectly by the actors who reconstructed them after
his death.

~~~
mc32
Yeah but conversely, or, rather additionally it could be used to unmask
anonymous authors who for political reasons wish to remain anonymous, like
whistleblowers, dissidents, etc.

------
sparkie
There already exists a counter tool to help against this kind of privacy
invasion

[https://github.com/psal/anonymouth](https://github.com/psal/anonymouth)

~~~
jaytaylor
I just tried to get anonymouth working, but unfortunately even after fixing
the invalid code issues/errors preventing compilation, it crashes after you
fill out about 6 screens worth questions about where various types of text
content is located:

    
    
        >>>>>>>>>>>>>>>>>>>>>>>   LOGGING STACK TRACE   <<<<<<<<<<<<<<<<<<<<<<<<<
        java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
        java.util.ArrayList.rangeCheck(Unknown Source)
        java.util.ArrayList.get(Unknown Source)
        edu.drexel.psal.anonymouth.utils.FunctionWords.run(FunctionWords.java:60)
        edu.drexel.psal.anonymouth.engine.DocumentProcessor.processDocuments(DocumentProcessor.java:140)
        edu.drexel.psal.anonymouth.engine.DocumentProcessor.access$000(DocumentProcessor.java:39)
        edu.drexel.psal.anonymouth.engine.DocumentProcessor$1.doInBackground(DocumentProcessor.java:70)
        edu.drexel.psal.anonymouth.engine.DocumentProcessor$1.doInBackground(DocumentProcessor.java:67)
        javax.swing.SwingWorker$1.call(Unknown Source)
        java.util.concurrent.FutureTask.run(Unknown Source)
        javax.swing.SwingWorker.run(Unknown Source)
        java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        java.lang.Thread.run(Unknown Source)
        
        ( ErrorHandler ) - Fatal error encountered, will exit now...
    

It's too bad the flow isn't more along the lines of "Give me some docs from
one author, and the other document you want to test. Okay! Here is your
result!"

~~~
sparkie
From this trace it looks like the file
jsan_resources/koppel_function_words.txt is missing/empty/invalid.

~~~
jaytaylor
Thanks for the info, that file appears to be a clean and intact list of
english words, except it is under the path:

    
    
        anonymouth/src/edu/drexel/psal/resources/koppel_function_words.txt
    

And down the rabbithole we go; after copying the file to src/jsan_resources we
now get a new error:

    
    
        >>>>>>>>>>>>>>>>>>>>>>>   LOGGING STACK TRACE   <<<<<<<<<<<<<<<<<<<<<<<<<
        java.lang.NullPointerException
        edu.drexel.psal.anonymouth.engine.InstanceConstructor.getAttributes(InstanceConstructor.java:182)
        edu.drexel.psal.anonymouth.engine.InstanceConstructor.runInstanceBuilder(InstanceConstructor.java:121)
        ...
    
        >>>>>>>>>>>>>>>>>>>>>>>   LOGGING STACK TRACE   <<<<<<<<<<<<<<<<<<<<<<<<<
        java.lang.NullPointerException
        edu.drexel.psal.anonymouth.utils.TaggedDocument.makeAndTagSentences(TaggedDocument.java:380)
        edu.drexel.psal.anonymouth.engine.DocumentProcessor.processDocuments(DocumentProcessor.java:184)
        ...
    

:)

------
deutronium
I'm curious how you can know the tool is 95% accurate, if it's being tested on
real world data, such as from reddit etc.?

I can only assume it was tested on a synthetic dataset perhaps.

Also I'm wondering how many unique users are present in the dataset, along
with the volume of content for each user.

~~~
gwern
You can easily turn any dataset with labeled authors into a de-anonymization
dataset: split each author's writings in half and give them different IDs. Now
you know the true answer for every pairwise combination.

~~~
firebones
The problem with that approach is that it is training the classifier to solve
a different problem than what is claimed. It assumes that there is no
difference between how people write when they are identifiable and when they
are not (since it would only train on identifiable samples that are anonymized
after the fact). Further, it would require solving the problem of identifying
authors between different media--which would be a huge achievement on its own.

A great test set for anyone trying to do this: look at the Scott Adams sock
puppet controversy on Metafilter [1] and see if you can train something on his
public writing to match the "PlannedChaos" commenter's posts and Adams' own
tweets. It is probably the closest you could get to a "pure" training set in
the sense that presumably Adams didn't think he'd get caught. (And if he did,
and therefore did alter his stylometrics, then it's even a better challenge.)

[1] [http://www.adweek.com/galleycat/scott-adams-caught-
defending...](http://www.adweek.com/galleycat/scott-adams-caught-defending-
himself-anonymously-on-metafilter/28973)

------
wrs
If there's a 5% false-positive rate, then if I give it an unidentified text
and the posts from 1,000,000 identified Redittors, it's going to give me
50,000 possible authors? That doesn't seem either useful _or_ troublesome...

~~~
tzs
That in and of itself would not be more than marginally useful or troublesome,
but usually it won't be used by itself. Usually it will be combined with other
evidence that you already have via Bayesian reasoning or intuition.

For example, suppose someone is revealing on Reddit details about some
business dealing of yours that should have only been known by people who are
under NDAs. If you intersect the set of 50000 Redditors returned by the
deanonymizing tool with the set of people under your NDA, and that
intersection is not empty, then the leaker is probably one of the ones in the
intersection.

------
steveeq1
Satoshi can probably be revealed using these techniques. It's a double-edge
sword.

------
modeless
If it works, prove it by unmasking Satoshi Nakamoto.

~~~
desdiv
This is exactly the type of thing I'm afraid of when I read the link. Being
doxxed due to something I did is one thing, but being doxxed due to something
I _didn 't_ do is something else altogether.

Can you imagine how much it would suck if you woke up the next morning and the
entire internet is convinced that you're Satoshi Nakamoto or a pedophile due
to a false positive from this program? There is no due process and no chance
of appeal; your social life is simply over at that point. All because of a 5%
chance of a false positive.

~~~
colejohnson66
That's already what happens with false rape/pedophile accusations due to
America's love of "trial by media". One false rape accusation and your photo
is all over the local media. Your life is ruined.

~~~
dredmorbius
I've addressed this in part in my own reply to the parent. But the larger one
is that with 1) pervasive information and 2) very cheap analysis or assertion,
you're hugely increasing the potential for this type of abuse.

The limit is in attention paid -- the public has a limited capacity to absorb
information, and there are a few hundred, perhaps a thousand or so "top
celebrities" at any one time.

And some of those can attain a highly significant level of immunity to
criticism. Ronald Reagan's presidency was the most scandal-prone in recent
memory, and yet his moniker was "the Teflon president". William Jefferson
Clinton took far more flack for far less, and Barak Obama takes the hit for
complete fabrications. Meantime, a major party presidential candidate
advocates overt violence to protesters and various other views ... and is only
embraced all the more strongly by his supporters.

The dynamics of this are odd.

------
jstanley
I wonder whether the code actually exists, or if the whole thing is just
intended to provoke discussion.

------
niftich
This is valuable code and should be released, but the product's naming and
one-line about already betray it's intended purpose (as the author envisions),
providing an additional vector of criticism.

In cases like this, I feel erring on the side of being less explicit tends to
help. Leave just enough out to let everyone read between the lines, and put
the pieces together.

Don't say it's a "Machine learning algorithm to connect anonymous accounts to
real names", say it's a 'speech pattern analyzer', or say it 'allows
comparison of speech patterns for likelihood of same author'.

~~~
SturgeonsLaw
Good point, but personally I think it's a breath of fresh air that not only
has the author elected to forego the doublespeak on describing this tool, but
they also sparked a discussion on the ethics of using it.

------
kyriakos
It should be released. Its the equivalent of stopping medical research because
it could be used for biological weapons.

The people who need this for evil purposes will develop it whether it's
released in open source or not.

------
jaykru
This should be released, if only so that we can learn how to evade it. Perhaps
it could be broken by only writing anonymously in a different language from
that of your public life, for example.

------
yuhong
Doxxing does makes me feel bad. Fixing the problems that led to for example
throwaway Ask HN posts is in the long term a better solution, though may be
easier said than than done. (Yes, I mean doing the Ask HN posts non-
anonymously instead.)

------
Zuider
It would be interesting to create a tool with the opposite functionality which
would take writing in one's own idiosyncratic style as input, and output
writing in a more generic style. Better still if it improved composition and
presentation.

------
Fiahil
I wonder if this could work when confronted with "hivemind" twitter accounts
shared with lots of different people. I don't see how it could get enough
signal from the noise to allow someone to unmask any identities.

------
tiredofnoobs
This entire post and the debate surrounding it, is frankly stupid. What does
95% accuracy even means?? Consider face recognition, even when there is good
gold standard for matching faces (human judgement, since human are good at
recognizing faces), determining accuracy of Face recognition algorithms is
still challenging (E.g. Megaface challenge). When it comes to a piece of text
written by an author its even more difficult. There are several practical
problems too, such as how do you distinguish Quotes and copy-pasted paragraphs
from rest of the text.

This sounds like a beginner who created a dataset, with a flawed metric. And
is now going around claiming 95% accuracy, using "Deep" learning. And equally
clueless commenters are hyping it up.

Why stop at claiming 95%, hell even I can create a "dataset" and a "deep
learning" algorithm and get 99.9%.

I am not discounting that there are legitimate stylometric analysis methods,
which have been peer reviewed. But please lets not hype "Deep learning for
doxxing". This just sullies the real progress being made in deep learning.

~~~
tzs
> This entire post and the debate surrounding it, is frankly stupid. What does
> 95% accuracy even means??

Generally in supervised machine learning a claim of X% accuracy means that
when tested on a large dataset for which the correct result is known and that
was not part of the training dataset or validation dataset, it classified X%
of that dataset correctly.

Typically you gather a big dataset of labeled data and then split it randomly
into training, validation, and test sets. A 50/25/25 split is common. If the
learning approach you are using does not need a validation set, then 70/30
training/test is common.

How reliable such an accuracy estimate is depends on how well your dataset
matches the characteristics of the datasets people will be using your trained
system on. His 95% accuracy report is probably reasonably reliable when his
software is used on anonymous posts on the forums where he gathered his
datasets. It would probably be less reliable looking at anonymous posts on,
say, a bagpipe maker's forum.

~~~
tiredofnoobs
Huh... of course I know definition of accuracy, and how its calculated.

However even in supervised learning, accuracy is only used in very limited
cases such as multi class classification. For a whole bunch of problems
including the one being discussed its a poor and in some cases a biased
metric. E.g. consider a heavily unbalanced problem 99% positive labels. By
predicting all instances with majority label its possible to get 99% accuracy.
There are several better metrics, False Accept rates, Precision Recall curves
etc.

Without knowing how the dataset was collected, did the "username" leaked into
the dataset, etc. its impossible to evaluate such outlandish claims.

The whole moral and ethical debate is non-sequitur, and harms legitimate deep
learning research.

~~~
rspeer
I understand this concern. It's like the old 20 Newsgroups data set for
testing classification, where supposedly you're distinguishing the topics of
conversation between comp.graphics, sci.electronics, talk.politics.misc, and
so on...

...but what the most effective classifiers do is memorize the names and
signature blocks of people who posted in each newsgroup.

------
diyseguy
People tend to think and write in the same familiar memes and patterns - I
doubt very much it could discern individuals based on text samples.

------
vwbiiwgrvi
that this exists means that other similar tools exist which we don't know
about (because any good idea is simultaneously discovered by multiple
unconnected people around the world), so release it

------
aminok
This will lead to political dissidents being outted.

------
dreamfactory2
so is this how Roko's Basilisk will find you?

------
AckSyn
Yes

------
Giorgi
Yes, at least for demo purposes.

