
Researchers Unmask Anonymous Twitter Accounts With 97% Accuracy Using ML - Jerry2
https://www.wired.co.uk/article/twitter-metadata-user-privacy
======
SOLAR_FIELDS
To be clear - since it isn't really from the title - this isn't exactly about
being able to find out who you are specifically if you want to remain
anonymous. If I have a single Twitter account, and it's anonymous, this method
isn't about finding out that I am John Doe and I live at 123 Main Street.
Rather it's a method to associate a "throwaway" anonymous Twitter account with
my "main" account if I want to make some tweets from a "throwaway" that are
anonymous. The end result is the same, of course, if your "main" Twitter
account contains personal identifying information.

~~~
lotu
It appeared to do even less than that, because they don’t have a ground truth
of throwaway and main account associations. It sounds like they where just
able to re-associate an account with it’s origional name after removing the
name by looking at metadata. This is throughly unimpressive.

~~~
taurine
Glad I am not the only one. This seems like a task of memory, not author
identification. How could this be used at test time?

10.000 users is meaningless on a social network with millions of accounts.

What about the static features (like account creation dates)? Aren't those
overfitting with cross-validation? Would not learning curves be required when
classifying on unseen future data (the reason d'etre of ML)?

~~~
Bartweiss
> _What about the static features (like account creation dates)? Aren 't those
> overfitting with cross-validation?_

Yes, absolutely. The paper admits that using only account creation time with
KNN was enough to get 98% accuracy all on its own. The authors then broke
creation time into hour and minute to increase difficulty (slightly), and
introduced heavy data fuzzing to see how far they could go and still get
results.

But in practice, that's not an ML problem. It's just asking "how little data
is needed to perform recall on this dataset?", and finding that it's
relatively easy to associate Twitter accounts with, um, themselves.

I think this could have been made interesting by not just fuzzing some data,
but actively stripping everything account-level instead of tweet-level; if
metadata like posting time was enough to tie tweets together, that could have
interesting consequences for identifying sockpuppets or even deanonymizing the
human users. But as far as I can tell, including account-level features makes
this a non-story.

------
JamesMcMinn
This is a terrible title and should be changed. The story has nothing to do
with unmasking anonymous Twitter accounts, rather it is about using metadata
to identify posts by the same user from a dataset which has had specific user
IDs removed.

~~~
scabarott
The entire piece could do with a serious re-edit. It left me more confused
than anything. Terribly written.

------
chatmasta
I wonder how execs at Twitter feel when they read research like this from
outside Twitter. It’s common knowledge that Twitter has a huge sockpuppet
problem, and it’s an open secret that the “problem” is actually beneficial to
the price of TWTR.

Ignoring that conflict of interest is easy, if the problem remains in the
shadows and nobody has a solution to it. But there are dozens of research
papers from the past year that could all be applied to wiping out the
sockpuppet population on Twitter. Yet Twitter does nothing. How long can they
sustain this strategy when research increasingly demonstrates Twitter is
neglecting real options to address the problem?

~~~
throwawaymath
Why is the sockpuppet problem good for Twitter’s stock price?

~~~
chatmasta
Well, since you asked, here's some "throwaway math" for you. ;)

a) Although Twitter tries to estimate the number of "bot accounts" in its QE
reports, it has no accountability in calculating the percentages. Therefore
they can use it as a tool to manipulate DAU. If real DAU goes down, just
change the bot numbers to make it look like DAU is up.

b) Sockpuppets generate activity from real humans. The whole point is that
they push agendas, sow discord, etc. Sockpuppets wouldn't work without victims
interacting with them. Therefore more sockpuppets = more human activity = more
ad revenue & higher metrics for stock price

c) My speculation is that the sockpuppet problem is extremely understated, and
an accurate estimate would amount to a higher percentage of Twitter's reported
userbase than previously admitted. Therefore, accurately reporting the
percentage would require Twitter to negatively adjust their real user numbers,
which drive the stock price.

The central problem is that for every 1% of users Twitter defines as
sockpuppets, it must lose 1% of "real" users. Therefore once twitter makes a
single estimate of bot percentages (which they have done), increasing that
percentage is dangerous for their stock price because it means decreasing the
percentage of real users, a key metric in the stock price.

------
nindalf
It would be much better if journalists would stop reporting "accuracy" and
instead mention the precision and recall[1] of the classifiers they report on.
There's a fundamental trade off between the two and it's hard to tell what
where the trade off has been made unless both are reported.

To elaborate, it's trivial to make a classifier with 0 false positives... but
also many false negatives (high precision, low recall). Similarly, we can have
0 false negatives but many false positives (low precision, high recall).

[1] -
[https://en.wikipedia.org/wiki/Precision_and_recall](https://en.wikipedia.org/wiki/Precision_and_recall)

~~~
yorwba
This isn't a binary classifier, they're trying to determine the user who
posted a given set of tweets among a certain set of Twitter users known to
contain that user. The terms "positive" and "negative" don't apply to that
kind of non-binary problem, and neither do precision and recall. Accuracy is a
completely appropriate measure of performance in this case.

Precision and recall _can_ be useful for multi-class classification problems,
but only if there's a "background" class for negative results. E.g. if you're
evaluating a cancer detector, then you might care about it's precision when
detecting _any_ cancer vs. no cancer, but then the classifier has been reduced
to a binary decision. It's also sometimes useful to focus on a certain class
(e.g. skin cancer) to see whether the classifier has difficulties with that
specific one, but again that's treating the multi-class classifier as a binary
classifier.

------
zodPod
I always kind of wondered if something like this (the way the title describes
it not the way it actually went) wasn't theoretically possible using
similarities in writing styles. Like I have several different accounts on some
social media sites and make posts and comments on each one for different
purposes. But I suspect that, if they were each analyzed based on how things
were worded and word frequency and a few other things, they could at least be
assumed to be linked together.

~~~
ahartman00
Theoretically is the key word here :) Here is an article from yesterday:
[https://news.ycombinator.com/item?id=17486376](https://news.ycombinator.com/item?id=17486376)

If you look at the charts, their 3 techniques couldn't accurately disambiguate
between two authors. If you like, whatever, use a lot of superfluous,
unnecessary words, or whatever, you might get clustered into, like, the group
of people who do that or whatever. But unless you have a really unique way of
writing, I wouldn't worry too much.

IMO, after reading a bit on nlp, I think sensationalized headlines which the
research doesnt live up to the headline are far too common. I suppose this
might be science in general:
[http://phdcomics.com/comics/archive_print.php?comicid=1174](http://phdcomics.com/comics/archive_print.php?comicid=1174)

------
dvfjsdhgfv
Can anyone tell what exactly the 144 pieces of metadata Twitter has about us
are? I assume most of these are available to Twitter only, correct?

~~~
yorwba
From the paper:

 _We define metadata as the information available pertaining to a Twitter
post. This is information that describes the context on which the post was
shared. Apart from the 140 character message, each tweet contains about 144
fields of metadata. Each of these fields provides additional information
about: the account from which it was posted; the post (e.g. time, number of
views); other tweets contained within the message; various entities (e.g.
hashtags, URLs, etc); and the information of any users directly mentioned in
it._

...

 _For data collection, we used the Twitter Streaming Public API (Twitter, Inc.
2018). Our population is a random sample of the tweets posted between October
2015 and January 2016 (inclusive). During this period we collected
approximately 151,215,987 tweets corresponding 11,668,319 users. However, for
the results presented here we considered only users for which we collected
more than 200 tweets. Our final dataset contains tweets generated by 5,412,693
users._

------
madenine
The paper referenced in the article uses KNN on metadata features to re-
identify users from their tweets. They found that account creation time (ACT)
alone could get them 97% accuracy.

All they did was exchange username for ACT as a unique identifier; and then
found that its only ~97% unique.

------
willgdjones
Wouldn't "account creation" date be shared between test and train data and so
would essentially constitute a train/set set leak?

E.g. user in training set has a meta-data about account creation.

Any test set case would only need to look at the account creation date to
identify the user.

~~~
jasallen
Yes, this is silly. And while they address it, and admit that using that field
alone basically results in perfect classification, they don't do the logical
thing and give this whole exercise up as pointless. Instead they just break
the Account Create Time up into individual features: "Account Creation Hour",
"Account Creation Minute". Seriously?

The reality is, the inclusion of that field in the metadata means that
identifying a user from metadata is trivial and no interesting case for ML. In
order to publish, they "degraded" the data until it was just interesting
enough to be headline worthy. Insulting.

------
djtriptych
grammar nerdery: anonymous means "without a name". So, there's no such thing
as an anonymous twitter account.

~~~
Bartweiss
Horribly, the title is correct - for this ill-written article.

The researchers didn't find the users behind Twitter accounts, they stripped
the usernames off the accounts and then reassociated them via usage patterns.
So they did "unmask anonymous accounts" (because they anonymized the
accounts). It's just a completely different result than the "find a real
person" outcome the article implies.

