Hacker News new | comments | show | ask | jobs | submit login
Researchers Unmask Anonymous Twitter Accounts With 97% Accuracy Using ML (wired.co.uk)
122 points by Jerry2 37 days ago | hide | past | web | favorite | 29 comments

To be clear - since it isn't really from the title - this isn't exactly about being able to find out who you are specifically if you want to remain anonymous. If I have a single Twitter account, and it's anonymous, this method isn't about finding out that I am John Doe and I live at 123 Main Street. Rather it's a method to associate a "throwaway" anonymous Twitter account with my "main" account if I want to make some tweets from a "throwaway" that are anonymous. The end result is the same, of course, if your "main" Twitter account contains personal identifying information.

It appeared to do even less than that, because they don’t have a ground truth of throwaway and main account associations. It sounds like they where just able to re-associate an account with it’s origional name after removing the name by looking at metadata. This is throughly unimpressive.

Glad I am not the only one. This seems like a task of memory, not author identification. How could this be used at test time?

10.000 users is meaningless on a social network with millions of accounts.

What about the static features (like account creation dates)? Aren't those overfitting with cross-validation? Would not learning curves be required when classifying on unseen future data (the reason d'etre of ML)?

> What about the static features (like account creation dates)? Aren't those overfitting with cross-validation?

Yes, absolutely. The paper admits that using only account creation time with KNN was enough to get 98% accuracy all on its own. The authors then broke creation time into hour and minute to increase difficulty (slightly), and introduced heavy data fuzzing to see how far they could go and still get results.

But in practice, that's not an ML problem. It's just asking "how little data is needed to perform recall on this dataset?", and finding that it's relatively easy to associate Twitter accounts with, um, themselves.

I think this could have been made interesting by not just fuzzing some data, but actively stripping everything account-level instead of tweet-level; if metadata like posting time was enough to tie tweets together, that could have interesting consequences for identifying sockpuppets or even deanonymizing the human users. But as far as I can tell, including account-level features makes this a non-story.

And there I thought it might be possible to discover all the puppet accounts celebrities use to rouse interest in themselves and create “engagement”... sigh

If the reassociation returns multiple accounts, then it should get both the main and the throwaway. Unless it is over fitting to the main somehow.

The basic result is horribly overfitted, because "account creation minute" and "account creation hour" are two of the parameters. (They split those up because "account creation time" was effectively a unique feature all on its own, and they wanted a 'harder' problem.)

This is basically an exercise in overfitting - they learned to recognize account data using that exact same data, and called it deanonymization. The data fuzzing was a little interesting, but it's still overfitting from top to bottom.

I just read quickly through on my way to work so I did not get to that part. I was thinking more like using tweet content, language useage(words used, sentence length etc) and ip data into a generative kind of network similar to what is used to fix images with missing content. Then the missing content can be the username and it might be able to at least find very similar users.

Time to run this on HN posts to see which incendiary/troll throwaway accounts belong to which real users.

This is a terrible title and should be changed. The story has nothing to do with unmasking anonymous Twitter accounts, rather it is about using metadata to identify posts by the same user from a dataset which has had specific user IDs removed.

The entire piece could do with a serious re-edit. It left me more confused than anything. Terribly written.

I wonder how execs at Twitter feel when they read research like this from outside Twitter. It’s common knowledge that Twitter has a huge sockpuppet problem, and it’s an open secret that the “problem” is actually beneficial to the price of TWTR.

Ignoring that conflict of interest is easy, if the problem remains in the shadows and nobody has a solution to it. But there are dozens of research papers from the past year that could all be applied to wiping out the sockpuppet population on Twitter. Yet Twitter does nothing. How long can they sustain this strategy when research increasingly demonstrates Twitter is neglecting real options to address the problem?

You may not be aware, but Twitter's share price took a rather large hit yesterday specifically because they banned 10s of millions of bot accounts.

They have no issues letting the public know how many accounts they are removing.

Having said that, their developer ToS are quite clear that they don't want researchers or developers to:

> aggregate Twitter user metrics such as total number of active users, accounts, total number of Periscope Broadcast views, user engagements or account engagements.

> sockpuppet problem

Ah, I long for the good old days when sockpuppets weren't a "problem" but a "feature", and a useful one at that.

Why is the sockpuppet problem good for Twitter’s stock price?

Well, since you asked, here's some "throwaway math" for you. ;)

a) Although Twitter tries to estimate the number of "bot accounts" in its QE reports, it has no accountability in calculating the percentages. Therefore they can use it as a tool to manipulate DAU. If real DAU goes down, just change the bot numbers to make it look like DAU is up.

b) Sockpuppets generate activity from real humans. The whole point is that they push agendas, sow discord, etc. Sockpuppets wouldn't work without victims interacting with them. Therefore more sockpuppets = more human activity = more ad revenue & higher metrics for stock price

c) My speculation is that the sockpuppet problem is extremely understated, and an accurate estimate would amount to a higher percentage of Twitter's reported userbase than previously admitted. Therefore, accurately reporting the percentage would require Twitter to negatively adjust their real user numbers, which drive the stock price.

The central problem is that for every 1% of users Twitter defines as sockpuppets, it must lose 1% of "real" users. Therefore once twitter makes a single estimate of bot percentages (which they have done), increasing that percentage is dangerous for their stock price because it means decreasing the percentage of real users, a key metric in the stock price.

It would be much better if journalists would stop reporting "accuracy" and instead mention the precision and recall[1] of the classifiers they report on. There's a fundamental trade off between the two and it's hard to tell what where the trade off has been made unless both are reported.

To elaborate, it's trivial to make a classifier with 0 false positives... but also many false negatives (high precision, low recall). Similarly, we can have 0 false negatives but many false positives (low precision, high recall).

[1] - https://en.wikipedia.org/wiki/Precision_and_recall

This isn't a binary classifier, they're trying to determine the user who posted a given set of tweets among a certain set of Twitter users known to contain that user. The terms "positive" and "negative" don't apply to that kind of non-binary problem, and neither do precision and recall. Accuracy is a completely appropriate measure of performance in this case.

Precision and recall can be useful for multi-class classification problems, but only if there's a "background" class for negative results. E.g. if you're evaluating a cancer detector, then you might care about it's precision when detecting any cancer vs. no cancer, but then the classifier has been reduced to a binary decision. It's also sometimes useful to focus on a certain class (e.g. skin cancer) to see whether the classifier has difficulties with that specific one, but again that's treating the multi-class classifier as a binary classifier.

I always kind of wondered if something like this (the way the title describes it not the way it actually went) wasn't theoretically possible using similarities in writing styles. Like I have several different accounts on some social media sites and make posts and comments on each one for different purposes. But I suspect that, if they were each analyzed based on how things were worded and word frequency and a few other things, they could at least be assumed to be linked together.

Theoretically is the key word here :) Here is an article from yesterday: https://news.ycombinator.com/item?id=17486376

If you look at the charts, their 3 techniques couldn't accurately disambiguate between two authors. If you like, whatever, use a lot of superfluous, unnecessary words, or whatever, you might get clustered into, like, the group of people who do that or whatever. But unless you have a really unique way of writing, I wouldn't worry too much.

IMO, after reading a bit on nlp, I think sensationalized headlines which the research doesnt live up to the headline are far too common. I suppose this might be science in general: http://phdcomics.com/comics/archive_print.php?comicid=1174

Exactly what I thought they did.

Can anyone tell what exactly the 144 pieces of metadata Twitter has about us are? I assume most of these are available to Twitter only, correct?

From the paper:

We define metadata as the information available pertaining to a Twitter post. This is information that describes the context on which the post was shared. Apart from the 140 character message, each tweet contains about 144 fields of metadata. Each of these fields provides additional information about: the account from which it was posted; the post (e.g. time, number of views); other tweets contained within the message; various entities (e.g. hashtags, URLs, etc); and the information of any users directly mentioned in it.


For data collection, we used the Twitter Streaming Public API (Twitter, Inc. 2018). Our population is a random sample of the tweets posted between October 2015 and January 2016 (inclusive). During this period we collected approximately 151,215,987 tweets corresponding 11,668,319 users. However, for the results presented here we considered only users for which we collected more than 200 tweets. Our final dataset contains tweets generated by 5,412,693 users.

The paper referenced in the article uses KNN on metadata features to re-identify users from their tweets. They found that account creation time (ACT) alone could get them 97% accuracy.

All they did was exchange username for ACT as a unique identifier; and then found that its only ~97% unique.

Wouldn't "account creation" date be shared between test and train data and so would essentially constitute a train/set set leak?

E.g. user in training set has a meta-data about account creation.

Any test set case would only need to look at the account creation date to identify the user.

Yes, this is silly. And while they address it, and admit that using that field alone basically results in perfect classification, they don't do the logical thing and give this whole exercise up as pointless. Instead they just break the Account Create Time up into individual features: "Account Creation Hour", "Account Creation Minute". Seriously?

The reality is, the inclusion of that field in the metadata means that identifying a user from metadata is trivial and no interesting case for ML. In order to publish, they "degraded" the data until it was just interesting enough to be headline worthy. Insulting.

They do address this fact in the paper, and in fact note that a simple KNN approach using only the account creation time gives 99.98% accuracy.

grammar nerdery: anonymous means "without a name". So, there's no such thing as an anonymous twitter account.

Horribly, the title is correct - for this ill-written article.

The researchers didn't find the users behind Twitter accounts, they stripped the usernames off the accounts and then reassociated them via usage patterns. So they did "unmask anonymous accounts" (because they anonymized the accounts). It's just a completely different result than the "find a real person" outcome the article implies.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact