10.000 users is meaningless on a social network with millions of accounts.
What about the static features (like account creation dates)? Aren't those overfitting with cross-validation? Would not learning curves be required when classifying on unseen future data (the reason d'etre of ML)?
Yes, absolutely. The paper admits that using only account creation time with KNN was enough to get 98% accuracy all on its own. The authors then broke creation time into hour and minute to increase difficulty (slightly), and introduced heavy data fuzzing to see how far they could go and still get results.
But in practice, that's not an ML problem. It's just asking "how little data is needed to perform recall on this dataset?", and finding that it's relatively easy to associate Twitter accounts with, um, themselves.
I think this could have been made interesting by not just fuzzing some data, but actively stripping everything account-level instead of tweet-level; if metadata like posting time was enough to tie tweets together, that could have interesting consequences for identifying sockpuppets or even deanonymizing the human users. But as far as I can tell, including account-level features makes this a non-story.
This is basically an exercise in overfitting - they learned to recognize account data using that exact same data, and called it deanonymization. The data fuzzing was a little interesting, but it's still overfitting from top to bottom.
Ignoring that conflict of interest is easy, if the problem remains in the shadows and nobody has a solution to it. But there are dozens of research papers from the past year that could all be applied to wiping out the sockpuppet population on Twitter. Yet Twitter does nothing. How long can they sustain this strategy when research increasingly demonstrates Twitter is neglecting real options to address the problem?
They have no issues letting the public know how many accounts they are removing.
Having said that, their developer ToS are quite clear that they don't want researchers or developers to:
> aggregate Twitter user metrics such as total number of active users, accounts, total number of Periscope Broadcast views, user engagements or account engagements.
Ah, I long for the good old days when sockpuppets weren't a "problem" but a "feature", and a useful one at that.
a) Although Twitter tries to estimate the number of "bot accounts" in its QE reports, it has no accountability in calculating the percentages. Therefore they can use it as a tool to manipulate DAU. If real DAU goes down, just change the bot numbers to make it look like DAU is up.
b) Sockpuppets generate activity from real humans. The whole point is that they push agendas, sow discord, etc. Sockpuppets wouldn't work without victims interacting with them. Therefore more sockpuppets = more human activity = more ad revenue & higher metrics for stock price
c) My speculation is that the sockpuppet problem is extremely understated, and an accurate estimate would amount to a higher percentage of Twitter's reported userbase than previously admitted. Therefore, accurately reporting the percentage would require Twitter to negatively adjust their real user numbers, which drive the stock price.
The central problem is that for every 1% of users Twitter defines as sockpuppets, it must lose 1% of "real" users. Therefore once twitter makes a single estimate of bot percentages (which they have done), increasing that percentage is dangerous for their stock price because it means decreasing the percentage of real users, a key metric in the stock price.
To elaborate, it's trivial to make a classifier with 0 false positives... but also many false negatives (high precision, low recall). Similarly, we can have 0 false negatives but many false positives (low precision, high recall).
 - https://en.wikipedia.org/wiki/Precision_and_recall
Precision and recall can be useful for multi-class classification problems, but only if there's a "background" class for negative results. E.g. if you're evaluating a cancer detector, then you might care about it's precision when detecting any cancer vs. no cancer, but then the classifier has been reduced to a binary decision. It's also sometimes useful to focus on a certain class (e.g. skin cancer) to see whether the classifier has difficulties with that specific one, but again that's treating the multi-class classifier as a binary classifier.
If you look at the charts, their 3 techniques couldn't accurately disambiguate between two authors. If you like, whatever, use a lot of superfluous, unnecessary words, or whatever, you might get clustered into, like, the group of people who do that or whatever. But unless you have a really unique way of writing, I wouldn't worry too much.
IMO, after reading a bit on nlp, I think sensationalized headlines which the research doesnt live up to the headline are far too common. I suppose this might be science in general: http://phdcomics.com/comics/archive_print.php?comicid=1174
We define metadata as the information available pertaining
to a Twitter post. This is information that describes the context on which the post was shared. Apart from the 140 character message, each tweet contains about 144 fields of metadata. Each of these fields provides additional information
about: the account from which it was posted; the post (e.g.
time, number of views); other tweets contained within the
message; various entities (e.g. hashtags, URLs, etc); and the
information of any users directly mentioned in it.
For data collection, we used the Twitter Streaming Public
API (Twitter, Inc. 2018). Our population is a random
sample of the tweets posted between October 2015 and January
2016 (inclusive). During this period we collected approximately 151,215,987 tweets corresponding 11,668,319 users.
However, for the results presented here we considered only
users for which we collected more than 200 tweets. Our final
dataset contains tweets generated by 5,412,693 users.
All they did was exchange username for ACT as a unique identifier; and then found that its only ~97% unique.
E.g. user in training set has a meta-data about account creation.
Any test set case would only need to look at the account creation date to identify the user.
The reality is, the inclusion of that field in the metadata means that identifying a user from metadata is trivial and no interesting case for ML. In order to publish, they "degraded" the data until it was just interesting enough to be headline worthy. Insulting.
The researchers didn't find the users behind Twitter accounts, they stripped the usernames off the accounts and then reassociated them via usage patterns. So they did "unmask anonymous accounts" (because they anonymized the accounts). It's just a completely different result than the "find a real person" outcome the article implies.