Hacker News new | comments | show | ask | jobs | submit login
Some Fresh Twitter Stats (as of July 2012, Dataset Included) (diegobasch.com)
54 points by ryannielsen on July 31, 2012 | hide | past | web | favorite | 17 comments

There's a flaw in this bit, which leads to the 530 million account estimate:

"The highest Twitter user id when I started the experiment was around 637M (found by trial and error). I figured there would be gaps in user ids mostly because of massive deletions of spammer accounts, and a quick sample estimated the gaps to be on the order of 20%. So I generated 1.25M unique user ids in the range 0-637M, and tried to fetch the profile details for them.


After fetching the 12,500 batches I was left with 1,039,556 Twitter profiles. This means that there must exist approximately 530 million Twitter accounts: 83% of 637M."

The problem is that Twitter account IDs used to be sequential - every integer would correspond to an account, unless that account had been deleted. Then in 2011 Twitter introduced the Snowflake update https://dev.twitter.com/docs/twitter-ids-json-and-snowflake which changed the way IDs were generated (for scaling reasons - it's much better to have separate machines able to deal out IDs rather than rely on a single point of failure).

This means that if you were to create a pool of random IDs between 1 and 637,000,000 you'll find that the IDs below a certain number (the highest ID at the time snowflake kicked in) almost all correspond to an account, whereas the IDs above that number have a much higher number of misses.

Good catch. I just checked the data. The effect of Snowflake is not too high, the error in the estimate because of this is perhaps 10M to 20M accounts. It's almost in the noise considering Twitter's daily signup rate. Also, it may even offset the fact that Twitter wouldn't return accounts created in the past ten days through the API.

Can you explain what it means to "check the data"? The point here is that, for the last year, Twitter has NOT been generating sequential IDs. Just look at the source code[1]. They save a few low bits for the worker ID, data center, and worker-specific sequence. The highest bits are the timestamp bits.

Maybe you can correct me if my assumptions based on the code are wrong, but my guess is you'll magically find that Twitter's highest ID grows linearly with time. Then you're just randomly sampling within this maximum timestamp. At this point you're assuming that Twitter sees a constant rate of signups wrt time, which I highly doubt.

[1] https://github.com/twitter/snowflake/blob/master/src/main/sc...

What I did was to measure the difference in yield between uniformly generated ids that would correspond to the time after Snowflake (when ids were at 380M) and the ones before. It's true that the yield is less. It went down from about 86% to 82%.

A separate problem is that my estimate of the highest id at the time of the experiment likely fell short. Since then I've encountered higher ids that are pretty sparse, but I don't know how many there are.

Do you have any ideas as to how to generate a better sequence of random ids that tracks Twitter ids after Snowflake? I'd like to redo this experiment in a while.

As far as I know, Snowflake is used to generate tweet IDs. Do you have any reason to believe that it's being used to generate user IDs?

As evidence against this, note that the ID density currently being produced for tweets is very, very low: when I collect from the sprinkler, I rarely get a tweet ID where the sequence number portion is much higher than 2 or 3. That is, almost all of the 12 bits of sequence number are 0, almost all the time. So the ID density for tweets is well under 1%. (And the sampling method Twitter is using to produce the sprinkler doesn't have an effect here, as far as I can work out.)

If Snowflake were being used to generate user IDs, I'd expect even lower ID densities. This doesn't jibe at all with 86% (or 82%) or anywhere close to it.

I would be really interested to see some analysis on the latest ~50m accounts because I've noticed an insane amount of bot accounts recently. They all follow the same pattern: Never tweet (or tweet once), follow ~2000 people and are being followed by ~3 and have a bio + avatar. They're used by the Twitter follower selling companies and there's an absolute metric fucktonne of these accounts, some examples of such accounts:

http://twitter.com/Yahairayqlcd http://twitter.com/Kenyetta992 http://twitter.com/Jade_482 http://twitter.com/Mozella_nxi http://twitter.com/Alane508

It's becoming a really big problem, every day I come across accounts that have bought followers\* (a few belonging to HN users) and it's... disappointing. Inevitable, but disappointing. I started putting together a website for checking if someone had bought their followers but haven't found the desire to finish it yet, but there are so many that do it, anywhere from purchasing 4,000 to 80,000 followers which can be done for ~$300.

I've been reporting the people doing this to Twitter but nothing has been done about it, ultimately it looks good for Twitter metrics so I doubt they care that there are people with 80k fake followers, that's an extra 80k users for their investors to salivate over.

As a side note, it's fun to pick out a random account (like those listed above) and see who they're following, there are some people you'd not expect to be purchasing followers that are. An alternate theory is that Twitter is responsible (after all, how can they not detect these obvious bots?) but I can't see why... well I can, but I don't think they would.

*I'm probably alone in this but people that buy followers really irk me because I like numbers to be accurate. The site I started making had a directory of people that bought followers, I'm wondering if I should finish it up, I figure it would be interesting for people to see that a large number of "media personalities" are just buying up their followers.

Quick plug for a co-worker who presented some of our findings on this topic last week during security bsides in las vegas. http://www.youtube.com/watch?v=6evQ8fU49Zg&list=UU4PBNDL... The economies of it are fascinating, and honestly I don't know what people think they're going to gain from it, as we haven't seen much correlation between buying followers and having an actual increase in the number of real followers. Everyone from wannabe musicians to politicians are doing it though.

It's becoming a really big problem, every day I come across accounts that have bought followers

I am followed by users that match the pattern you're describing, but I assure you I have never bought them. It's not coincidence that the accounts are of attractive young women- they want reciprocal follows. I am almost always followed shortly after tweeting, so I would venture to guess that they are plugged into the Twitter API's "sample" feed, and sometimes I am unlucky enough to be included in that sample.

I don't see any followers on your Twitter account that match this pattern. I have a small number of followers (about 1,100) and I've never had an account like the type I listed (avatar, bio, ~3 followers, ~2000 following) follow me.

The Twitter account I use for the website I work for which has 10,000 followers that are follows mainly through our website (we have the "follow us" button on a high trafficed page) are mainly from Twitter "lurkers" and I can't find a single one that matches the purchased followers pattern I mentioned, you're welcome to check yourself: http://twitter.com/i/#!/redstonewire/followers

On top of that I've checked with all the accounts I suspect of purchasing followers when they gained their followers: it's always within a few seconds that they've done from ~1,000 to 5,000, or 1,000 to 60,000. Here's an example of one account I was monitoring, you can see the jump from 2,000 to 4,000 took place at the same time: http://i.imgur.com/jUMNj.jpg and here's another that went from 0 to 60k: http://i.imgur.com/KwJZj.png

I don't mean these sort of accounts in isolation: 1 follower matching the pattern out of 100 is fine, but when a user has 10,000 followers and 9,900 match that pattern don't you think that's an indication of follower purchases?

Another type I didn't mention are the type that are legitimate active users that have authorised applications that are using this auth to generate follows on demand. These are interesting but not relevant because they're not created accounts just for followers, but legit accounts being used.

Same here. But as soon as they follow me (well, when I check) I mark them as spam and Twitter deletes them from my "Followers" list. I think I'd have 3 times as much followers by leaving them there.

I never see these accounts when using the Twitter website, but always when I'm using Tweetbot.

I'm guessing there's a layer in between that clears it up for Twitter's own tools, and the third-party tools are left to clean up the stream themselves.

(On the other hand, Tweetbot has no in-stream ads, which mean both of these things are happening in the same "post-processing" layer.)

By way of comparison, I ran the numbers back in 2009 using a similar sampling technique:


At the time, I estimated roughly 1,200,000 active and connected users on Twitter. This author currently estimates around 80M active users (as of mid-2012), or a 80x increase over three years.

Note that both samples considered "active" to be someone who posts, which is quite a bit stricter than the (reasonable) definition of consumption that the industry has been stabilizing on.

Neither of us knew how to account for spam/fake accounts, which must represent some non-trivial part of the ecosystem (at least judging from the followers my own dormant account continues to attract: http://twitter.com/#!/dewitt/followers).

I found it interesting, though in hindsight not surprising at all, that the average length of username is also going up over the years.

I just downloaded your data set and here is one particular observation - Are you retrieving the all the tweets of every user? Twitter allows you retrieve upto 3200 tweets of a user(if public) via pagination. You can download them to understand how "active" they are for a much better analysis.

That would require several million API calls. This sample was taken with only 12500 calls.

Ah No - it should be quite easy to achieve. There are two types of API calls for the Twitter API (IP based and OAuth). Oauth provides you 350 requests per hour. It's quite easy and off the shelf packages are available. So, in summary you could use 10 Twitter accounts and make OAuth requests from each of these accounts by polling.

1 hour = 3600 seconds Assuming you use 3500 seconds to make API calls (the rest 100 seconds is used for performing disk operations) 10 accounts should be more than sufficient.


Your math is off. You could do 3500 requests per hour with 10 accounts. You'd still need thousands of hours (i.e. a few months) because we are talking about millions of API calls.

Only 1 graph? Is Chrome broken for me? I thought we were going to get more visual representations like the Twitter "follow" graph with 33 billion edges!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact