"The highest Twitter user id when I started the experiment was around 637M (found by trial and error). I figured there would be gaps in user ids mostly because of massive deletions of spammer accounts, and a quick sample estimated the gaps to be on the order of 20%. So I generated 1.25M unique user ids in the range 0-637M, and tried to fetch the profile details for them.
After fetching the 12,500 batches I was left with 1,039,556 Twitter profiles. This means that there must exist approximately 530 million Twitter accounts: 83% of 637M."
The problem is that Twitter account IDs used to be sequential - every integer would correspond to an account, unless that account had been deleted. Then in 2011 Twitter introduced the Snowflake update https://dev.twitter.com/docs/twitter-ids-json-and-snowflake which changed the way IDs were generated (for scaling reasons - it's much better to have separate machines able to deal out IDs rather than rely on a single point of failure).
This means that if you were to create a pool of random IDs between 1 and 637,000,000 you'll find that the IDs below a certain number (the highest ID at the time snowflake kicked in) almost all correspond to an account, whereas the IDs above that number have a much higher number of misses.
Maybe you can correct me if my assumptions based on the code are wrong, but my guess is you'll magically find that Twitter's highest ID grows linearly with time. Then you're just randomly sampling within this maximum timestamp. At this point you're assuming that Twitter sees a constant rate of signups wrt time, which I highly doubt.
A separate problem is that my estimate of the highest id at the time of the experiment likely fell short. Since then I've encountered higher ids that are pretty sparse, but I don't know how many there are.
Do you have any ideas as to how to generate a better sequence of random ids that tracks Twitter ids after Snowflake? I'd like to redo this experiment in a while.
As evidence against this, note that the ID density currently being produced for tweets is very, very low: when I collect from the sprinkler, I rarely get a tweet ID where the sequence number portion is much higher than 2 or 3. That is, almost all of the 12 bits of sequence number are 0, almost all the time. So the ID density for tweets is well under 1%. (And the sampling method Twitter is using to produce the sprinkler doesn't have an effect here, as far as I can work out.)
If Snowflake were being used to generate user IDs, I'd expect even lower ID densities. This doesn't jibe at all with 86% (or 82%) or anywhere close to it.
It's becoming a really big problem, every day I come across accounts that have bought followers\* (a few belonging to HN users) and it's... disappointing. Inevitable, but disappointing. I started putting together a website for checking if someone had bought their followers but haven't found the desire to finish it yet, but there are so many that do it, anywhere from purchasing 4,000 to 80,000 followers which can be done for ~$300.
I've been reporting the people doing this to Twitter but nothing has been done about it, ultimately it looks good for Twitter metrics so I doubt they care that there are people with 80k fake followers, that's an extra 80k users for their investors to salivate over.
As a side note, it's fun to pick out a random account (like those listed above) and see who they're following, there are some people you'd not expect to be purchasing followers that are. An alternate theory is that Twitter is responsible (after all, how can they not detect these obvious bots?) but I can't see why... well I can, but I don't think they would.
*I'm probably alone in this but people that buy followers really irk me because I like numbers to be accurate. The site I started making had a directory of people that bought followers, I'm wondering if I should finish it up, I figure it would be interesting for people to see that a large number of "media personalities" are just buying up their followers.
I am followed by users that match the pattern you're describing, but I assure you I have never bought them. It's not coincidence that the accounts are of attractive young women- they want reciprocal follows. I am almost always followed shortly after tweeting, so I would venture to guess that they are plugged into the Twitter API's "sample" feed, and sometimes I am unlucky enough to be included in that sample.
The Twitter account I use for the website I work for which has 10,000 followers that are follows mainly through our website (we have the "follow us" button on a high trafficed page) are mainly from Twitter "lurkers" and I can't find a single one that matches the purchased followers pattern I mentioned, you're welcome to check yourself: http://twitter.com/i/#!/redstonewire/followers
On top of that I've checked with all the accounts I suspect of purchasing followers when they gained their followers: it's always within a few seconds that they've done from ~1,000 to 5,000, or 1,000 to 60,000. Here's an example of one account I was monitoring, you can see the jump from 2,000 to 4,000 took place at the same time: http://i.imgur.com/jUMNj.jpg and here's another that went from 0 to 60k: http://i.imgur.com/KwJZj.png
I don't mean these sort of accounts in isolation: 1 follower matching the pattern out of 100 is fine, but when a user has 10,000 followers and 9,900 match that pattern don't you think that's an indication of follower purchases?
Another type I didn't mention are the type that are legitimate active users that have authorised applications that are using this auth to generate follows on demand. These are interesting but not relevant because they're not created accounts just for followers, but legit accounts being used.
I'm guessing there's a layer in between that clears it up for Twitter's own tools, and the third-party tools are left to clean up the stream themselves.
(On the other hand, Tweetbot has no in-stream ads, which mean both of these things are happening in the same "post-processing" layer.)
At the time, I estimated roughly 1,200,000 active and connected users on Twitter. This author currently estimates around 80M active users (as of mid-2012), or a 80x increase over three years.
Note that both samples considered "active" to be someone who posts, which is quite a bit stricter than the (reasonable) definition of consumption that the industry has been stabilizing on.
Neither of us knew how to account for spam/fake accounts, which must represent some non-trivial part of the ecosystem (at least judging from the followers my own dormant account continues to attract: http://twitter.com/#!/dewitt/followers).
I found it interesting, though in hindsight not surprising at all, that the average length of username is also going up over the years.
1 hour = 3600 seconds
Assuming you use 3500 seconds to make API calls (the rest 100 seconds is used for performing disk operations) 10 accounts should be more than sufficient.