

Some Fresh Twitter Stats (as of July 2012, Dataset Included) - ryannielsen
http://diegobasch.com/some-fresh-twitter-stats-as-of-july-2012

======
citricsquid
I would be really interested to see some analysis on the latest ~50m accounts
because I've noticed an insane amount of bot accounts recently. They all
follow the same pattern: Never tweet (or tweet once), follow ~2000 people and
are being followed by ~3 and have a bio + avatar. They're used by the Twitter
follower selling companies and there's an absolute metric fucktonne of these
accounts, some examples of such accounts:

<http://twitter.com/Yahairayqlcd> <http://twitter.com/Kenyetta992>
<http://twitter.com/Jade_482> <http://twitter.com/Mozella_nxi>
<http://twitter.com/Alane508>

It's becoming a really big problem, every day I come across accounts that have
bought followers\\* (a few belonging to HN users) and it's... disappointing.
Inevitable, but disappointing. I started putting together a website for
checking if someone had bought their followers but haven't found the desire to
finish it yet, but there are so many that do it, anywhere from purchasing
4,000 to 80,000 followers which can be done for ~$300.

I've been reporting the people doing this to Twitter but nothing has been done
about it, ultimately it looks good for Twitter metrics so I doubt they care
that there are people with 80k fake followers, that's an extra 80k users for
their investors to salivate over.

As a side note, it's fun to pick out a random account (like those listed
above) and see who they're following, there are some people you'd not expect
to be purchasing followers that are. An alternate theory is that Twitter is
responsible (after all, how can they not detect these obvious bots?) but I
can't see why... well I can, but I don't think they would.

*I'm probably alone in this but people that buy followers really irk me because I like numbers to be accurate. The site I started making had a directory of people that bought followers, I'm wondering if I should finish it up, I figure it would be interesting for people to see that a large number of "media personalities" are just buying up their followers.

~~~
untog
_It's becoming a really big problem, every day I come across accounts that
have bought followers_

I am followed by users that match the pattern you're describing, but I assure
you I have never bought them. It's not coincidence that the accounts are of
attractive young women- they want reciprocal follows. I am almost always
followed shortly after tweeting, so I would venture to guess that they are
plugged into the Twitter API's "sample" feed, and sometimes I am unlucky
enough to be included in that sample.

~~~
citricsquid
I don't see any followers on your Twitter account that match this pattern. I
have a small number of followers (about 1,100) and I've never had an account
like the type I listed (avatar, bio, ~3 followers, ~2000 following) follow me.

The Twitter account I use for the website I work for which has 10,000
followers that are follows mainly through our website (we have the "follow us"
button on a high trafficed page) are mainly from Twitter "lurkers" and I can't
find a single one that matches the purchased followers pattern I mentioned,
you're welcome to check yourself:
<http://twitter.com/i/#!/redstonewire/followers>

On top of that I've checked with all the accounts I suspect of purchasing
followers when they gained their followers: it's always within a few seconds
that they've done from ~1,000 to 5,000, or 1,000 to 60,000. Here's an example
of one account I was monitoring, you can see the jump from 2,000 to 4,000 took
place at the same time: <http://i.imgur.com/jUMNj.jpg> and here's another that
went from 0 to 60k: <http://i.imgur.com/KwJZj.png>

I don't mean these sort of accounts in isolation: 1 follower matching the
pattern out of 100 is fine, but when a user has 10,000 followers and 9,900
match that pattern don't you think that's an indication of follower purchases?

Another type I didn't mention are the type that are legitimate active users
that have authorised applications that are using this auth to generate follows
on demand. These are interesting but not relevant because they're not created
accounts just for followers, but legit accounts being used.

------
simonw
There's a flaw in this bit, which leads to the 530 million account estimate:

"The highest Twitter user id when I started the experiment was around 637M
(found by trial and error). I figured there would be gaps in user ids mostly
because of massive deletions of spammer accounts, and a quick sample estimated
the gaps to be on the order of 20%. So I generated 1.25M unique user ids in
the range 0-637M, and tried to fetch the profile details for them.

[...]

After fetching the 12,500 batches I was left with 1,039,556 Twitter profiles.
This means that there must exist approximately 530 million Twitter accounts:
83% of 637M."

The problem is that Twitter account IDs used to be sequential - every integer
would correspond to an account, unless that account had been deleted. Then in
2011 Twitter introduced the Snowflake update
<https://dev.twitter.com/docs/twitter-ids-json-and-snowflake> which changed
the way IDs were generated (for scaling reasons - it's much better to have
separate machines able to deal out IDs rather than rely on a single point of
failure).

This means that if you were to create a pool of random IDs between 1 and
637,000,000 you'll find that the IDs below a certain number (the highest ID at
the time snowflake kicked in) almost all correspond to an account, whereas the
IDs above that number have a much higher number of misses.

~~~
diego
Good catch. I just checked the data. The effect of Snowflake is not too high,
the error in the estimate because of this is perhaps 10M to 20M accounts. It's
almost in the noise considering Twitter's daily signup rate. Also, it may even
offset the fact that Twitter wouldn't return accounts created in the past ten
days through the API.

~~~
joshma
Can you explain what it means to "check the data"? The point here is that, for
the last year, Twitter has NOT been generating sequential IDs. Just look at
the source code[1]. They save a few low bits for the worker ID, data center,
and worker-specific sequence. The highest bits are the timestamp bits.

Maybe you can correct me if my assumptions based on the code are wrong, but my
guess is you'll magically find that Twitter's highest ID grows linearly with
time. Then you're just randomly sampling within this maximum timestamp. At
this point you're assuming that Twitter sees a constant rate of signups wrt
time, which I highly doubt.

[1]
[https://github.com/twitter/snowflake/blob/master/src/main/sc...](https://github.com/twitter/snowflake/blob/master/src/main/scala/com/twitter/service/snowflake/IdWorker.scala)

~~~
diego
What I did was to measure the difference in yield between uniformly generated
ids that would correspond to the time after Snowflake (when ids were at 380M)
and the ones before. It's true that the yield is less. It went down from about
86% to 82%.

A separate problem is that my estimate of the highest id at the time of the
experiment likely fell short. Since then I've encountered higher ids that are
pretty sparse, but I don't know how many there are.

Do you have any ideas as to how to generate a better sequence of random ids
that tracks Twitter ids after Snowflake? I'd like to redo this experiment in a
while.

------
dewitt
By way of comparison, I ran the numbers back in 2009 using a similar sampling
technique:

<http://blog.unto.net/sampling-twitter.html>

At the time, I estimated roughly 1,200,000 active and connected users on
Twitter. This author currently estimates around 80M active users (as of
mid-2012), or a 80x increase over three years.

Note that both samples considered "active" to be someone who posts, which is
quite a bit stricter than the (reasonable) definition of consumption that the
industry has been stabilizing on.

Neither of us knew how to account for spam/fake accounts, which must represent
some non-trivial part of the ecosystem (at least judging from the followers my
own dormant account continues to attract:
<http://twitter.com/#!/dewitt/followers>).

I found it interesting, though in hindsight not surprising at all, that the
average length of username is also going up over the years.

------
denzil_correa
I just downloaded your data set and here is one particular observation - Are
you retrieving the all the tweets of every user? Twitter allows you retrieve
upto 3200 tweets of a user(if public) via pagination. You can download them to
understand how "active" they are for a much better analysis.

~~~
diego
That would require several million API calls. This sample was taken with only
12500 calls.

~~~
denzil_correa
Ah No - it should be quite easy to achieve. There are two types of API calls
for the Twitter API (IP based and OAuth). Oauth provides you 350 requests per
hour. It's quite easy and off the shelf packages are available. So, in summary
you could use 10 Twitter accounts and make OAuth requests from each of these
accounts by polling.

1 hour = 3600 seconds Assuming you use 3500 seconds to make API calls (the
rest 100 seconds is used for performing disk operations) 10 accounts should be
more than sufficient.

:-)

~~~
diego
Your math is off. You could do 3500 requests per hour with 10 accounts. You'd
still need thousands of hours (i.e. a few months) because we are talking about
_millions_ of API calls.

------
carleverett
Only 1 graph? Is Chrome broken for me? I thought we were going to get more
visual representations like the Twitter "follow" graph with 33 billion edges!

