
Reconstructing Twitter's Firehose - minimaxir
https://docs.google.com/document/d/1xVrPoNutyqTdQ04DXBEZW4ZW4A5RAQW2he7qIpTmG-M/edit
======
andrewguenther
This is a really neat approach! A friend of mine actually did his thesis
working with Netflix to analyze Twitter and determine if Netflix was down[1].
He took a much more brute force approach, taking advantage of the fact that
his search space was limited to re-assemble a "firehose" for a given search
term leveraging hitting the search API from multiple machines and then
reconciling the stream and removing duplicates. This approach definitely
wouldn't scale to the entirety of the firehose, but I think that, in general,
interest in Twitter data revolves around more specific queries.

[1]
[https://users.soe.ucsc.edu/~eaugusti/pages/papers/docs/SPOON...](https://users.soe.ucsc.edu/~eaugusti/pages/papers/docs/SPOONS:%20Netflix%20Outage%20Detection%20Using%20Microtext%20Classification%20-%202013.pdf)

------
Reedx
> _Twitter’s firehose is a complete stream of all tweets made on their
> platform and is only available to a few businesses at an extraordinary
> price._

Anyone happen to know what that price is?

~~~
jaytaylor
Hundreds of thousands per month. At least that was the price in 2012.

~~~
dev_dull
Not only that, but you need to _prove_ you can handle the traffic.

~~~
jaytaylor
Not in my experience. Why would gnip / twitter care if you can handle it, as
long as you pay and abide by the terms of the contract? The burden to fetch
the data within the specified window is on the client, it's not a push system.

For context: Back in the day, I worked at Klout. We had to pay a crippling
monthly sum just to get access to the @mentions stream.

~~~
chrismeller
At one point I also read that you had to prove you could handle the traffic.
Whether that was a requirement of a technical nature or just to make sure you
knew what you were doing so they wouldn't end up holding your hand all the
time I'm not sure.

They don't specifically say you have to verify anything, but they do dance
around being able to handle the volume in a couple of different ways in their
streaming guides for us average mortals [1]. It mostly seems to come down to
the fact that they are, at least to some degree, buffering on their end to
make sure you don't miss anything if there's suddenly a spike that's too large
for you to handle or if your connection degrades for a while, etc.

It also wouldn't surprise me if it were originally simply a limitation of
their internal systems and they couldn't buffer or allow any kind of replay
because they were literally just writing everything off to a stream as it came
in and didn't have any mechanism for going back and reading from the actual
database after the fact.

1: [https://developer.twitter.com/en/docs/tweets/filter-
realtime...](https://developer.twitter.com/en/docs/tweets/filter-
realtime/guides/disconnections-explained)

------
stuck_in_matrix
I am the author of this document. If anyone has any questions, I'd be happy to
answer them!

~~~
the_arun
The timestamp is generated per server. The system time could differ across
nodes in the cluster (even with NTP) by nano seconds. So, isn't the accuracy
of the first tweet is approximation if there are multiple tweets containing
same word "earthquake" at the same time at ns level? But I get the point.

~~~
talaketu
Tweet object from the Twitter API has a "creation_time" that has 1s
resolution, whereas the snowflake creation time has 1ms resolution. No doubt
these could disagree, but if that happened then maybe both authors could get a
prize?

------
talaketu
Wonder how it fits with developer terms.

eg. 5d

    
    
      Do not use, access or analyze the Twitter API to monitor or measure the availability, performance, functionality, usage statistics or results of Twitter Services or for any other benchmarking or competitive purposes, including without limitation, monitoring or measuring:
    
          the responsiveness of Twitter Services; or
          aggregate Twitter user metrics such as total number of active users, accounts, total number of Periscope Broadcast views, user engagements or account engagements.
    

[1] [https://developer.twitter.com/en/developer-
terms/agreement-a...](https://developer.twitter.com/en/developer-
terms/agreement-and-policy.html)

~~~
thinkloop
I reckon not very neatly.

------
identity_zero
This is proper "hacker news" right here

------
socketnaut
The statistical claims in the article make the assumption that tweets are
being sampled uniformly at random, which is most likely false.

The fact that 3 machines handle 20% of tweets suggests that tweets are not in
fact assigned to machines in a uniformly random manner. I would guess that
there is a geographic bias as to which machines handle which tweets.

~~~
stuck_in_matrix
When I did the analysis, I was puzzled why certain machines handle a higher
percentage of tweets compared to others -- so you are most likely correct that
there may be some geographic consideration to the distribution.

I'm rewriting the code to include a prescan of the time range to determine
which server ids are in play at the time and which server ids are most active.

Figuring out how to deconstruct Snowflake was challenging and there is still a
lot of analysis left to do.

~~~
rokob
> Figuring out how to deconstruct Snowflake was challenging and there is still
> a lot of analysis left to do.

Why don't you just read the code?

~~~
detaro
How do you read the code of an implementation detail of Twitter's servers?
There's no guarantee that the example code they released years ago still
matches what they use.

------
HelloFellowDevs
Discord also utilizes Twitters Snowflake algorithm for the ridiculous amount
of messages that are sent in chats.[1]

[1][https://discordapp.com/developers/docs/reference/](https://discordapp.com/developers/docs/reference/)

~~~
andreareina
Link doesn't work for me, I just get a partially-loaded page with an
eternally-spinning spinner (FF65 on OSX).

Gotta go to
[https://discordapp.com/developers/docs](https://discordapp.com/developers/docs)
and then click on the Reference link in the sidebar.

~~~
mscdex
If you remove the trailing forward slash the link works fine.

------
aboutruby
That's quite amazing!

One thing is that the data center ids are in the tweet ids, so it could be
used to get a rough location of Twitter users.

~~~
aboutruby
I see there is interest in this observation so I used a little ruby
(.to_s(2)[-22..-17].to_i(2)) to get the datacenter id.

Then ran it on a few Twitter accounts:
[https://pastebin.com/w8Dnj5kM](https://pastebin.com/w8Dnj5kM)

It does work, and it's going to be hard to patch

edit: I realized you don't only get one location but the whole location
history of a Twitter user. Also locating Twitter's data centers as it doesn't
seem to be public information

~~~
stuck_in_matrix
This is really interesting. When I did the original analysis on datacenter /
server ids, I didn't think about correlation with user accounts. Nice
observation!

~~~
aboutruby
I was curious and just did that :)
[https://gist.github.com/localhostdotdev/48ed13972c3e5391a47f...](https://gist.github.com/localhostdotdev/48ed13972c3e5391a47f8e3dd7b9e0dd)
(small sample of ~1500 localized tweets)

------
edent
That's excellent stuff. I had no idea that's how snowflake IDs were
constructed.

Now makes me wonder how hard it would be to make a tweet which linked to
itself...?

~~~
pomber
It's not that hard:
[https://twitter.com/mauritscorneIis](https://twitter.com/mauritscorneIis)

Code is here: [https://github.com/pomber/escher-
bot](https://github.com/pomber/escher-bot)

It's a pity that twitter UI doesn't embed the recursive tweet.

------
dandare
>Twitter does sell premium data services including their much coveted
“firehose” stream. Twitter’s firehose is a complete stream of all tweets made
on their platform and is only available to a few businesses at an
extraordinary price.

As a developer, I have no understanding of how is data monetized. How much is
user data worth? What kind of user data is worth more than others? What is the
data used for (why would anyone pay "extraordinary price" for tweets?). Who
pays for user data - is there some public market that I am not aware of?

~~~
throwfaraway113
Firehose access is prohibitively expensive. Last I heard the cost was based on
something like " 30% your company's annual revenue " (this is some sort of
second hand rumour I heard when I asked).

Services like gnip have everything from the firehose (ever) and it can be
retrieved by paying a monthly fee (hundreds, thousand of $ for 500k twwets)

------
ohadron
To store such a firehose stream of data, you will need approximately 0.3gb of
storage per one second of data.

This is if you only collect username, timestamp and tweet, excluding any
additional metadata such as data center, likes and retweets, not to mention
images and videos.

Full calculation here: [https://docs.google.com/spreadsheets/d/1BIAguT9Qvy0GK-
dalpQf...](https://docs.google.com/spreadsheets/d/1BIAguT9Qvy0GK-
dalpQfkbzqtCv0pvNuxU3zbMiR-iQ/edit?usp=sharing)

~~~
keithwinstein
I believe your calculations are too high by a factor of 1000... 53
bytes/(typical tweet) * 6000 typical tweets/second is 318 KB (or 0.3 MB) per
second, not 318 MB.

~~~
ohadron
Yikes, you are totally right. Fixed

------
cwkoss
Do the server and datacenter ID's effectively leak location data?

~~~
doomjunky
1) The leaked location is just "datacenterlocally", e.g. westcoast USA or
northern europe.

2) There is no guarantee that traffic IS NOT routed from another region to
handle traffic peaks.

------
cdoxsey
If you need access to Twitter data you should sign up for a developer account
and contact them. It's expensive but not prohibitively so for a company that
needs Twitter data to do business.

For academic research Twitter offers products which dramatically reduce the
amount of data you need to consume via the full search API or historical
powertrack.

~~~
m-p-3
I'm wondering if the Wayback Machine has access to it.

~~~
toomuchtodo
I would be _extremely interested_ in getting the Twitter corpus ingested into
the Internet Archive.

To my knowledge, the Archive does not have access to the firehose.

~~~
siculars
The Hadoop archive they recently moved to Google is 300PB. Good luck.

------
andy_ppp
This is really interesting, an unforeseen insight into ids that can be
enumerated even partially. I suppose encrypting or decrypting them is too
costly so any other ideas for getting the properties of Snowflake at scale
without this sort of attack being possible?

~~~
pas
If you wanted to "encrypt" them, you could increase the key space and add a
random salt to each one.

------
kickinthedoor
(Using figures cited in the article) Assuming that most of the time only 20
machines are responsible for ID generation, and ~50% of tweets use the first
available sequence number for a given millisecond (so half the time machines
only process one new tweet per millisecond), can we estimate Twitter's new
tweet QPS average to be 20,000?

Edit: actually this is an upper bound for the 50% because it assumes that
every machine processes a tweet every millisecond, which may not be the case

~~~
stuck_in_matrix
It's variable based on what's going on at the time but I've seen upwards of 7k
tweets a second for the sections of the timeline that I've ingested using this
technique.

Someone suggested trying it when the New Year starts in Japan. Apparently
there are tens of thousands of tweets per second then.

------
ed312
If Twitter is already storing the tweet timestamp, is there a reason to
generate sequential IDs? It seems like they could move to a UUID scheme and
protect their "firehose" from reverse-engineering.

~~~
RhodesianHunter
They're snowflake IDs, not _exactly_ sequential.

Timestamp bits, shifted all the way left Data center id shifted left Server id
shifted left Increment this millisecond

Since the left bits are the timestamp, they are date/time sortable and a much
more useful index/primary key/whatever than a UUID.

~~~
thinkloop
You can prefix the uuid with a timestamp for the same effect without leakage.
There are chronological uuid projects: [https://github.com/uucid-
project/spec](https://github.com/uucid-project/spec)

~~~
RhodesianHunter
You _can_ , but why would you? You're advocating slapping lipstick on a pig
not really engineered for this use case vs. using something that was designed
for it.

UInt64's are going to be all around more efficient.

~~~
thinkloop
You wouldn't, just showing that it's an easy problem to solve.

------
tbodt
Twitter will just randomize the machine IDs and datacenter IDs

~~~
_wmd
Doing this for historical data would break their API. They'd also need a more
complex scheme than snowflake currently provides

------
ninetax
Interesting! Great work. I wonder why they didn't just use uuid and have a
timestamp attribute to sort them in the UI?

------
thoughtstheseus
How can Twitter be a free speech platform if you cant receive the speech for
free?

~~~
oliveshell
Hence the distinction between “free as in speech” and “free as in beer”.

~~~
daveguy
Just referring to the speech part... Is the speech free as in speech if it
isn't free as in beer? If there is a gateway to prevent it from being freely
distributed? With GNU the free as in speech is also free as in beer -- if all
you want is the speech (aka code). If you want services around the code it
will cost when companies are trying to make a business on it. Correct me if
I'm wrong (and I probably am), but doesn't every additional dollar reduce free
as in speech because of reduced access?

~~~
pas
GNU/FSF supports selling code for money. Twitter does the same with tweets.

You can download your own tweets for free, or even tweets of someone else
freely, just doing it for all someone elses costs money.

So, with GNU _usually_ you get both, but that's not necessary. And usually
seen as completely meaningless, as if there would be a very useful GPL open
source software that was not free to download and use, then a simple
consortium of people could get together and buy one copy and then redistribute
it for free. (Though they would need to do this for all future versions.) --
which is similar to what the author proposes with regards to forming a group
of ~1 700 users/folks to brute force the firehose.

------
anigbrowl
This looks to considerably lower the pain of mapping retweet networks, thanks!

------
nspkr
>Twitter’s statuses lookup API endpoint allows for a total of 1,200 API calls
every 15 minutes. Each call allows the user to pass 100 ids for a total of
120,000 id requests every 15 minutes using both APP auth [...]

Use the secret consumer keys from Twitter to bypass these limits:
[https://gist.github.com/shobotch/5160017](https://gist.github.com/shobotch/5160017)

~~~
throwawaymath
I can't imagine these will last long now that they've been posted in a comment
on a front page HN story.

~~~
gberger
These are hardcoded in the apps. If they disable a key, they would need to
release a new version of the app (ok), and all users would need to update
(infeasible).

~~~
mrits
Wouldn't they just enforce the limit with those keys as well?

~~~
metildaa
Sounds like a solid reason to not build around a platform that can arbitrarily
kill your project or business. ActivityPub seems to be gaining traction, w/o
the beholden to a single entity issue.

~~~
jjeaff
A platform that can arbitrarily take down your business like a mobile app on
iPhone or Android?

------
rhcom2
Interesting technical write up but a little naive about its uses as it is
probably against the Twitter Developer agreement to use the API as it
circumvents the rate limits.

~~~
tpetry
You can‘t circumvent the rate limit, that is the idea of a rate limit. The
article is stating this all the time.

The genious idea is simply shrinking down the search space by probabilistic
assumptions. Never thought about using the snowflake ids to get historical
data, i am impressed.

------
mttpgn
If sequence ID means what I think it means... then if I reply to a twitter
thread that already has 4095 replies... there could be the possibility of a
tweet ID collision, possibly causing another tweet made simultaneously to mine
to be unretrievable...

After years of waiting fruitlessly for Twitter to implement my feature
request, it may finally be possible to delete someone else's tweet.

~~~
marcinzm
The id of tweets includes the time stamp. The sequence is just an incremental
number per server per millisecond. As far as I can see it’s not based on
retweets.

~~~
mttpgn
Thanks for the quick explainer on what the sequence is. Reassuring to know
it's not the ordered number of tweets in a thread.

