Hacker News new | past | comments | ask | show | jobs | submit login
Reconstructing Twitter's Firehose (docs.google.com)
388 points by minimaxir 24 days ago | hide | past | web | favorite | 102 comments



This is a really neat approach! A friend of mine actually did his thesis working with Netflix to analyze Twitter and determine if Netflix was down[1]. He took a much more brute force approach, taking advantage of the fact that his search space was limited to re-assemble a "firehose" for a given search term leveraging hitting the search API from multiple machines and then reconciling the stream and removing duplicates. This approach definitely wouldn't scale to the entirety of the firehose, but I think that, in general, interest in Twitter data revolves around more specific queries.

[1] https://users.soe.ucsc.edu/~eaugusti/pages/papers/docs/SPOON...


> Twitter’s firehose is a complete stream of all tweets made on their platform and is only available to a few businesses at an extraordinary price.

Anyone happen to know what that price is?


Hundreds of thousands per month. At least that was the price in 2012.


Not only that, but you need to prove you can handle the traffic.


Not in my experience. Why would gnip / twitter care if you can handle it, as long as you pay and abide by the terms of the contract? The burden to fetch the data within the specified window is on the client, it's not a push system.

For context: Back in the day, I worked at Klout. We had to pay a crippling monthly sum just to get access to the @mentions stream.


At one point I also read that you had to prove you could handle the traffic. Whether that was a requirement of a technical nature or just to make sure you knew what you were doing so they wouldn't end up holding your hand all the time I'm not sure.

They don't specifically say you have to verify anything, but they do dance around being able to handle the volume in a couple of different ways in their streaming guides for us average mortals [1]. It mostly seems to come down to the fact that they are, at least to some degree, buffering on their end to make sure you don't miss anything if there's suddenly a spike that's too large for you to handle or if your connection degrades for a while, etc.

It also wouldn't surprise me if it were originally simply a limitation of their internal systems and they couldn't buffer or allow any kind of replay because they were literally just writing everything off to a stream as it came in and didn't have any mechanism for going back and reading from the actual database after the fact.

1: https://developer.twitter.com/en/docs/tweets/filter-realtime...


This isn't true.


I am the author of this document. If anyone has any questions, I'd be happy to answer them!


Assuming that sequential IDs are handed out sequentially. Assuming further that gaps in this sequence indicating deleted tweets.

You could reduce the sequence ID space by employing the fact that it is overall less likely that a tweet is deleted than otherwise and a sequence of two CONSECUTIVE deleted tweets is even less likelier.

E.g. If you find sequence ID 0 but don't find 1,2,3 than you can probably skip 4,5,6,7. I don't know the probabilities of deleted tweets, but i am sure someone could calculate it to determine the 99% threshold.


The timestamp is generated per server. The system time could differ across nodes in the cluster (even with NTP) by nano seconds. So, isn't the accuracy of the first tweet is approximation if there are multiple tweets containing same word "earthquake" at the same time at ns level? But I get the point.


Tweet object from the Twitter API has a "creation_time" that has 1s resolution, whereas the snowflake creation time has 1ms resolution. No doubt these could disagree, but if that happened then maybe both authors could get a prize?


Wonder how it fits with developer terms.

eg. 5d

  Do not use, access or analyze the Twitter API to monitor or measure the availability, performance, functionality, usage statistics or results of Twitter Services or for any other benchmarking or competitive purposes, including without limitation, monitoring or measuring:

      the responsiveness of Twitter Services; or
      aggregate Twitter user metrics such as total number of active users, accounts, total number of Periscope Broadcast views, user engagements or account engagements.
[1] https://developer.twitter.com/en/developer-terms/agreement-a...


I reckon not very neatly.


This is proper "hacker news" right here


The statistical claims in the article make the assumption that tweets are being sampled uniformly at random, which is most likely false.

The fact that 3 machines handle 20% of tweets suggests that tweets are not in fact assigned to machines in a uniformly random manner. I would guess that there is a geographic bias as to which machines handle which tweets.


When I did the analysis, I was puzzled why certain machines handle a higher percentage of tweets compared to others -- so you are most likely correct that there may be some geographic consideration to the distribution.

I'm rewriting the code to include a prescan of the time range to determine which server ids are in play at the time and which server ids are most active.

Figuring out how to deconstruct Snowflake was challenging and there is still a lot of analysis left to do.


> Figuring out how to deconstruct Snowflake was challenging and there is still a lot of analysis left to do.

Why don't you just read the code?


How do you read the code of an implementation detail of Twitter's servers? There's no guarantee that the example code they released years ago still matches what they use.


Discord also utilizes Twitters Snowflake algorithm for the ridiculous amount of messages that are sent in chats.[1]

[1]https://discordapp.com/developers/docs/reference/


Link doesn't work for me, I just get a partially-loaded page with an eternally-spinning spinner (FF65 on OSX).

Gotta go to https://discordapp.com/developers/docs and then click on the Reference link in the sidebar.


If you remove the trailing forward slash the link works fine.


Not only messages - EVERYTHING in Discord is identified by a snowflake - messages, users, channels, guilds (servers), bots/applications, ...


That's quite amazing!

One thing is that the data center ids are in the tweet ids, so it could be used to get a rough location of Twitter users.


I see there is interest in this observation so I used a little ruby (.to_s(2)[-22..-17].to_i(2)) to get the datacenter id.

Then ran it on a few Twitter accounts: https://pastebin.com/w8Dnj5kM

It does work, and it's going to be hard to patch

edit: I realized you don't only get one location but the whole location history of a Twitter user. Also locating Twitter's data centers as it doesn't seem to be public information


This is really interesting. When I did the original analysis on datacenter / server ids, I didn't think about correlation with user accounts. Nice observation!


I was curious and just did that :) https://gist.github.com/localhostdotdev/48ed13972c3e5391a47f... (small sample of ~1500 localized tweets)


Should be pretty simple to check the tweet geo or location mentions per server id to see of they imply a correlation with geographic area of the server. Then you're just one hop to knowing where (or where not) other tweeters are.


They could "fix" it by periodically rotating the DC and server IDs.

I wish it was not so easily fixable, because this will break the key space reduction trick ;), but unfortunately such a solution is feasible and would come with the side-effect of drastically increasing the required scanning space.


What are you guys even smoking. The ID segment is 5 bits long. You have an extra server ID bit making the datacenter results more significant than they are.


That's excellent stuff. I had no idea that's how snowflake IDs were constructed.

Now makes me wonder how hard it would be to make a tweet which linked to itself...?


It's not that hard: https://twitter.com/mauritscorneIis

Code is here: https://github.com/pomber/escher-bot

It's a pity that twitter UI doesn't embed the recursive tweet.


It has been done but I can't find a link to a modern example, only those before Snowflake IDs.


>Twitter does sell premium data services including their much coveted “firehose” stream. Twitter’s firehose is a complete stream of all tweets made on their platform and is only available to a few businesses at an extraordinary price.

As a developer, I have no understanding of how is data monetized. How much is user data worth? What kind of user data is worth more than others? What is the data used for (why would anyone pay "extraordinary price" for tweets?). Who pays for user data - is there some public market that I am not aware of?


Firehose access is prohibitively expensive. Last I heard the cost was based on something like " 30% your company's annual revenue " (this is some sort of second hand rumour I heard when I asked).

Services like gnip have everything from the firehose (ever) and it can be retrieved by paying a monthly fee (hundreds, thousand of $ for 500k twwets)


I’d guess adtech, market research and investment companies would be interested.


To store such a firehose stream of data, you will need approximately 0.3gb of storage per one second of data.

This is if you only collect username, timestamp and tweet, excluding any additional metadata such as data center, likes and retweets, not to mention images and videos.

Full calculation here: https://docs.google.com/spreadsheets/d/1BIAguT9Qvy0GK-dalpQf...


I believe your calculations are too high by a factor of 1000... 53 bytes/(typical tweet) * 6000 typical tweets/second is 318 KB (or 0.3 MB) per second, not 318 MB.


Yikes, you are totally right. Fixed


Since a couple of years ago the tweet limit was bumped to 280 characters, so I assume the amount is double of what you calculated.


Good point! Fixing my calculation. So it's between 0.3gb (typical tweet is ~30 chars) and 6.8gb per second


Do the server and datacenter ID's effectively leak location data?


1) The leaked location is just "datacenterlocally", e.g. westcoast USA or northern europe.

2) There is no guarantee that traffic IS NOT routed from another region to handle traffic peaks.




If you need access to Twitter data you should sign up for a developer account and contact them. It's expensive but not prohibitively so for a company that needs Twitter data to do business.

For academic research Twitter offers products which dramatically reduce the amount of data you need to consume via the full search API or historical powertrack.


I'm wondering if the Wayback Machine has access to it.


I would be extremely interested in getting the Twitter corpus ingested into the Internet Archive.

To my knowledge, the Archive does not have access to the firehose.


At one point the Library of Congress did, but they a) didn't share it with the general public, and b) stopped doing it.

https://www.npr.org/sections/thetwo-way/2017/12/26/573609499...


The Hadoop archive they recently moved to Google is 300PB. Good luck.


This is really interesting, an unforeseen insight into ids that can be enumerated even partially. I suppose encrypting or decrypting them is too costly so any other ideas for getting the properties of Snowflake at scale without this sort of attack being possible?


If you wanted to "encrypt" them, you could increase the key space and add a random salt to each one.


(Using figures cited in the article) Assuming that most of the time only 20 machines are responsible for ID generation, and ~50% of tweets use the first available sequence number for a given millisecond (so half the time machines only process one new tweet per millisecond), can we estimate Twitter's new tweet QPS average to be 20,000?

Edit: actually this is an upper bound for the 50% because it assumes that every machine processes a tweet every millisecond, which may not be the case


It's variable based on what's going on at the time but I've seen upwards of 7k tweets a second for the sections of the timeline that I've ingested using this technique.

Someone suggested trying it when the New Year starts in Japan. Apparently there are tens of thousands of tweets per second then.


That is more or less accurate as an upper bound. You can find public talks from Twitter engineers that specifically highlight their tweet/ps at 3k-7k and this was years ago.


If Twitter is already storing the tweet timestamp, is there a reason to generate sequential IDs? It seems like they could move to a UUID scheme and protect their "firehose" from reverse-engineering.


They're snowflake IDs, not exactly sequential.

Timestamp bits, shifted all the way left Data center id shifted left Server id shifted left Increment this millisecond

Since the left bits are the timestamp, they are date/time sortable and a much more useful index/primary key/whatever than a UUID.


You can prefix the uuid with a timestamp for the same effect without leakage. There are chronological uuid projects: https://github.com/uucid-project/spec


You can, but why would you? You're advocating slapping lipstick on a pig not really engineered for this use case vs. using something that was designed for it.

UInt64's are going to be all around more efficient.


You wouldn't, just showing that it's an easy problem to solve.


In some databases having a sequential key is valuable (e.g. Cassandra at least used to be like this)


Twitter will just randomize the machine IDs and datacenter IDs


Doing this for historical data would break their API. They'd also need a more complex scheme than snowflake currently provides


Interesting! Great work. I wonder why they didn't just use uuid and have a timestamp attribute to sort them in the UI?


How can Twitter be a free speech platform if you cant receive the speech for free?


Who said it’s a free speech platform? You have the right to free speech but not the right to a free platform.


Hence the distinction between “free as in speech” and “free as in beer”.


Just referring to the speech part... Is the speech free as in speech if it isn't free as in beer? If there is a gateway to prevent it from being freely distributed? With GNU the free as in speech is also free as in beer -- if all you want is the speech (aka code). If you want services around the code it will cost when companies are trying to make a business on it. Correct me if I'm wrong (and I probably am), but doesn't every additional dollar reduce free as in speech because of reduced access?


GNU/FSF supports selling code for money. Twitter does the same with tweets.

You can download your own tweets for free, or even tweets of someone else freely, just doing it for all someone elses costs money.

So, with GNU usually you get both, but that's not necessary. And usually seen as completely meaningless, as if there would be a very useful GPL open source software that was not free to download and use, then a simple consortium of people could get together and buy one copy and then redistribute it for free. (Though they would need to do this for all future versions.) -- which is similar to what the author proposes with regards to forming a group of ~1 700 users/folks to brute force the firehose.


Free speech is about what you can say, not about who gets to hear it. If the newspaper doesn't print your speech, it doesn't mean they are limiting it. Free speech is about what is allowed to exit your mouth, how it travels is a different beast.


This looks to considerably lower the pain of mapping retweet networks, thanks!


>Twitter’s statuses lookup API endpoint allows for a total of 1,200 API calls every 15 minutes. Each call allows the user to pass 100 ids for a total of 120,000 id requests every 15 minutes using both APP auth [...]

Use the secret consumer keys from Twitter to bypass these limits: https://gist.github.com/shobotch/5160017


I can't imagine these will last long now that they've been posted in a comment on a front page HN story.


These are hardcoded in the apps. If they disable a key, they would need to release a new version of the app (ok), and all users would need to update (infeasible).


Wouldn't they just enforce the limit with those keys as well?


They explicitly want to cripple third-party apps to push users to use their (awful) official clients, and their way of doing so is to enforce unreasonably low limits.

If they were to do the same with their official clients they'd become unusable.


Sounds like a solid reason to not build around a platform that can arbitrarily kill your project or business. ActivityPub seems to be gaining traction, w/o the beholden to a single entity issue.


A platform that can arbitrarily take down your business like a mobile app on iPhone or Android?


The Twitter for Android key and secret is at least 8 years old

https://twitter.com/kevinriggle/status/23932444186


I've been using them for years now.


You can do the same for the Reddit API, change the agent header and then you have unlimited API calls.


Just be careful about how much you change it, since you could get banned. You might be able to get away with changing the app ID or version, but even that's risky.

> NEVER lie about your user-agent. This includes spoofing popular browsers and spoofing other bots. We will ban liars with extreme prejudice.

https://github.com/reddit-archive/reddit/wiki/API


How would that work, I couldn't find much on the web?


Same problem as the mobile app for Twitter, they can't block people that are not logged in the mobile app so they don't rate limit them.


How come they can block people on the web again?


Probably by doing a mitm attack on the mobile app and using the headers used there.


Seems like a good way to get your IP blacklisted


I worked with a company in the past that abused the twitter api to an unimaginable level from a single IP address.

That was a few years ago but at the time I’m pretty sure their blacklisting was between inexistent and pathetic.


It’s sad that you think this somehow reflects badly on Twitter. I appreciate a company that will opt to be conservative rather than ban-hammer innocent people on accident just to stop a single idiot.


I don’t think it reflects badly on Twitter, it was simply a remark on the state of their blacklisting at the time. I’m sure they could’ve done a lot better if they chose to. Sorry I didn’t expressed myself better.


> It’s sad that you think this somehow reflects badly on Twitter. I appreciate a company that will opt to be conservative rather than ban-hammer innocent people on accident just to stop a single idiot.

Off-topic but reminds me of ~2004 where we had a particularly pervasive cheater in our dedicated game server and the end decision of the admin was to just ban the entire IP range of said cheaters' ISP. Not particularly conservative - very effective.


Using these keys?


No. Different strategy.


Seems like a good way to get an angry e-mail from your ISP, or Twitter's lawyer.


[flagged]


Hackernews..


Can we use them to get the streaming API back? I think the apps still use something like that.


[flagged]


It probably is! When your bottleneck is a remote API limit, CPU usage might not be worth optimizing.


Python is perfectly fine for these type of tasks. I ingest all of Reddit in real-time (https://pushshift.io) and also ingest Gab.com and several others (Stackoverflow, etc.) and at most one or two CPU cores are at 10-15%.

Also, when I provide code examples, I try and use a language that most programmers will have some exposure to and generally Python is high on that list.


Interesting technical write up but a little naive about its uses as it is probably against the Twitter Developer agreement to use the API as it circumvents the rate limits.


You can‘t circumvent the rate limit, that is the idea of a rate limit. The article is stating this all the time.

The genious idea is simply shrinking down the search space by probabilistic assumptions. Never thought about using the snowflake ids to get historical data, i am impressed.


If sequence ID means what I think it means... then if I reply to a twitter thread that already has 4095 replies... there could be the possibility of a tweet ID collision, possibly causing another tweet made simultaneously to mine to be unretrievable...

After years of waiting fruitlessly for Twitter to implement my feature request, it may finally be possible to delete someone else's tweet.


The id of tweets includes the time stamp. The sequence is just an incremental number per server per millisecond. As far as I can see it’s not based on retweets.


Thanks for the quick explainer on what the sequence is. Reassuring to know it's not the ordered number of tweets in a thread.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: