Anyone happen to know what that price is?
For context: Back in the day, I worked at Klout. We had to pay a crippling monthly sum just to get access to the @mentions stream.
They don't specifically say you have to verify anything, but they do dance around being able to handle the volume in a couple of different ways in their streaming guides for us average mortals . It mostly seems to come down to the fact that they are, at least to some degree, buffering on their end to make sure you don't miss anything if there's suddenly a spike that's too large for you to handle or if your connection degrades for a while, etc.
It also wouldn't surprise me if it were originally simply a limitation of their internal systems and they couldn't buffer or allow any kind of replay because they were literally just writing everything off to a stream as it came in and didn't have any mechanism for going back and reading from the actual database after the fact.
You could reduce the sequence ID space by employing the fact that it is overall less likely that a tweet is deleted than otherwise and a sequence of two CONSECUTIVE deleted tweets is even less likelier.
E.g. If you find sequence ID 0 but don't find 1,2,3 than you can probably skip 4,5,6,7. I don't know the probabilities of deleted tweets, but i am sure someone could calculate it to determine the 99% threshold.
Do not use, access or analyze the Twitter API to monitor or measure the availability, performance, functionality, usage statistics or results of Twitter Services or for any other benchmarking or competitive purposes, including without limitation, monitoring or measuring:
the responsiveness of Twitter Services; or
aggregate Twitter user metrics such as total number of active users, accounts, total number of Periscope Broadcast views, user engagements or account engagements.
The fact that 3 machines handle 20% of tweets suggests that tweets are not in fact assigned to machines in a uniformly random manner. I would guess that there is a geographic bias as to which machines handle which tweets.
I'm rewriting the code to include a prescan of the time range to determine which server ids are in play at the time and which server ids are most active.
Figuring out how to deconstruct Snowflake was challenging and there is still a lot of analysis left to do.
Why don't you just read the code?
Gotta go to https://discordapp.com/developers/docs and then click on the Reference link in the sidebar.
One thing is that the data center ids are in the tweet ids, so it could be used to get a rough location of Twitter users.
Then ran it on a few Twitter accounts: https://pastebin.com/w8Dnj5kM
It does work, and it's going to be hard to patch
edit: I realized you don't only get one location but the whole location history of a Twitter user. Also locating Twitter's data centers as it doesn't seem to be public information
I wish it was not so easily fixable, because this will break the key space reduction trick ;), but unfortunately such a solution is feasible and would come with the side-effect of drastically increasing the required scanning space.
Now makes me wonder how hard it would be to make a tweet which linked to itself...?
Code is here: https://github.com/pomber/escher-bot
It's a pity that twitter UI doesn't embed the recursive tweet.
As a developer, I have no understanding of how is data monetized. How much is user data worth? What kind of user data is worth more than others? What is the data used for (why would anyone pay "extraordinary price" for tweets?). Who pays for user data - is there some public market that I am not aware of?
Services like gnip have everything from the firehose (ever) and it can be retrieved by paying a monthly fee (hundreds, thousand of $ for 500k twwets)
This is if you only collect username, timestamp and tweet, excluding any additional metadata such as data center, likes and retweets, not to mention images and videos.
Full calculation here:
2) There is no guarantee that traffic IS NOT routed from another region to handle traffic peaks.
For academic research Twitter offers products which dramatically reduce the amount of data you need to consume via the full search API or historical powertrack.
To my knowledge, the Archive does not have access to the firehose.
Edit: actually this is an upper bound for the 50% because it assumes that every machine processes a tweet every millisecond, which may not be the case
Someone suggested trying it when the New Year starts in Japan. Apparently there are tens of thousands of tweets per second then.
Timestamp bits, shifted all the way left
Data center id shifted left
Server id shifted left
Increment this millisecond
Since the left bits are the timestamp, they are date/time sortable and a much more useful index/primary key/whatever than a UUID.
UInt64's are going to be all around more efficient.
You can download your own tweets for free, or even tweets of someone else freely, just doing it for all someone elses costs money.
So, with GNU usually you get both, but that's not necessary. And usually seen as completely meaningless, as if there would be a very useful GPL open source software that was not free to download and use, then a simple consortium of people could get together and buy one copy and then redistribute it for free. (Though they would need to do this for all future versions.) -- which is similar to what the author proposes with regards to forming a group of ~1 700 users/folks to brute force the firehose.
Use the secret consumer keys from Twitter to bypass these limits: https://gist.github.com/shobotch/5160017
If they were to do the same with their official clients they'd become unusable.
> NEVER lie about your user-agent. This includes spoofing popular browsers and spoofing other bots. We will ban liars with extreme prejudice.
That was a few years ago but at the time I’m pretty sure their blacklisting was between inexistent and pathetic.
Off-topic but reminds me of ~2004 where we had a particularly pervasive cheater in our dedicated game server and the end decision of the admin was to just ban the entire IP range of said cheaters' ISP. Not particularly conservative - very effective.
Also, when I provide code examples, I try and use a language that most programmers will have some exposure to and generally Python is high on that list.
The genious idea is simply shrinking down the search space by probabilistic assumptions. Never thought about using the snowflake ids to get historical data, i am impressed.
After years of waiting fruitlessly for Twitter to implement my feature request, it may finally be possible to delete someone else's tweet.