* Library of Congress: http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archi...
* Twitter Data grants: https://blog.twitter.com/2014/introducing-twitter-data-grant...
I'll admit, I haven't applied for access through either one, but neither have I seen any papers cite access through those venues—and I read quite a few NLP + Twitter papers.
"Transfer of Data to the Library
In December, 2010, Twitter named a Colorado-based company, Gnip, as the delivery agent for moving data to the Library.
Shortly thereafter, the Library and Gnip began to agree on specifications and processes for the transfer of files - "current" tweets - on an ongoing basis.
In February 2011, transfer of "current" tweets was initiated and began with tweets from December 2010.
On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.
As of December 1, 2012,the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies."
I find the quantities hilarious.
But since they haven't been able to cope with providing access yet I get pessimistic about their prospects of doing so at all any time soon.
Can we do something to help them?
I've been thinking maybe GPU-accelerated databases like MapD, could mitigate the cost issue for them, but I'm pretty sure that doesn't go all the way to solving the problem...
EDIT: That's 11.5k tweets/sec. How do you get eleven thousand people to tweet every second?
Also, tweets are limited to 140 characters, not bytes - chinese tweets typically take about 200-250 bytes, for example.
RX bytes:18053724505080 (16.4 TiB) TX bytes:2686623557042 (2.4 TiB)
It works out at around 70GB/day, so I'd actually think that the full firehose would use considerably more data than 400GB/day (likely closer to 800GB).
Well, they produce 100 GB of data _per second_
Welcome to the real big data world.
Edit: Never mind, this is what they are doing
"We would like to develop some kind of ‘google' brain where we can zoom in and out, see it from different perspectives and understand how brain structure and function is related."
Thousands of fake accounts are tweeting out nonsense all the time. Another example, there are multiple accounts that tweet items from HN (and presumably lots of other rss feeds).
I took a random tweet on my timeline, and the JSON representation, with added metadata, weighs 8740 bytes.
With this figure, you would "only" need 500 tweets/sec to get 400GB a day.
There's a lot of people on Twitter, and a lot of bots, so that doesn't seem unreasonable to me.
"Although a majority of tweets are public, if scientists want to freely search the lot, they do it through Twitter's application programming interface, which currently scours only 1 percent of the archive. But that is about to change: in February the company announced that it will make all its tweets, dating back to 2006, freely available to researchers."
On a more serious tone, there is one area of research that Twitter data is very valuable for: how information and disinformation is created and spread during major news events: wars, catastrophes, uprisings, school shootings etc.
My gut feeling is that today a lot of quality journalism happens outside of the traditional journalistic organisations. The downside is that also a lot of wild speculation and rumours are spread, but it would be valuable to see how good this modern "crowd" journalism is. A skilled research group can use Twitter data and Internet Archive to track down the original sources of information pretty well.
You never know unless you try.