Hacker News new | comments | ask | show | jobs | submit login
Twitter to Release All Tweets to Scientists (scientificamerican.com)
153 points by digital55 on May 27, 2014 | hide | past | web | favorite | 53 comments

I've heard that before.

* Library of Congress: http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archi...

* Twitter Data grants: https://blog.twitter.com/2014/introducing-twitter-data-grant...

I'll admit, I haven't applied for access through either one, but neither have I seen any papers cite access through those venues—and I read quite a few NLP + Twitter papers.

This article is just talking about Twitter Data Grants for which 6 universities were decided as winners [0]. You won't see papers through these grants as yet because well, the winners were announced about 40 days back!

[0] https://blog.twitter.com/2014/twitter-datagrants-selections

from http://www.loc.gov/today/pr/2013/files/twitter_report_2013ja...

"Transfer of Data to the Library

In December, 2010, Twitter named a Colorado-based company, Gnip, as the delivery agent for moving data to the Library. Shortly thereafter, the Library and Gnip began to agree on specifications and processes for the transfer of files - "current" tweets - on an ongoing basis.

In February 2011, transfer of "current" tweets was initiated and began with tweets from December 2010.

On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.

As of December 1, 2012,the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies."

I find the quantities hilarious. But since they haven't been able to cope with providing access yet I get pessimistic about their prospects of doing so at all any time soon.

Can we do something to help them?

I've been thinking maybe GPU-accelerated databases like MapD, could mitigate the cost issue for them, but I'm pretty sure that doesn't go all the way to solving the problem...

Twitter had initiated granting of datasets some time back (now closed)[0] on the merits of a short proposal. The number of groups who eventually got access to the data were very few[1]. I hope in the future they increase the number of grants.

[0] https://blog.twitter.com/2014/introducing-twitter-data-grant...

[1] https://blog.twitter.com/2014/twitter-datagrants-selections

Yes, I am pretty sure this article is just rehashing the Twitter grants (I believe there were only 6 to 8 rewards), rather than announcing full open data to any researchers (thereby making the title misleading).

This is exciting to me; does anyone know how Twitter will go about this? Will there be a public dataset available for download? A research contract through the recently-acquired GNIP? Or just firehose access for future streams?

Considering there's at least 400GB of data generated per day, I don't think it'll be readily available for the public as a download.

Jeez, 400 GB of text per day? How the hell?

EDIT: That's 11.5k tweets/sec. How do you get eleven thousand people to tweet every second?

Most of that data is not the content of the tweet itself, but the metadata associated with it. When I last checked, we were storing about a kilobyte of data for every tweet.

Also, tweets are limited to 140 characters, not bytes - chinese tweets typically take about 200-250 bytes, for example.

Yeah, I figured 300 bytes per Tweet, to be generous, but didn't realize it would take 1 kb of metadata. Thanks for that detail.

I have access to the Twitter gardenhose (which is equal to slightly less than 10% of the full volume). These are the RX and TX statistics from the machine that I've been using to gather data for a few months now:

RX bytes:18053724505080 (16.4 TiB) TX bytes:2686623557042 (2.4 TiB)

It works out at around 70GB/day, so I'd actually think that the full firehose would use considerably more data than 400GB/day (likely closer to 800GB).

Have you ever heard about the 'Human Brain Project' ?

Well, they produce 100 GB of data _per second_

Welcome to the real big data world.

Woah, 100 GB per second???!! What exactly do they do?

Edit: Never mind, this is what they are doing

"We would like to develop some kind of ‘google' brain where we can zoom in and out, see it from different perspectives and understand how brain structure and function is related."

Pick a topic you are familiar with. Open up a twitter search for it. Wait a while. See the inevitable storm of tweet-spam that sort of looks like social sharing.

Thousands of fake accounts are tweeting out nonsense all the time. Another example, there are multiple accounts that tweet items from HN (and presumably lots of other rss feeds).

Depends how much metadata there is with every tweet.

I took a random tweet on my timeline, and the JSON representation, with added metadata, weighs 8740 bytes.

With this figure, you would "only" need 500 tweets/sec to get 400GB a day.

There's a lot of people on Twitter, and a lot of bots, so that doesn't seem unreasonable to me.

Maybe it includes images.

it does not include images. A tweet object does include a bunch of metadata though: https://dev.twitter.com/docs/platform-objects/tweets

hell lot of data! First they have to find some alternative to zip these much of data!

Doesn't Twitter make a fair bit of money from selling access to various slices of their data? I'd be surprised if they released it all to the general public. I imagine scientists would have to be under some sort of NDA.

The Twitter terms of service prohibit sharing the data in the Tweets. Researchers are allowed to share tweet IDs and User IDs which can be used to identify a Tweet. Currently, Twitter collections are shared using this method -- I recently released a dataset of 120 million Tweet IDs which cover a sample of a months worth of data, and numerous other researchers have used these IDs to crawl Twitter and obtain the same dataset as I used in my experiments.

This would likely make a great natural language data set for compression algorithms.

@JoshTriplett tweets r alrdy #compressed. hth

Upvoted for much needed humour on HN.

True, but doesn't Twitter already provide an API for access to a fraction of the firehose? Surely that would be enough data. If Twitter doesn't have a good API, Reddit allows full access to all comments through their API (although Reddit has orders of magnitude less data).

Twitter's API is too limited for historical data. (you'll hit the rate limits quickly for any meaningful volume). Reddit's rate limits, however, let you process a million comments every day.

It appears as if the data is only available to those scientists who apply for the data grant and win it. Furthermore, applications for the grant have been closed since midway through March. Yea, I'm not surprised Twitter isn't making its historical data public. That would literally end Gnip, which is a revenue source for Twitter not based on advertising to users.

How about just loosening the API rate limits, or making a better token request process with resource allocation e.g. I'd like 1500 requests per 15 min window (as opposed to 15, for some things) for 72 hours. I guess this could be limited to those with a academic email address if they insist.

One thing I've wondered. Is it possible to follow "everyone" on Twitter? If-not, what type of cap does Twitter enforce on the number of accounts you're allowed to follow? I realize it'd be difficult to know which new accounts to add as people join, but how far could you push a roll-your-own stream of the Twitter firehose?

Hypothetically: I'm sure you could do it algorithmically; if your program sees a retweet from someone who is not on your following list, you then follow them. You might miss a few, but you would get most everyone.

A release for journalists would be nice too...

I can't wait to see someone legitimately design a better sewerage system by using twitter's geolocation.

Their geo-data is utter crap. The vast majority of it is based on 'profile location', which means that there are almost a million people tweeting from the exact center of Atlanta. It's a crowded spot, must be a Starbucks there or something.

Just find those mass locations and remove them from the data set.

You end up with about 0.01% of tweets having locations after that. It's basically just iphone users.

It is all available to the public to begin with anyway. I don't see the dilemma here.

Not all tweets are public.

Surely they aren't coughing up private tweets?

That was my interpretation of "all", but reading back it seems there's another difference:

"Although a majority of tweets are public, if scientists want to freely search the lot, they do it through Twitter's application programming interface, which currently scours only 1 percent of the archive. But that is about to change: in February the company announced that it will make all its tweets, dating back to 2006, freely available to researchers."

I think all the data should be available.

Can anyone find a primary source?

There is a link to apply for a "Data Grant" here: https://engineering.twitter.com/research. Unfortunately, it looks like submissions are closed.

So much for the "protected" tweet illusion.

Who cares? Isn't it already available?

Yes it is, however not in excel. (written so a non tech person could understand)

Article date is Jun 1, 2014. Is the author from the future?

There is no magic discovery about the nature of man hidden away in that data. Nothing you're average stand up comedian hasn't already written a bit about.

That was a good one.

On a more serious tone, there is one area of research that Twitter data is very valuable for: how information and disinformation is created and spread during major news events: wars, catastrophes, uprisings, school shootings etc.

My gut feeling is that today a lot of quality journalism happens outside of the traditional journalistic organisations. The downside is that also a lot of wild speculation and rumours are spread, but it would be valuable to see how good this modern "crowd" journalism is. A skilled research group can use Twitter data and Internet Archive to track down the original sources of information pretty well.

And momentum stock trades. I've always felt a adept programmer who knew how to data mine and worked for Twitter could make a fortune in the stock market?

That's a little cynical. I like to think we can predict big historical social events (regime changes, major protests, climate change?) based on Twitter's data.

You never know unless you try.

Are you familiar with Atrocity Watch? If not, you should look them up - that's exactly what they do.

I am not. Checking them out now; looks pretty cool.

Not to be skeptical, but I'm pretty sure one of these scientists may happen to work for the NSA.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact