

Twitter to Release All Tweets to Scientists  - digital55
http://www.scientificamerican.com/article/twitter-to-release-all-tweets-to-scientists-a-trove-of-billions-of-tweets-will-be-a-research-boon-and-an-ethical-dilemma/

======
chbrown
I've heard that before.

* Library of Congress: [http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archi...](http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archive-at-the-library-of-congress/)

* Twitter Data grants: [https://blog.twitter.com/2014/introducing-twitter-data-grant...](https://blog.twitter.com/2014/introducing-twitter-data-grants)

I'll admit, I haven't applied for access through either one, but neither have
I seen any papers cite access through those venues—and I read quite a few NLP
+ Twitter papers.

~~~
denzil_correa
This article is just talking about Twitter Data Grants for which 6
universities were decided as winners [0]. You won't see papers through these
grants as yet because well, the winners were announced about 40 days back!

[0] [https://blog.twitter.com/2014/twitter-datagrants-
selections](https://blog.twitter.com/2014/twitter-datagrants-selections)

------
stokedmartin
Twitter had initiated granting of datasets some time back (now closed)[0] on
the merits of a short proposal. The number of groups who eventually got access
to the data were very few[1]. I hope in the future they increase the number of
grants.

[0] [https://blog.twitter.com/2014/introducing-twitter-data-
grant...](https://blog.twitter.com/2014/introducing-twitter-data-grants)

[1] [https://blog.twitter.com/2014/twitter-datagrants-
selections](https://blog.twitter.com/2014/twitter-datagrants-selections)

~~~
alexleavitt
Yes, I am pretty sure this article is just rehashing the Twitter grants (I
believe there were only 6 to 8 rewards), rather than announcing full open data
to any researchers (thereby making the title misleading).

------
apetresc
This is exciting to me; does anyone know how Twitter will go about this? Will
there be a public dataset available for download? A research contract through
the recently-acquired GNIP? Or just firehose access for future streams?

~~~
beejiu
Considering there's at least 400GB of data generated per day, I don't think
it'll be readily available for the public as a download.

~~~
StavrosK
Jeez, 400 GB of _text_ per day? How the hell?

EDIT: That's 11.5k tweets/sec. How do you get eleven thousand people to tweet
every second?

~~~
nevinera
Most of that data is not the content of the tweet itself, but the metadata
associated with it. When I last checked, we were storing about a kilobyte of
data for every tweet.

Also, tweets are limited to 140 _characters_ , not _bytes_ \- chinese tweets
typically take about 200-250 bytes, for example.

~~~
StavrosK
Yeah, I figured 300 bytes per Tweet, to be generous, but didn't realize it
would take 1 kb of metadata. Thanks for that detail.

------
JoshTriplett
This would likely make a great natural language data set for compression
algorithms.

~~~
hyperbovine
@JoshTriplett tweets r alrdy #compressed. hth

~~~
loceng
Upvoted for much needed humour on HN.

------
Smulv
It appears as if the data is only available to those scientists who apply for
the data grant and win it. Furthermore, applications for the grant have been
closed since midway through March. Yea, I'm not surprised Twitter isn't making
its historical data public. That would literally end Gnip, which is a revenue
source for Twitter not based on advertising to users.

------
jebus989
How about just loosening the API rate limits, or making a better token request
process with resource allocation e.g. I'd like 1500 requests per 15 min window
(as opposed to 15, for some things) for 72 hours. I guess this could be
limited to those with a academic email address if they insist.

------
uptown
One thing I've wondered. Is it possible to follow "everyone" on Twitter? If-
not, what type of cap does Twitter enforce on the number of accounts you're
allowed to follow? I realize it'd be difficult to know which new accounts to
add as people join, but how far could you push a roll-your-own stream of the
Twitter firehose?

~~~
freehunter
Hypothetically: I'm sure you could do it algorithmically; if your program sees
a retweet from someone who is not on your following list, you then follow
them. You might miss a few, but you would get most everyone.

------
theg2
A release for journalists would be nice too...

------
NamTaf
I can't wait to see someone legitimately design a better sewerage system by
using twitter's geolocation.

~~~
nevinera
Their geo-data is utter crap. The vast majority of it is based on 'profile
location', which means that there are almost a million people tweeting from
the _exact center_ of Atlanta. It's a crowded spot, must be a Starbucks there
or something.

~~~
Nicholas_C
Just find those mass locations and remove them from the data set.

~~~
nevinera
You end up with about 0.01% of tweets having locations after that. It's
basically just iphone users.

------
izzydata
It is all available to the public to begin with anyway. I don't see the
dilemma here.

~~~
namenotrequired
Not all tweets are public.

~~~
jonknee
Surely they aren't coughing up private tweets?

~~~
namenotrequired
That was my interpretation of "all", but reading back it seems there's another
difference:

"Although a majority of tweets are public, if scientists want to freely search
the lot, they do it through Twitter's application programming interface, which
currently scours only 1 percent of the archive. But that is about to change:
in February the company announced that it will make all its tweets, dating
back to 2006, freely available to researchers."

------
unclesaamm
Can anyone find a primary source?

~~~
mike415
There is a link to apply for a "Data Grant" here:
[https://engineering.twitter.com/research](https://engineering.twitter.com/research).
Unfortunately, it looks like submissions are closed.

------
_RPM
So much for the "protected" tweet illusion.

------
of
Who cares? Isn't it already available?

~~~
callesgg
Yes it is, however not in excel. (written so a non tech person could
understand)

------
extesy
Article date is Jun 1, 2014. Is the author from the future?

------
flycaliguy
There is no magic discovery about the nature of man hidden away in that data.
Nothing you're average stand up comedian hasn't already written a bit about.

~~~
dirtyaura
That was a good one.

On a more serious tone, there is one area of research that Twitter data is
very valuable for: how information and disinformation is created and spread
during major news events: wars, catastrophes, uprisings, school shootings etc.

My gut feeling is that today a lot of quality journalism happens outside of
the traditional journalistic organisations. The downside is that also a lot of
wild speculation and rumours are spread, but it would be valuable to see how
good this modern "crowd" journalism is. A skilled research group can use
Twitter data and Internet Archive to track down the original sources of
information pretty well.

~~~
marincounty
And momentum stock trades. I've always felt a adept programmer who knew how to
data mine and worked for Twitter could make a fortune in the stock market?

------
scalene
Not to be skeptical, but I'm pretty sure one of these scientists may happen to
work for the NSA.

