
Free SQL dump with 200 million tweets from 13 million users - _hfqa
About the data:<p>- DB Size: 543 million rows<p>- Data Size: 173GB (uncompressed)<p>- Stored in mysql<p>- 200+ Million tweets from 13+ Million users<p>- Collected in 1 week<p>- Operation costs: 100+ dollars<p>- Rackspace Cloud - 1 CentOS 8GB Ram server<p>- Java, memcache, mysql and perl for core processing<p>- js, php for analytics &#38; visualization<p><i></i>* Download the data at this url
http://www.archive.org/details/2011-06-calufa-twitter-sql
======
sethish
Twitter changed their ToS to explicitly disallow distributing twitter dumps
like this: [http://chronicle.com/blogs/profhacker/the-end-of-
twapperkeep...](http://chronicle.com/blogs/profhacker/the-end-of-
twapperkeeper-and-what-to-do-about-it/31582)

I was a part of the webecology project (and 140kit.com, both of which gave
large twitter datasets to researchers.

~~~
tibbon
Additionally, this data probably isn't as useful as many might think. We found
that collecting random tweets probably isn't that useful for most research
overall, partially because using any of the streaming APIs omit tweets. Even
'full' firehose seems to omit some tweets, so it can't be considered a
complete set, nor verified as a completely random set.

~~~
calufa
\-- I disagree.

\- You can cluster users based on tweet data, links relationships &/or even
user-to-user relationships

\- Understand how retweets work and how fast they propagate.

\- Sentiment analysis based on a specific keyword.

\- Trend analysis.

There are N number of ways this dataset can be helpful. You have 200MM tweets.
Enough for a quick experiment using real data.

* Its true that is "random" data. Just unrandom it!

~~~
tibbon
User-to-user relationships aren't that great with incomplete data of the
tweets, but also of the social graph. Pulling a large social graph from
Twitter is nearly impossible and getting deltas on anything more than a few
hundred people is equally impossible.

Propagation of retweets really needs a near complete dataset of those
tweets/retweets. A steaming sample of the dataset really isn't great for this.

Sentiment analysis can be done to determine the overall feeling on a topic,
but I'd feel really incomplete doing it on this dataset. Again, pulling the
stream for the term or keyboard you're looking to sample is much better. Most
sentiment analysis on Twitter is pretty flawed anyway.

Trend analysis works on this dataset ok, but measuring the true magnitude of
an event would be hard (like Osama being killed) since you don't know what
portion of the tweets you've actually got.

I worked with Sethish on the Web Ecology Project. I wouldn't call your dataset
useless, but it really would be more useful generally to have a question, then
pull the best possible data that will help you answer that question. Otherwise
there's going to be a lot more unknowns that make it a weaker piece of
research.

~~~
calufa
your points are valid.

I want to clarify that this dump is for learning purposes due the lack of
"open data".

If people can play with real data from real people and get "real" inputs, that
can encourage curious programmers to join the data-mining party. I know there
are other dumps out there, np with that. This is just another dump, it may
help people come with ideas without the need of coding a a multi-threaded
scrapper.

------
jdvolz
Calufa, next time you're in Vegas, send me a message and we'll get a beer.
Thank you. You just made something I'm doing vastly more awesome.

------
StavrosK
Torrent here, when done: <http://burnbit.com/torrent/170493/twitter_sql_bz2>

------
calufa
import to mysql:

bunzip2 < my_database.sql.bz2 | mysql -h localhost -u root -p my_database

------
aonic
Thanks! More interested in the scraper.. is it open-source? If yes, where can
we download it? If not, can you write about your experience in building it?

~~~
calufa
I will blog about how I did it in a few days...

~~~
jason_slack
Where do you Blog so I can add to my RSS?

~~~
calufa
I dont have a blog, sorry. I will open one soon...

Feel free to follow me <http://twitter.com/calufa>.

------
JeeyoungKim
Hey guys, what would be the most sane way to work with this dataset? If it's
173GB, it's probably hard to load it up in a single machine.

------
ck2
Hmm, how many days back does it go?

Twitter search still only goes back 10 days in 2011, so how deep is this data?

~~~
calufa
To be honest I have no idea. It crawled 13MM users, some accounts can be very
old with very old tweets... You can look at the CD_data table and look for the
tweet html code and parse the timestamp.

~~~
ck2
Apparently Twitter now has 100+ Million tweets per DAY.

So you caught about 2 days worth but randomly in time.

------
laprise
Neat ! here some tips for creating a kick ass graph visualization:
[http://www.martinlaprise.info/2010/02/15/visualize-your-
own-...](http://www.martinlaprise.info/2010/02/15/visualize-your-own-twitter-
graph-part-2/)

------
nametoremember
Damn, I just saw this. I would have liked to use it. How can Twitter make you
take it down when it is all public information anyway?

------
calufa
A EMAIL FROM TWITTER KILLED THE DATASET --- :S

~~~
user24
Can you give more detail? The link is still up... What did they say?

edit reply via twitter: "they asked me to remove the dump due TOS"
(<http://twitter.com/#!/calufa/status/78556903772393474>)

which I guess is what I expected.

But are scrapers subject to TOS?

------
JeeyoungKim
Does anybody want to share MD5 hash of the file? I'm trying to decompress this
file, and I'm keep getting an error.

~~~
JeeyoungKim
wait, the torrent link has it. I do have the same md5hash, and yet, it's keep
crashing whenever i'm trying to uncompress this shit... wtf is going on.

~~~
justadude
Did you figure out how to get this working? I tried 7-zip as well as winrar
and both errored out

------
juiceandjuice
Wow, I just downloaded that whole archive in a minute.

~~~
calufa
bz2 compression ;) --- 1147480:1 compression ratio

~~~
joelthelion
Just shows how much real information is in tweets : not much :)

------
8maki
Oh it's awesome dump. Are these mainly from US?

------
chrisjsmith
All that is meaningless chatter between people and information about bathroom
habits. Perhaps if we pooled that distributed effort into something
constructive, the world would be a better place.

~~~
PostOnce
<http://twitter.com/#!/id_aa_carmack>

"Msbuild seems to limit to 100 files on a cl command line, which introduces
noticeable sync losses when parallel building on 24 threads."

It's not all meaningless, you just choose to follow meaningless users.

