Hacker News new | comments | show | ask | jobs | submit login
Free SQL dump with 200 million tweets from 13 million users
98 points by calufa 2209 days ago | hide | past | web | 36 comments | favorite
About the data:

- DB Size: 543 million rows

- Data Size: 173GB (uncompressed)

- Stored in mysql

- 200+ Million tweets from 13+ Million users

- Collected in 1 week

- Operation costs: 100+ dollars

- Rackspace Cloud - 1 CentOS 8GB Ram server

- Java, memcache, mysql and perl for core processing

- js, php for analytics & visualization

* Download the data at this url http://www.archive.org/details/2011-06-calufa-twitter-sql

Twitter changed their ToS to explicitly disallow distributing twitter dumps like this: http://chronicle.com/blogs/profhacker/the-end-of-twapperkeep...

I was a part of the webecology project (and 140kit.com, both of which gave large twitter datasets to researchers.

-- Ups, I forgot to scrape the TOS

I have 500GB of tweets from the random sample stream, in raw json form, going back to August 2010. Aside from some very broad measures it hasn't been all that useful to us. I never considered distributing it until now, but the TOS clause forbidding redistribution wrankles me. Are we really all okay with contributing content under these terms?

Additionally, this data probably isn't as useful as many might think. We found that collecting random tweets probably isn't that useful for most research overall, partially because using any of the streaming APIs omit tweets. Even 'full' firehose seems to omit some tweets, so it can't be considered a complete set, nor verified as a completely random set.

-- I disagree.

- You can cluster users based on tweet data, links relationships &/or even user-to-user relationships

- Understand how retweets work and how fast they propagate.

- Sentiment analysis based on a specific keyword.

- Trend analysis.

There are N number of ways this dataset can be helpful. You have 200MM tweets. Enough for a quick experiment using real data.

* Its true that is "random" data. Just unrandom it!

User-to-user relationships aren't that great with incomplete data of the tweets, but also of the social graph. Pulling a large social graph from Twitter is nearly impossible and getting deltas on anything more than a few hundred people is equally impossible.

Propagation of retweets really needs a near complete dataset of those tweets/retweets. A steaming sample of the dataset really isn't great for this.

Sentiment analysis can be done to determine the overall feeling on a topic, but I'd feel really incomplete doing it on this dataset. Again, pulling the stream for the term or keyboard you're looking to sample is much better. Most sentiment analysis on Twitter is pretty flawed anyway.

Trend analysis works on this dataset ok, but measuring the true magnitude of an event would be hard (like Osama being killed) since you don't know what portion of the tweets you've actually got.

I worked with Sethish on the Web Ecology Project. I wouldn't call your dataset useless, but it really would be more useful generally to have a question, then pull the best possible data that will help you answer that question. Otherwise there's going to be a lot more unknowns that make it a weaker piece of research.

your points are valid.

I want to clarify that this dump is for learning purposes due the lack of "open data".

If people can play with real data from real people and get "real" inputs, that can encourage curious programmers to join the data-mining party. I know there are other dumps out there, np with that. This is just another dump, it may help people come with ideas without the need of coding a a multi-threaded scrapper.

This is a gold mine for a budding programmer. Anyone interested in learning MapReduce frameworks or messing with sentiment analysis/classification should get this data.

With that said...the data is now unavailable?

This is ridiculous. Are ToS of this kind really enforceable? (What about outside the US?)

(Anyway, I guess that in practice, if a torrent with a dump of tweets appears somewhere, it's pretty hard to find out who did it. Yes, Twitter could do some clever watermarking of the API results or correlate the dump contents with the server logs, but it would probably be a lot of work.)

Calufa, next time you're in Vegas, send me a message and we'll get a beer. Thank you. You just made something I'm doing vastly more awesome.

import to mysql:

bunzip2 < my_database.sql.bz2 | mysql -h localhost -u root -p my_database

Thanks! More interested in the scraper.. is it open-source? If yes, where can we download it? If not, can you write about your experience in building it?

Writing a Twitter scraper is pretty trivial and you can find several good examples on Github. I'd put mine online, but the commands I was using in 2009/2010 are changed/deprecated largely and my code wouldn't run.

In either case, as Sethish said, distributing dumps like this is against the new ToS.

I will blog about how I did it in a few days...

Where do you Blog so I can add to my RSS?

I dont have a blog, sorry. I will open one soon...

Feel free to follow me http://twitter.com/calufa.

if you are interested in crawling FB, check this out http://www.zubha-labs.com/oauth-trick-for-facebook-desktop-a...

Hey guys, what would be the most sane way to work with this dataset? If it's 173GB, it's probably hard to load it up in a single machine.

Hmm, how many days back does it go?

Twitter search still only goes back 10 days in 2011, so how deep is this data?

To be honest I have no idea. It crawled 13MM users, some accounts can be very old with very old tweets... You can look at the CD_data table and look for the tweet html code and parse the timestamp.

Apparently Twitter now has 100+ Million tweets per DAY.

So you caught about 2 days worth but randomly in time.

@ck2 --- correct.

Neat ! here some tips for creating a kick ass graph visualization: http://www.martinlaprise.info/2010/02/15/visualize-your-own-...

Damn, I just saw this. I would have liked to use it. How can Twitter make you take it down when it is all public information anyway?


Can you give more detail? The link is still up... What did they say?

edit reply via twitter: "they asked me to remove the dump due TOS" (http://twitter.com/#!/calufa/status/78556903772393474)

which I guess is what I expected.

But are scrapers subject to TOS?

Does anybody want to share MD5 hash of the file? I'm trying to decompress this file, and I'm keep getting an error.

wait, the torrent link has it. I do have the same md5hash, and yet, it's keep crashing whenever i'm trying to uncompress this shit... wtf is going on.

Did you figure out how to get this working? I tried 7-zip as well as winrar and both errored out

Wow, I just downloaded that whole archive in a minute.

bz2 compression ;) --- 1147480:1 compression ratio

Just shows how much real information is in tweets : not much :)

Oh it's awesome dump. Are these mainly from US?

All that is meaningless chatter between people and information about bathroom habits. Perhaps if we pooled that distributed effort into something constructive, the world would be a better place.


"Msbuild seems to limit to 100 files on a cl command line, which introduces noticeable sync losses when parallel building on 24 threads."

It's not all meaningless, you just choose to follow meaningless users.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact