

Tweets 2011 Corpus - yarapavan
http://trec.nist.gov/data/tweets/

======
rspeer
Isn't this completely useless?

You can download about 10 million tweets a day just by watching Twitter's
basic streaming API. You don't have to sign anything, and you probably get
them faster because you just keep one HTTP connection open.

EDIT: Okay, I can see why it might be marginally useful -- to ensure that
people doing similar research get the same set of tweets. But, as others have
pointed out, that doesn't work if people actually comply with delete requests
and don't do so with lockstep timing.

~~~
Tichy
Ah, so the tweets provided here are just a sample, just as the basic stream
provides? At first I thought it would be all Tweets from 2011, which would be
cool...

------
chime
> Twitter provided identifiers for approximately 16 million tweets sampled
> between January 23rd and February 8th, 2011.

> Note that it can take several days to download your copy of the Tweets2011.

Why does Twitter make it so cumbersome to download tweets, especially for non-
commercial reasons? Twitter data can be REALLY useful to researchers. Here's
what I'm doing with it:
<http://ktype.net/wiki/research:articles:progress_20110209>

I'd love to re-run my parsing algo and build a newer, better n-gram list but
this seems like a lot of effort to download what could have been a simple
torrent. I understand the need to enforce delete-tweet request but that can
still be accomplished regardless of how the data is downloaded.

------
prosa
It's good to see that in this day and age the government has still mastered
the tried and true "print, sign, scan, email, then download via FTP" approach
to file downloads.

~~~
matt4711
As someone who downloaded the corpus I can tell you that's not how it works.
What you are downloading from nist is a java twitter html crawler and a list
of tweet ids that you have to download directly from twitter.

It took me a week or more to download the complete 16 million tweets.

Another problem with the corpus is the fact that (at the time I downloaded the
tweets) around 2% of the tweets in the corpus were no longer available from
twitter as users deleted their twitter account. The longer you wait, the more
tweets are going to be unavailable.

~~~
Someone
It is worse. _"in particular you agree ... to delete tweets that are marked
deleted in the future"_

Because of that, I do not see how anybody can even think about using this
dataset for research.

If you keep all data, you are in breach of the license. If you do not, you are
guaranteeing that you cannot reproduce your results in the future.

Als, there is the practical side. I would guess it takes a week to check for
deleted tweets, too, so how are you going to comply with that clause?

------
utunga
Pretty excited to see. The research community needs a standardized corpus.

However, not as exciting as I hoped it would be from the heading.

What would be awesome - for us anyway - is if it was all/most tweets from a
sampling of _users_ (not just random tweets across all users) because then we
could do the kind of analysis we're trying to do at hashmapd.com. Now _that_
would be awesome.

If it's not that why not just consume the spritzer for a few days? You'd get
the same kind of data. Well I guess the objective is to create a standardized
dataset, which is neat, but not what _we_ need.

