
The Library of Congress Acquires Entire Twitter Archive - fogus
http://www.facebook.com/notes/the-library-of-congress/how-tweet-it-is-library-acquires-entire-twitter-archive/110775778955250
======
pyre
I wonder if the LoC keeps dumps from Wikipedia (in relation to the other
frontpage story). I would think that the full 5.6TB Wikipedia dump would
probably contain more useful information on average than 20TB of tweets.

[ref: <http://news.ycombinator.com/item?id=1265138>]

------
lecha
This headline reads like one for an Onion news story

~~~
chaosmachine
I thought for sure this was a late april fools post.

~~~
xelipe
Yeah, I thought the same thing... then the fear that my tweets to trending
hashtags like #threewordsforyou or #worstlies are in the Library of Congress
for perpetuity.

------
protomyth
Hopefully they'll convert the short urls to long and make some arrangements to
download pictures from twitphoto and their ilk.

~~~
paulgb
If not, this would be a cool project for a third-party distributed project
like SETI@home or folding@home.

~~~
chronomex
I'm a member of ArchiveTeam (<http://www.archiveteam.org/>). One of our
current projects is to archive all the URL shorteners we can find. We're going
to make the full dataset available for public download.

~~~
alanh
Can we report shorteners? For example, I host a private-use one at
<http://ajh.us> that's powered by my project <http://lessnmore.net>

~~~
Kadin
While they may well have a way for you to submit archives, I think the greater
question to ask is "why?"

URL shorteners are bad enough on their own, but arguably a necessary evil --
encouraging everyone to set up their own is in no way a good thing.

I can't think of a _worse_ service for an average user (or anyone without a
datacenter, or better yet multiple datacenters) to run. By all means run your
own email server -- if it craps out, at least you'll just lose _your_ email.
But if you run a URL shortening service and send out or post a lot of URLs
using it, and then you go bankrupt, die, or just get bored and decide to stop
running it ... now everyone who happens to have one of the links is S.O.L.

It strikes me as the kind of service that's relatively easy to run, but unless
someone is prepared to run it _forever_ , they probably shouldn't.

~~~
alanh
I am using it primarily to shorten links back to <http://alanhogan.com>, and
since I run both domains, I intend to keep both of them running as long as
possible -- and when one goes down, the other becomes pretty much useless.

But yes, I agree that it is important to intend to run these services forever.

------
astine
I work for the LoC. I think I'd of heard of this.

Edit: It is my considered opinion that a lot of what the LoC does is
excessively ambitious. The project I'm currently working in is called the Born
Digital project, which is an attempt to allow major broadcasters like PBS and
CNN to submit broadcast material to the Library in real-time. The tools to do
this, do not currently exist.

~~~
jonknee
> The tools to do this, do not currently exist.

Considering that all the major networks beam their broadcasts all over the
world in real time I don't see what the challenge is other than storage space.
Startups like Justin.tv have figured out how to not only capture but stream
thousands of simultaneous live streams, so the tools certainly exist.

~~~
astine
The primary difficulty is that the Library is attempting more than just live-
capture of video streams. They are going for full archival quality. They're
not simply saving streams to tape but but running extensive qc processes on
everything and transcoding coding the material into multiple formats.
Technically, the technology exists, but the Library does not have the hardware
to run the software. (Unless someone knows of some efficient, off the shelf
content-validation program that we don't know of...)

Other problems include, yes, storage. The projected aggregation rate will
outstrip, not only the Libraries current physical facilities, but all planned
new facilities, by 2013 (at which point we will be looking at 250PB of
storage.) The rate of growth will continue to increase from there. Unless tape
technology improves at an exponential rate, this seem unavoidable.

~~~
jonknee
Are you storing everything or just their unique content? Commercials and what
not make up a large percentage (30%) and then re-runs (unknown percentage, but
huge). And why multiple formats? Unless you're distributing the content it
seems more efficient to transcode it when you need it.

~~~
astine
I really don't know. I don't make those decisions.

------
joshwa
Is anyone concerned about copyright issues? I'd have to go back and consult
the Twitter TOS...

edit:

"You agree that this license includes the right for Twitter to make such
Content available to other companies, organizations or individuals who partner
with Twitter for the syndication, broadcast, distribution or publication of
such Content on other media and services, subject to our terms and conditions
for such Content use.

Such additional uses by Twitter, or other companies, organizations or
individuals who partner with Twitter, may be made with no compensation paid to
you with respect to the Content that you submit, post, transmit or otherwise
make available through the Services.

We may modify or adapt your Content in order to transmit, display or
distribute it over computer networks and in various media and/or make changes
to your Content as are necessary to conform and adapt that Content to any
requirements or limitations of any networks, devices, services or media."

~~~
whyenot
What copyright issues? The US Copyright Office is part of the Library of
Congress. One of the LOC's responsibilities is to maintain a collection of all
copyrighted works.

 _The Library serves as a legal repository for copyright protection and
copyright registration, and as the base for the United States Copyright
Office. Regardless of whether they register their copyright, all publishers
are required to submit two complete copies of their published works to the
Library if requested—this requirement is known as mandatory deposit._
<http://en.wikipedia.org/wiki/Library_of_Congress>

~~~
jamesbritt
Even for non-USA citizens?

~~~
whyenot
IANAL, but I don't think your citizenship is an issue. What matters is where a
work is published, and in Twitter's case that would appear to be the United
States.

------
michael_nielsen
This has been confirmed on the Library of Congress blog:
<http://www.loc.gov/tweet/how-tweet-it-is.html>

It is not a fake or a late April Fool joke, despite the fact that the original
post is from facebook.

------
cobralibre
I wonder how the content and quality of discourse on Twitter would have
differed if its users had known all along that their posts would be archived
for posterity by the LoC. This has been like an _Ender's Game_ for
hyperbanality!

~~~
arthur_debert
It's very likely that most people using twitter have no idea what the LoC is.

------
wallflower
It will be interesting if they extend their CQL (Contextual Query Language)
implementation to support complex searches (e.g. who tweeted first about X)

<http://www.loc.gov/standards/sru/index.html>

------
iamdave
Great, now future generations can see how obsessed we are with Justin Beiber.

------
ja27
How interesting is it that they announced that on Facebook?

------
jrockway
One more government database...

~~~
tptacek
First they want to track all of our _books_... now _this_? Next it'll be the
newspapers. Just you wait.

~~~
jrockway
Books were intended for wide distribution, but not all tweets are.

I don't care that this was done, and it's nice that this snapshot of 2008-2010
will be available for future researchers. But I think this sort of dataset is
more useful for nefarious purposes (government investigations) than a bunch of
books or newspapers, and it would be too expensive to build for just one
investigation.

"Build it and they will come." It's built.

------
axod
I don't see why this particularly matters. You could likely store the entire
twitter archive on a couple of memory sticks, so it doesn't seem to warrant
some big library to "house" it.

~~~
tibbon
No. Its fair big data actually. Around 55M tweets per day currently, although
it wasn't actually always that big. Maybe that doesn't sound that big, but its
at least a few terabytes of data.

I think this is big for a few reasons. One, if its in the LoC then it is
(probably) available to the public. Previously it wasn't. That's pretty big.

Also with the LoC having a public archive of it, we can keep better track of
politicians who post things online- always a good thing since occasionally
they shift their positions or outright lie.

As a data analyst, I'm really excited.

~~~
Gormo
55 million tweets/day * 140 characters/tweet ~= 14.5 GB/day, presuming a
16-bit character encoding.

That's only about 20 TB for four years worth of data, or about $1500 worth of
storage, before compression.

That doesn't seem like it would be a significant challenge for the LOC. I'm
more interested in understanding why they would want to use that amount of
space to archive Twitter, as opposed to anything else.

~~~
FlemishBeeCycle
Combination of low-hanging fruit plus inflated idea of twitter's usefulness
(much like other people largely unfamiliar with today's technology culture)?

