
Update on the Twitter Archive at the Library of Congress - dizzystar
https://blogs.loc.gov/loc/2017/12/update-on-the-twitter-archive-at-the-library-of-congress-2/
======
olivermarks
I'm so old I can remember when you could see _all_ the live tweets flying by
last decade.

Great quote in the (too long) nyer article: “Talk, talk, talk: the utter and
heartbreaking stupidity of words,” William Faulkner 1927..

I'm sure the NSA, GCHQ, the big platform companies etc are archiving all the
tweets forever, it's a shame the public don't have access to curated content
taxpayers have paid to keep in those repositories.

The LoB can't possibly archive the entire internet without massive investment

------
testplzignore
I've wondered how much hardware it takes to store and search over every Tweet.
I figure the total number of Tweets is in the low trillions. I know Twitter
does this themselves and presumably
[http://support.gnip.com/apis/search_full_archive_api/](http://support.gnip.com/apis/search_full_archive_api/)
uses the same backend servers. Does Twitter have any public info on how much
hardware is dedicated to this?

~~~
strictnein
It's sometimes mentioned in their blog posts:

[https://blog.twitter.com/engineering/en_us/topics/infrastruc...](https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-
infrastructure-behind-twitter-scale.html)

Not a lot of specifics, but there's some interesting tidbits:

> Hadoop: We have multiple clusters storing over 500 PB divided in four groups
> (real time, processing, data warehouse and cold storage). Our biggest
> cluster is over 10k nodes. We run 150k applications and launch 130M
> containers per day.

------
killjoywashere
> the cost of properly arranging and organizing vast amounts of information is
> frequently underestimated.

There are a lot of jobs to be had here. But not in tweets. Think cancer
research, molecular biology, taxonomy, behavioral and experimental psychology.
The data

~~~
JadeNB
> There are a lot of jobs to be had here. But not in tweets. Think cancer
> research, molecular biology, taxonomy, behavioral and experimental
> psychology. The data

Did your comment get truncated?

~~~
jstarfish
He stopped at the 160-character limit to make a point.

~~~
JadeNB
Thank you; it totally went over my head.

------
Spivak
It seems like a good decision. I don't think there's much merit in archiving
the entire firehose.

~~~
sp332
It's impossible to know ahead of time which tweets or trends are going to be
interesting.

~~~
QAPereo
True, but filling the records with so much noise that finding the interesting
ones amounts to concentrating gold from seawater, you’re no longer doing the
job an archivist should.

~~~
allenz
If you have the firehose, it is easy to derive the selective archive by
filtering on the list of noteworthy accounts. Ignoring monetary costs,
archiving the firehose is strictly better than selective archival. It enables
future researchers to examine trends on a societal scale.

~~~
WorldMaker
Or find things that seemed insignificant at the time but became significant
later with hindsight.

~~~
optimuspaul
not to mention the fact that twitter is basically millions of monkeys banging
away at keyboards, eventually they will produce some great works.

~~~
QAPereo
For values of “eventually” which include timescales on the order of proton
decay...

------
Uehreka
Given that tweets are only 280 characters and there are perhaps a couple
hundred accounts that the LoC archives, I fail to understand why the premier
archiving project in the country can't keep a slowly growing fistful of
gigabytes on a hard drive somewhere.

~~~
falcolas
From someone who worked with the Twitter firehose briefly: A handful of gigs
per day. More like hundreds of terabytes of information total, headed into
petabyte territory. First, there's a lot of tweets. Also, the message itself
is not all the data available with a tweet. Toss in any semantic processing or
tagging and each message grows quickly.

The company I worked with spent a non-trivial amount of money storing
historical tweets. I'd even go so far to say that was the majority of the IAAS
costs - even more than the compute required to process them in real time.

~~~
Uehreka
OK fair enough, I didn't account for metadata. I'm gonna try some back of the
envelope math, tell me if I go off the rails anywhere:

First, how many accounts are we talking about here? Between POTUS, VP, their
spokespeople, cabinet members and official agency accounts, I'm going to
assume the executive branch has about 50 accounts that ought to be archived.
For congress, I think it's reasonable to archive each member's account and
their spokesperson: 200 accounts for senate, 870 accounts for congress. In the
interest of Fermi estimation (and because I'm not sure if every one of these
people has an account) let's call it 1000 accounts.

I'm going to go with a mean of 10 tweets per day (again in the name of Fermi
estimation).

With 280 characters + metadata, I'm going to round up to 1KB per tweet.

1000 accounts * 10 tweets/day * 1KB/tweet = ~10MB/day = ~3.65GB/year = ~40GB
for the current lifetime of Twitter

If you're drinking from the firehose to archive tweets for a huge userbase
(and feed them into models or perform semantic analysis) I could see this
getting expensive and costly. If you're just try to archive tweets from a
certain group of users and keep them on a disk (with a tarball or zip file
released quarterly) it feels a bit more doable.

That said, this little thought exercise has gotten me thinking a lot more
about what I expect of the LoC and the National Archives. I'd be happy with a
"cold storage" record of the tweets being preserved for posterity, but they
may see their mission as making the tweets into a tagged searchable
collection. I also think there are arguments to be had about how many people's
account really need to be archived (perhaps states could handle archiving
their own reps).

In the end I guess I'm cool with the LoC scaling back collection as long as
they're transparent about how they're doing it.

~~~
rajivm
They were doing all accounts previously, not just government officials. Their
new selection is likely to still include government officials.

------
miheermunjal
I feel I'm not the only one thinking that "select few" to archive will exclude
a certain public figure. The benefit to the "full dump" is that it includes
everything, without any bias or removals.

~~~
floren
Are you implying that the Library of Congress will exclude communications from
the President of the United States? That seems like a stretch.

------
teh_klev
The article title is misleading, they haven't "quit twitter", they're just no
longer archiving every public tweet (except for a select few, e.g. those that
are likely to be historically interesting/noteworthy - e.g. Trump's twitter
utterances).

Perhaps the title should be editorialised to something like:

"Library of Congress announce a change in collections practice for Twitter"

edit:

@dang - how about changing the link to:

[https://blogs.loc.gov/loc/2017/12/update-on-the-twitter-
arch...](https://blogs.loc.gov/loc/2017/12/update-on-the-twitter-archive-at-
the-library-of-congress-2/)

or

[https://blogs.loc.gov/loc/files/2017/12/2017dec_twitter_whit...](https://blogs.loc.gov/loc/files/2017/12/2017dec_twitter_white-
paper.pdf)

~~~
tw1010
Misleading, maybe, but I don't think many were confused about the actual
meaning of the headline (I certainly wasn't).

~~~
fermienrico
So you actually read it as "The library of congress quits archiving Twitter"?
Hmm...

~~~
tw1010
Yes. But I had heard of the news story about them starting to archive twitter.
If I hadn't, maybe I would be more confused. Why would the library of congress
quit twitter, I thought, and why would it even be a worthy enough event to
warrant a news article? If they didn't find value in it, they would just stop
posting. Why would a neutral government agency take something like a political
stance against twitter? If I was wrong, yes, it definitely was news worthy.
But occam's razor, and the law of clickbait, led me to the conclusion that
most likely, it means that they've stopped archiving tweets, and not that they
have deleted their account.

