
The Backing Up of the Internet Archive Continues - bane
http://ascii.textfiles.com/archives/4636
======
textfiles
Jason Scott here. Just wanted to address the questions that always come up
when this project gets some attention. (Also: Come volunteer to be a client!
The more the merrier.)

* We are only backing up public facing data. (Roughly 12pb) * We are only backing up curated sets of data. (So less than that.) * We are stepping carefully to learn more about the whole process as we go, documenting, etc. * The hope is this will produce some real-world lessons and code that other sites can use. * This project uses non Internet Archive infrastructure, and is not an Internet Archive project.

It's going well, and the more people who join up, the better. Oh, and support
the Internet Archive with a donation - it's a meaningful non-profit making a
real difference in the world.
[http://archive.org/donate](http://archive.org/donate)

~~~
nrao123
Thanks for doing this. Hopefully, this will help reduce the problem of the Web
of Alexandria that Brett Victor talked about:

 _60% of my fav links from 10 yrs ago are 404. I wonder if Library of Congress
expects 60% of their collection to go up in smoke every decade.

\---

For someone who's thinking about a library in every desk, going on the web
today might feel like visiting the Library of Alexandria. Things didn't work
out so well with the Library of Alexandria.

It's interesting that life itself chose Bush's approach. Every cell of every
organism has a full copy of the genome. That works pretty well -- DNA gets
damaged, cells die, organisms die, the genome lives on. It's been working
pretty well for about 4 billion years.

We, as a species, are currently putting together a universal repository of
knowledge and ideas, unprecedented in scope and scale. Which information-
handling technology should we model it on? The one that's worked for 4 billion
years and is responsible for our existence? Or the one that's led to the
greatest intellectual tragedies in history? _

[https://twitter.com/worrydream/status/478087637031325697](https://twitter.com/worrydream/status/478087637031325697)

[http://worrydream.com/TheWebOfAlexandria/](http://worrydream.com/TheWebOfAlexandria/)

~~~
ISL
Every cell has a full copy of the current operating plan, not the entire
history of all preceding operating plans. Storing the entire commit history of
our DNA would be much more space intensive.

~~~
ddlatham
How much more?

~~~
ISL
Well, poking around the web, it looks like the average bacterial time-between-
generations is ~0.1-10 hr, and there are ~0.1-100 mutations/generation. So, if
life began 4 billion years ago, I get between 10^11 and 10^16 mutations that
need tracking.

The article states that there are 10^9 bases in the human genome.

~~~
Arelius
While that's a pretty significant range, it seems that we could store the
entire commit-log in the same amount of DNA that 100 cells normally store..
10,000,000 cells on the high-end. Which is still only a few milligrams of
cells. Impratical of course, but interesting.

------
sneak
This isn't a lot of data in the general scheme of things.

Why doesn't AWS or Google offer to pop it in glacier/S3 or GCS for free for
the PR? They both have a huge multiple of that much unused disk - it would
cost them effectively nothing beyond inbound bandwidth.

~~~
tracker1
It kind of is a lot of data... trying to archive internet websites whole and
keeping multiple snapshots is pretty close to what google and other search
engines do, only search engines need to generate distributed indexes as well
as some information regarding versioning.

Not to mention, that historically speaking, you cannot trust google to keep
this information preserved, or public. Look at what happened to any number of
other tools google once offered. TBH, I would like to see funding via a grant
from the library of congress towards archive.org.

I'm curious as to how many copies a piece of data is needed to be "safe" in
such a flexible unknown as end user/volunteer storage. It's one thing for
compute items that can be re-queued for work in a day or two if abandoned...
it's another where every copy of a record happens to walk away. Let alone the
communications protocol.. this goes way beyond most bigtable implementations.

~~~
manigandham
On a purely numerical comparison, the article mentions 27TB which really isn't
much in terms of size, especially compared to what some of the companies using
AWS produce daily.

EDIT: according to comments below, looks like its about 20+ petabytes which is
actually a fairly large amount.

~~~
vidarh
The article mentions 27TB _so far_ at a point where they appear to still be
focused on making their tools better. 27TB is a tiny proportion of IA.

~~~
philh
Cite:
[https://en.wikipedia.org/wiki/Internet_Archive](https://en.wikipedia.org/wiki/Internet_Archive)

> As of October 2012, its collection topped 10 petabytes.

I'd be curious what it's grown to since, but google didn't immediately tell
me.

~~~
Mithrandir
About 21-22 petabytes:
[http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK](http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK)

------
userbinator
One of the comments there brought up an interesting point:

 _Is this pure archival, meaning a onetime download and no uploads?_

Imagine a distributed, peer-to-peer style Internet Archive. That would be
awesome. It reminds me of a decade ago, when the rapid rise of P2P file
sharing (mostly torrents) made it possible to find literally _anything_ that
existed and someone felt neighbourly enough to share. Multimedia, software,
books, anything that was present in digital form. It was probably the closest
we've ever been to a "global library", but too bad antipiracy
groups/commercial interests/security focus mostly killed it off and replaced
it with locked-down content silos...

~~~
icebraining
Since it uses git-annex, it actually can fetch files in a P2P fashion, using
Bittorrent: [http://git-
annex.branchable.com/special_remotes/bittorrent/](http://git-
annex.branchable.com/special_remotes/bittorrent/)

------
IanCal
I had a crack at running this, and it was quite interesting. I'm certainly up
for having some of my spare space for such a worthy cause.

I stopped, however, mostly due to the speed. I have an 80Mb connection but
couldn't pull more than ~0.5-1Mb down. At that rate, even filling 500GB would
take about two months. Perhaps this was due to the size of the files
(downloading music).

For anyone looking to try it out, please do, and see what you get. One thing
to be aware of is it asked for the amount of space to leave _free_ not how
much to use, and got the size of my disk wrong by about a factor of 10 so I
had to be a bit cautious about what I selected.

~~~
Filligree
> One thing to be aware of is it asked for the amount of space to leave free
> not how much to use, and got the size of my disk wrong by about a factor of
> 10 so I had to be a bit cautious about what I selected.

Well, that's a deal-breaker for me.

I can easily dedicate, say, a terabyte or two to the project, but I can't do
"Free space minus this number". I need predictability.

~~~
IanCal
Perhaps I worded that poorly, it's asking for the amount of space to not use.
As in, on a 3TB drive, leave me 2TB for my own purposes and it'll then use a
max of 1TB. It's not based on how utilised the drive currently is.

My problem was it thought the drive was 16TB when in reality it was 2TB.
That's a minor usability problem though, one I've been meaning to go and
submit a PR for.

Edit - As I understand it anyway.

~~~
joeyh
I'd be interested in that bug report!

------
unicornporn
Perhaps approaching this with LOCKSS (Lots Of Copies Keep Stuff Safe)[1] or a
LOCKSS-like system would be a good idea.

[1] [http://www.lockss.org](http://www.lockss.org)

------
jerf
"We’ve intentionally and _unintentionally_ punched clients in the gut"

Oh come now, you can't drop that in a blog post without elaboration. :) Share
your pain!

------
s0me0ne
The Internet Archive will still archive your site even if you use their
recommended robots.txt suggestions to have them not archive your site. I had
several domains where I setup the robots.txt immediately as one of the first
things on my sites. While I owned the sites and ran them with the robots.txt
Archive.org would not show any archive of them. However once the sites went
down/changed domains, etc, the archive of them is now on archive.org. So all
they do is block you from seeing what is on the site as of now, not forever.
So you'd have to own the domain infinitely to have it not archived by them.

------
rtpg
what exactly is the objective here? just making sure there's multiple copies
of the internet archive?

Nifty stuff in any case

~~~
textfiles
There's multiple bits of objectives here - education, research, practice, and
awareness. As time goes on, it's obvious that distributed sharing of data is
the only real vaccine against the inherent problems of the modern Internet,
and learning more about different ways this might be done benefit all.

Plus it's nice to have some of this data scattered about the world.

~~~
mdaniel
In the sphere of research & practice, are there any long-term plans for
distributed processing applied to the data, akin to the Common Crawl in AWS
allowing one to run map-reduce jobs against it?

There is a long history of sandboxing the JVM, which means in theory it should
be safe to bring the code to the data without running the risk of having your
local machine p0wned.

------
andyjohnson0
I'm curious about how archive.org feel about this. Presumably they are paying
for the outgoing bandwidth.

~~~
zcore
This is being done by Jason Scott and the rest of the team that runs
archive.org. Presumably they're ok with using their own bandwidth!

~~~
textfiles
It's being overseen by Jason Scott (me) but I am not utilizing other employees
of Internet Archive - they're quite busy enough as it is.

