
Internetarchive.bak - edward
http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK
======
colindean
I could see this as a Kickstarter project of sorts.

Sell a device - a Raspberry Pi with an SD card preloaded with "plug and go"
software - that people can buy for $50. Sell alongside it an inexpensive SATA
and/or PATA enclosure into which people can put their old drives. They
basically plug in their old hard drives and forget about it, perhaps writing
off the tops ~$25/mo in electricity cost as a donation†.

When they plug it in, the RPi gets an IP and announces its availability via
DNS-SD (Bonjour/Avahi/etc.). The user downloads a tool or just knows to visit
`myarchiver.local` to pull up the control panel. The control panel lets the
user set up IP options, time, etc. as well as control bandwidth usage,
scheduling, etc. They can also see the amount of space used and set up
notifications of some kind (reboot for updates, disk space, etc.).

This little device just sits in a corner and is effectively a multi-TB shard
of this 20 PB and growing cloud storage system preserving the Internet. Box it
up with some sexy industrial design and some attractive packaging, and it
becomes a conversation piece, just like your Apple Time Capsule, Amazon Echo,
Butterfly Labs Jalapeño, or other newfangled life device, except it's actually
storing more than porn, cat pictures, or magic internet money.

† I don't know how feasible it is, but go with it.

~~~
ris
The Archive Team have something called the Archive Team Warrior
([http://archiveteam.org/index.php?title=ArchiveTeam_Warrior](http://archiveteam.org/index.php?title=ArchiveTeam_Warrior))
which seems quite similar to this except it's a virtual rather than physical
appliance. Worth a look.

(edit: although looking further this is more for distributed crawling, while
the results will be uploaded to their archive)

~~~
sp332
The warriors download items assigned by a central tracker, then upload each
item to a holding area. The warriors don't keep items around when they're
finished. Also, the current tracker keeps everything in RAM, and can't keep up
with more than ~2 million items at once. It would take a major rewrite just to
keep track of all the pieces.

P.S. Please run a warrior instance. It comes in a small VM image that you can
run in the background all the time.

------
kardos
These storage pods ([http://www.45drives.com/products/direct-wired-
standard.php](http://www.45drives.com/products/direct-wired-standard.php))
hold 180TB if you populate them with 4TB drives. 20 PB / 180 TB ~= 111. With
some redundancy, you could round up to 200 pods.

Now you host these in various datacentres around the world. Maybe one rack per
data centre so approx 20 datacentres. This will be manageable, sort of like
managing a small cluster. You'll have plenty of cheap bandwidth and excellent
reliability/availability. Datacentres may be able to donate space or bandwidth
to the project for the positive publicity, etc.

Relying on USB drives hanging off of consumer's laptops connected over wifi is
going to have dismal reliability/availability, poor bandwidth, etc, so to
compensate you'll need a colossal amount of redundancy (5-10x? more?) for it
to even function.

From an end user's point of view, I think my donation of $100 toward the above
storage pod approach would go a lot further than me buying a $100 USB drive
and managing a low quality bandwidth capped contribution.

------
zrail
The work that the Internet Archive does is amazing and vital. If you've ever
used the Wayback Machine[1] and thought "huh, that's neat" you should consider
donating to their cause[2]. I just set up a donation through my payroll
provider (ZenPayroll) that gives them a small chunk of every paycheck.

[1]: [https://archive.org/web/](https://archive.org/web/)

[2]: [https://archive.org/donate/](https://archive.org/donate/)

~~~
josu
I've seen the Wayback Machine used in court cases, I was pretty amazed the
first time I saw it mentioned in a court case.

~~~
rancur
they should monetize on this as a resource since there's money available
there.

~~~
sp332
How would you monetize a resource that's more valuable when it's accessible to
everyone?

------
Animats
There's already a backup of the Internet Archive, in Alexandria, Egypt.[1] It
hasn't been updated since 2007, though. The Archive originally wanted to have
duplicates in several locations, but Egypt was the only one that came online.
Here's YCombinator's original web site from 2005, from the copy of the
Internet Archive at the Bibliotheca Alexandrina.[2]

A system, OpenArchive, was set up to replicate the Archive, but nobody else
signed up.

[1]
[http://www.bibalex.org/internetarchive/ia_en.aspx](http://www.bibalex.org/internetarchive/ia_en.aspx)
[2]
[http://web.archive.bibalex.org/web/20050401073947/http://yco...](http://web.archive.bibalex.org/web/20050401073947/http://ycombinator.com/)

~~~
mmastrac
Thanks for this link. I've been digging for my old homepage data from '97 or
so that started returning blanks on archive.org. I found a viable copy here!

~~~
walterbell
If Alexandria was a backup of IA, shouldn't everything in Alexandria already
be available at archive.org? Otherwise, that implies some pages were deleted
or are no longer visible at archive.org.

~~~
gwern
IA hard drives fail, and as far as I know, they don't use anything fancy like
FEC or distributed filesystems to heal from hard drives dying.

~~~
lucb1e
They don't keep data twice? So the archive is just slowly rotting away
literally, all the time?

Okay, this just convinced me that Internetarchive.bak is a good idea. I'm
going to participate.

~~~
hbromley
Internet Archive engineer here. We do maintain two copies of everything in our
main data collection. Disks fail at an average rate of perhaps 5-6 per day,
and the content does get restored from the other copy.

(That said, originally the Wayback crawl content was stored separately from
paired storage, but it was merged into paired storage a few years ago.)

We would prefer greater redundancy, and may shift to a more sophisticated
arrangement at some point, but as a non-profit with limited resources, we're
constantly managing tradeoffs between what we should do and what we can afford
to do.

~~~
walterbell
Thanks for the reassuring clarification. Two copies is much better than one
copy :) It would be useful to know the maximum time window for restoral from
the backup copy.

~~~
hbromley
Depending on what time of day it fails, the bad disk is generally replaced
within 2-18 hours. Copying the data onto the new disk, from the remaining half
of the original pair, typically takes another day or so.

But the material found on the failed disk usually remains available throughout
the replacement process because both original copies are on live servers. Read
requests are ordinarily divided more or less evenly between the two. If one
copy dies, requests are simply all directed to the other copy until the lost
one is replaced.

There is no ability to "write" (make changes) to any affected items during the
replacement process, but they usually remain readable throughout.

------
_prometheus
Hey everyone! We designed IPFS ([http://ipfs.io](http://ipfs.io)) specifically
for this kind of purpose.

archiveteam: would be great to coordinate with you and see where we can help
your effort! closure/joeyh pointed me to his proposal for you-- i'll write one
up as well so you can check out what is relevant for you. Also, recommend
looking into proofs-of-storage/retrievability.

Also, if archivists (Archive.org team, or archiveteam (they're different)) see
this:

\-----------

Same note in rendered markdown, served over IPFS ;) at:

[http://gateway.ipfs.io/ipfs/QmSrCRJmzE4zE1nAfWPbzVfanKQNBhp7...](http://gateway.ipfs.io/ipfs/QmSrCRJmzE4zE1nAfWPbzVfanKQNBhp7ZWmMnEdbiLvYNh/mdown#/ipfs/QmWagGRBkAwAQBjy3nzQbUPvZjDTCrtkuJz7NgNJYjbJ24/hello-
archive.md)

\-----------

Dear Archivists, First off, thank you very much for the hard work you do. We
all owe you big time.

I'm the author of IPFS -- i designed IPFS with the archive in mind.[1]

Our tech is very close to ready. you can read about the details here:
[http://static.benet.ai/t/ipfs.pdf](http://static.benet.ai/t/ipfs.pdf), or
watch the old talk here:
[https://www.youtube.com/watch?v=Fa4pckodM9g](https://www.youtube.com/watch?v=Fa4pckodM9g).
(I will be doing another, updated tech dive into the protocol + details soon)

You can loosely think of ipfs as git + bittorrent + dht + web.

I've been trying to get in touch with you about this-- i've been to a friday
lunch (virgil griffith brought me months ago) and recently reached out to
Brewster. I think you'll find that ipfs will very neatly plug into your
architecture, and does a ton of heavy lifting for versioning and replicating
all the data you have. Moreover, it allows people around the world to help
replicate the archive.

It's not perfect yet -- keep in mind there was no code a few months ago -- but
today we're at a point of streaming video reliably and with no noticeable
lag-- which is enough perf for starting to replicate the archive.

We're at a point where figuring out your exact constraints-- as they would
look with ipfs-- would help us build what you need. We care deeply about this,
so we want to help.

Cheers,

-Juan

[1] see also the end of
[https://www.youtube.com/watch?v=skMTdSEaCtA](https://www.youtube.com/watch?v=skMTdSEaCtA)

~~~
zimbatm
I tried ipfs, it's impressive. The only challenge it to re-write paths to be
relative instead of absolute as ipfs has it's own "/ipfs/$hash" prefix.

------
seanp2k2
This sounds a lot like torrents, and I would think that it'd be less work to
solve the problem of newer snapshots of torrents (or put all the new content
into new "chunks" and thus new torrents) vs. re-inventing the entire thing.

~~~
sp332
The tracker would have to manage millions of torrents, and each client would
have to figure out which subset of torrents it wanted to download. And there
needs to be a mechanism for the server to force a client to recompute hashes
from time to time. So I don't see how it's very torrent-like.

~~~
e12e
If one uses one torrent per chunk (500 MB), that's just 42K torrents -- as
mentioned here[1]:

"Update: Here’s a copy of 17 million torrents from Bitsnoop.com, pretty much
the same format but nicely categorized. It’s only 535 MB."

So, that's 17M [magnet]s, which means the archive could grow by some orders of
magnitude from it's current need of 42K [magnet]s, and still doling out a
subset to clients seems to be quite possible to manage.

No needs for trackers any more, [magnet]s work fine, and are regularly used to
distribute ~1GB torrents (eg: hd tv episodes). Whatever one might think of
distributing unlicensed copies of media -- they show that distributing vast
amounts of data is quite easy with technology that is readily available and
well tested.

[1] [https://torrentfreak.com/download-a-copy-of-the-pirate-
bay-i...](https://torrentfreak.com/download-a-copy-of-the-pirate-bay-its-
only-90-mb-120209/)

~~~
TheLoneWolfling
What we need now is a torrent client that can prioritize. In other words:
here's this absurd number of torrent chunks, here's the space I have, now what
are the best chunks to archive for the health of the swarm?

~~~
renata
Most torrent clients already prefer rare chunks, assuming the user doesn't
turn on "Prefer chunks in streaming order" or whatever similar option the
client has.

~~~
TheLoneWolfling
I don't know of a torrent client that will allow me say "download at least
these files of the torrent, and the 100mb of anything else that's the most
likely to help the swarm".

Downloading the chunks that are the rarest first is different than downloading
only rare chunks - although it's good that some of the infrastructure is
already in place.

------
bokchoi
Joey's design proposal to use git-annex:

[http://git-annex.branchable.com/design/iabackup/](http://git-
annex.branchable.com/design/iabackup/)

------
willejs
I am wondering why they don't ask large cloud providers to donate storage. If
amazon, microsoft, google, rackspace, joyent, etc all gave them 10pb worth of
data, they could just make a facade to their apis and store all the data on
there, and just keep one copy of the data and a database of files?

A distributed or p2p model is a great idea, but it seems very difficult to
achieve.

~~~
jedberg
You know, Amazon alone could provide this. The retail cost of 20PB of Glacier
storage is $222,100.96/mo. Assuming they have a margin of 50%, that's $110K a
month in actual cost to them, so basically their "donation" would be
$1.2MM/year, which is 0.01% of their revenue ($90B). Even if their margin is
0, that's still only 0.02% of their revenue.

------
revelation
Their concerns are seemingly exactly why we have invented hash functions. They
have managed to keep BitTorrent (swarms certainly) free of bad actors even
when there is considerable monetary interest in their existence.

~~~
sp332
Plain hashes are vulnerable to all sorts of attacks. Here's a paper on proof
of retrievability
[https://cseweb.ucsd.edu/~hovav/dist/verstore.pdf](https://cseweb.ucsd.edu/~hovav/dist/verstore.pdf)
and one on non-outsourceabe storage
[http://cs.umd.edu/~amiller/nonoutsourceable.pdf](http://cs.umd.edu/~amiller/nonoutsourceable.pdf)

~~~
quoiquoi
can you give the tl;dr? What are possible attacks?

~~~
sp332
Just for example, you could download data, compute the hash, then delete the
data. You could pretend to have stored lots of data this way. If someone asks
for the hash to prove that you have the data, just give them the hashes you
computed earlier. That's what the first paper prevents.

Another one is to pretend that you have lots of chunks, and when someone
challenges you to produce one, you quickly download it from another source and
present it as if you had it all along. That's what the second paper prevents.

------
jff
At a first pass, I'd try to implement such a thing with a Distributed Hash
Table-like situation. Generate chunks in some fashion (either one chunk per
file, or glob files together up to N megabytes), take a hash of the chunk, and
then let people fetch it. All the Internet Archive needs to do is then keep
redundant copies of the hashes.

------
pain
Make it compatible with [http://Archive.today](http://Archive.today) and
[http://Webcitation.org](http://Webcitation.org) please?

------
r-u-serious
You can donate here:
[https://archive.org/donate/](https://archive.org/donate/)

------
myfonj
Idea of shared/distributed backup leads me to vision of web browsers actively
archiving and sharing public data. No more slashdot effects: more clients hits
some public resource, more caches are filled and seeded in case of original
resources downtime.

~~~
IgorPartola
Latency will kill it. I don't foresee symmetric connections for residential
networks any time soon, so when you hit up Wikipedia.org, you will be
downloading it from your neighbor's 128 Kbps uplink. Discovering that this
neighbor has a copy of the article you want is even more difficult. On top of
that, what if you are signed into Wikipedia? Now you are sending your session
ID to your neighbor? And what about HTTPS? Can your neighbor now send you
resources as Wikipedia.com?

The idea of doing HTTP over torrent comes up periodically and nobody has ye
answers to these questions. Besides, online access is not the problem: there
is plenty of bandwidth and CDN's already solve the data locality problem.

~~~
walterbell
TWC cheapeast plan has 1 Mbps upload, the next tier is 5 Mbps upload,
[https://www.timewarnercable.com/en/plans-
packages/internet/i...](https://www.timewarnercable.com/en/plans-
packages/internet/internet-service-plans.html)

~~~
IgorPartola
I think you are making my point. How does 1 Mbps or 5 Mbps compare to say, a 1
Gbps connection you can have with a server colocated or rented from a hosting
company? Moreover, what's the benefit?

~~~
walterbell
1Mbs is eight times faster than 128Kbps, your original characterization of
residential uplink speed. Comparing 1Gbps and 1Mbps is more about bandwidth
than latency, since the physical distance between server nodes will on average
be larger than the physical distance between local neighbors.

One benefit is ownership decentralization of the serving endpoint, which helps
in some scenarios (more difficult to censor) and hurts in other scenarios
(more difficult to sue).

~~~
dragonwriter
> 1Mbs is eight times faster than 128Kbps

Actually, 8 times _as fast_ , or 7 times _faster_. Unless 128Kbps is 1 times
faster than 128Kbps.

~~~
walterbell
Thanks :)

------
subleq
It seems like TAHOE-LAFS would be a great way to let anonymous people donate
storage while maintaining data integrity.

------
edward
There is now a channel, #internetarchive.bak, on EFNet, for discussion about
the implementations and testing.

------
infruset
This reminds me of the concept of Permacoin
([https://www.cs.umd.edu/~elaine/docs/permacoin.pdf](https://www.cs.umd.edu/~elaine/docs/permacoin.pdf)).
Maybe they should give it a thought? The idea is to create a financial
incentive to store chunks of a huge dataset by using proof-of-retrievability
algorithms to mine a cryptocurrency.

------
notatoad
I would like to be a part of this - I have a file server that runs 24/7
already, and a spare terabyte (or more) of hard drive space. However, I do not
have a spare terabyte of bandwidth to use downloading a chunk of the internet
archive.

Hopefully whatever backup plan they come up with includes a way for me to mail
them a spare hard drive and get a drive full of internet archives in return.

~~~
shabble
Are you limited by total BW cap, or transfer rate? There's no reason an
suitably intelligent resumption system could allow that TB to be downloaded
over the course of several months if necessary. You'd be a lower priority
peer, but still contributing.

The manpower and logistic hurdles of shipping disks around make it seem
unlikely it would ever be an option, but perhaps something more local sneaker-
net where they partner with nearby universities or libraries with decent
connections who allow people to bring their machines/disk arrays in for that
specific use.

------
frik
There is an official mirror of the Wayback Archive.org located in the
Bibliotheca Alexandrina:
[http://en.wikipedia.org/wiki/Bibliotheca_Alexandrina](http://en.wikipedia.org/wiki/Bibliotheca_Alexandrina)

I dreamed last night that an _internet dotCom giant_ donates money / host a
mirror of Archive.org. It would be definitely a novel step.

------
slagfart
The challenges have been solved already. The article talks about splitting it
into 42k 500GB chunks. BitTorrent Sync will happily deal with 500GB, and open
trackers are available that can report the counts of users in a swarm.

To complete it, just build a web front-end that provides the least-populous
BTSync key from the pool. People can then paste it into their client and
contribute.

~~~
Mithrandir
BT Sync was considered for a related Archive Team project [1], but was ruled
out as it is (currently) proprietary.

1: [http://www.archiveteam.org/index.php?title=Valhalla#Non-
opti...](http://www.archiveteam.org/index.php?title=Valhalla#Non-options)

~~~
slagfart
Ok - but that's the only protest? An open version would satisfy all the
requirements? Syncthing.net, or even plain old Bittorrent should do
(especially if it's one static file per 500GB).

------
makmanalp
This is a neat idea. I have some random thoughts about doing this in a
distributed fashion:

...

In the wiki there's a mention of how the geocities torrent was 900GB and very
few people seeded it as a result. There should be a way and an incentive for
clients to seed parts of files without having the whole thing. As long as
there are enough copies of each chunk on the network, it's fine - we don't
care that any one client has the whole thing.

...

Another cool thing is thinking about health metrics - you don't just care
about how many people have a file, you care about recoverability too. This
seems a bit similar to AWS glacier - you have to take into account how long
it's going to retrieve data. You could have stats on how often a client
responded and completed a transfer request, what its track record is, and
assign a probability and timeframe for correctly retrieving a file.

...

One thing that comes to mind is that whatever the solution ends up being, it
should be such that partial recovery should be possible. The example that
comes to mind is stuff like a zip file or encrypted image that is completely
ruined when a chunk is missing. So maybe it makes sense to have the smallest
unit of information be large enough to be still intelligible on its own?

At first I was thinking of a model where chunks get randomly distributed and
replicated, but then that made me wonder whether that's a bad idea.

...

And then what of malicious clients? I guess it's not hard to have a hash-based
system and then just verify the contents of a file upon receiving it. But can
I effectively destroy a file by controlling all the clients that are supposed
to be keeping it up? How could you guard against that? Could you send a
challenge to a client to send you the hash of a random byte sequence in a file
to prove that they do indeed have it? What guarantees that they will send it
even if they do?

...

And then what about search / indexing?! P2P search is an awesome problem
domain that has so many problems that my mind is buzzing right now. Do you
just index predetermined metadata? Do you allow for live searches that
propagate through the system? Old systems like kademlia had issues with
spammers falsifying search results to make you download viruses and stuff -
how to guard against that? Searching by hash is safe but by name / data is
not. Etc etc.

I wish this was my job! :)

~~~
darkmighty
This seems to have a lot of overlap with people coming up with distributed
storage systems, perhaps Maidsafe [1] has solved some of those problems? (I'm
not very familiar with it)

[1]
[http://en.wikipedia.org/wiki/MaidSafe](http://en.wikipedia.org/wiki/MaidSafe)

------
phacops
They should really look at using permacoin to store the archive.

Permacoin is a cryptocurrency that uses proof of storage instead of bitocoin’s
proof of work:

[https://www.cs.umd.edu/~elaine/docs/permacoin.pdf](https://www.cs.umd.edu/~elaine/docs/permacoin.pdf)

~~~
slagfart
You've got a trusted central authority here (the Archive itself), so why deal
with some coin guff if you don't have to?

~~~
phacops
Because it provides incentive for people to help store the archive, and it
solves the problem in their article about defending against bad actors.

~~~
sfeng
But all you need to defend against bad actors is for archive.org to hang on to
a SHA256 or 512 for every chunk. Much simpler than a distributed blockchain.

~~~
phacops
It’s a little more complicated than that. I could temporarily take a chunk,
compute the hash, then throw it away and report back that I am happily storing
the data, even though I’m not.

Anyhow, read the permacoin paper. It’s pretty cool, and it needs a large
petabyte data seed to secure the network. Seems like a win-win to me.

~~~
roganartu
The centralisation helps again here too though.

The central server has access to the entire file, and hence can compute the
hash of any arbitrary chunk. Challenges/verifications don't have to happen all
that often ([1] indicates they are looking at once a month) so creating a
unique challenge for each user shouldn't be too compute intensive.

For each known user:

\- Central server chooses random chunk of each file it wishes to verify the
user still has. This could be any length from as long as the hash function to
the whole file size and could be offset any number of bits.

\- Client is asked to provide the hash of the chunk using the file stored
locally.

Precomputing hashes for all possible chunk permutations would take up
substantially more space than simply storing the file in the first place. A
bad actor would need to store the hash for all the possible chunk lengths
starting from every possible start location in the file which is in the order
of O(n^2) where n is the stored file size (500GB in this case). For reference
that would be about one third of the entire 20PB archive for a single 500GB
chunk if using a 256 bit hash function.

[1] [http://git-annex.branchable.com/design/iabackup/](http://git-
annex.branchable.com/design/iabackup/)

~~~
darkmighty
It's a good thing in this case they also don't have to worry too much about
bad actors, since it's a fundamentally altruistic endeavor. Worst case
scenarios:

1) Someone keeps downloading the full archive and throwing away;

2) Someone wants a file erased, keeping it with the intention of denying
access in some future.

\--

1) Bandwidth has costs on both sides; just balancing upload among receivers
would probably suffice for this not to be a problem.

2) Assigning large random chunks to downloaders should prevent this "chosen-
block attack"; add in some global redundancy for good measure and that's
probably enough (although I still wouldn't trust this 100% as a _primary_
storage, only as an insurance storage).

------
dmd
It would cost about US$2.5 million/year to back it up using Amazon Glacier.

~~~
LeoPanthera
With that much data, it would be cheaper to build your own Glacier-like
service.

~~~
lunixbochs
About $1m to build 21 PB of live storage (not tape), and $50k/year to power it
(not factoring maintenance, rack space cost, and cooling)

A service offering live storage at this (Petabyte) scale [1] - costs
22,020,096 GB * 3¢/GB * 12mo = $7,927,234.56 a year. You could build six
fully-mirrored copies for about that much, though management of 360 4U
machines (30x 48U full-height racks worth) is a business on its own (see
BackBlaze photos), and you need to deal with drive replacement after a point.

[1]
[http://www.rsync.net/products/petabyte.html](http://www.rsync.net/products/petabyte.html)
(If you actually sign up for this, be sure to send me that _$24,000_ referral
bonus o_o)

 _Math:_

\-- _Machine Cost_ \--

$300 * 45 * 8TB drives = $13,500 (drive cost) \+ $3,387.28 (pod cost) ==
$16,887.28 (unit cost) / 360TB = $46.91 per TB

$46.91/TB * (21 PB * 1024 == 21,504 TB) == $1,008,752.64 / 21 PB (for no
redundancy)

\-- _Power Cost_ \--

Let's say the motherboard/controllers draw around 100W continuous and each
drive pulls 11W (from my post elsewhere in the comments). This is 2,688
drives, which fit into 60x 45-drive 4U BackBlaze storage pods.

2,688 * 11W = 29,568 W (for the drives) + 6,000 W (for the computers) = 35.568
kW to run a non-redundant 21 PB array.

You'll need at least two network switches and moderate cooling infrastructure
as well. The heat from using 35.5 kWh continuously takes nontrivial effort to
disperse, but let's assume that's somehow free for the sake of simplicity
here.

35.568 kW * 24 * 365 = 311.57568 MWh (in a year)

I'm sure at this point you might be able to work out a bulk deal of some kind,
but at $0.16/kWh you're talking a pretty low cost of $49,852.11 in power every
year.

\----

Building arrays of nonredundant live drives would be significantly cheaper
over the long term than Amazon Glacier (which is almost certainly cold
storage, not live) at this scale.

~~~
ghaff
>Amazon Glacier (which is tape-based!)

Citation? Not saying you're wrong but I've seen supposedly "in the know"
commentary in favor of both the tape-based and some sort of cold disk-based
storage. I've been curious and also a bit surprised that the true answer has
never really hit publicly although I'm sure it's well-known within some
communities.

~~~
lunixbochs
Oh, you're right. I haven't thought about this for a while, and I remember my
previous assumption about tape storage being challenged. I'll replace that
part with "cold storage".

After refreshing my memory: Amazon outright they're using denied tape backup,
some third-parties pointed out cold spinning disks wouldn't be very
economical, and a few people theorized it's actually BDXL.

~~~
ghaff
Cold storage is probably safe. I've seen the assertion that it's tape from
someone who I wouldn't have thought would get it wrong but I'd now bet against
it whether or not Amazon's denial seals the deal.

Robin Harris has argued in favor of BDXL but he's speculated about a number of
things. [http://storagemojo.com/2014/04/25/amazons-glacier-secret-
bdx...](http://storagemojo.com/2014/04/25/amazons-glacier-secret-bdxl/)

My money would be on the simplest: cold disk. Some support here:
[http://mjtsai.com/blog/2014/04/27/how-is-amazon-glacier-
impl...](http://mjtsai.com/blog/2014/04/27/how-is-amazon-glacier-implemented/)
but it's all speculative and anecdotal. I don't have a problem with the
thought that Amazon is getting out ahead of the drive economics, especially
given that I'm not sure how widely used Glacier is today.

I do find it interesting that this has been apparently kept under wraps from
anyone who would share more widely.

~~~
res0nat0r
Someone that worked set amazon at the time Glacier came out a while back
commented that it is some type of extremely low power hard drive that backs
this system. Which makes sense due to the 1-4 hour delay in retrieval time.
The racks are idle and only come online one every 4 hours or so and batch
retrieve/store their data then power down. This small power usage is probably
what makes them profit on this very slim margin.

------
ryan-allen
I'd gladly host 3tb of data, I hope this becomes a thing.

------
lazylizard
store them on a gazillion 360yunpan+weiyun+baiduyun (or justcloud) accounts!

------
moe
I never understood this desire to "back up" the internet.

It's an impossible endeavour and will only become more incomplete as the net
grows. And 99,999% of the data "backed up" will never be looked at.

What's the point?

~~~
ghuntley
Well, I personally spent the better half of tonight recovering a bunch of old
websites of mine from the IA using wget.

IA is one of the most important initiatives/projects on the internet, links
are not meant to change but they do and when it happens knowledge & history
erodes.

Remember the 2004 Indian Ocean earthquake and tsunami?

It was the first major event that provided mainstream credibility to the
blogosphere, before this disaster consumers turned to news publications for
information. For the first time ever news publications did not have the info
and were desperate for content, any. TV crews flew into the effected areas and
started purchasing suitcases/video/still cameras sight unseen off shell
shocked tourists whom were too traumatized to remember if they had captured
anything.

It was also the last major disaster that occurred before the social revolution
- YouTube, Flickr, Facebook, smart-phones and mainstream adoption of
broadband. Back then the main way for the general public to share information
was via email (which had limits of 10mb) and if they were technical or knew
someone technical on their 100mb of free hosting provided by their ISP.

Now imagine you're a student, academic researcher, production assistant and
you have been tasked researching what transpired. As of right now, the
majority of content from 2004/the disaster is no-longer available on the
internet and only accessible via the IA.

(nb: I was the guy behind
[http://web.archive.org/web/20050711081524/http://www.waveofd...](http://web.archive.org/web/20050711081524/http://www.waveofdestruction.org/)
which was the first emergency crisis live/micro blog. For what Google now has
a team of volunteers doing during a crisis I did it myself with a single
pentium 4 cpu, 4GB of ram, a single IDE hard drive, a ramdrive, a 100mbit
uplink and a week without sleep. WOD will be relaunching within the next
couple of weeks and was one of the sites recovered from the IA.)

~~~
moe
_knowledge & history erodes_

There surely is history worth preserving. It's just this "compulsive
hoarding"-approach that seems ultimately futile to me.

The internet itself inherently preserves the knowledge that at least one
person cares enough about to put it on a webserver. Pretty much everything of
even minor relevance should be referenced by Wikipedia by now.

IMHO the internet itself is the "Internet Archive".

 _Remember the 2004 Indian Ocean earthquake and tsunami?_

To be honest, no. But Wikipedia seems to have that event pretty well covered;
[http://en.wikipedia.org/wiki/2004_Indian_Ocean_earthquake_an...](http://en.wikipedia.org/wiki/2004_Indian_Ocean_earthquake_and_tsunami)

~~~
shabble
> * To be honest, no. But Wikipedia seems to have that event pretty well
> covered*

And in a lot of cases, the actual citations or references used by Wikipedia
are dead links, or have some sort of 'last retrieved at' annotation, or end up
at parked/squatted domains and the original content is no longer there to
validate the claims it supposedly made.

Offhand I don't know if Wikipedia has a policy on references to Internet
Archive / wayback-machine links to things, but imo, being able to point to a
snapshot of a reference at the time it was made, as well as the current
version if it still exists, is a desirable feature, and well worth striving
for.

Internet pagerot/linkrot has caused me grief with bookmarks for things I
didn't realise I needed again until years later when they were gone.

~~~
Mithrandir
> Offhand I don't know if Wikipedia has a policy on references to Internet
> Archive / wayback-machine links to things

They do have a policy.[1] Additionally, IA has been specifically crawling
links on Wikipedia to preserve the citations.[2][3]

1:
[https://en.wikipedia.org/wiki/Help:Using_the_Wayback_Machine](https://en.wikipedia.org/wiki/Help:Using_the_Wayback_Machine)

2: [https://blog.archive.org/2013/10/25/fixing-broken-
links/](https://blog.archive.org/2013/10/25/fixing-broken-links/)

3:
[https://archive.org/details/NO404-WKP/v2](https://archive.org/details/NO404-WKP/v2)

------
rawnlq
Shouldn't there already be a lot of existing tech for doing crowd backup/sync?
Quick googling yields at least a few companies which allow you to donate
unused hard drive space in exchange for benefits:

Symform: [http://www.symform.com/](http://www.symform.com/)

> Earn free cloud storage when you allocate unused space on your device for
> your Symform account. Users that contribute get 1GB free for every 2GB
> contributed. Contribution can always be fine tuned for your personal needs.

Wuala: [http://www.techrepublic.com/blog/windows-and-
office/review-w...](http://www.techrepublic.com/blog/windows-and-
office/review-wuala-cloud-storage/)

> Extra space is available: Unlike the majority of online file storage
> services like Dropbox and Box.com, which offer paid plans only as a means to
> expand storage, Wuala offers you the option to gain additional online
> storage in exchange for some of your unused local disk space to commit to
> the network.

~~~
vertex-four
> Shouldn't there already be a lot of existing tech for doing crowd
> backup/sync?

There's bits and pieces, but essentially nothing that's open-source and
community-supported.

~~~
rawnlq
Just pointing out that P2P backup is a well understood problem (in industry
and academia) and they should probably partner with people who do this stuff
seriously or at least base their design on existing solutions. For example
their implementation page right now is trying to do this with git-annex:
[http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/g...](http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/git-
annex_implementation) which doesn't feel quite right...

~~~
db48x
One of the more interesting design requirements here is that the system should
work even when the distributed content is _not_ encrypted. If I join this
backup effort, and a meteor takes out all of SF (and thus the Internet
Archive), and their off-site backups fail, and the aftermath is preventing me
from accessing the internet, then that 1TB of data that I had downloaded is
all still usable to me. It's not some giant encrypted blob, it's a bunch of
zip files (or similar) containing images, sound, video, web pages, and other
types of files that I can, in principle, already use. (Maybe I can't do
anything much with that Apple II disk image, but I can at least identify it.)

Git annex and BitTorrent would both work with that requirement, and most of
the commercial, proprietary solutions cannot; their clients wouldn't use them
if that data wasn't private.

