Hacker News new | past | comments | ask | show | jobs | submit login
Internetarchive.bak (archiveteam.org)
290 points by edward on Mar 4, 2015 | hide | past | web | favorite | 110 comments

I could see this as a Kickstarter project of sorts.

Sell a device - a Raspberry Pi with an SD card preloaded with "plug and go" software - that people can buy for $50. Sell alongside it an inexpensive SATA and/or PATA enclosure into which people can put their old drives. They basically plug in their old hard drives and forget about it, perhaps writing off the tops ~$25/mo in electricity cost as a donation†.

When they plug it in, the RPi gets an IP and announces its availability via DNS-SD (Bonjour/Avahi/etc.). The user downloads a tool or just knows to visit `myarchiver.local` to pull up the control panel. The control panel lets the user set up IP options, time, etc. as well as control bandwidth usage, scheduling, etc. They can also see the amount of space used and set up notifications of some kind (reboot for updates, disk space, etc.).

This little device just sits in a corner and is effectively a multi-TB shard of this 20 PB and growing cloud storage system preserving the Internet. Box it up with some sexy industrial design and some attractive packaging, and it becomes a conversation piece, just like your Apple Time Capsule, Amazon Echo, Butterfly Labs Jalapeño, or other newfangled life device, except it's actually storing more than porn, cat pictures, or magic internet money.

† I don't know how feasible it is, but go with it.

The Archive Team have something called the Archive Team Warrior (http://archiveteam.org/index.php?title=ArchiveTeam_Warrior) which seems quite similar to this except it's a virtual rather than physical appliance. Worth a look.

(edit: although looking further this is more for distributed crawling, while the results will be uploaded to their archive)

The warriors download items assigned by a central tracker, then upload each item to a holding area. The warriors don't keep items around when they're finished. Also, the current tracker keeps everything in RAM, and can't keep up with more than ~2 million items at once. It would take a major rewrite just to keep track of all the pieces.

P.S. Please run a warrior instance. It comes in a small VM image that you can run in the background all the time.

You're actually grossly overestimating the electricity for most of the US. Raspberry Pi and similar run for <$10/year

> plug in their old hard drives

I honestly have no idea how inefficient 'old' spinning disk hard drives might be, but I wouldn't be surprised if an older drive consumes more electricity than a modern, power-sipping ARM-based RPi. Maybe GP took that into consideration?

11W is about the highest active draw I've seen on a drive datasheet. As far as I know, the power usage hasn't really changed in a while because it's mostly attributed to the 5400-7200RPM motor. The control board is very specialized and draws effectively no power in comparison.

The Raspberry Pi can run on a 5V / 500mAh bus, which is a 2.5W max load. This gives us a theoretical 13.5W max constant draw (which is likely much higher than reality, as we're talking about rated power instead of measured).

13.5W * 24h * 365 = 118260Wh = 118.260kWh max in a year

At 16 cents per kWh (a lowish power tier for PG&E in Silicon Valley), this comes out to $18.92 a year.

These numbers assume a 100% efficient power supply running the unit, but are also talking about theoretical maximum power draw and not observed power. This is also a fraction of the average power draw of any of these you might use: electric oven/range, microwave, fridge, incandescent lights, electric dryer, electric water heater.

As far as observed power: I've seen a fairly efficient Intel motherboard (CPU TDP of 17W) with 8 modern spinning disks drawing around 75W (measured at the wall using a Kill-A-Watt with all drives active) so I'd guess the real-world number is <8W per active drive.

There's really no reason for the hd to be powered most of the time, for this use-case. Download the backup (couple of days to a week for 500gb?) -- spin down the drive. Only needs to spin up if anyone wants to read the data. Now if we move away from the idea of "backup" to just: lets make a set of torrents, and try for many copies of each chunk/DHT -- then that's a different story. It'd be interesting to see how efficient one could make such a "massively distributed content dispersion network". Imagine: suddenly the www is distributed, not centralized again!

"Active idle" on a drive is still rated at around 5W. Maybe they don't spin down completely in the idle state? BackBlaze says it's better for the longevity of a drive to leave it running than to power it on and off regularly.

> it's better for the longevity of a drive to leave it running than to power it on and off regularly.

Sure, for the general use-case. But if this was "just" for backup, powering up would probably be quite rare. If it was for content distribution (eg: torrent) with some expected semi-frequent access, leaving the drive on would probably be best.

With an USB-disk (or a disk in a USB enclosure) it should be possible to power it down completely. The main problem then would be powering it up again -- I'm not sure if there's an API/interface for powering up an USB device that has been powered down, other than plugging it out and in again?

Compared to the power requirements of an RPi, the power for several spindle drives would be pretty significant.. except you have to figure they won't be spun up most of the time.

One option to fund this would be to team up with some storage provider like Bittorrent Sync. They would sponsor the device, I would provide hard disk, network and electricity. Sponsor would get 20-30% of my disk space for their purposes and the rest would go for Internet Archive.

A bit similar to what Fon was doing. http://en.wikipedia.org/wiki/FON

These storage pods (http://www.45drives.com/products/direct-wired-standard.php) hold 180TB if you populate them with 4TB drives. 20 PB / 180 TB ~= 111. With some redundancy, you could round up to 200 pods.

Now you host these in various datacentres around the world. Maybe one rack per data centre so approx 20 datacentres. This will be manageable, sort of like managing a small cluster. You'll have plenty of cheap bandwidth and excellent reliability/availability. Datacentres may be able to donate space or bandwidth to the project for the positive publicity, etc.

Relying on USB drives hanging off of consumer's laptops connected over wifi is going to have dismal reliability/availability, poor bandwidth, etc, so to compensate you'll need a colossal amount of redundancy (5-10x? more?) for it to even function.

From an end user's point of view, I think my donation of $100 toward the above storage pod approach would go a lot further than me buying a $100 USB drive and managing a low quality bandwidth capped contribution.

The work that the Internet Archive does is amazing and vital. If you've ever used the Wayback Machine[1] and thought "huh, that's neat" you should consider donating to their cause[2]. I just set up a donation through my payroll provider (ZenPayroll) that gives them a small chunk of every paycheck.

[1]: https://archive.org/web/

[2]: https://archive.org/donate/

Jason Scott, an archivist at archive.org (and other things) was recently a guest on The Web Ahead: http://thewebahead.net/97.

I've seen the Wayback Machine used in court cases, I was pretty amazed the first time I saw it mentioned in a court case.

And the Internet Archive has a lot more projects than just the Wayback Machine. http://blog.archive.org/2015/02/12/whats-new-with-v2/

I am disappointed by the v2 website frontpage (beta). It's too overloaded. Hopefully web.archive.org will stay unchanged. I want a Google-like style page, not a Yahoo style overcrowded page that is very dynamic and dog slow.

I like the mission and the current site of archive.org a lot!

they should monetize on this as a resource since there's money available there.

How would you monetize a resource that's more valuable when it's accessible to everyone?

No, they shouldn't.

There's already a backup of the Internet Archive, in Alexandria, Egypt.[1] It hasn't been updated since 2007, though. The Archive originally wanted to have duplicates in several locations, but Egypt was the only one that came online. Here's YCombinator's original web site from 2005, from the copy of the Internet Archive at the Bibliotheca Alexandrina.[2]

A system, OpenArchive, was set up to replicate the Archive, but nobody else signed up.

[1] http://www.bibalex.org/internetarchive/ia_en.aspx [2] http://web.archive.bibalex.org/web/20050401073947/http://yco...

Thanks for this link. I've been digging for my old homepage data from '97 or so that started returning blanks on archive.org. I found a viable copy here!

If Alexandria was a backup of IA, shouldn't everything in Alexandria already be available at archive.org? Otherwise, that implies some pages were deleted or are no longer visible at archive.org.

That's unfortunately the case. I've got pages that resolve as blank from archive.org and properly from Alexandria. I suspect there might be some data problems on the main archive.org server.

They aren't deleted -- they show up in the timeline correctly -- but they come back as entirely blank when I load them (no toolbar, no nothing).

IA hard drives fail, and as far as I know, they don't use anything fancy like FEC or distributed filesystems to heal from hard drives dying.

They don't keep data twice? So the archive is just slowly rotting away literally, all the time?

Okay, this just convinced me that Internetarchive.bak is a good idea. I'm going to participate.

Internet Archive engineer here. We do maintain two copies of everything in our main data collection. Disks fail at an average rate of perhaps 5-6 per day, and the content does get restored from the other copy.

(That said, originally the Wayback crawl content was stored separately from paired storage, but it was merged into paired storage a few years ago.)

We would prefer greater redundancy, and may shift to a more sophisticated arrangement at some point, but as a non-profit with limited resources, we're constantly managing tradeoffs between what we should do and what we can afford to do.

Thanks for the reassuring clarification. Two copies is much better than one copy :) It would be useful to know the maximum time window for restoral from the backup copy.

Depending on what time of day it fails, the bad disk is generally replaced within 2-18 hours. Copying the data onto the new disk, from the remaining half of the original pair, typically takes another day or so.

But the material found on the failed disk usually remains available throughout the replacement process because both original copies are on live servers. Read requests are ordinarily divided more or less evenly between the two. If one copy dies, requests are simply all directed to the other copy until the lost one is replaced.

There is no ability to "write" (make changes) to any affected items during the replacement process, but they usually remain readable throughout.

I looked into this a few years ago when I noticed some of my IA links were erroring out, and yes, as far as I could tell, there is no redundancy in the IA's archives (except, obviously, at the snapshot level; if you lose a machine with some snapshots of cnn.com, there will probably be snapshots of cnn.com at other time periods on other machines).

It's unfortunate, but can they really afford the extra space redundancy/FEC would necessarily require? Or the nontrivial software engineering to set up such redundancy? (Presumably their current servers are well-understood, well-debugged, and complex, while redundancy would bring in an unknowable number of new bugs.)

Are there other online storage services (epecially one using the term 'archive') which loses data? This is hard to fathom.

Why would version 1 of IA have been developed without a prime directive of not losing data? This is assumed by anyone working in the design of storage systems.

Does the data eventually get restored from backup, when a hard drive fails?

Hey everyone! We designed IPFS (http://ipfs.io) specifically for this kind of purpose.

archiveteam: would be great to coordinate with you and see where we can help your effort! closure/joeyh pointed me to his proposal for you-- i'll write one up as well so you can check out what is relevant for you. Also, recommend looking into proofs-of-storage/retrievability.

Also, if archivists (Archive.org team, or archiveteam (they're different)) see this:


Same note in rendered markdown, served over IPFS ;) at:



Dear Archivists, First off, thank you very much for the hard work you do. We all owe you big time.

I'm the author of IPFS -- i designed IPFS with the archive in mind.[1]

Our tech is very close to ready. you can read about the details here: http://static.benet.ai/t/ipfs.pdf, or watch the old talk here: https://www.youtube.com/watch?v=Fa4pckodM9g. (I will be doing another, updated tech dive into the protocol + details soon)

You can loosely think of ipfs as git + bittorrent + dht + web.

I've been trying to get in touch with you about this-- i've been to a friday lunch (virgil griffith brought me months ago) and recently reached out to Brewster. I think you'll find that ipfs will very neatly plug into your architecture, and does a ton of heavy lifting for versioning and replicating all the data you have. Moreover, it allows people around the world to help replicate the archive.

It's not perfect yet -- keep in mind there was no code a few months ago -- but today we're at a point of streaming video reliably and with no noticeable lag-- which is enough perf for starting to replicate the archive.

We're at a point where figuring out your exact constraints-- as they would look with ipfs-- would help us build what you need. We care deeply about this, so we want to help.



[1] see also the end of https://www.youtube.com/watch?v=skMTdSEaCtA

I tried ipfs, it's impressive. The only challenge it to re-write paths to be relative instead of absolute as ipfs has it's own "/ipfs/$hash" prefix.

This is ludicrously ambitious, love it. Keep it up!

This sounds a lot like torrents, and I would think that it'd be less work to solve the problem of newer snapshots of torrents (or put all the new content into new "chunks" and thus new torrents) vs. re-inventing the entire thing.

This is already solved with BEP 39 [1]. It wouldn't be difficult for the Internet Archive to implement, either.

[1]: http://www.bittorrent.org/beps/bep_0039.html

The tracker would have to manage millions of torrents, and each client would have to figure out which subset of torrents it wanted to download. And there needs to be a mechanism for the server to force a client to recompute hashes from time to time. So I don't see how it's very torrent-like.

If one uses one torrent per chunk (500 MB), that's just 42K torrents -- as mentioned here[1]:

"Update: Here’s a copy of 17 million torrents from Bitsnoop.com, pretty much the same format but nicely categorized. It’s only 535 MB."

So, that's 17M [magnet]s, which means the archive could grow by some orders of magnitude from it's current need of 42K [magnet]s, and still doling out a subset to clients seems to be quite possible to manage.

No needs for trackers any more, [magnet]s work fine, and are regularly used to distribute ~1GB torrents (eg: hd tv episodes). Whatever one might think of distributing unlicensed copies of media -- they show that distributing vast amounts of data is quite easy with technology that is readily available and well tested.

[1] https://torrentfreak.com/download-a-copy-of-the-pirate-bay-i...

What we need now is a torrent client that can prioritize. In other words: here's this absurd number of torrent chunks, here's the space I have, now what are the best chunks to archive for the health of the swarm?

Most torrent clients already prefer rare chunks, assuming the user doesn't turn on "Prefer chunks in streaming order" or whatever similar option the client has.

I don't know of a torrent client that will allow me say "download at least these files of the torrent, and the 100mb of anything else that's the most likely to help the swarm".

Downloading the chunks that are the rarest first is different than downloading only rare chunks - although it's good that some of the infrastructure is already in place.

As they say themselves (http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/t...), that would be 42k * 500 Gb. Can you imagine a torrent with a single chunk of 500Gb ?

e12e said one torrent per 500GB chunk, not one chunk per 500GB.

Torrents would work perfectly well, and are in fact one of the proposed solutions. The problems you mention are real though.

One idea is to use a custom BitTorrent client which the user can install (a bit like the ArchiveTeam Warrior VM). The user would tell it how much disk space they want to donate, and the client would join swarms over time to fill up that donated space. The user could even join specific swarms representing content that they enjoy and particularly want to save, but wouldn't have to go to the hassle of picking every torrent manually. Also, IA already provides a torrent for every single item in the archive with IA as a web seed, so this could be pretty simple to do from their perspective.

Joey's design proposal to use git-annex:


I am wondering why they don't ask large cloud providers to donate storage. If amazon, microsoft, google, rackspace, joyent, etc all gave them 10pb worth of data, they could just make a facade to their apis and store all the data on there, and just keep one copy of the data and a database of files?

A distributed or p2p model is a great idea, but it seems very difficult to achieve.

You know, Amazon alone could provide this. The retail cost of 20PB of Glacier storage is $222,100.96/mo. Assuming they have a margin of 50%, that's $110K a month in actual cost to them, so basically their "donation" would be $1.2MM/year, which is 0.01% of their revenue ($90B). Even if their margin is 0, that's still only 0.02% of their revenue.

Their concerns are seemingly exactly why we have invented hash functions. They have managed to keep BitTorrent (swarms certainly) free of bad actors even when there is considerable monetary interest in their existence.

Yes. I'm reminded of a project that I didn't really "get" when I first came across it, the Least Authority File System: https://en.wikipedia.org/wiki/Tahoe-LAFS - it seems these guys have been building for exactly this kind of application!

Plain hashes are vulnerable to all sorts of attacks. Here's a paper on proof of retrievability https://cseweb.ucsd.edu/~hovav/dist/verstore.pdf and one on non-outsourceabe storage http://cs.umd.edu/~amiller/nonoutsourceable.pdf

can you give the tl;dr? What are possible attacks?

Just for example, you could download data, compute the hash, then delete the data. You could pretend to have stored lots of data this way. If someone asks for the hash to prove that you have the data, just give them the hashes you computed earlier. That's what the first paper prevents.

Another one is to pretend that you have lots of chunks, and when someone challenges you to produce one, you quickly download it from another source and present it as if you had it all along. That's what the second paper prevents.

I think the easier way to go about this is have a pub/sub system, where notifications are created to consumers of the torrent link when the Archive generates the metadata for new archived items, and those consumers then join the swarm for that item.

Can torrent clients properly handle being part of hundreds of thousands of item swarms at once?

At a first pass, I'd try to implement such a thing with a Distributed Hash Table-like situation. Generate chunks in some fashion (either one chunk per file, or glob files together up to N megabytes), take a hash of the chunk, and then let people fetch it. All the Internet Archive needs to do is then keep redundant copies of the hashes.

Make it compatible with http://Archive.today and http://Webcitation.org please?

You can donate here: https://archive.org/donate/

Idea of shared/distributed backup leads me to vision of web browsers actively archiving and sharing public data. No more slashdot effects: more clients hits some public resource, more caches are filled and seeded in case of original resources downtime.

Latency will kill it. I don't foresee symmetric connections for residential networks any time soon, so when you hit up Wikipedia.org, you will be downloading it from your neighbor's 128 Kbps uplink. Discovering that this neighbor has a copy of the article you want is even more difficult. On top of that, what if you are signed into Wikipedia? Now you are sending your session ID to your neighbor? And what about HTTPS? Can your neighbor now send you resources as Wikipedia.com?

The idea of doing HTTP over torrent comes up periodically and nobody has ye answers to these questions. Besides, online access is not the problem: there is plenty of bandwidth and CDN's already solve the data locality problem.

> I don't foresee symmetric connections for residential networks any time soon

All FTTH customers in the Netherlands have this, and it's just a matter of time before everyone has FTTH.

Then again, it's probably also a matter of time before abuse of symmetric connections becomes large enough that ISPs start disabling it.

I see your point though, we're nowhere near there yet, especially on a global scale.

TWC cheapeast plan has 1 Mbps upload, the next tier is 5 Mbps upload, https://www.timewarnercable.com/en/plans-packages/internet/i...

I think you are making my point. How does 1 Mbps or 5 Mbps compare to say, a 1 Gbps connection you can have with a server colocated or rented from a hosting company? Moreover, what's the benefit?

1Mbs is eight times faster than 128Kbps, your original characterization of residential uplink speed. Comparing 1Gbps and 1Mbps is more about bandwidth than latency, since the physical distance between server nodes will on average be larger than the physical distance between local neighbors.

One benefit is ownership decentralization of the serving endpoint, which helps in some scenarios (more difficult to censor) and hurts in other scenarios (more difficult to sue).

In either case, would you consider hosting something like Wikipedia on a 1 Mbps connection? Besides, you will be sharing that connection with everything else your neighbor does: both CPU and link-wise. "Oops, Bob's getting on Skype again. Better go make some tea, while this page loads. Too bad the original server this came from has a 1 Gbps connection and crazy redundancy. I sure am happy I can get this page from Bob's 6 year old router running at 200 MHz."

What problem is this solution supposed to solve? Bandwidth for the type of content you can actually mirror like this is my a problem and hasn't been for a decade.

> 1Mbs is eight times faster than 128Kbps

Actually, 8 times as fast, or 7 times faster. Unless 128Kbps is 1 times faster than 128Kbps.

Thanks :)

A similar concept: content-centric networking.

It seems like TAHOE-LAFS would be a great way to let anonymous people donate storage while maintaining data integrity.

There is now a channel, #internetarchive.bak, on EFNet, for discussion about the implementations and testing.

This reminds me of the concept of Permacoin (https://www.cs.umd.edu/~elaine/docs/permacoin.pdf). Maybe they should give it a thought? The idea is to create a financial incentive to store chunks of a huge dataset by using proof-of-retrievability algorithms to mine a cryptocurrency.

I would like to be a part of this - I have a file server that runs 24/7 already, and a spare terabyte (or more) of hard drive space. However, I do not have a spare terabyte of bandwidth to use downloading a chunk of the internet archive.

Hopefully whatever backup plan they come up with includes a way for me to mail them a spare hard drive and get a drive full of internet archives in return.

Are you limited by total BW cap, or transfer rate? There's no reason an suitably intelligent resumption system could allow that TB to be downloaded over the course of several months if necessary. You'd be a lower priority peer, but still contributing.

The manpower and logistic hurdles of shipping disks around make it seem unlikely it would ever be an option, but perhaps something more local sneaker-net where they partner with nearby universities or libraries with decent connections who allow people to bring their machines/disk arrays in for that specific use.

That sounds wildly impractical manpower-wise.

There is an official mirror of the Wayback Archive.org located in the Bibliotheca Alexandrina: http://en.wikipedia.org/wiki/Bibliotheca_Alexandrina

I dreamed last night that an internet dotCom giant donates money / host a mirror of Archive.org. It would be definitely a novel step.

The challenges have been solved already. The article talks about splitting it into 42k 500GB chunks. BitTorrent Sync will happily deal with 500GB, and open trackers are available that can report the counts of users in a swarm.

To complete it, just build a web front-end that provides the least-populous BTSync key from the pool. People can then paste it into their client and contribute.

BT Sync was considered for a related Archive Team project [1], but was ruled out as it is (currently) proprietary.

1: http://www.archiveteam.org/index.php?title=Valhalla#Non-opti...

Ok - but that's the only protest? An open version would satisfy all the requirements? Syncthing.net, or even plain old Bittorrent should do (especially if it's one static file per 500GB).

This is a neat idea. I have some random thoughts about doing this in a distributed fashion:


In the wiki there's a mention of how the geocities torrent was 900GB and very few people seeded it as a result. There should be a way and an incentive for clients to seed parts of files without having the whole thing. As long as there are enough copies of each chunk on the network, it's fine - we don't care that any one client has the whole thing.


Another cool thing is thinking about health metrics - you don't just care about how many people have a file, you care about recoverability too. This seems a bit similar to AWS glacier - you have to take into account how long it's going to retrieve data. You could have stats on how often a client responded and completed a transfer request, what its track record is, and assign a probability and timeframe for correctly retrieving a file.


One thing that comes to mind is that whatever the solution ends up being, it should be such that partial recovery should be possible. The example that comes to mind is stuff like a zip file or encrypted image that is completely ruined when a chunk is missing. So maybe it makes sense to have the smallest unit of information be large enough to be still intelligible on its own?

At first I was thinking of a model where chunks get randomly distributed and replicated, but then that made me wonder whether that's a bad idea.


And then what of malicious clients? I guess it's not hard to have a hash-based system and then just verify the contents of a file upon receiving it. But can I effectively destroy a file by controlling all the clients that are supposed to be keeping it up? How could you guard against that? Could you send a challenge to a client to send you the hash of a random byte sequence in a file to prove that they do indeed have it? What guarantees that they will send it even if they do?


And then what about search / indexing?! P2P search is an awesome problem domain that has so many problems that my mind is buzzing right now. Do you just index predetermined metadata? Do you allow for live searches that propagate through the system? Old systems like kademlia had issues with spammers falsifying search results to make you download viruses and stuff - how to guard against that? Searching by hash is safe but by name / data is not. Etc etc.

I wish this was my job! :)

This seems to have a lot of overlap with people coming up with distributed storage systems, perhaps Maidsafe [1] has solved some of those problems? (I'm not very familiar with it)

[1] http://en.wikipedia.org/wiki/MaidSafe

They should really look at using permacoin to store the archive.

Permacoin is a cryptocurrency that uses proof of storage instead of bitocoin’s proof of work:


You've got a trusted central authority here (the Archive itself), so why deal with some coin guff if you don't have to?

Because it provides incentive for people to help store the archive, and it solves the problem in their article about defending against bad actors.

But all you need to defend against bad actors is for archive.org to hang on to a SHA256 or 512 for every chunk. Much simpler than a distributed blockchain.

It’s a little more complicated than that. I could temporarily take a chunk, compute the hash, then throw it away and report back that I am happily storing the data, even though I’m not.

Anyhow, read the permacoin paper. It’s pretty cool, and it needs a large petabyte data seed to secure the network. Seems like a win-win to me.

The centralisation helps again here too though.

The central server has access to the entire file, and hence can compute the hash of any arbitrary chunk. Challenges/verifications don't have to happen all that often ([1] indicates they are looking at once a month) so creating a unique challenge for each user shouldn't be too compute intensive.

For each known user:

- Central server chooses random chunk of each file it wishes to verify the user still has. This could be any length from as long as the hash function to the whole file size and could be offset any number of bits.

- Client is asked to provide the hash of the chunk using the file stored locally.

Precomputing hashes for all possible chunk permutations would take up substantially more space than simply storing the file in the first place. A bad actor would need to store the hash for all the possible chunk lengths starting from every possible start location in the file which is in the order of O(n^2) where n is the stored file size (500GB in this case). For reference that would be about one third of the entire 20PB archive for a single 500GB chunk if using a 256 bit hash function.

[1] http://git-annex.branchable.com/design/iabackup/

It's a good thing in this case they also don't have to worry too much about bad actors, since it's a fundamentally altruistic endeavor. Worst case scenarios:

1) Someone keeps downloading the full archive and throwing away;

2) Someone wants a file erased, keeping it with the intention of denying access in some future.


1) Bandwidth has costs on both sides; just balancing upload among receivers would probably suffice for this not to be a problem.

2) Assigning large random chunks to downloaders should prevent this "chosen-block attack"; add in some global redundancy for good measure and that's probably enough (although I still wouldn't trust this 100% as a primary storage, only as an insurance storage).

The added bonus of periodically validating random chunks is that it forms a sort of data-scrubbing function, allowing the detection of non-malicious failures due to bad sectors or whatever as well.

"I could temporarily take a chunk, compute the hash, then throw it away and report back that I am happily storing the data, even though I’m not."

This is not a new problem, and most of the obvious solutions work just fine.

It would cost about US$2.5 million/year to back it up using Amazon Glacier.

With that much data, it would be cheaper to build your own Glacier-like service.

About $1m to build 21 PB of live storage (not tape), and $50k/year to power it (not factoring maintenance, rack space cost, and cooling)

A service offering live storage at this (Petabyte) scale [1] - costs 22,020,096 GB * 3¢/GB * 12mo = $7,927,234.56 a year. You could build six fully-mirrored copies for about that much, though management of 360 4U machines (30x 48U full-height racks worth) is a business on its own (see BackBlaze photos), and you need to deal with drive replacement after a point.

[1] http://www.rsync.net/products/petabyte.html (If you actually sign up for this, be sure to send me that $24,000 referral bonus o_o)


-- Machine Cost --

$300 * 45 * 8TB drives = $13,500 (drive cost) + $3,387.28 (pod cost) == $16,887.28 (unit cost) / 360TB = $46.91 per TB

$46.91/TB * (21 PB * 1024 == 21,504 TB) == $1,008,752.64 / 21 PB (for no redundancy)

-- Power Cost --

Let's say the motherboard/controllers draw around 100W continuous and each drive pulls 11W (from my post elsewhere in the comments). This is 2,688 drives, which fit into 60x 45-drive 4U BackBlaze storage pods.

2,688 * 11W = 29,568 W (for the drives) + 6,000 W (for the computers) = 35.568 kW to run a non-redundant 21 PB array.

You'll need at least two network switches and moderate cooling infrastructure as well. The heat from using 35.5 kWh continuously takes nontrivial effort to disperse, but let's assume that's somehow free for the sake of simplicity here.

35.568 kW * 24 * 365 = 311.57568 MWh (in a year)

I'm sure at this point you might be able to work out a bulk deal of some kind, but at $0.16/kWh you're talking a pretty low cost of $49,852.11 in power every year.


Building arrays of nonredundant live drives would be significantly cheaper over the long term than Amazon Glacier (which is almost certainly cold storage, not live) at this scale.

Right but Glacier _will_ provide redundancy. If you introduce even the most basic redundancy to your math (multiply everything by 2), it pretty nearly closes the gap with Glacier's costs. You also forgot to add in the costs of paying at least one human (probably a few) to set all this up and maintain it forever.

The real question is, why do they need this archive to be spinning? What's wrong with tape?

>Amazon Glacier (which is tape-based!)

Citation? Not saying you're wrong but I've seen supposedly "in the know" commentary in favor of both the tape-based and some sort of cold disk-based storage. I've been curious and also a bit surprised that the true answer has never really hit publicly although I'm sure it's well-known within some communities.

Oh, you're right. I haven't thought about this for a while, and I remember my previous assumption about tape storage being challenged. I'll replace that part with "cold storage".

After refreshing my memory: Amazon outright they're using denied tape backup, some third-parties pointed out cold spinning disks wouldn't be very economical, and a few people theorized it's actually BDXL.

Cold storage is probably safe. I've seen the assertion that it's tape from someone who I wouldn't have thought would get it wrong but I'd now bet against it whether or not Amazon's denial seals the deal.

Robin Harris has argued in favor of BDXL but he's speculated about a number of things. http://storagemojo.com/2014/04/25/amazons-glacier-secret-bdx...

My money would be on the simplest: cold disk. Some support here: http://mjtsai.com/blog/2014/04/27/how-is-amazon-glacier-impl... but it's all speculative and anecdotal. I don't have a problem with the thought that Amazon is getting out ahead of the drive economics, especially given that I'm not sure how widely used Glacier is today.

I do find it interesting that this has been apparently kept under wraps from anyone who would share more widely.

Someone that worked set amazon at the time Glacier came out a while back commented that it is some type of extremely low power hard drive that backs this system. Which makes sense due to the 1-4 hour delay in retrieval time. The racks are idle and only come online one every 4 hours or so and batch retrieve/store their data then power down. This small power usage is probably what makes them profit on this very slim margin.

For a more accurate comparison to amazon multiply by at least 3 for redundancy, don't forget to factor in the cost of the real estate, replacement equipment, employees to run it all, etc. You would be hard pressed to beat amazon prices, which are falling over time.

You're absolutely right. I was aiming for a baseline/ballpark here.

At this scale, redundancy via 3x mirror is pretty expensive. You could likely get away with something like chunks of 12 drives in raidz3, especially if you're planning on having at least a few facilities like this. $1m setup + land + a few salaries upkeep sounds like the kind of thing you could convince a few governments to do.

However, Glacier won't do at all for the live Internet Archive, so it's a question of whether they want this to have live replication or not (if the whole backup is offline, who does your integrity checks?)

I'm actually curious how much it would money/storage space it would require to print the entire Internet Archive on whatever the current equivalent of microfiche is (anything cheap and more permanent than magnetic or optical storage).

I'd gladly host 3tb of data, I hope this becomes a thing.

store them on a gazillion 360yunpan+weiyun+baiduyun (or justcloud) accounts!

I never understood this desire to "back up" the internet.

It's an impossible endeavour and will only become more incomplete as the net grows. And 99,999% of the data "backed up" will never be looked at.

What's the point?

Well, I personally spent the better half of tonight recovering a bunch of old websites of mine from the IA using wget.

IA is one of the most important initiatives/projects on the internet, links are not meant to change but they do and when it happens knowledge & history erodes.

Remember the 2004 Indian Ocean earthquake and tsunami?

It was the first major event that provided mainstream credibility to the blogosphere, before this disaster consumers turned to news publications for information. For the first time ever news publications did not have the info and were desperate for content, any. TV crews flew into the effected areas and started purchasing suitcases/video/still cameras sight unseen off shell shocked tourists whom were too traumatized to remember if they had captured anything.

It was also the last major disaster that occurred before the social revolution - YouTube, Flickr, Facebook, smart-phones and mainstream adoption of broadband. Back then the main way for the general public to share information was via email (which had limits of 10mb) and if they were technical or knew someone technical on their 100mb of free hosting provided by their ISP.

Now imagine you're a student, academic researcher, production assistant and you have been tasked researching what transpired. As of right now, the majority of content from 2004/the disaster is no-longer available on the internet and only accessible via the IA.

(nb: I was the guy behind http://web.archive.org/web/20050711081524/http://www.waveofd... which was the first emergency crisis live/micro blog. For what Google now has a team of volunteers doing during a crisis I did it myself with a single pentium 4 cpu, 4GB of ram, a single IDE hard drive, a ramdrive, a 100mbit uplink and a week without sleep. WOD will be relaunching within the next couple of weeks and was one of the sites recovered from the IA.)

knowledge & history erodes

There surely is history worth preserving. It's just this "compulsive hoarding"-approach that seems ultimately futile to me.

The internet itself inherently preserves the knowledge that at least one person cares enough about to put it on a webserver. Pretty much everything of even minor relevance should be referenced by Wikipedia by now.

IMHO the internet itself is the "Internet Archive".

Remember the 2004 Indian Ocean earthquake and tsunami?

To be honest, no. But Wikipedia seems to have that event pretty well covered; http://en.wikipedia.org/wiki/2004_Indian_Ocean_earthquake_an...

> * To be honest, no. But Wikipedia seems to have that event pretty well covered*

And in a lot of cases, the actual citations or references used by Wikipedia are dead links, or have some sort of 'last retrieved at' annotation, or end up at parked/squatted domains and the original content is no longer there to validate the claims it supposedly made.

Offhand I don't know if Wikipedia has a policy on references to Internet Archive / wayback-machine links to things, but imo, being able to point to a snapshot of a reference at the time it was made, as well as the current version if it still exists, is a desirable feature, and well worth striving for.

Internet pagerot/linkrot has caused me grief with bookmarks for things I didn't realise I needed again until years later when they were gone.

> Offhand I don't know if Wikipedia has a policy on references to Internet Archive / wayback-machine links to things

They do have a policy.[1] Additionally, IA has been specifically crawling links on Wikipedia to preserve the citations.[2][3]

1: https://en.wikipedia.org/wiki/Help:Using_the_Wayback_Machine

2: https://blog.archive.org/2013/10/25/fixing-broken-links/

3: https://archive.org/details/NO404-WKP/v2

There is some balance between "compulsive hoarding" and preservation. I actually agree that it's an ultimately futile effort--and not very fruitful--to obsess over preserving every bit of text, audio, and video that's ever been posted to the Internet. On the other hand, preserving the content of Usenet or even Geocities do strike me as worth some disk space and won't happen organically for the most part. And the sort of people who have the mindset to do that preservation probably inherently have a bit of the compulsive hoarder about them. On balance, I think it's a good thing.

Shouldn't there already be a lot of existing tech for doing crowd backup/sync? Quick googling yields at least a few companies which allow you to donate unused hard drive space in exchange for benefits:

Symform: http://www.symform.com/

> Earn free cloud storage when you allocate unused space on your device for your Symform account. Users that contribute get 1GB free for every 2GB contributed. Contribution can always be fine tuned for your personal needs.

Wuala: http://www.techrepublic.com/blog/windows-and-office/review-w...

> Extra space is available: Unlike the majority of online file storage services like Dropbox and Box.com, which offer paid plans only as a means to expand storage, Wuala offers you the option to gain additional online storage in exchange for some of your unused local disk space to commit to the network.

> Shouldn't there already be a lot of existing tech for doing crowd backup/sync?

There's bits and pieces, but essentially nothing that's open-source and community-supported.

Just pointing out that P2P backup is a well understood problem (in industry and academia) and they should probably partner with people who do this stuff seriously or at least base their design on existing solutions. For example their implementation page right now is trying to do this with git-annex: http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/g... which doesn't feel quite right...

One of the more interesting design requirements here is that the system should work even when the distributed content is _not_ encrypted. If I join this backup effort, and a meteor takes out all of SF (and thus the Internet Archive), and their off-site backups fail, and the aftermath is preventing me from accessing the internet, then that 1TB of data that I had downloaded is all still usable to me. It's not some giant encrypted blob, it's a bunch of zip files (or similar) containing images, sound, video, web pages, and other types of files that I can, in principle, already use. (Maybe I can't do anything much with that Apple II disk image, but I can at least identify it.)

Git annex and BitTorrent would both work with that requirement, and most of the commercial, proprietary solutions cannot; their clients wouldn't use them if that data wasn't private.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact