Sell a device - a Raspberry Pi with an SD card preloaded with "plug and go" software - that people can buy for $50. Sell alongside it an inexpensive SATA and/or PATA enclosure into which people can put their old drives. They basically plug in their old hard drives and forget about it, perhaps writing off the tops ~$25/mo in electricity cost as a donation†.
When they plug it in, the RPi gets an IP and announces its availability via DNS-SD (Bonjour/Avahi/etc.). The user downloads a tool or just knows to visit `myarchiver.local` to pull up the control panel. The control panel lets the user set up IP options, time, etc. as well as control bandwidth usage, scheduling, etc. They can also see the amount of space used and set up notifications of some kind (reboot for updates, disk space, etc.).
This little device just sits in a corner and is effectively a multi-TB shard of this 20 PB and growing cloud storage system preserving the Internet. Box it up with some sexy industrial design and some attractive packaging, and it becomes a conversation piece, just like your Apple Time Capsule, Amazon Echo, Butterfly Labs Jalapeño, or other newfangled life device, except it's actually storing more than porn, cat pictures, or magic internet money.
† I don't know how feasible it is, but go with it.
(edit: although looking further this is more for distributed crawling, while the results will be uploaded to their archive)
P.S. Please run a warrior instance. It comes in a small VM image that you can run in the background all the time.
I honestly have no idea how inefficient 'old' spinning disk hard drives might be, but I wouldn't be surprised if an older drive consumes more electricity than a modern, power-sipping ARM-based RPi. Maybe GP took that into consideration?
The Raspberry Pi can run on a 5V / 500mAh bus, which is a 2.5W max load. This gives us a theoretical 13.5W max constant draw (which is likely much higher than reality, as we're talking about rated power instead of measured).
13.5W * 24h * 365 = 118260Wh = 118.260kWh max in a year
At 16 cents per kWh (a lowish power tier for PG&E in Silicon Valley), this comes out to $18.92 a year.
These numbers assume a 100% efficient power supply running the unit, but are also talking about theoretical maximum power draw and not observed power. This is also a fraction of the average power draw of any of these you might use: electric oven/range, microwave, fridge, incandescent lights, electric dryer, electric water heater.
As far as observed power: I've seen a fairly efficient Intel motherboard (CPU TDP of 17W) with 8 modern spinning disks drawing around 75W (measured at the wall using a Kill-A-Watt with all drives active) so I'd guess the real-world number is <8W per active drive.
Sure, for the general use-case. But if this was "just" for backup, powering up would probably be quite rare. If it was for content distribution (eg: torrent) with some expected semi-frequent access, leaving the drive on would probably be best.
With an USB-disk (or a disk in a USB enclosure) it should be possible to power it down completely. The main problem then would be powering it up again -- I'm not sure if there's an API/interface for powering up an USB device that has been powered down, other than plugging it out and in again?
A bit similar to what Fon was doing.
Now you host these in various datacentres around the world. Maybe one rack per data centre so approx 20 datacentres. This will be manageable, sort of like managing a small cluster. You'll have plenty of cheap bandwidth and excellent reliability/availability. Datacentres may be able to donate space or bandwidth to the project for the positive publicity, etc.
Relying on USB drives hanging off of consumer's laptops connected over wifi is going to have dismal reliability/availability, poor bandwidth, etc, so to compensate you'll need a colossal amount of redundancy (5-10x? more?) for it to even function.
From an end user's point of view, I think my donation of $100 toward the above storage pod approach would go a lot further than me buying a $100 USB drive and managing a low quality bandwidth capped contribution.
I like the mission and the current site of archive.org a lot!
A system, OpenArchive, was set up to replicate the Archive, but nobody else signed up.
They aren't deleted -- they show up in the timeline correctly -- but they come back as entirely blank when I load them (no toolbar, no nothing).
Okay, this just convinced me that Internetarchive.bak is a good idea. I'm going to participate.
(That said, originally the Wayback crawl content was stored separately from paired storage, but it was merged into paired storage a few years ago.)
We would prefer greater redundancy, and may shift to a more sophisticated arrangement at some point, but as a non-profit with limited resources, we're constantly managing tradeoffs between what we should do and what we can afford to do.
But the material found on the failed disk usually remains available throughout the replacement process because both original copies are on live servers. Read requests are ordinarily divided more or less evenly between the two. If one copy dies, requests are simply all directed to the other copy until the lost one is replaced.
There is no ability to "write" (make changes) to any affected items during the replacement process, but they usually remain readable throughout.
It's unfortunate, but can they really afford the extra space redundancy/FEC would necessarily require? Or the nontrivial software engineering to set up such redundancy? (Presumably their current servers are well-understood, well-debugged, and complex, while redundancy would bring in an unknowable number of new bugs.)
Why would version 1 of IA have been developed without a prime directive of not losing data? This is assumed by anyone working in the design of storage systems.
archiveteam: would be great to coordinate with you and see where we can help your effort! closure/joeyh pointed me to his proposal for you-- i'll write one up as well so you can check out what is relevant for you. Also, recommend looking into proofs-of-storage/retrievability.
Also, if archivists (Archive.org team, or archiveteam (they're different)) see this:
Same note in rendered markdown, served over IPFS ;) at:
First off, thank you very much for the hard work you do. We all owe you big time.
I'm the author of IPFS -- i designed IPFS with the archive in mind.
Our tech is very close to ready. you can read about the details here: http://static.benet.ai/t/ipfs.pdf, or watch the old talk here: https://www.youtube.com/watch?v=Fa4pckodM9g. (I will be doing another, updated tech dive into the protocol + details soon)
You can loosely think of ipfs as git + bittorrent + dht + web.
I've been trying to get in touch with you about this-- i've been to a friday lunch (virgil griffith brought me months ago) and recently reached out to Brewster. I think you'll find that ipfs will very neatly plug into your architecture, and does a ton of heavy lifting for versioning and replicating all the data you have. Moreover, it allows people around the world to help replicate the archive.
It's not perfect yet -- keep in mind there was no code a few months ago -- but today we're at a point of streaming video reliably and with no noticeable lag-- which is enough perf for starting to replicate the archive.
We're at a point where figuring out your exact constraints-- as they would look with ipfs-- would help us build what you need. We care deeply about this, so we want to help.
 see also the end of https://www.youtube.com/watch?v=skMTdSEaCtA
"Update: Here’s a copy of 17 million torrents from Bitsnoop.com, pretty much the same format but nicely categorized. It’s only 535 MB."
So, that's 17M [magnet]s, which means the archive could grow by some orders of magnitude from it's current need of 42K [magnet]s, and still doling out a subset to clients seems to be quite possible to manage.
No needs for trackers any more, [magnet]s work fine, and are regularly used to distribute ~1GB torrents (eg: hd tv episodes). Whatever one might think of distributing unlicensed copies of media -- they show that distributing vast amounts of data is quite easy with technology that is readily available and well tested.
Downloading the chunks that are the rarest first is different than downloading only rare chunks - although it's good that some of the infrastructure is already in place.
One idea is to use a custom BitTorrent client which the user can install (a bit like the ArchiveTeam Warrior VM). The user would tell it how much disk space they want to donate, and the client would join swarms over time to fill up that donated space. The user could even join specific swarms representing content that they enjoy and particularly want to save, but wouldn't have to go to the hassle of picking every torrent manually. Also, IA already provides a torrent for every single item in the archive with IA as a web seed, so this could be pretty simple to do from their perspective.
A distributed or p2p model is a great idea, but it seems very difficult to achieve.
Another one is to pretend that you have lots of chunks, and when someone challenges you to produce one, you quickly download it from another source and present it as if you had it all along. That's what the second paper prevents.
Can torrent clients properly handle being part of hundreds of thousands of item swarms at once?
The idea of doing HTTP over torrent comes up periodically and nobody has ye answers to these questions. Besides, online access is not the problem: there is plenty of bandwidth and CDN's already solve the data locality problem.
All FTTH customers in the Netherlands have this, and it's just a matter of time before everyone has FTTH.
Then again, it's probably also a matter of time before abuse of symmetric connections becomes large enough that ISPs start disabling it.
I see your point though, we're nowhere near there yet, especially on a global scale.
One benefit is ownership decentralization of the serving endpoint, which helps in some scenarios (more difficult to censor) and hurts in other scenarios (more difficult to sue).
What problem is this solution supposed to solve? Bandwidth for the type of content you can actually mirror like this is my a problem and hasn't been for a decade.
Actually, 8 times as fast, or 7 times faster. Unless 128Kbps is 1 times faster than 128Kbps.
Hopefully whatever backup plan they come up with includes a way for me to mail them a spare hard drive and get a drive full of internet archives in return.
The manpower and logistic hurdles of shipping disks around make it seem unlikely it would ever be an option, but perhaps something more local sneaker-net where they partner with nearby universities or libraries with decent connections who allow people to bring their machines/disk arrays in for that specific use.
I dreamed last night that an internet dotCom giant donates money / host a mirror of Archive.org. It would be definitely a novel step.
To complete it, just build a web front-end that provides the least-populous BTSync key from the pool. People can then paste it into their client and contribute.
In the wiki there's a mention of how the geocities torrent was 900GB and very few people seeded it as a result. There should be a way and an incentive for clients to seed parts of files without having the whole thing. As long as there are enough copies of each chunk on the network, it's fine - we don't care that any one client has the whole thing.
Another cool thing is thinking about health metrics - you don't just care about how many people have a file, you care about recoverability too. This seems a bit similar to AWS glacier - you have to take into account how long it's going to retrieve data. You could have stats on how often a client responded and completed a transfer request, what its track record is, and assign a probability and timeframe for correctly retrieving a file.
One thing that comes to mind is that whatever the solution ends up being, it should be such that partial recovery should be possible. The example that comes to mind is stuff like a zip file or encrypted image that is completely ruined when a chunk is missing. So maybe it makes sense to have the smallest unit of information be large enough to be still intelligible on its own?
At first I was thinking of a model where chunks get randomly distributed and replicated, but then that made me wonder whether that's a bad idea.
And then what of malicious clients? I guess it's not hard to have a hash-based system and then just verify the contents of a file upon receiving it. But can I effectively destroy a file by controlling all the clients that are supposed to be keeping it up? How could you guard against that? Could you send a challenge to a client to send you the hash of a random byte sequence in a file to prove that they do indeed have it? What guarantees that they will send it even if they do?
And then what about search / indexing?! P2P search is an awesome problem domain that has so many problems that my mind is buzzing right now. Do you just index predetermined metadata? Do you allow for live searches that propagate through the system? Old systems like kademlia had issues with spammers falsifying search results to make you download viruses and stuff - how to guard against that? Searching by hash is safe but by name / data is not. Etc etc.
I wish this was my job! :)
Permacoin is a cryptocurrency that uses proof of storage instead of bitocoin’s proof of work:
Anyhow, read the permacoin paper. It’s pretty cool, and it needs a large petabyte data seed to secure the network. Seems like a win-win to me.
The central server has access to the entire file, and hence can compute the hash of any arbitrary chunk. Challenges/verifications don't have to happen all that often ( indicates they are looking at once a month) so creating a unique challenge for each user shouldn't be too compute intensive.
For each known user:
- Central server chooses random chunk of each file it wishes to verify the user still has. This could be any length from as long as the hash function to the whole file size and could be offset any number of bits.
- Client is asked to provide the hash of the chunk using the file stored locally.
Precomputing hashes for all possible chunk permutations would take up substantially more space than simply storing the file in the first place. A bad actor would need to store the hash for all the possible chunk lengths starting from every possible start location in the file which is in the order of O(n^2) where n is the stored file size (500GB in this case). For reference that would be about one third of the entire 20PB archive for a single 500GB chunk if using a 256 bit hash function.
1) Someone keeps downloading the full archive and throwing away;
2) Someone wants a file erased, keeping it with the intention of denying access in some future.
1) Bandwidth has costs on both sides; just balancing upload among receivers would probably suffice for this not to be a problem.
2) Assigning large random chunks to downloaders should prevent this "chosen-block attack"; add in some global redundancy for good measure and that's probably enough (although I still wouldn't trust this 100% as a primary storage, only as an insurance storage).
This is not a new problem, and most of the obvious solutions work just fine.
A service offering live storage at this (Petabyte) scale  - costs 22,020,096 GB * 3¢/GB * 12mo = $7,927,234.56 a year. You could build six fully-mirrored copies for about that much, though management of 360 4U machines (30x 48U full-height racks worth) is a business on its own (see BackBlaze photos), and you need to deal with drive replacement after a point.
 http://www.rsync.net/products/petabyte.html (If you actually sign up for this, be sure to send me that $24,000 referral bonus o_o)
-- Machine Cost --
$300 * 45 * 8TB drives = $13,500 (drive cost)
+ $3,387.28 (pod cost) == $16,887.28 (unit cost)
/ 360TB = $46.91 per TB
$46.91/TB * (21 PB * 1024 == 21,504 TB) == $1,008,752.64 / 21 PB (for no redundancy)
-- Power Cost --
Let's say the motherboard/controllers draw around 100W continuous and each drive pulls 11W (from my post elsewhere in the comments). This is 2,688 drives, which fit into 60x 45-drive 4U BackBlaze storage pods.
2,688 * 11W = 29,568 W (for the drives) + 6,000 W (for the computers) = 35.568 kW to run a non-redundant 21 PB array.
You'll need at least two network switches and moderate cooling infrastructure as well. The heat from using 35.5 kWh continuously takes nontrivial effort to disperse, but let's assume that's somehow free for the sake of simplicity here.
35.568 kW * 24 * 365 = 311.57568 MWh (in a year)
I'm sure at this point you might be able to work out a bulk deal of some kind, but at $0.16/kWh you're talking a pretty low cost of $49,852.11 in power every year.
Building arrays of nonredundant live drives would be significantly cheaper over the long term than Amazon Glacier (which is almost certainly cold storage, not live) at this scale.
The real question is, why do they need this archive to be spinning? What's wrong with tape?
Citation? Not saying you're wrong but I've seen supposedly "in the know" commentary in favor of both the tape-based and some sort of cold disk-based storage. I've been curious and also a bit surprised that the true answer has never really hit publicly although I'm sure it's well-known within some communities.
After refreshing my memory: Amazon outright they're using denied tape backup, some third-parties pointed out cold spinning disks wouldn't be very economical, and a few people theorized it's actually BDXL.
Robin Harris has argued in favor of BDXL but he's speculated about a number of things. http://storagemojo.com/2014/04/25/amazons-glacier-secret-bdx...
My money would be on the simplest: cold disk. Some support here: http://mjtsai.com/blog/2014/04/27/how-is-amazon-glacier-impl... but it's all speculative and anecdotal. I don't have a problem with the thought that Amazon is getting out ahead of the drive economics, especially given that I'm not sure how widely used Glacier is today.
I do find it interesting that this has been apparently kept under wraps from anyone who would share more widely.
At this scale, redundancy via 3x mirror is pretty expensive. You could likely get away with something like chunks of 12 drives in raidz3, especially if you're planning on having at least a few facilities like this. $1m setup + land + a few salaries upkeep sounds like the kind of thing you could convince a few governments to do.
However, Glacier won't do at all for the live Internet Archive, so it's a question of whether they want this to have live replication or not (if the whole backup is offline, who does your integrity checks?)
I'm actually curious how much it would money/storage space it would require to print the entire Internet Archive on whatever the current equivalent of microfiche is (anything cheap and more permanent than magnetic or optical storage).
It's an impossible endeavour and will only become more incomplete as the net grows. And 99,999% of the data "backed up" will never be looked at.
What's the point?
IA is one of the most important initiatives/projects on the internet, links are not meant to change but they do and when it happens knowledge & history erodes.
Remember the 2004 Indian Ocean earthquake and tsunami?
It was the first major event that provided mainstream credibility to the blogosphere, before this disaster consumers turned to news publications for information. For the first time ever news publications did not have the info and were desperate for content, any. TV crews flew into the effected areas and started purchasing suitcases/video/still cameras sight unseen off shell shocked tourists whom were too traumatized to remember if they had captured anything.
It was also the last major disaster that occurred before the social revolution - YouTube, Flickr, Facebook, smart-phones and mainstream adoption of broadband. Back then the main way for the general public to share information was via email (which had limits of 10mb) and if they were technical or knew someone technical on their 100mb of free hosting provided by their ISP.
Now imagine you're a student, academic researcher, production assistant and you have been tasked researching what transpired. As of right now, the majority of content from 2004/the disaster is no-longer available on the internet and only accessible via the IA.
(nb: I was the guy behind http://web.archive.org/web/20050711081524/http://www.waveofd... which was the first emergency crisis live/micro blog. For what Google now has a team of volunteers doing during a crisis I did it myself with a single pentium 4 cpu, 4GB of ram, a single IDE hard drive, a ramdrive, a 100mbit uplink and a week without sleep. WOD will be relaunching within the next couple of weeks and was one of the sites recovered from the IA.)
There surely is history worth preserving. It's just this "compulsive hoarding"-approach that seems ultimately futile to me.
The internet itself inherently preserves the knowledge that at least one person cares enough about to put it on a webserver. Pretty much everything of even minor relevance should be referenced by Wikipedia by now.
IMHO the internet itself is the "Internet Archive".
To be honest, no. But Wikipedia seems to have that event pretty well covered; http://en.wikipedia.org/wiki/2004_Indian_Ocean_earthquake_an...
And in a lot of cases, the actual citations or references used by Wikipedia are dead links, or have some sort of 'last retrieved at' annotation, or end up at parked/squatted domains and the original content is no longer there to validate the claims it supposedly made.
Offhand I don't know if Wikipedia has a policy on references to Internet Archive / wayback-machine links to things, but imo, being able to point to a snapshot of a reference at the time it was made, as well as the current version if it still exists, is a desirable feature, and well worth striving for.
Internet pagerot/linkrot has caused me grief with bookmarks for things I didn't realise I needed again until years later when they were gone.
They do have a policy. Additionally, IA has been specifically crawling links on Wikipedia to preserve the citations.
> Earn free cloud storage when you allocate unused space on your device for your Symform account. Users that contribute get 1GB free for every 2GB contributed. Contribution can always be fine tuned for your personal needs.
> Extra space is available: Unlike the majority of online file storage services like Dropbox and Box.com, which offer paid plans only as a means to expand storage, Wuala offers you the option to gain additional online storage in exchange for some of your unused local disk space to commit to the network.
There's bits and pieces, but essentially nothing that's open-source and community-supported.
Git annex and BitTorrent would both work with that requirement, and most of the commercial, proprietary solutions cannot; their clients wouldn't use them if that data wasn't private.