Hacker News new | past | comments | ask | show | jobs | submit login
Thank you for helping us increase our bandwidth (archive.org)
677 points by edward 19 days ago | hide | past | web | favorite | 199 comments



archive.org feels like an irreplaceable treasure, the Wayback Machine alone is a time capsule of our digital history.

I donate to them monthly and know a lot of other people do as well, so I don't worry much about their financial stability. I'm more worried about external pressures taking content down. I hope the data is backed up six ways to sunday, and that somewhere there's a plan to make it all accessible if Internet Archive can't continue to play the role it does.


I've thought for a long time why I absolutely agree with this sentiment and the best I can come up with is that the Internet Archive feels to me like the embodiment of the old internet, not this marketing driven, data steal-and-sell, VC backed cyberpunk dystopia Internet that we have everywhere else.

It's the same sort of feeling as I why I enjoy this site, wikipedia, and so on. Bonus, IA is also a bit of an organizational mess, but rewards the adventurer with rich treasure.

(I was recently on a Korean history kick and came across not only 1, but several entirely different first hand books written by visitors in the late 19th and early 20th century, scanned in, freely accessible/downloadable, in a variety of formats, and with an excellent on-web reader. These books are so out of print I checked with three local counties for copies and none of them even have references to any of them in their catalog -- treasure!)


I'm very worried about its backups and what happens when the next Big One hits SF.

As far as I've ever been able to determine from talking to anyone at IA (e.g., Kahle, Scott), they don't really have any sort of backups that could actually be restored from in a disaster situation.


There is http://iabak.archiveteam.org, but it’s not exactly large.


The IA is about 50 PB.

IABAK stores 100 TB, or 0.2% of it.


You can get an 8TB HDD for $150 right now. That's 6250 drives. That's about $1MM in drives, which doesn't sound that cost-prohibitive. Obviously that's not the whole cost since you need to pay for bandwidth, replication and other infrastructure like the host node, but it sounds like something that could be even hosted by a number of volunteers on r/datahoarder or r/homelab.

I also remember reading about Sia on HN, which is a dapp that pays hosts to store data and distributes it. Looking at the going rates on Sia ($1.45/TB/mo), that's $870k/yr. That's ~10% of the IA budget (which is only $10MM/yr, which sounds very efficient!) but shows that the order of magnitude is not that crazy.


I think at this scale we're no longer talking about buying drives, but securing a steady supply stream of them. So we're talking not $1MM, but a somewhat safe $1MM per year for it to be even worth considering.


BackBlaze B2 is $5/TB/month

Azure Archive is $2/TB/month ($1.68 if reserved)

AWS Glacier Deep Archive is $1/TB/month

GCP Cloud Storage Archive is $1.20/TB/month

Of course, there can be i/o and network charges, and different levels of redundancy (but possibly bulk discounts)...but the bare storage costs for for 50 PB per year would be roughly $600k - $3 MM/y.


The cloud business is a small fraction of Amazon's revenue but a large part of their profits. It's extremely profitable for them. That's why there is such a large discrepancy between (non bulk) HDD price and (non bulk) per month cost for archival.


Aren't the costs of getting that data out of the backup much larger than the cost of keeping it in the first place, to the point that when you actually need to restore a large backup, it turns out it was better to have been managing it yourself? That's the impression I got from various HN comments on the topic over the years.


Low cost to insert, low cost to keep it there, high cost to retrieve is exactly the combination you want when looking at disaster backup solutions, since you don't intend to retrieve the data frequently. Buy some earthquake insurance (I know, easier said than done) and only pay for 1/20 of the retrieval cost.


AWS has Snowball and Snowmobile though only used former to reduce data transfer costs. Dont remember what other savings are in there. Like is there price reduction if use with Glacier or not.


Isn't that inbound only? Getting the data out again is also required.


You can export data via Snowball as well.

https://docs.aws.amazon.com/snowball/latest/ug/create-export...


The cost is only there if you transfer out of aws. Something like glacier will have a retrieval time on the order of hours or days.


Glacier Deep Archive does charge for retrievals at $0.02 per GB and additional $0.01 per 1000 such requests (both of which are $0.00 for Standard S3). PUT, LIST, DELETE are at $0.05 per 1000 requests, 10x the Standard S3 rates.

https://aws.amazon.com/s3/pricing/


The majority of cloud costs of storage are bandwidth, so ignoring this makes the analysis meaningless.


Would it remain feasible as they scale up? Their content is growing faster and faster so the number of drives would have to rise every single month, probably by several dozen even.


> That's about $1MM

is that a "Million Million" eg. 10^12 ?


Some industries use M for thousand. For example, advertising uses CPM (cost per mille), which is the cost for 1000 clicks or views.

In this case, 6250*150 = 937500, almost a million.


It definitely stalled out. It needs a Windows version to get real traction IMO.


If I'm reading that correctly, it would only cost a bit over 500 bucks a month to host that whole archive on BackBlaze B2.

Furthermore it would not be so hard to translate Archive.org items to IPFS objects, if there were an effort to pin a significant number of them to storage and network.


Since numbers are not 100% clear...

(50 petabytes * 0.2% = 100 terabytes)

[$0.005 ($/GB/Month) BackBlaze cost]

[(50 petabytes) / (1 gigabyte) = 50,000,000]

(50,000,000 * $0.005 = $250,000 US$)

—————

Meaning based on my numbers, that is $250,000 USD a month to host 50 petabytes of data on BackBlaze.


That's a fraction of the AWS bills for many startups arguably doing absolutely nothing


They have VC money to burn. Archive.org doesn't.

Also, those backups would be (relatively) cheap to keep, but not necessarily to restore.


I would guess restore wouldn't be a problem. AWS or whoever would do it for free given it is a non-profit (in case of a disaster only, of course).


The effort is the issue here. There was this comment back when IA.BAK was in design phase https://news.ycombinator.com/item?id=9148576 And then this is all there was to show for it: https://www.archiveteam.org/index.php?title=INTERNETARCHIVE....


Yeah, I noticed that being a problem. My biggest problems with IPFS are hardly mentioned in their updates, and it's hard to tell if they have any interest.

I had constant issues with objects simply never (hours and many requests) being found, despite being pinned in several places; and the daemon sucked resources away from the system at an alarming rate back when I was trying properly.

With all that being said, maybe now is the time to look at it properly, the budgets are there. Maybe Juan, _prometheus, can find somebody to at least PoC this important application.


You're reading it correctly, but IABAK backs up 0.2% of the Internet Archive.


Do they have distributed backup?

Surely a lot of people would be happy to donate unused space on their drives (I know I would) - especially since backup wouldn’t use much bandwidth...


I wonder what the cost would be to build out a secondary backup site on a different continent.


Whatever happened to their copy in Canada?

https://blog.archive.org/2016/12/03/faqs-about-the-internet-...


Without going too much into it (not my place) it exists.


Thanks, I'm glad to hear it! It sounds like IA is aware of the need to have multiple approaches (and places) to backup as much of the data as possible.


> archive.org feels like an irreplaceable treasure

Yes I save some dead sites that I think are critically important resources to tarballs that I'm storing separately in case the Wayback Machine disappears.


While that may work for your personal usecase, there is a significant issue around trust, if you were to share these tarballs with anyone else. I wouldn't trust you to give me originals in the same way I trust the wayback machine. So even if there are backups lying around here and there, their utility is decreased a lot.


By your philosophy, just get off the internet. At some point you gotta trust HN, your keyboard driver, CPU firmware, motherboard, ISP, Google or whatever service you're using. I am not even mentioning your mailman and your amazon delivery driver.

Your stance is pretty extreme - please don't discourage people from making backups.


I don’t think their stance is extreme as you make it seem. Insert their username into the same list you just provided:

im3w1l, HN, your keyboard driver, CPU, firmware, motherboard, ISP, Google

One of these things is not like the others.


That's useful for personal use. I think that the IPFS is designed for this purpose as well, but meant to be accessible to anyone.


Well the idea is I can provide it publicly (not sure how... don't know how they get the copyright to do it) myself in the future if needed.


They have lots of copyrighted content. Such as virtually every commercial retro video game ever made it seems.

I personally think it’s great but surely companies aren’t too pleased? How does archive.org avoid being sued into oblivion?


As far as I know, they limit public access to basically any material with a complaint or request, but keep copies. They may do more, but that seems to be the default response.

Which seems smart enough: minimize litigation costs while not losing content permanently via court orders or dmcas or similar threats.

But it's frustrating for the wayback machine- where sites like Snopes and some newspapers have opted out of having their history published (after being accused of ghost edits to articles).

I don't know what the right answer is, but even if they didn't display the page, it would be nice to see the diffs (ala wikipedia), or at least show if and when changes to a page were made. Maybe that's beyond their mission scope, though.


> sites like Snopes and some newspapers have opted out of having their history published (after being accused of ghost edits to articles)

Is there any way to interpret this behavior as anything other than straight up admitting to being dishonest? Especially in context of accusations that could be easily disproved using the Wayback Machine... unless there is something to those accusations, that is.


There are exemptions to DMCA for certain content.

> Computer programs and video games distributed in formats that have become obsolete and which require the original media or hardware as a condition of access. A format shall be considered obsolete if the machine or system necessary to render perceptible a work stored in that format is no longer manufactured or is no longer reasonably available in the commercial marketplace.

https://www.copyright.gov/1201/2003/index.html


That doesn't seem to stop Nintendo from shutting down ROM sites left and right. For example, emuparadise has complied and taken down all ROMs: https://www.emuparadise.me/emuparadise-changing.php

But here is a complete NES ROM set, just sitting there, on archive.org: https://archive.org/details/NESrompack

I just find it strange. Nintendo is vehement about their IP. For archive.org to get an exception doesn't make sense to me.


(IANAL) Circumventing DRM is usually illegal under DMCA. That quote allows circumventing obsolete DRM. NES cartridge ROM data is still covered by copyright though.

I don't know how archive.org gets away with that.


You raise good points. I wonder what a copyright lawyer would have to say.


It's a non-profit library and they don't publicly provide a way to download most of the copyrighted content, so rights holders aren't too concerned. I imagine most people, even copyright lawyers, personally support the free archiving of everything made in the current age as long as it doesn't detract from current business operations and the copyrights are respected for their duration.


It's very easy to download the copyrighted content. For example, here is a full NES ROM set: https://archive.org/details/NESrompack


It’s the only organization I actually donate to. Partially because it was so easy to set up a monthly donation ;)


They could set up a torrent that people could download parts of, I think it's an ideal system for a distributed backup.

But it probably needs tweaking for that purpose; for one you need to ensure that the data is evenly distributed, and second you're dealing with data that is appended to regularly.

But I think it can be done.


It exists! The ArchiveTeam had a project, you can see its status here: http://iabak.archiveteam.org/

Here's an overview: https://www.archiveteam.org/index.php?title=INTERNETARCHIVE....

Sadly, it seems the project has fallen out of repair.


There are only two sites I donate to: wikipedia and archive.org. I honestly don't know what I'd do without them. I have been able to find web sites from 20 years ago on archive.org, it's an absolute treasure.


They should team up with Backblaze...


60 Gbit/s of continuous traffic is a lot. If I'm reading the graphs right, Wikipedia "only" has 13.4 Gbit/s of outbound traffic [1]. Of course still well below the single-digit Tbit/s traffic of large internet exchanges [2] [3], but still unexpectedly large.

[1]: adding up the outbound numbers for each datacenter: 1.888 + 8.003 + 810 + 1.958 + 807: https://grafana.wikimedia.org/d/000000605/datacenter-global-...

[2]: https://en.wikipedia.org/wiki/List_of_Internet_exchange_poin...

[3]: Not 100% correct as thanks to Corona, ix.br now has passed 10 Tbit/s peaks. https://ix.br/noticia/releases/ix-br-reaches-mark-of-10-tb-s...


60 Gbps is about 1/3 of the traffic served by a single Netflix CDN node.


It's harder when you don't make money, and everyone's not downloading the same things.


I understand money is tight for a non-profit, but why are bandwidth costs a gating factor?

Both your upstreams, Cogent and Hurricane Electric, offer 100G ports at fiveish grand per month in carrier neutral DCs. Given that your budget is in the millions, an outlay of this magnitude doesn't seem wildly out of the question.

If you can explain what the problems are in getting more bandwidth, I'd be more than happy to see what I can do to help.

Please let me know if you'd prefer to discuss the matter over email.


Happy to chat in much greater detail over email if you like (in my profile) -- transit specifically is not generally our limiting cost (though of course in the context we operate in, every penny counts -- e.g. we run our own DCs, ambiently cooled, single-feed grid power with minimal backing, etc). We generally run very close to the edge, capacity-wise, in order to get the most out of what we have -- in this case, demand moved very quickly and we had to work a hardware deployment to catch up.


I'm sure I'm not the only one who'd be interested in reading more of this. Is there a reason your conversation can't proceed on HackerNews?


Or at least update us on the conclusion :)


You can follow the thread on the NANOG list:

https://mailman.nanog.org/pipermail/nanog/2020-May/107720.ht...


Email sent.


You are comparing to one of the largest bandwidth using services world-wide. Of course archive.org will lose out.

I should have added: 60 Gbit/s is a lot for a donation run service like archive.org. Not sure if there are donation run services on the internet with more traffic.


> single Netflix CDN node.

Looks like they max out at 100 Gbps, or as low as 40 Gbps, depending on appliance model and link aggregation configuration. No argument either way, just thought it was cool info.

https://openconnect.zendesk.com/hc/en-us/articles/3600345383...


That's out of date. Here's a presentation from last year where they talk about >190 Gbps: https://people.freebsd.org/~gallatin/talks/euro2019.pdf


The video accompanying this slide deck is also a great watch: https://www.youtube.com/watch?v=8NSzkYSX5nY


I'd say the "bytes" being transferred are very different.

Netflix has a (relatively) small number of large content pieces, and a CDN node would hold the hottest items.

IA has 430 billion individual pieces of data, on average much smaller than a piece of Netflix content.

So each "byte" transferred from IA is "more work" to produce, and less likely to come from some kind of cache.


does archive.org use cdn or is it all served from one location?


Source?


Wikipedia is mostly text with some images. Internet archive offers every hung from websites and text to movies and games, which obviously are much larger and don't compress as nicely either.


I wonder what the comparison between served data is for IA vs Wikipedia - anecdotally, I feel like most of the data for Wikipedia is text based (plus some number of images), whereas IA includes (as mentioned in the article) audio, video, etc.


They increased the bandwidth and traffic immediately filled the new capacity. While this could be just a faster web site attracting more users / users were giving up due to slow loads before, it could also indicate that they have a lot of automated traffic that will consume whatever resources are available and can be throttled without endangering their mission.

I hope they'll look into it, find a way to identify that traffic, and then (ideally) classify it accordingly and continue to serve it only after all human traffic has been served (i.e. if they run into a bandwidth crunch due to bots, only the bots suffer, and they don't need to upgrade as quickly).


It could also be they are limited by the capacity, For me Wayback machine is always slow , although my bandwidth is otherwise not been a bottleneck. Perhaps increased capacity is getting utilized by both increased usage and higher performance of their services .


We don't save logs of who hits us, but we watch who hits us. It's not bots. But good thinking.


How do you know? Not doubting you, just curious as to how you figure out if a request is from a bot or not.


Well, obviously, a sneaky bot is a sneaky sneak and acts like a person. But conversely, people act in a way that could be like a bot if anyone casually looked at them - mass downloads, no theme or meaning to content. In general, however, it's all pretty much people. Millions of people a day.


The wayback machine has been essential through COVID as many government just publish the numbers and data "of the day" and the only way to compare to the day before is to look at IA.


Governments aren't the only ones overwriting old information with new either. The BBC has developed an annoying habit of completely overwriting old articles about Covid-19 with new, semi-related ones, meaning that the only way to see what they were saying earlier in the pandemic is through sites like the Internet Archive.

There's also at least one outright correction that only seems to exist on the Internet Archive now: https://web.archive.org/web/20200428214306/https://www.bbc.c... "Correction 25 April 2020: An earlier version of this article incorrectly said that France had conducted just under 140,000 tests a day by 21 April. The figure of just under 140,000 refers instead to the number of tests it had carried out weekly."


Here's another one: "some of our old tweets were wrong so we just quietly deleted them lol":

https://twitter.com/voxdotcom/status/1242537366620966912


Twitter ought to have some kind of strikethrough feature. Allow users to mark that they no longer stand behind a tweet without completely deleting it.


What would be the incentive to use that feature over just deleting the tweet?


There wouldn't be, therefore tweets shouldn't be deletable after a set period of time.


Given that a British "journalist" lost her house to legal fees after libelling someone and refusing to take the tweet down, I think there are reasonable reasons to want to take down libellous, abusive, dangerous, or wrong tweets without having to sue Twitter.

("Journalist" is in quotes because, while that was her job title, she specialized in abuse, libel, smears, and calls to murder, and I'm glad she lost the libel case)


I'm not sure I like this for most users who don't have any kind of following or well developed sense of what is and is not appropriate to post.

I do like this a lot for high profile accounts. It seems fair that if you're verified and have a very large number of followers, you are operating at a privileged level of impact and should lose your right to delete tweets any tweets.


Integrity or the appearance thereof -- showing that you're not trying to cover up your mistakes.


This is necessary on Twitter. It's easy for fake or incorrect news to spread, you can't edit it and updates/self-replies are often hidden.


Sure, but they should also own up to their mistake. They flat out said "It won't be a deadly epidemic".


They made a tweet to explicitly say they deleted it because it's wrong. How is that not "owning up"?


Properly owning up would include keeping the mistake up, so that people reading the apology would actually know what they're owning up to. Without the original Tweet, they might be apologizing for mistyping "2+2=4" as "2+2=5", for all you know.



Do you have any examples of government sites that are doing this?

I have a side-hobby of setting up scrapers which pull scraped data into a git repository, precisely for this kind of thing. I'd be happy to set a few up.

Some of my posts about this technique (which I call "git scraping"): https://simonwillison.net/tags/gitscraping/


Not sure if you’re asking for:

(1) US coronavirus data in general,

(2) examples of sources that do not log prior days data, or

(3) source that “edit” prior data without noting the edits.

(4) something else

If (1) this page in the table under the column “sources” links to where the data came from:

https://www.worldometers.info/coronavirus/country/us/


2 and 3.


Government of Ontario has been doing it. They now have some APIs that link to the number of cases per day, but for the first month of the pandemic, they were literally overwriting the page every day and when you look at github there was tons of researchers that came up with their own parser in order to save this data before it is deleted.

I've also been running bash scripts every day to save an archive of some of those pages https://github.com/jeromegv/covid_data


For the US at least, the COVID Tracking Project records multiple screenshots per day of the official websites where those numbers are published.

https://covidtracking.com/


Wikipedia is a good source too, at least for the countries I've been following.


Crazy idea: Why don't ICANN hand over the running of .org to archive.org? It's a bit outside their expertise, but they do have knowhow on running tech to scale, and the extra revenue could be used to fund a very good cause.

Also, I'd trust these guys not to mishandle .org

(yeah, I know it's not going to happen)


> Also, I'd trust these guys not to mishandle .org

As much as I like the idea of running a TLD, if anyone gives me a TLD, I'm gonna put an MX record on it and be jonah@org. That's a promise.

EDIT: To be fair, I would probably be overwhelmed with the usual sense of responsibility and not do this. But, the temptation


Would be a nightmare with all the crappy email validators out there..


I recall a similar story involving amateur radio. By ITU convention, amateur callsigns are [prefix][digit][suffix], like W1AW.

So one night, a new ham is operating his friend's fancier station, which has the HF equipment to talk clear around the world. And they make contact with another ham identifying himself as JY1. No suffix. Well that's weird, so they look it up, the JY prefix is Jordan, okay. So the guy on the mic just asks, "why doesn't your callsign have any letters after the digit?" and the reply comes back with a chuckle, "Oh, because I am the King."

That was the late King Hussein. about whom much has been written. I don't know if his call ever caused trouble with software validators, but I'd certainly believe it.


I know of a callsign causing trouble with most log programs - RAEM, which Russian amateurs dust off every now and then to commemorate polar pioneer Ernst Krenkel.

Took a while to get ARRL Logbook of the World credit for that one.


I wish that Archive.org would start collecting funds to finance an external backup site. I strongly feel that archive.org is an irreplaceable treasure and that making sure that it would resist natural disasters should be a priority...


That doesn't require a special fund, donate to them. They'd make backups if they had enough donations to cover it.


I actually already donate about 250 usd a year to them... But would much prefer if there was a special fund I would donate that would make that happen in the future


I donate to the Internet Archive (recurring). I just hope they stick to their core mission, and don't get sidetracked like Wikipedia has. They offer a valuable service and I believe will help historians in the future understand this point in the internet's life.

PS - Don't forget to pick an Amazon Smile charity and use the Smile.Amazon sub-domain. Donated almost $20 to the EFF that way.


> They offer a valuable service and I believe will help historians in the future understand this point in the internet's life.

It's also essential for non-historians. A huge percentage of the info people need just to do their jobs or pursue their hobbies is only found on people's personal webpages, and those presumably all go away once the person dies.


How has Wikipedia got sidetracked?


Most of wikimedia's budget goes towards non-wikipedia related expenses: https://www.washingtonpost.com/news/the-intersect/wp/2015/12...


I believe the Author is referring to the phenomenon why Wikipedia has lost core contributors due to poor management. https://en.wikipedia.org/wiki/Wikipedia:Why_is_Wikipedia_los...


They already kind of have no?

They go from archiving the web (simple websites), itself a huge feat, to archiving a bunch of videos from justin.tv as long as they have 10 views?

There's more cases of websites shutting down, and AI feeling the need to archive everything.

Do we really need it?


Why don't they put their data dumps in an SQLite database and indexed for full-text-search https://sqlite.org/fts5.html

Then put the file in a torrent. Let the users seed it.

Users can use sqltorrent (https://github.com/bittorrent/sqltorrent) to query the db without downloading the entire torrent - essentially it knows to download only the pieces of the torrent to satisfy the query.

Every time a new dump is published by internet archive, the peers can change to the new torrent and reuse the pieces they already have - since SQLite is indexed in an optimal way to reduce file changes (and hence piece changes) when the data is updated.

I talk a bit about it here: https://medium.com/@lmatteis/torrentnet-bd4f6dab15e4

Would save Internet Archive lots of bandwidth and hassle


A 50PB sqlite database? I don’t think any filesystem would be happy with that.


Indeed. But nothing nothing stops them from storing torrents of torrents. The search index would be just that, the text index. Which would point to another torrent storing the actual content. Assets would point to yet other torrents.

Would be interesting to learn how it's currently partitioned. I would mimic the same portioning system but use torrent instead so users can help with hosting. And use sqltorrent to serve queries efficiently.


Seems that you're being downvoted at the moment because it looks like you're making fun of archive.org, but that article is actually rather interesting.

How does this compare to something like IPFS?


I clicked this article to look for ways to contribute.. bandwidth. But there is no torrent/ipfs like method to help, just cash. I'd be happy to run a service that would help Wikipedia, archive.org, ddg (any service I believe in) decentralize. I always keep my Ubuntu torrents open for launch days and some beyond. I have a server running and I don't need my 50/50 fiber all the time.

If I was a better software dev, I'd try to make a daemon which I can feed a list of websites which I want to support with my caches that can be upgraded (like the DNS system?). Like a folding at home for bandwidth.


> just cash

This is almost always the best and most efficient way you can help a cause. You donating infrastructure means donating them extra work - they'll need to integrate and keep track of it, as well as manage a relationship with you (in particular, a risk of problems with you or your service). Meanwhile, with extra cash, they can buy what they need and what integrates well, or pay the most effective specialists they need.

It's an universal principle. Cash is the best gift, if you're gifting to help. That's why e.g. Red Cross frowns at people donating stuff - it creates a huge logistical problem for them as well as depriving them of opportunity to boost the markets in the disaster-struck area. Or why it's better for you to donate money to your local homeless shelter rather than volunteer to work in it - if you work your job for extra hours instead, and donate that money, they'll hire workers better skilled for that task.


But I have bandwidth to share, not cash. I'm already at the lowest subscription level with my ISP. I understand the argument, but if there was a general technical solution this would be nice, right?


That's true. Does such solution exist, though?

I had some hopes for Filecoin (i.e. a P2P system with monetary incentives attached to backing up other people's data, from one of the very few groups in the crypto space that don't look like scammers to me), but I haven't heard anything about it in a while.


I used to participate in such a service from Lacie (or was it Lacie?), I shared 1 TB, got 1 TB in return "cloud" storage (I think multiplied by the time you were online actually), which was in fact, distributed (after local encryption) over many other online PCs. I ran it on my server but one was not even obligated to keep their PC on. It worked "meh" (it was all java based with a meh UI) and they discontinued it.


If you have bandwidth to share but not cash, use the bandwidth to promote sending cash to the archive. :)


Hey, we are working with Filecoin.ai on a pilot program where their new minors in Filecoin Discover will host some of our open data sets like Prelinger Films and .Gov web pages.

It will be an additional back up to our copies. We're hosting a DWeb Meetup to share the latest in Filecoin and Storj (two decentralized storage providers we're experimenting with.)

It's Wed (tomorrow) 5/13 at 10 am Pacific, 5 PM UTC https://www.eventbrite.com/e/dweb-meet-up-virtual-decentrali...


Sorry, "we" being the Internet Archive.


Oh cool! Will look into this!


Yeah, I am in the same boat - I have boatloads of unutilized bandwidth and storage space that I would love to contribute to a worthy cause like archive.org.

Ideally, I would like to be able to point at some RSS feed of torrents of relevant archive.org collections (whatever the curators believe popular enough to benefit from peer-to-peer distribution). I could then set a torrent client to donwload and seed everything on this feed.


A suggestion around Donation, see if your employer will match it that way you can double the contribution: https://help.archive.org/hc/en-us/articles/360002059712-How-...


Archive.org works surprisingly well as a general purpose web proxy. Just prefix the URL, e.g., http://example.com, with https://web.archive.org/save/, e.g., https://web.archive.org/save/http://example.com

The aesthetic intrusiveness of the archive.org header and footer are minimal since I use a text-only browser that has no Javascript engine.

Sometimes I get "This url is not available on the live web or can not be archived." However this happens for only a surprisingly small minority of websites.

Rarely I find that /save is unsuccessful in which case I can still find past copies using something like

   curl -o 1.txt "https://web.archive.org/cdx/search/cdx?url=http://www.example.net&fl=timestamp,original" ;
   sed -i '/^[12][0-9]* h/!d;/^[12][0-9]* h/{s/^/http:\/\/web.archive.org\/web\//;s/ /\//;s/\r//;}' 1.txt
The limitation with past copies versus /save is that archive.org will not usually crawl past page one on websites with many successive pages, e.g., http://example.com/?page=2, http://example.com/?page=3, etc.

Has anyone ever considered mirroring archive.org, or parts of it, to other geographic locations.

Could this be done. Why or why not.


I use a custom browser keyword search to find existing archived pages before saving one, personally. Eg: ar <URL> for:

  https://wayback.archive.org/web/*/%S
I'd imagine it would be useful for IA to implement some message for scenarios where a page has already been saved within a certain timespan and provide both a link to the already saved version and offer to save again. As this would mitigate mass savings of an identical page that can occur when some popular link is accidentally shared with the /save/ URL instead of the static URL or when it's a popular page that people want to archive.

Archive.is displays such a message (to the effect of, 'this page was archived <date>, if it looks outdated click save') and also redirects to the most recent copy.


That's a good point. I mainly use it for browsing websites that change daily as well as ones with many successive pages, e.g., 1, 2, 3, etc. that IA does automatically not crawl. Thus, not many existing copies if any.

HAproxy changes the Host header and modifies the URL. I can either use the text-only browser's http-proxy option or I can direct the request to the web.archive.org backend by adding a custom HTTP header to the request. If I am not mistaken, the so-called "modern" browsers do not have built-in capability to add headers.


I'm really surprised they don't use more CDN for this. Anyone know the reason why it isn't served by something like cloudflare?


Please don't make the whole internet basically cloudflare. they've banned my VPN endpoint (Hetzner server) and as a result a huge chunk of websites already don't work for me despite me having done nothing wrong. I've heard reports of Tor users being restricted as well.


It's more likely that website owners are blocking the ASNs of hosting providers since those are often used for content scraping and exploit type attacks [since they're cheap]. There are public and private lists of hosting provider ASNs you can use to block them all if you generally only want visitors with a business/residential IP.


It is usually nothing specific to your VPN setup or IP. It is mainly either for (D)DoS protection or Content Licensing related ( VPNs are easy way to bypass Geo-locked content). Many CDNs and services will allow you to access using only non commercial IP range


Okay, any one of the other CDNs then. It seems like the bandwidth costs and equipment are prohibitively expensive for them. There must be a reason they don't use them which is what I want to know.


Any of the other CDNs are absurdly prohibitively expensive, pay-per-GB, etc.

And would still have to download from their source.


So many of the VPS companies have bots galore doing credit card attacks on smaller sites that it’s very easy and convenient to simply block those VPS providers.


I don't foresee use of a CDN saving them much or any money. Cache hit ratio would probably be relatively low and the amount of storage required at the edge to cover an appreciable portion of traffic would be very expensive. E.g. their Akamai bill would be in the five figures per month if not six figures, depending on what kind of volume discount they could arrange, just based on outbound bandwidth. They're not the cheapest but also not the most expensive.

Serving up that large of a media library at mid scale just isn't really a great use case for a CDN, those that have to due to transit costs becoming truly enormous (e.g. Netflix) make an enormous investment in hardware at the edge that probably isn't affordable to IA (or necessary at this point).


Sibling comments are exactly correct -- we're a library. We don't track or record the activity of our patrons (see e.g. https://archive.org/services/docs/api/views.html#footnote-wh... ), and since most CDNs cannot offer the same guarantee, we can't use them.


At the bandwidth levels they are using they would need to use Cloudflare Enterprise, and in my experience that is way more expensive than other CDN providers.

Also, since archive.org has so much content, the caching ratio is going to be very bad and kill CDN efficiency while still requiring lots of direct bandwidth. Cheap direct bandwidth in their case looks best(which is what they seem to be doing).


Another aspect is they are actually using bandwidth to max and have a cap on usage spending. No CDN I know of allows capping bandwidth usage.

The way they use bandwidth means the most efficent GB/dollar cost although at the cost of poor performance.


Cloudflare does not have bandwidth limits.


They serve everything in-house and are their own AS; this allows them to provide stronger guarantees of reader privacy than sites that use a CDN.


> They serve everything in-house and are their own AS

Forgive me, but what is an AS?


Autonomous System. https://en.wikipedia.org/wiki/Autonomous_system_(Internet)

An AS is an entity significant enough to be reasoned about at the BGP level.


Thanks!


The geographic distribution benefits for archive.org seem minimal to me - for the actual content high latency isn't a big deal.


Geographic locationa affects far more than just latency. Even though you may have a 100 Mbps link on both client and server , only 10 Mbps might make it across the Atlantic Ocean. Connections between countries/ISPs are not unlimited in capacity!


It's really benefiticial for outside of US.


It could be philosophical


That's what I'm thinking.

If they wanted to use Cloudflare, I'm pretty sure they'd jump at the opportunity.


The article doesn't mention why archive.org demand went up so much after COVID. Does anyone know?


I don't know if this explains all of it, but they launched a National Emergency Library, where they pooled the DRM'd versions of ebooks from lots of libraries and removed one little bit of the DRM restrictions. http://blog.archive.org/2020/03/24/announcing-a-national-eme...


I'm not sure, but this article suggests people might just be online more and reading more?

http://blog.archive.org/2020/05/11/what-it-means-to-be-a-lib...


Not just reading, general increase in internet use, for example they have a lot of old archived games, many official covid health pages only track current numbers so wayback machine is the only way to track first / second derivative data points etc.


Mixed feelings here... I love the IA but I feel people should not use them as a streaming service. Their mission is preservation, and access is obviously part of that, but I would feel bad hammering them every day for general TV watching purposes.


I assumed that their bandwidth needs were due to their web crawler rather than serving cached content. I guess I’m basing that assumption on my own usage patterns.


Maybe my sense of scale is warped but I’m amazed at how little BW the archive uses. Since this data appears to be averaged over a 1 minute window, I wonder what their p99 usage is. The article says they bought a 2nd router. Were they only using a single router before (no redundancy or protected BW)? I realize the Archive probably has a small budget but one can buy a 3.2T switch relatively cheaply these days.


What's the endgame for archive.org? Are they expecting for there to be a breakthrough in storage technology?

I read somewhere that data creation is exceeding storage solutions' pace. Is this true?

What about a mesh of some kind where every person who install an application hosts bits and pieces of random data and serves it to whoever asks for it?


What if there was a feature to load the pages without images? I use IA a lot but I rarely need the media stuff, at least initially. Speaking for myself I’d be happy to get the page without images and then click to optionally get the images. (Lazy lazy loading)


Inline images (even across all Wayback Machine usage) are likely a tiny factor, compared to the rich-media downloads (audio, video) dominating this usage.


Hey,for those who would like to help the Internet Archive with their excess storage, we are ginning up a project with Filecoin.ai to store some of our open data collections like Prelinger films, .Gov data, some audio and texts.

You can get an update on decentralized storage at our DWeb Meetup tomorrow (5/13 at 10 AM pacific, 5 PM UTC)

https://www.eventbrite.com/e/dweb-meet-up-virtual-decentrali...


Not trying to brag or virtue signal here but this is one cause I really feel great supporting. This is potentially our archive of human history and collective knowledge.


Sometimes I wonder if people keep track of internet history. For example, we have the writing how Rome and Republic of Rome was only because of Cicero otherwise we wouldn't be able to comprehend. I think the modern day Herodotus is this website right here (assuming it will survive centuries hopefully)


Kinda off-topic, but I noticed that it seems the owners of the sites can ask Archive.org to remove existing archives. I've lost a few I "created" myself this way.

Not say it's not reasonable, but how do we work around this? Any alternative services that are more "resilient" in this regard?


For the video and audio content. Could they take advantage of something like Webtorrent.


Every item has a torrent file, and can be retrieved with a BitTorrent client.

https://help.archive.org/hc/en-us/articles/360004715251-Arch...


Brilliant!


We need an open source competitor of archive.org using IPFS technology


Or an IPFS transfer/mirror mechanism built into archive.org.


I'm curious about the outgoing bandwidth, for crawling :) !


What a refreshingly simple and wholesome announcement.


What's the easiest way to seed the most popular content ala torrent?


What if to have every item on archive.org also have a webtorrent. I know that webtorrent leaks your private IP, but if an item is requested a lot then it will at least save them bandwidth and be shared amongst people who are requesting the content.



I would love to donate bandwidth/storage. I haven't a clue how. I wish there was software I could throw in my server, set how much storage and bandwidth I can donate and run 24/7.


Every item on archive.org has a bit-torrent file, I believe. This doesn't solve the problem of working out which items are the most popular, but if you can figure out that then you would be able to manage storage and bandwidth. I suppose the real question is: how many other people use bit-torrent to download from archive.org, as opposed to direct downloads. I (sadly) suspect it's a very small fraction.

Semi-unrelated, but if you're looking for ways to help and have a spare server, Archive Team [1] is always looking for additional capacity. Although Archive Team != archive.org, they do grabs of at-risk content which (almost always) get uploaded to archive.org. [disclaimer: I help out with various Archive Team projects, the most recent of which was the backup of Yahoo Groups).

[1] https://www.archiveteam.org/


This should be the future of internet media as a whole. You set your limits, and billions of people do the same, suddenly no more single point of failure. Someday


Donated.


I have revoked my donations until they close their pirate library. Sorry but let's call a spade a spade. They admit:

> multiple readers can access a digital book simultaneously

with the only caveat being it is borrowed for two weeks but let's face it, most value of a book comes from its first reading -- and what stops you from "borrowing" it again, anyways.

It's been a gigantic disappointment for me to see them do this and not back but try to placate the authors with weasel words and an opt out.


I’m sorry to hear this. I will make an additional donation on your behalf. I can appreciate how authors feel, but they must also realize copyright laws are asymmetrical (tilted heavily towards copyright owners) and in these times, many have no access to their local public library. They are essentially robbed of access to the book materials their tax dollars (which pays an author) have paid for during this time.

It’s been a gigantic disappointment to see authors respond negatively to this effort (considering this once-in-a-century event), and I will never buy a book again from one of these authors.


I also find it amusing that sofware developers do not recognize any more how we are on the same side in this fight. Imagine if we were still selling boxed software and they decided that in light of COVID-19 they just hand it out with a two week limit. It's an imperfect comparison but the current situation is not a reason to just discard the law. It might not be the best law, the place to fight that is in Congress and not on the back of authors most of whom are not in the best financial state to say the least.


You assume authors are losing revenue from this effort. It is likely this revenue would never have been realized regardless of the Archive’s efforts. A piece of content copied doesn’t mean someone would’ve paid for it.

As an aside, many SaaS products have given away their product for free due to COVID and widespread forced WFH.

https://www.entrepreneur.com/article/347840


"As an aside, many SaaS products have given away their product for free due to COVID and widespread forced WFH."

I'm sure a lot of authors would have contributed their work to the effort, if they'd been asked. But they weren't asked. It's difficult to imagine how you'd similarly force SaaS companies to give away their products for free during the pandemic -- lucky for them -- but if you found a way to do it technically, how do you think they'd react?


Honest question: how do you propose a non profit online library ask every author for written permission, while also considering the legal interests of their publishers?


Sounds difficult, but not a justification for breaking copyright laws, or interfering with people's livelihoods without asking.


Any author who notices the Archive's temporary 'National Emergency Library' affecting their 'livelihood' can opt out quite easily.


The HN consensus is always that only opt-in is acceptable when it comes to online privacy or anything else that affects them personally. Apparently opt-out is fine for everybody else.

You put 'livelihood' in scare quotes, but I don't understand what's deserving of mockery. Writers make their living by selling their writing.


I doubt many authors would have been able to do so, given that they've signed contracts with their publishers that likely prevent them from giving their work to anyone else.


I mean, imagine they did.

So a huge group of people get to try your software for free. There's a system to try and cut them off at two weeks but it might fail. This is basically free marketing. All sorts of people who may never have tried your software try it. Nearly all of them don't buy it, but some do. It doesn't cost you any money and you're able to help people in need.

There's evidence that, in general, piracy lowers sales, but the specifics are complex and uncertain[1]. Piracy can have positive impacts on the fortunes of the original artist. I think that a crisis like this is exactly the time to try new and experimental ways of making things available.

[1] https://marketly.com/does-piracy-impact-sales/


> . I think that a crisis like this is exactly the time to try new and experimental ways of making things available.

Yeah a crisis when the stress level of anyone is already much higher is the best time to crank the stress of authors even higher by experimenting with their livelihood without consulting them!


There is a huge difference between a book and code that I write. A book can be read once and the reader benefits. Useful software is usually useful for far more than one use. Free trials and demos are how we sell more software. A book? It's hard to have a free trial.

That said, I really have a hard time assailing libraries. Ebooks are sold to libraries at 2-5x the price I can buy the same ebook for as a consumer. Print books are sold at similar prices to consumers and libraries.


> A book? It's hard to have a free trial.

Publish one chapter of the book in an anthology, on the web, etc. I've purchased several books after reading one chapter in this manner.


> Imagine if we were still selling boxed software and they decided that in light of COVID-19 they just hand it out with a two week limit.

I think that is a fair comparison. As a software developer, I would JUMP at the opportunity to do that. Partly because it would be an opportunity to help sustain the world through this time of crisis at no real cost to myself (perhaps some opportunity cost). And partly because it would serve as free advertising.


That's not quite what happened here -- your software is given away whether or not you want to jump on that opportunity, because they don't care what you want. If you happen to notice what's going on, you can ask them to stop. And you don't get any goodwill from giving it away, because it was the Internet Archive and not you that made the gesture.

Honestly they would have gotten huge buy-in from authors if they'd bothered to ask. What bothers me more is the intellectual dishonesty of their FAQ's.


> Imagine if we were still selling boxed software and they decided that in light of COVID-19 they just hand it out with a two week limit.

You mean like shareware?


You mean entrepreneurs, not software developers. I can't think of anything about software development makes it inherently commercial.


Culture wants to be free.


Writing wants to be uncompensated?


I guess it is more information wants to be free? protecting the price of information is harder when making copies is very cheap.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: