
Thank you for helping us increase our bandwidth - edward
https://blog.archive.org/2020/05/11/thank-you-for-helping-us-increase-our-bandwidth/
======
mostlysimilar
archive.org feels like an irreplaceable treasure, the Wayback Machine alone is
a time capsule of our digital history.

I donate to them monthly and know a lot of other people do as well, so I don't
worry much about their financial stability. I'm more worried about external
pressures taking content down. I hope the data is backed up six ways to
sunday, and that somewhere there's a plan to make it all accessible if
Internet Archive can't continue to play the role it does.

~~~
dmd
I'm very worried about its backups and what happens when the next Big One hits
SF.

As far as I've ever been able to determine from talking to anyone at IA (e.g.,
Kahle, Scott), they don't really have any sort of backups that could actually
be restored from in a disaster situation.

~~~
britmob
There is [http://iabak.archiveteam.org](http://iabak.archiveteam.org), but
it’s not exactly large.

~~~
microcolonel
If I'm reading that correctly, it would only cost a bit over 500 bucks a month
to host that whole archive on BackBlaze B2.

Furthermore it would not be so hard to translate Archive.org items to IPFS
objects, if there were an effort to pin a significant number of them to
storage and network.

~~~
Endlessly
Since numbers are not 100% clear...

(50 petabytes * 0.2% = 100 terabytes)

[$0.005 ($/GB/Month) BackBlaze cost]

[(50 petabytes) / (1 gigabyte) = 50,000,000]

(50,000,000 * $0.005 = $250,000 US$)

—————

Meaning based on my numbers, that is $250,000 USD a month to host 50 petabytes
of data on BackBlaze.

~~~
ramraj07
That's a fraction of the AWS bills for many startups arguably doing absolutely
nothing

~~~
TeMPOraL
They have VC money to burn. Archive.org doesn't.

Also, those backups would be (relatively) cheap to keep, but not necessarily
to restore.

~~~
jfkebwjsbx
I would guess restore wouldn't be a problem. AWS or whoever would do it for
free given it is a non-profit (in case of a disaster only, of course).

------
est31
60 Gbit/s of _continuous_ traffic is a lot. If I'm reading the graphs right,
Wikipedia "only" has 13.4 Gbit/s of outbound traffic [1]. Of course still well
below the single-digit Tbit/s traffic of large internet exchanges [2] [3], but
still unexpectedly large.

[1]: adding up the outbound numbers for each datacenter: 1.888 + 8.003 + 810 +
1.958 + 807: [https://grafana.wikimedia.org/d/000000605/datacenter-
global-...](https://grafana.wikimedia.org/d/000000605/datacenter-global-
overview?orgId=1)

[2]:
[https://en.wikipedia.org/wiki/List_of_Internet_exchange_poin...](https://en.wikipedia.org/wiki/List_of_Internet_exchange_points_by_size)

[3]: Not 100% correct as thanks to Corona, ix.br now has passed 10 Tbit/s
peaks. [https://ix.br/noticia/releases/ix-br-reaches-mark-
of-10-tb-s...](https://ix.br/noticia/releases/ix-br-reaches-mark-of-10-tb-s-
of-peak-internet-traffic/)

~~~
cperciva
60 Gbps is about 1/3 of the traffic served by a single Netflix CDN node.

~~~
jonah-archive
It's harder when you don't make money, and everyone's not downloading the same
things.

~~~
bits_n_bytes
I understand money is tight for a non-profit, but why are bandwidth costs a
gating factor?

Both your upstreams, Cogent and Hurricane Electric, offer 100G ports at
fiveish grand per month in carrier neutral DCs. Given that your budget is in
the millions, an outlay of this magnitude doesn't seem wildly out of the
question.

If you can explain what the problems are in getting more bandwidth, I'd be
more than happy to see what I can do to help.

Please let me know if you'd prefer to discuss the matter over email.

~~~
jonah-archive
Happy to chat in much greater detail over email if you like (in my profile) --
transit specifically is not generally our limiting cost (though of course in
the context we operate in, every penny counts -- e.g. we run our own DCs,
ambiently cooled, single-feed grid power with minimal backing, etc). We
generally run very close to the edge, capacity-wise, in order to get the most
out of what we have -- in this case, demand moved very quickly and we had to
work a hardware deployment to catch up.

~~~
MaxBarraclough
I'm sure I'm not the only one who'd be interested in reading more of this. Is
there a reason your conversation can't proceed on HackerNews?

~~~
Aeolun
Or at least update us on the conclusion :)

~~~
bits_n_bytes
You can follow the thread on the NANOG list:

[https://mailman.nanog.org/pipermail/nanog/2020-May/107720.ht...](https://mailman.nanog.org/pipermail/nanog/2020-May/107720.html)

------
tgsovlerkhgsel
They increased the bandwidth and traffic immediately filled the new capacity.
While this could be just a faster web site attracting more users / users were
giving up due to slow loads before, it could also indicate that they have a
lot of automated traffic that will consume whatever resources are available
and can be throttled without endangering their mission.

I hope they'll look into it, find a way to identify that traffic, and then
(ideally) classify it accordingly and continue to serve it only after all
human traffic has been served (i.e. if they run into a bandwidth crunch due to
bots, only the bots suffer, and they don't need to upgrade as quickly).

~~~
textfiles
We don't save logs of who hits us, but we watch who hits us. It's not bots.
But good thinking.

~~~
twicetwice
How do you know? Not doubting you, just curious as to how you figure out if a
request is from a bot or not.

~~~
textfiles
Well, obviously, a sneaky bot is a sneaky sneak and acts like a person. But
conversely, people act in a way that could be like a bot if anyone casually
looked at them - mass downloads, no theme or meaning to content. In general,
however, it's all pretty much people. Millions of people a day.

------
jeromegv
The wayback machine has been essential through COVID as many government just
publish the numbers and data "of the day" and the only way to compare to the
day before is to look at IA.

~~~
simonw
Do you have any examples of government sites that are doing this?

I have a side-hobby of setting up scrapers which pull scraped data into a git
repository, precisely for this kind of thing. I'd be happy to set a few up.

Some of my posts about this technique (which I call "git scraping"):
[https://simonwillison.net/tags/gitscraping/](https://simonwillison.net/tags/gitscraping/)

~~~
Endlessly
Not sure if you’re asking for:

(1) US coronavirus data in general,

(2) examples of sources that do not log prior days data, or

(3) source that “edit” prior data without noting the edits.

(4) something else

If (1) this page in the table under the column “sources” links to where the
data came from:

[https://www.worldometers.info/coronavirus/country/us/](https://www.worldometers.info/coronavirus/country/us/)

~~~
simonw
2 and 3.

------
lkramer
Crazy idea: Why don't ICANN hand over the running of .org to archive.org? It's
a bit outside their expertise, but they do have knowhow on running tech to
scale, and the extra revenue could be used to fund a very good cause.

Also, I'd trust these guys not to mishandle .org

(yeah, I know it's not going to happen)

~~~
jonah-archive
> Also, I'd trust these guys not to mishandle .org

As much as I like the idea of running a TLD, if anyone gives me a TLD, I'm
gonna put an MX record on it and be jonah@org. That's a promise.

EDIT: To be fair, I would probably be overwhelmed with the usual sense of
responsibility and not do this. But, _the temptation_

~~~
Jleagle
Would be a nightmare with all the crappy email validators out there..

~~~
myself248
I recall a similar story involving amateur radio. By ITU convention, amateur
callsigns are [prefix][digit][suffix], like W1AW.

So one night, a new ham is operating his friend's fancier station, which has
the HF equipment to talk clear around the world. And they make contact with
another ham identifying himself as JY1. No suffix. Well that's weird, so they
look it up, the JY prefix is Jordan, okay. So the guy on the mic just asks,
"why doesn't your callsign have any letters after the digit?" and the reply
comes back with a chuckle, "Oh, because I am the King."

That was the late King Hussein. about whom much has been written. I don't know
if his call ever caused trouble with software validators, but I'd certainly
believe it.

~~~
lb1lf
I know of a callsign causing trouble with most log programs - RAEM, which
Russian amateurs dust off every now and then to commemorate polar pioneer
Ernst Krenkel.

Took a while to get ARRL Logbook of the World credit for that one.

------
sersi
I wish that Archive.org would start collecting funds to finance an external
backup site. I strongly feel that archive.org is an irreplaceable treasure and
that making sure that it would resist natural disasters should be a
priority...

~~~
driverdan
That doesn't require a special fund, donate to them. They'd make backups if
they had enough donations to cover it.

~~~
sersi
I actually already donate about 250 usd a year to them... But would much
prefer if there was a special fund I would donate that would make that happen
in the future

------
Someone1234
I donate to the Internet Archive (recurring). I just hope they stick to their
core mission, and don't get sidetracked like Wikipedia has. They offer a
valuable service and I believe will help historians in the future understand
this point in the internet's life.

PS - Don't forget to pick an Amazon Smile charity and use the Smile.Amazon
sub-domain. Donated almost $20 to the EFF that way.

~~~
notRobot
How has Wikipedia got sidetracked?

~~~
gruez
Most of wikimedia's budget goes towards non-wikipedia related expenses:
[https://www.washingtonpost.com/news/the-
intersect/wp/2015/12...](https://www.washingtonpost.com/news/the-
intersect/wp/2015/12/02/wikipedia-has-a-ton-of-money-so-why-is-it-begging-you-
to-donate-yours/)

------
sktrdie
Why don't they put their data dumps in an SQLite database and indexed for
full-text-search [https://sqlite.org/fts5.html](https://sqlite.org/fts5.html)

Then put the file in a torrent. Let the users seed it.

Users can use sqltorrent
([https://github.com/bittorrent/sqltorrent](https://github.com/bittorrent/sqltorrent))
to query the db without downloading the entire torrent - essentially it knows
to download only the pieces of the torrent to satisfy the query.

Every time a new dump is published by internet archive, the peers can change
to the new torrent and reuse the pieces they already have - since SQLite is
indexed in an optimal way to reduce file changes (and hence piece changes)
when the data is updated.

I talk a bit about it here: [https://medium.com/@lmatteis/torrentnet-
bd4f6dab15e4](https://medium.com/@lmatteis/torrentnet-bd4f6dab15e4)

Would save Internet Archive lots of bandwidth and hassle

~~~
Aeolun
A 50PB sqlite database? I don’t think any filesystem would be happy with that.

~~~
sktrdie
Indeed. But nothing nothing stops them from storing torrents of torrents. The
search index would be just that, the text index. Which would point to another
torrent storing the actual content. Assets would point to yet other torrents.

Would be interesting to learn how it's currently partitioned. I would mimic
the same portioning system but use torrent instead so users can help with
hosting. And use sqltorrent to serve queries efficiently.

------
teekert
I clicked this article to look for ways to contribute.. bandwidth. But there
is no torrent/ipfs like method to help, just cash. I'd be happy to run a
service that would help Wikipedia, archive.org, ddg (any service I believe in)
decentralize. I always keep my Ubuntu torrents open for launch days and some
beyond. I have a server running and I don't need my 50/50 fiber all the time.

If I was a better software dev, I'd try to make a daemon which I can feed a
list of websites which I want to support with my caches that can be upgraded
(like the DNS system?). Like a folding at home for bandwidth.

~~~
TeMPOraL
> _just cash_

This is almost always the best and most efficient way you can help a cause.
You donating infrastructure means donating them extra work - they'll need to
integrate and keep track of it, as well as manage a relationship with you (in
particular, a risk of problems with you or your service). Meanwhile, with
extra cash, they can buy what they need and what integrates well, or pay the
most effective specialists they need.

It's an universal principle. Cash is the best gift, if you're gifting to help.
That's why e.g. Red Cross frowns at people donating _stuff_ \- it creates a
huge logistical problem for them as well as depriving them of opportunity to
boost the markets in the disaster-struck area. Or why it's better for you to
donate money to your local homeless shelter rather than volunteer to work in
it - if you work your job for extra hours instead, and donate that money,
they'll hire workers better skilled for that task.

~~~
teekert
But I have bandwidth to share, not cash. I'm already at the lowest
subscription level with my ISP. I understand the argument, but if there was a
general technical solution this would be nice, right?

~~~
TeMPOraL
That's true. Does such solution exist, though?

I had some hopes for Filecoin (i.e. a P2P system with monetary incentives
attached to backing up other people's data, from one of the very few groups in
the crypto space that don't look like scammers to me), but I haven't heard
anything about it in a while.

~~~
teekert
I used to participate in such a service from Lacie (or was it Lacie?), I
shared 1 TB, got 1 TB in return "cloud" storage (I think multiplied by the
time you were online actually), which was in fact, distributed (after local
encryption) over many other online PCs. I ran it on my server but one was not
even obligated to keep their PC on. It worked "meh" (it was all java based
with a meh UI) and they discontinued it.

------
suyash
A suggestion around Donation, see if your employer will match it that way you
can double the contribution: [https://help.archive.org/hc/en-
us/articles/360002059712-How-...](https://help.archive.org/hc/en-
us/articles/360002059712-How-can-I-get-my-company-to-match-my-donation-)

------
x3blah
Archive.org works surprisingly well as a general purpose web proxy. Just
prefix the URL, e.g., [http://example.com](http://example.com), with
[https://web.archive.org/save/](https://web.archive.org/save/), e.g.,
[https://web.archive.org/save/http://example.com](https://web.archive.org/save/http://example.com)

The aesthetic intrusiveness of the archive.org header and footer are minimal
since I use a text-only browser that has no Javascript engine.

Sometimes I get "This url is not available on the live web or can not be
archived." However this happens for only a surprisingly small minority of
websites.

Rarely I find that /save is unsuccessful in which case I can still find past
copies using something like

    
    
       curl -o 1.txt "https://web.archive.org/cdx/search/cdx?url=http://www.example.net&fl=timestamp,original" ;
       sed -i '/^[12][0-9]* h/!d;/^[12][0-9]* h/{s/^/http:\/\/web.archive.org\/web\//;s/ /\//;s/\r//;}' 1.txt
    

The limitation with past copies versus /save is that archive.org will not
usually crawl past page one on websites with many successive pages, e.g.,
[http://example.com/?page=2](http://example.com/?page=2),
[http://example.com/?page=3](http://example.com/?page=3), etc.

Has anyone ever considered mirroring archive.org, or parts of it, to other
geographic locations.

Could this be done. Why or why not.

~~~
Springtime
I use a custom browser keyword search to find existing archived pages before
saving one, personally. Eg: _ar <URL>_ for:

    
    
      https://wayback.archive.org/web/*/%S
    

I'd imagine it would be useful for IA to implement some message for scenarios
where a page has already been saved within a certain timespan and provide both
a link to the already saved version and offer to save again. As this would
mitigate mass savings of an identical page that can occur when some popular
link is accidentally shared with the _/ save/_ URL instead of the static URL
or when it's a popular page that people want to archive.

Archive.is displays such a message (to the effect of, 'this page was archived
<date>, if it looks outdated click save') and also redirects to the most
recent copy.

~~~
x3blah
That's a good point. I mainly use it for browsing websites that change daily
as well as ones with many successive pages, e.g., 1, 2, 3, etc. that IA does
automatically not crawl. Thus, not many existing copies if any.

HAproxy changes the Host header and modifies the URL. I can either use the
text-only browser's http-proxy option or I can direct the request to the
web.archive.org backend by adding a custom HTTP header to the request. If I am
not mistaken, the so-called "modern" browsers do not have built-in capability
to add headers.

------
SteveNuts
I'm really surprised they don't use more CDN for this. Anyone know the reason
why it isn't served by something like cloudflare?

~~~
adtac
Please don't make the whole internet basically cloudflare. they've banned my
VPN endpoint (Hetzner server) and as a result a huge chunk of websites already
don't work for me despite me having done nothing wrong. I've heard reports of
Tor users being restricted as well.

~~~
SteveNuts
Okay, any one of the other CDNs then. It seems like the bandwidth costs and
equipment are prohibitively expensive for them. There must be a reason they
don't use them which is what I want to know.

~~~
dylz
Any of the other CDNs are absurdly prohibitively expensive, pay-per-GB, etc.

And would still have to download from their source.

------
ufo
The article doesn't mention why archive.org demand went up so much after
COVID. Does anyone know?

~~~
sp332
I don't know if this explains all of it, but they launched a National
Emergency Library, where they pooled the DRM'd versions of ebooks from lots of
libraries and removed one little bit of the DRM restrictions.
[http://blog.archive.org/2020/03/24/announcing-a-national-
eme...](http://blog.archive.org/2020/03/24/announcing-a-national-emergency-
library-to-provide-digitized-books-to-students-and-the-public/)

------
jl6
Mixed feelings here... I love the IA but I feel people should not use them as
a streaming service. Their mission is preservation, and access is obviously
part of that, but I would feel bad hammering them every day for general TV
watching purposes.

~~~
sradman
I assumed that their bandwidth needs were due to their web crawler rather than
serving cached content. I guess I’m basing that assumption on my own usage
patterns.

------
mmmBacon
Maybe my sense of scale is warped but I’m amazed at how little BW the archive
uses. Since this data appears to be averaged over a 1 minute window, I wonder
what their p99 usage is. The article says they bought a 2nd router. Were they
only using a single router before (no redundancy or protected BW)? I realize
the Archive probably has a small budget but one can buy a 3.2T switch
relatively cheaply these days.

------
sergiotapia
What's the endgame for archive.org? Are they expecting for there to be a
breakthrough in storage technology?

I read somewhere that data creation is exceeding storage solutions' pace. Is
this true?

What about a mesh of some kind where every person who install an application
hosts bits and pieces of random data and serves it to whoever asks for it?

------
social_quotient
What if there was a feature to load the pages without images? I use IA a lot
but I rarely need the media stuff, at least initially. Speaking for myself I’d
be happy to get the page without images and then click to optionally get the
images. (Lazy lazy loading)

~~~
gojomo
Inline images (even across all Wayback Machine usage) are likely a tiny
factor, compared to the rich-media downloads (audio, video) dominating this
usage.

------
whanamura
Hey,for those who would like to help the Internet Archive with their excess
storage, we are ginning up a project with Filecoin.ai to store some of our
open data collections like Prelinger films, .Gov data, some audio and texts.

You can get an update on decentralized storage at our DWeb Meetup tomorrow
(5/13 at 10 AM pacific, 5 PM UTC)

[https://www.eventbrite.com/e/dweb-meet-up-virtual-
decentrali...](https://www.eventbrite.com/e/dweb-meet-up-virtual-
decentralized-storage-comes-of-age-tickets-104783963656)

------
S_A_P
Not trying to brag or virtue signal here but this is one cause I really feel
great supporting. This is potentially our archive of human history and
collective knowledge.

------
lihaciudaniel
Sometimes I wonder if people keep track of internet history. For example, we
have the writing how Rome and Republic of Rome was only because of Cicero
otherwise we wouldn't be able to comprehend. I think the modern day Herodotus
is this website right here (assuming it will survive centuries hopefully)

------
fireattack
Kinda off-topic, but I noticed that it seems the owners of the sites can ask
Archive.org to remove existing archives. I've lost a few I "created" myself
this way.

Not say it's not reasonable, but how do we work around this? Any alternative
services that are more "resilient" in this regard?

------
simonmales
For the video and audio content. Could they take advantage of something like
Webtorrent.

~~~
toomuchtodo
Every item has a torrent file, and can be retrieved with a BitTorrent client.

[https://help.archive.org/hc/en-
us/articles/360004715251-Arch...](https://help.archive.org/hc/en-
us/articles/360004715251-Archive-BitTorrents)

~~~
simonmales
Brilliant!

------
oknoorap
We need an open source competitor of archive.org using IPFS technology

~~~
iofiiiiiiiii
Or an IPFS transfer/mirror mechanism built into archive.org.

------
PudgePacket
I'm curious about the outgoing bandwidth, for crawling :) !

------
VectorLock
What a refreshingly simple and wholesome announcement.

------
afpx
What's the easiest way to seed the most popular content ala torrent?

~~~
__s
ipfs: [https://betanews.com/2018/08/09/decentralized-archive-
org](https://betanews.com/2018/08/09/decentralized-archive-org)

~~~
ghastmaster
I would love to donate bandwidth/storage. I haven't a clue how. I wish there
was software I could throw in my server, set how much storage and bandwidth I
can donate and run 24/7.

~~~
betamaxthetape
Every item on archive.org has a bit-torrent file, I believe. This doesn't
solve the problem of working out which items are the most popular, but if you
can figure out that then you would be able to manage storage and bandwidth. I
suppose the real question is: how many other people use bit-torrent to
download from archive.org, as opposed to direct downloads. I (sadly) suspect
it's a very small fraction.

Semi-unrelated, but if you're looking for ways to help and have a spare
server, Archive Team [1] is always looking for additional capacity. Although
Archive Team != archive.org, they do grabs of at-risk content which (almost
always) get uploaded to archive.org. [disclaimer: I help out with various
Archive Team projects, the most recent of which was the backup of Yahoo
Groups).

[1] [https://www.archiveteam.org/](https://www.archiveteam.org/)

------
wglane
Donated.

------
chx
I have revoked my donations until they close their pirate library. Sorry but
let's call a spade a spade. They admit:

> multiple readers can access a digital book simultaneously

with the only caveat being it is borrowed for two weeks but let's face it,
most value of a book comes from its first reading -- and what stops you from
"borrowing" it again, anyways.

It's been a gigantic disappointment for me to see them do this and not back
but try to placate the authors with weasel words and an opt out.

~~~
bzb3
Culture wants to be free.

~~~
therealcamino
Writing wants to be uncompensated?

~~~
manquer
I guess it is more information wants to be free? protecting the price of
information is harder when making copies is very cheap.

