
Cloudflare and the Wayback Machine, joining forces for a more reliable Web - jgrahamc
http://blog.archive.org/2020/09/17/internet-archive-partners-with-cloudflare-to-help-make-the-web-more-useful-and-reliable/
======
james412
Was worried about CF getting their claws dug into archive.org, but on reading,
this is a decidedly non-evil deal, actually it sounds wonderful. Still, I
worry if there might be some unseen long term interest in the archive.

Never forget Dejanews

~~~
neop1x
Not long ago, CF has been blocking access from Tor. And they are blocking
access from my web crawler sometimes. I don't like CF as they act as a police
or gatekeeper to the origin website, deciding who to penalize and who do not,
while pretending to be speeding up websites and protecting from 'threads'.

~~~
gogopuppygogo
One of the first 100 people to use cloudflare when it launched.

Paying them today to speed up a couple of websites while protecting them.

They rock at making big things possible for very small companies.

~~~
trevorsstar
Hey, me too! Do you have the first-users t-shirt?

------
buildbuildbuild
This should be made very clear to Cloudflare users, ideally a warning next to
the Always Online checkbox.

"Always Online" now can mean "Archive Forever" \- even when a site is pre-
launch.

~~~
JackC
Yeah, I definitely expect this to bite some people, if I'm understanding
correctly. A plausible scenario (among many) would be: soft launch a site,
show it to some early stakeholders, have Wayback archive everything via Always
Online, fix embarrassing screwups or oversharing in soft-launched version,
publicize site more broadly, everyone in the world can rewind to version zero,
regrets. I don't think the existing warnings really make clear that a soft
launch is now a forever launch.

~~~
jgrahamc
The solution to this is... robots.txt. Otherwise your site might turn up in
Google etc. Since it's archive.org that's doing the crawling and they respect
robots.txt it won't get archived.

~~~
symfoniq
Archive.org does not respect robots.txt IIRC. I’ve run into this problem
before with them. Ironically, I ended up blocking Internet Archive’s ASN using
Cloudflare.

EDIT: Internet Archive started ignoring robots.txt in 2017:
[https://www.digitaltrends.com/computing/internet-archive-
rob...](https://www.digitaltrends.com/computing/internet-archive-robots-txt/)

~~~
kalleboo
They only started ignoring robots.txt on US government websites (as that
article also says)

~~~
symfoniq
That is not what the article says.

It says Internet Archive had _already_ started ignoring robots.txt on US
government websites.

Now (since 2017) they ignore it on _all_ websites.

------
superkuh
Normally I'd be upset about Cloudflare getting involved in anything good and
pure like archive.org but this relationship, just suggesting new URLs to
archive, seems harmless enough.

~~~
cj
> I'd be upset about Cloudflare getting involved in anything good and pure

Why?

At least by FAANG standards, Cloudflare stands out to me as one of the good
guys.

~~~
therealmarv
Just a reminder: Cloudflare with its standard settings is breaking the second
and third world countries internet with their captchas on websites. This is
discrimination in my opinion. As long as you are only in a first world country
you will never notice.

~~~
peteretep
I do almost all of my internet access from a 2nd/3rd world country and hadn’t
noticed?

~~~
andreareina
Depends on the country/ISP probably. From the Philippines, Firefox with
uBlockOrigin and PrivacyBadger hit captcha walls all the time; all that
stopped as soon as I moved to Singapore.

------
dualboot
Cloudflare is really neat unless you find yourself mysteriously blacklisted by
them as a user.

Then suddenly the web is a much smaller place.

~~~
pabs3
You can use archive.org to bypass the Cloudflare blocklist, especially
considering the save page feature.

------
varbhat
Good News.

I also recommend using Internet archive addon in browser. Clicking on it would
archive the website. That way, you can archive pages you visit.

~~~
surround
Or use this bookmarklet:

    
    
      javascript:window.location="https://web.archive.org/save/"+location.href

~~~
sp332
I use
[https://web.archive.org/web/submit?url=%s](https://web.archive.org/web/submit?url=%s)
and set the keyword to "rez". That way I can type "rez example.com" and it
will send me to the archived version.

~~~
jwilk
Why "rez"?

~~~
null0pointer
Short for "resurrect" maybe?

~~~
sp332
Yup!

------
SimeVidas
Just how big are Internet Archive’s servers? I can’t fathom how they’re able
to store so much of the web in so many versions.

~~~
013
The Wayback Machine uses 9.6 PetaBytes. Total storage is 50 PetaBytes.

[https://archive.org/web/petabox.php](https://archive.org/web/petabox.php)

------
nikisweeting
Next step, have CloudFlare start mirroring IA on their own servers so we have
some redundancy in case IA ever goes bankrupt.

Ideally it would be a non-profit that does it, but as a last resort CF is one
of the few companies I'd trust to do it right and do it transparently.

------
borrame
Cloudflare is not vpn friendly.

I'm a privacy concerned vpn user and in my daily browsing I have to deal
dozens of times a day with cloudflare captchas or in some cases with
cloudflare total blocking.

~~~
keepingscore
Is using this chrome/fx addon a option for your use case?

[https://support.cloudflare.com/hc/en-
us/articles/11500199265...](https://support.cloudflare.com/hc/en-
us/articles/115001992652-Using-Privacy-Pass-with-Cloudflare)

------
booleanbetrayal
This is actually a really good symbiotic relationship that should foster the
archival of a ton more content. Hoping to see this toggle enabled by default
at some point.

------
stubish
Interesting to find that, when I checked to see if I was using the feature, I
had already agreed to the supplemental terms saying my information will be
shared with IA.

(For others who need to opt out, [https://support.cloudflare.com/hc/en-
us/articles/200168436-U...](https://support.cloudflare.com/hc/en-
us/articles/200168436-Understanding-Cloudflare-Always-Online) describes how to
disable "Always Online". There doesn't seem to be a way to turn off just the
information sharing.)

------
resynth1943
This clearly isn't to create some utopic 'more reliable Web'. In fact,
Cloudflare severely undermines that, by pushing their centralised view of what
the internet should be.

I was hopeful, but after reading this:

> “The Internet Archive’s Wayback Machine has an impressive infrastructure
> that can archive the web at scale,” said Matthew Prince, co-founder and CEO
> of Cloudflare. “By working together, we can take another step toward making
> the Internet more resilient by stopping server issues for our customers and
> in turn from interrupting businesses and users online.”

It's plain to see that this is a money-making venture for Cloudflare. While I
do like the added functionality, I personally can't see how this 'improves'
the Wayback Machine. It's just going to place _more_ load on it.

------
luckylion
I don't like the idea of "we're tacking this onto an existing service lots of
people have enabled". CF bit me recently by suddenly taking away proxied dns
wildcards from free zones, as it's now a premium feature (breaking the
security promise in the process by changing the wild card entry to non-
proxied). I don't like surprises and opt-out changes in critical
infrastructure.

It's one thing to use CF's Always-On service - you're a customer, you know you
can remove your data from it. It's another to get the Internet Archive
involved, who may or may not remove your data, and may or may not honor
robots.txt.

~~~
stubish
Sending the details seems to be tied into clicking the 'Update' button in the
Cloud Flare UI, which documents that clicking it you agree. So they might not
be sending your PII to a 3rd party until they get your permission. Hopefully
any automated updates are not violating customers wishes. Yes, it is annoying
the features have been tied together for people who choose to have as little
interaction with IA as possible.

------
britmob
Wow, this is awesome to see. I hope this doesn’t put a lot of load on the IA,
though..

~~~
M4v3R
I would assume that when a site goes offline Cloudflare fetches a snapshot
from IA only once and then serves this copy to all further visitors, unless
I'm missing something?

Here's a more detailed description of the service from Cloudflare support
pages: [https://support.cloudflare.com/hc/en-
us/articles/200168436](https://support.cloudflare.com/hc/en-
us/articles/200168436)

------
zmix
Whatever fills web.archive.org is good!

Though, it would be nice if someone invented technology, that can erase all
the 404 pages and redirects, that are archived, as well, as soon the page goes
offline. Maybe a job for AI?

~~~
feralimal
No.

Keep historical revisionism out of archive.org.

------
throwawaysea
Has Cloudflare clarified their stance on which content they allow and
disallow? They’ve wavered in the past and given how their service is basically
critical infrastructure for the internet, I really want them to commit to free
speech, avoid deplatforming, and avoid exceeding legal minimums.

------
josefresco
Side rant: Sure would be nice if the Wayback Machine showed actual snapshots
of web pages, instead of "hybrid" snapshots where they combine old with new
(maybe it's a setting?). I recently horked a website, and thought to check the
Wayback Machine. Curiously, an edit I made that day was showing on snapshots
dating back several years. Until I discovered how the WBM worked, I was
pulling my hair out.

~~~
judge2020
This is probably due to XHR's. The IA loads all JS, so if a website hard-codes
the URL or does other complicated XHR stuff the IA might not be equipped to
save the response for those, if they do at all.

------
feralimal
Perhaps we are getting a sugared pill? Perhaps CF are genuinely being useful
here, but in order to gain trust to act nefariously in future?

I don't feel comfortable with their ability to switch off parts of the
internet, nor in this case, that they have their hands near what is preserved
for posterity.

As they say: "Cloudflare has become core infrastructure for the Web, and we
are glad we can be helpful in making a more reliable web for everyone." They
are indeed powerful.

I'm concerned that they are becoming gatekeepers to information, under the
guise of providing a better internet service. They are able to operate at a
level deeper than the odious restrictions youtube, facebook et al enforce on
free speech.

~~~
feralimal
I'm being downvoted - but we have seen major 'book burnings' on youtube, etc
where billions of comments and videos have been purged. These are private
platforms and can do what they like, so in a way that's acceptable as it is
within their terms of service.

CF is a level deeper than that. This is a company that can effectively shut
down the internet for companies and individuals. And now they are involved
with archive.org? Should we be concerned about online historical revisionism
as that relationship matures?

I feel uncomfortable that CF seems to be positioning itself as a guardian to
all information - not at an application level, but at an architectural level.

Cloudflare is shaping up to be a key tool that an authoritarian government
requires. And I'm concerned about it!

------
josefrichter
Could someone please automatically activate this for all content linked from
HN? It happens all the time that many of the first page links are down due to
traffic spike.

~~~
raybb
From what I can tell, all links submitted are automatically archived.

------
j45
Encouraging to see this kind of partnership this day in age. May it never
forget why it started and only improve on it.

------
lgats
Great, until your admin panel is archived...

~~~
scrollaway
Cloudflare just gives more discovery, it doesn't give IA access to anything
that was previously more secure...

"COVID tests: great, until you find a positive result"

~~~
lgats
| As new URLs are added to sites that use that service they are submitted for
archiving to the Wayback Machine.

Yes, this would prevent most order-confirmation pages or otherwise private
must-be-logged-in pages from being archived, but it will expose presumed-
private URLs that are thought to be unique (tracking numbers, files uploaded
with unique names, unique/private image urls that are otherwise publicly
accessible)

If you've made efforts to your systems to prevent enumeration-attacks, this
could partly bypass them.

~~~
jlgaddis
What I'm hearing is "don't rely on security by obscurity", which I
wholeheartedly agree with.

------
amelius
Perhaps they can start archiving YouTube.

------
jamescampbell
My heart shuddered when I read the headline. I can’t be alone in the fear.

------
vaccinator
If only we could get the NSA to publish their archive of public data.

~~~
est31
There is hope. With the help of then-senator Al Gore, the CIA made photographs
available to researchers it had made of the polar regions to search for soviet
nuclear installations. They became valuable for climate research later on.

------
mcdevilkiller
Strange that the blog doesn't have https redirection.

------
nomercy400
Who owns Cloudflare? And what are they valued at?

I mean, these deals make them look cool and altruistic, but what happens when
BigCompany offers them enough money to sell?

------
tiffanyh
Is this basically Archive.org becoming a customer of Cloudflare CDN to reduce
load off their servers?

~~~
toomuchtodo
It's Archive.org being provided URL telemetry for archiving public sites they
have not yet found through traditional means (crawling or users submitting
requests through the Wayback Save page) by a Cloudflare product.

The next step would be for Cloudflare to point to Archive.org Wayback links
when an origin isn't available (similar to browser extensions that point to
Archive.org when sites 404 or are down, but in Cloudflare's core).

Cool stuff. Thanks Cloudflare folks.

~~~
GekkePrutser
I really doubt their customers would want that. Usually when a page is 404
it's because the company in question wants to forget about it :)

~~~
jedberg
You would return the archived page for a 5xx error, not a 4xx error.

~~~
GekkePrutser
Ah I see. But this is precisely a usecase for cloudflare's own caching
service.

It wouldn't be fair to use archive.org's community-sponsored resources for
propping up businesses which are too cheap to pay for proper IT :)

~~~
pronoiac
While it's not explicitly mentioned, I think Cloudflare is providing financial
support to the Internet Archive.

~~~
FlyMoreRockets
One would hope so. Considering the timing relative to the IA's potentially
very expensive legal battle, I full expect this to be the case. Still,
considering CF's anti-privacy/anti-TOR stance this is a deal with the devil.
Guess I should give money directly to the IA. Considering how much value they
provide, I'll do this immediately after updating this post.

