
Why I link to Wayback Machine instead of original web content - puggo
https://hawaiigentech.com/post/commentary/why-i-link-to-waybackmachine-instead/
======
bartread
I'm not sure I'm a fan of this because it just turns WayBackMachine into
another content silo. It's called the world wide web for a reason, and this
isn't helping.

I can see it for corporate sites where they change content, remove pages, and
break links without a moment's consideration.

But for my personal site, for example, I'd much rather you link to me directly
rather than content in WayBackMachine. Apart from anything else linking to
WayBackMachine only drives traffic to WayBackMachine, not my site. Similarly,
when I link to other content, I want to show its creators the same courtesy by
linking directly to their content rather than WayBackMachine.

What I can see, and I don't know if it exists yet (a quick search suggests
perhaps not), is some build task that will check all links and replace those
that are broken with links to WayBackMachine, or (perhaps better) generate a
report of broken links and allow me to update them manually just in case a
site or two happen to be down when my build runs.

I think it would probably need to treat redirects like broken links given the
prevalence of corporate sites where content is simply removed and redirected
to the homepage, or geo-locked and redirected to the homepage in other locales
(I'm looking at you and your international warranty, and access to tutorials,
Fender. Grr.).

I also probably wouldn't run it on every build because it would take a while,
but once a week or once a month would probably do it.

~~~
silicon2401
> But for my personal site, for example, I'd much rather you link to me
> directly rather than content in WayBackMachine.

That would make sense if users were archiving your site for your benefit, but
they're probably not. If I were to archive your site, it's because I want my
own bookmarks/backups/etc to be more reliable than just a link, not because
I'm looking out to preserve your website. Otherwise, I'm just gambling that
you won't one day change your content, design, etc on a whim.

Hence I'm in a similar boat as the blog author. If there's a webpage I really
like, I download and archive it myself. If it's not worth going through that
process, I use the wayback machine. If it's not worth that, then I just keep a
bookmark.

~~~
3pt14159
The issue is that if this becomes widespread then we're going to get into
copyright claims against the wayback machine. When I write content it is mine.
I don't even let Facebook crawlers index it because I don't want it appearing
on their platform. I'm happy to have wayback machine archive it, but that's
with the understanding that it is a backup, not an authoritative or primary
source.

Ideally, links would be able to handle 404s and fallback. Like we can do with
images and srcset in html. That way if my content goes away we have a backup.
I can still write updates to a blog piece or add translations that people send
in and everyone benefits from the dynamic nature of content, while still being
able to either fallback or verify content at the time it was publish via the
wayback machine.

~~~
headmelted
But it’s also not guaranteed to be consistent. What if you don’t delete the
content but just change it? (I.e. what if your opinions change or you’re
pressured to edit information by a third party?).

~~~
3pt14159
I addressed this.

> I can still write updates to a blog piece or add translations that people
> send in and everyone benefits from the dynamic nature of content, while
> still being able to either fallback or verify content at the time it was
> publish via the wayback machine.

Updates are usually good. Sometimes you need to verify what was said though,
and for that wayback machine works. I agree it would be nice if there was a
technical way to support both, but for the average web request it's better to
link to the source.

------
markjgraham
We suggest/encourage people link to original URLs but ALSO (as opposed to
instead of) provide Wayback Machine URLs so that if/when the original URLs go
bad (link rot) the archive URL is available, or to give people a way to
compare the content associated with a given URL over time (content drift)

BTW, we archive all outlinks from all Wikipedia articles from all Wikipedia
sites, in near-real-time... so that we are able to fix them if/when they
break. We have rescued more than 10 million so far from more than 30 Wikipedia
sites. We are now working to have Wayback Machine URLs added IN ADDITION to
Live Web links when any new outlinks are added... so that those references are
"born archived" and inherently persistent.

Note, I manage the Wayback Machine team at the Internet Archive. We appreciate
all your support, advice, suggestions and requests.

~~~
jhallenworld
It's interesting to think about how HTML could be modified to fix the issue.
Initial thought: along with HREF, provide AREF- a list of archive links. The
browser could automatically try a backup if the main one fails. The user
should be able to right-click the link to select a specific backup. Another
idea is to allow the web-page author to provide a rewrite rule to
automatically generate wayback machine (or whatever) links from the original.
This seems less error prone and browsers could provide a default that authors
could override.

Anyway, the fix should work even with plain HTML. I'm sure there are a bunch
of corner cases and security issues involved..

Well as mentioned by others, there is a browser extension. It's interesting to
read the issues people have with it:

[https://addons.mozilla.org/en-US/firefox/addon/wayback-
machi...](https://addons.mozilla.org/en-US/firefox/addon/wayback-
machine_new/reviews/)

~~~
javajosh
So this is a little indirect, but it does avoid the case where the Wayback
machine goes down (or is subverted): include a HASHREF which is a hash of the
state of the content when linked. Then you could find the resource using the
content-addressable system of your choice. (Including, it must be said, the
wayback machine itself).

~~~
ponker
I've found that web pages have so much dynamic content these days that even
something that feels relatively static generates two different hashes almost
on every pageload.

~~~
javajosh
Indeed. I don't think you could or should hash the DOM - not least of which
because it is, in general, the structured output of a program. Ideally you
could hash the source. This might be a huge problem for single page
applications, except you can always pre-render a SPA at any given URL, which
solves the problem. (This is done all the time - the most elegant way is to
run e.g. React on the server to pre-render, but you can also use another
templating system in an arbitrary language, although you end up doing all
features maybe not twice, but about 1.5x).

------
bherb
Here, I fixed your link:
[https://web.archive.org/web/20200908090515/https://hawaiigen...](https://web.archive.org/web/20200908090515/https://hawaiigentech.com/post/commentary/why-
i-link-to-waybackmachine-instead/)

~~~
shemnon42
Came here for this. Have my upvote.

------
outsomnia
This is a bad idea...

In the worst case one might write a cool article and get two hits, one
noticing it exists, and the other from the archive service. After that it
might go viral, but the author may have given up by then.

The author is losing out on inbound links so google thinks their site is
irrelevant and gives it a bad pagerank.

All you need to do is get archive.org to take a copy at the time, you can
always adjust your link to point to that if the original is dead.

~~~
ethanwillis
Google shouldn't be the center of the Web. They could also easily determine
where the archive link is pointing to and not penalize. But I guess making
sure we align with Google's incentives is more important than just using the
Web.

~~~
bartread
> Google shouldn't be the center of the Web.

I agree, but are you suggesting it's going to be better if WayBackMachine is?

~~~
ethanwillis
That's a strawman because I never said they should be. There's room for better
alternatives.

We as a community need to think bigger rather than resigning ourselves to our
fate.

~~~
bartread
It's not a strawman because (a) I agreed with you, (b) context, and (c) I
asked a question based on what you seemed to be implying in that context: a
question to which you still haven't provided an answer.

Let me put it another way: what specifically are you suggesting as an
alternative?

~~~
ethanwillis
If I had to pick a solution from what's available right now technology wise
I'd pick something that links based on content hashes. And then pulls the
content from decentralized hosting.

I don't think I like IPFS as an organization, but tech wise it's probably what
I'd go with.

------
CaptArmchair
So, this is the problem of persistence of URL's always referencing the
original content, regardless of where it is hosted, in an authoritative way.

It's an okay idea to link to WB, because (a) it's de facto assumed to be
authoritative by the wider global community and (b) as an archive it provides
a promise that it's URL's will keep pointing to the archived content come what
may.

Though, such promises are just that: promises. Over a long period of time, no
one can truly guarantee the persistence of a relationship between an URI and
the resource it references to. That's not something technology itself solves.

The "original" URI still does carry the most authority, as that's the domain
on which the content was first published. Moreover, the author can explicitly
point to the original URI as the "canonical" URI in the HTML head of the
document.

Moreover, when you link to the WB machine, what do you link to? A specific
archived version? Or the overview page with many different archived versions?
Which of those versions is currently endorsed by the original publisher, and
which are deprecated? How do you know this?

Part of ensuring persistence is the responsibility of original publisher.
That's where solutions such as URL resolving come into play. In the academic
world, DOI or handle.net are trying to solve this problem. Protocols such as
ORE or Memento further try to cater to this issue. It's a rabbit hole, really,
when you start to think about this.

~~~
kapep
> Moreover, when you link to the WB machine, what do you link to? A specific
> archived version? Or the overview page with many different archived
> versions? Which of those versions is currently endorsed by the original
> publisher, and which are deprecated? How do you know this?

WB also supports linking to the very latest version. If the archive is updated
frequently enough I would say it is reasonable to link to that if you use WB
just as a mirror. In some cases I've seen error pages being archived after the
original page has been moved or removed though but that is probably just a
technical issue caused by some website misconfiguration or bad error handling.

------
ffpip
You can create a bookmark in Firefox to save a link quickly.

Bookmark Location-
[https://web.archive.org/save/%s](https://web.archive.org/save/%s)

Keyword - save

So searching 'save
[https://news.ycombinator.com/item?id=24406193](https://news.ycombinator.com/item?id=24406193)'
archives this post.

You can use any Keyword instead of 'save'.

You can also search with
[https://web.archive.org/*/%s](https://web.archive.org/*/%s)

~~~
bad_user
Does that `save` keyword work?

The problem is %s gets escaped, so Firefox generates this URL, which seems to
be invalid:

[https://web.archive.org/save/https%3A%2F%2Fnews.ycombinator....](https://web.archive.org/save/https%3A%2F%2Fnews.ycombinator.com%2Fitem%3Fid%3D24406193)

~~~
aendruk
Uppercase %S for unescaped, e.g.:

[https://web.archive.org/web/*/%S](https://web.archive.org/web/*/%S)

~~~
ffpip
TIL. Thanks for the info.

------
kibibu
Can we update this link to point to the archive version?

~~~
drummer
Brilliant

------
imhoguy
This is building yet another silo and point of failure. We can't pass the
entire Internet traffic thru WayBackMachine as its resources are limited.

Most preserving solutions are like that and at the end the funding or business
priorities (google groups) become a serious problem.

I think we need something like web - distributed and dumb easy to participate
and contribute a preservation space.

Look, there are Torrents available for 17 years [0]. Sure, there are some
unintresting long gone but there is always a little chance somebody still has
the file and someday becomes online with it.

I know about IPFS/Dat/SBB, but still that stuff, like Bitcoin, is too complex
for a layman contributor with a plain altruistic motivation. It should be like
SETI@Home - fire and forget. Eventually integrated with a browser to cache
content you star/bookmark and share when it is offline.

[0] [https://torrentfreak.com/worlds-oldest-torrent-still-
alive-a...](https://torrentfreak.com/worlds-oldest-torrent-still-alive-
after-15-years-180929/)

------
mountainb
Link rot has convinced me that the web is not good for its ostensible purpose.
I used to roll my eyes reading how academic researchers and librarians would
discourage using webpages as resources. Many years later, it's obvious that
the web is pretty bad for anything that isn't ephemeral.

~~~
ImaCake
>I used to roll my eyes reading how academic researchers and librarians would
discourage using webpages as resources.

While this is true in general, I am amused that this is _not_ true for citing
wikipedia. Wikipedia can be trusted to remain online for many more years to
come. And it has a built-in wayback machine in the form of Revision History.

~~~
mountainb
Try following the references on big Wiki pages and you will see why Wikipedia
pages are nightmarish for any kind of research. This is important when you are
trying to drill down to the sources of various claims. Many major pages
relating to significant events and concepts are riddled with rotted links.

The page can be completely correct and accurate, but if you cannot trace the
references then it cannot be verified and you cannot make the claims in a new
work as a result. The whole point of references is to make it so that the
claims can be independently verified. Even when there isn't a link rot problem
you will often find junk references that cannot be verified.

Wikipedia isn't a bad starting point and sometimes you can find good
references. But it is not anywhere close to reliable: just trace the
references in the next 20 Wiki articles you read and your faith will be
shaken.

------
codetrotter
By that reasoning, shouldn’t you be be using WayBack Machine links when
posting your own content to HN, instead of posting direct links?

------
cornedor
But how certain is the future of WayBackMachine, when disaster strikes, all
your links are dead. On the other hand, the original links can still be read
from the url, so the original reference is not completely gone.

~~~
INTPenis
Yeah, my thoughts were more of the way Waybackmachine is funded.

I don't feel comfortable sending a bunch of web traffic to them for no reason
other than it being convenient. The wayback machine is a web archival project,
not your personal content proxy to make sure your links don't go stale.

They need our help both in funding and in action, one simple action is not to
abuse their service.

~~~
sanitycheck
Precisely my first thoughts, too. It's an archive, not a free CDN.

I hope the author of this piece considers donating and promoting donation to
their readers: [https://archive.org/donate/](https://archive.org/donate/)

------
romwell
Good idea, by why not both (i.e. link to a webpage, _and_ to the Archive)?

Linking to Archive only makes Archive a single point of failure.

~~~
sseneca
Yes, this makes the most sense in my opinion:

Check out [this link]([https://...](https://...))
([archived]([https://...)](https://...\)))

This can also help in the event of a "hug of death"

~~~
roberto
This is what I do on my blog, with some additional metadata:

    
    
        <p>
          <a 
            data-archive-date="2020-09-01T22:11:02.287871+00:00"
            data-archive-url="https://web.archive.org/web/20200901221101/https://reubenwu.com/projects/25/aeroglyphs"
            href="https://reubenwu.com/projects/25/aeroglyphs"
          >
            Aeroglyphs
          </a>
          <span class="archive">
            [<a href="https://web.archive.org/web/20200901221101/https://reubenwu.com/projects/25/aeroglyphs">archived</a>]
          </span>
          is an ongoing series of photos of nature with superimposed geometrical shapes drawn by drones.
        </p>

------
NateEag
I understand where the author is coming from, but I think the best approach is
to write your content with direct links to the canonical versions of articles.

Have a link checking process you run regularly against your site, using some
of the standard tools I've mentioned elsewhere in this thread:

[https://www.npmjs.com/package/broken-link-checker-
local](https://www.npmjs.com/package/broken-link-checker-local)

[https://linkchecker.github.io/linkchecker/](https://linkchecker.github.io/linkchecker/)

When you run the link check (which should be regularly, perhaps at least
weekly), also run a process that harvests the non-local links from your site
and 1) adds any new links' content to your own local, unpublished archive of
external content, and 2) submits those new links to archive.org.

This keeps canonical URLs canonical, makes sure content you've linked to is
backed up on archive.org so a reasonably trustworthy source is available
should the canonical one die out, and gives you your own backup in case
archive.org and the original both vanish.

I don't currently do this with my own sites, but now I'm questioning why not.
I already have the regular link checks, and the second half seems pretty
straightforward to add (for static sites, anyway).

------
susam
I think the fundamental problem here is that URLs locate resources. We find
the desired content by finding its location given by an address. Now what
server or content lives on that address may change from time to time or may
even disappear. This leads to broken links.

The problem with linking to Wayback Machine is that we are still writing
archive.org URLs still linking to Wayback Machine servers. What guarantee is
there that those archive.org links will not break in future?

It would have been nice if the web were designed to be content-addressable.
That is, the identifier or string we use to access a content addresses the
content directly, not a location where the content lives. There is good effort
going on in this area in the InterPlanetary File System (IPFS) project but I
don't think the mainstream content providers on the Internet are going to move
to IPFS anytime soon.

------
yreg
I'm all for Archive.org. However, using it in this way — setting up a mirror
of some content and purposefuly diverting traffic to said mirror — is
copyright infringement (freebooting), as it competes with the original source.

------
j1elo
This is a bad idea for the reasons that other commenters have already stated.
If WayBackMachine falls, all links would fall. Actually the "Web" would stop
being one, if all links are all within the same service.

For docs and other texts, I just link to the original site and add an
(Archive) suffix, e.g. the "Sources" section in [https://doc-
kurento.readthedocs.io/en/latest/knowledge/nat.h...](https://doc-
kurento.readthedocs.io/en/latest/knowledge/nat.html#nat-types-and-nat-
traversal)

That is a simple and effective solution, yes it is a bit more cumbersome, but
it does not bother me.

------
asdfman123
> So in Feb 14 2019 your users would have seen the content you intended.
> However in Sep 07 2020, your users are being asked to support independent
> Journalism instead.

Can you believe it? Yesterday, I tried to walk out of the grocery store with a
head of lettuce for free, and they instead were more interested in making me
pay money to support the grocery and agricultural business!

~~~
monktastic1
Right. I thought it was pretty bad form for him to call this "spam," as though
they're the ones wronging _him._

------
koboll
This seems like a problem that would be better solved by something like:

1\. Browsers build in a system whereby if a link appears dead, they first
check against the Wayback Machine to see if a backup exists.

2\. If it does, they go there instead.

3\. In return for this service, and to offset costs associated with increased
traffic, they jointly agree to financially support the Internet Archive in
perpetuity.

------
aldo712
Here's a WayBackMachine Link to this article. :)

[https://web.archive.org/web/20200908090515/https://hawaiigen...](https://web.archive.org/web/20200908090515/https://hawaiigentech.com/post/commentary/why-
i-link-to-waybackmachine-instead/)

------
dltj
Take a look at _Robustify Your Links_.[1] It is an API and a snippet of
JavaScript that saves your target HREF in one of the web archiving services
and adds a decorator to the link display that offers the option to the user to
view the web archive.

[1]
[https://robustlinks.mementoweb.org/about/](https://robustlinks.mementoweb.org/about/)

------
wolco
No one touched on this but the experience of viewing through the
waybackmachine is awful.

Media many times will not be saved so pages look broken. The iframe and the
iframe breakers on original sites can kill any navigating.

The waybackmachine is okay for researching but a poor replacement as a perm
link.

~~~
ethagnawl
> Media many times will not be saved so pages look broken.

In my experience, this has gotten much, much better in the last few years. I
haven't explored enough to know if this is part of the archival process or
not, but I've noticed on a few occasions that assets will suddenly appear some
time after archiving a page. For instance, when I first archived this page
([https://web.archive.org/web/20180928051336/https://www.intel...](https://web.archive.org/web/20180928051336/https://www.intelfuturequiz.com/)),
none of the stylesheets, scripts, fonts or images were present. However, after
some amount of time (days/weeks) they suddenly appeared and I was able to use
the site as it originally appeared.

------
shortformblog
This man’s entire argument is completely terrible for two reasons:

1) The example he uses is The Epoch Times, a questionable source even on the
best of days.

2) What he refers to as “spam” is a paywall. He is literally taking away from
business opportunities for this outlet that produced a piece of content he
wants to draw attention to, but he does not want to otherwise support.

He’s a taker. And while the Wayback Machine is very useful for sharing
archived information, that’s not what this guy is doing. He’s trying to
undermine the business model of the outlets he’s reading.

The Epoch Times is one thing—it’s an outlet that is essentially propaganda—but
when he does this to a local newspaper or an actual independent media outlet,
what happens?

~~~
Ensorceled
> 2) What he refers to as “spam” is a paywall. He is literally taking away
> from business opportunities for this outlet that produced a piece of content
> he wants to draw attention to, but he does not want to otherwise support.

For the destination site, this is all of the downsides of AMP with none of the
upsides.

------
celsoazevedo
Is there any WordPress plugin that adds a link to the WayBack Machine next to
the original link? I would use something like that.

~~~
aargh_aargh
Look at the format of the wayback machine URL. It's trivial to generate.

Where a WP plugin would add value is by saving to the archive whenever WP
publishes a new or edited article.

------
wila
The idea of being able to access the URL once it is gone is good. However this
also means that any updates made to the original page are no longer seen.

Not all updates are about "begging for money" as the example in the article.

------
nikisweeting
Or link to your own archive of the content with ArchiveBox!

That way we're not all completely reliant on a central system. (ArchiveBox
submits your links to Archive.org in addition to saving them locally).

[https://github.com/pirate/ArchiveBox](https://github.com/pirate/ArchiveBox)

Also many other tools that can do this too:

[https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-
Comm...](https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-
Community#other-archivebox-alternatives)

------
krapp
Apropos of nothing but I added the ability to archive links in Anarki a few
months back[0]. If dang or someone wants to take it for HN they're welcome to.
Excuse the crappy quality of my code and pr format, though.

It might be useful as a backup if the original site starts getting hugged to
death.

[0][https://github.com/arclanguage/anarki/pull/179](https://github.com/arclanguage/anarki/pull/179)

------
fornowiamhere
> _Now it’s spam from a site suffering financial need._ Well, yeah!

Of course, linking to WBM is not the main reason why a site might be in this
situation but it piles up.

------
hownottowrite
Awesome. Hey, mods... Can you change the link on this post to
[http://web.archive.org/web/20200908090515/https://hawaiigent...](http://web.archive.org/web/20200908090515/https://hawaiigentech.com/post/commentary/why-
i-link-to-waybackmachine-instead/)

------
stratigos
I link to WayBackMachine as Ive built a great many greenfield applications for
startups as a freelancer, which only existed for about 6-8 months before
hitting their burn rate. If I linked to their original domains, my portfolio
would be a list of 404s.

------
rmoriz
I once discovered an information leak of German public broadcasting
organization ARD which leaked real mobile numbers on their CI/CD page where
they showed the business card designs (lol).

All records of this page on Archive.org were deleted after a couple of days, a
twitter account posting the details with a screenshot and link was reported
and my account temporarily suspended.

I assume it must be very easy to remove inconvenient content from archive.org.

(in German) [https://blog.rolandmoriz.de/2019/04/25/sind-die-leute-von-
de...](https://blog.rolandmoriz.de/2019/04/25/sind-die-leute-von-der-ard-so-
doof/)

------
runxel
While I certainly wouldn't do this with every page and also not every time, I
got so anxious of link rot lately I save out of reflex any good content I come
across to the Waybackmachine.

The use of the bookmarklet makes this really convenient.

------
AnonHP
WayBackMachine is slow (slower than many bloated websites). So it’s not a good
enough experience for the person clicking on that link.

Secondly, I personally don’t like the fact that WayBackMachine doesn’t provide
an easy way to get content removed and to stop indexing and caching content
(the only way I know is to email them, with delayed responses or responses
that don’t help). It’s far easier to get content de-indexed in the major
search engines. I know that the team running it have some reasons to archive
anything and everything (as) permanently (as possible), but it doesn’t serve
everybody’s needs.

------
euske
This is both good and scary idea: for the good part, I'm frustrated enough
that some unscrupulous websites (even some news outlets) secretly alter their
contents without mentioning the change. I want a mechanism that holds the
publisher responsible. At the same time, this is scary because we're basically
using one private organization a single arbitrator. (I know it's a nonprofit,
but they're probably not as public as a government entity.) Maybe it's good
for the time being, but we should be aware that this is a solution that's far
from perfect.

~~~
anaganisk
Public "or" a government entity.

------
cpcallen
This seems like a risky strategy, what with the pending lawsuit against
archive.org over their National Emergency Library: I am fully expecting that
web.archive.org will go away permanently within a few years.

------
rkagerer
I link to the original, but archive it in both WayBackMachine and Archive.is.

------
uniqueid
Yeah, that's another problem with the design of the web, and kind of a
significant one! Somewhat pointless to link to external documents when half of
them won't be around next year.

------
woko
As others mentioned, it is a good habit to request the page to be archived.
You don't have to link to the archive, but you would have the option to if the
page were to disappear in the future.

I wish I had done this 15 years ago for a small project/website. Nowadays, my
website is there, with all of its content, but most of the awesome references
which I had linked to are unavailable. I wrote "most", but it is close to all
of them.

------
luord
While I generally disagree because I'd rather my site was the one getting the
hits—and I would rather give the same courtesy to other authors—this does give
me the idea of checking (or creating if none exists) an archive link of
whatever I reference, and include that archive link in the metadata of every
link I include.

Users will find the archive link if they really want to, and it will make it
easier for me to replace broken links in the future.

------
8bitsrule
Gotta completely agree ... for anything you need to be stable and available.

I've been building lists of -reference- URLs for over a decade ... and the
ones aimed at Archive.org (are slower to load, but) are much more reliable.

Saved Wayback URLs contain the original site URL. It's really easy to check it
to see if the site has deteriorated (usually it has). If it's gotten better
... it's easy to update your saved WB link.

------
jakeogh
If it's not distributed, it is going to disappear.

The waybackmachine is backed by WARC files. It's perhaps the only thing on
archive.org that cant be downloaded... well except the original mpg files for
911 news footage.

[https://news.ycombinator.com/item?id=20623177](https://news.ycombinator.com/item?id=20623177)

------
samatman
This is such a fundamental problem that I'd like to be able to solve it at the
HTML level.

An anchor type which allows several URLs, to be tried in order, would go a
long way. Then we could add automatic archiving and backup links to a CMS.

It isn't real content-centric networking, which is a pity, but it's achievable
with what we have.

------
ffpip
The wayback machine has helps me on a daily basis. So many old links are dead.

The other day, I noticed that even old links from the front page of Google and
Youtube are dead now. Internet Archive still has them. These were links on the
front page of YT. Was very disappointed that even Google has dead links.

------
spqr233
I made a chrome extension called Capsule that works perfectly for this use
case. With just a click, you can create a publically shareable link that
preserves the webpage exactly as you see it in your browser.

[https://capsule.click](https://capsule.click)

~~~
nikisweeting
Does it use SingleFile under-the-hood? What storage format does this use, is
it portable? e.g. WARC/memento/zim/etc?

------
ashishb
I wrote a link checker[1] to detect outbound links and mark dead links, so
that, I can replace them manually with archive.org links.

1 - [https://github.com/ashishb/outbound-link-
checker](https://github.com/ashishb/outbound-link-checker)

------
nullandvoid
I experienced this just the other day.

I was browsing an old HN post from 2018, with lots of what seemed like useful
links to their blog

Upon visiting it the site had been rebranded and the blog entries had
disappeared

Waybackmachine saved me in this cass, but a link to it originally would have
saved me a few clicks

------
m-p-3
I still link to the original URL because the author deserves the ad revenue
and traffic, but I archive a copy to the Wayback Machine just in case the
website can't handle the load, so there is an alternative way of getting the
content.

------
Cthulhu_
If it's to actually reference a third party source, it's probably better to
make a self-hosted copy of the page. You can print it to a PDF file for
example. I don't believe archive.org is eternal, or that its pages will remain
the same.

------
PhilosAccnting
Thank you! I've only been using the labor-intensive trust-issues version of
this: paraphrasing things in my own words and linking to THAT.

I think I've been curating about 200 essays so far like that. You're now
making me rethink my flow.

------
tannhaeuser
The proper way is for a site to expose a canonical link to an article via a
meta-link (rel=canonical) if necessary, and then have a browser plugin to
automatically try archive.org with an URL generated from the canonical one if
it is down.

------
ponker
What would be even cooler is if there was an easy way to turn your own server
into a Wayback machine, so that when your server rendered a webpage, it would
use the original link if available, or its own cached version if not.

------
EllieEffingMae
I maintain a Fork of a program that does exactly this! You can check it out
here

[https://github.com/Lifesgood123/prevent-link-
rot](https://github.com/Lifesgood123/prevent-link-rot)

------
michaelanckaert
In the past I would fall back to WBM when something is no longer online.
Though recently I've been bookmarking interesting content very rigorously and
just rely on the archival feature of my bookmarking software.

------
drummer
For anything important you can't beat a good save to pdf feature in the
browser. You can then upload the pdf and link to that instead. Someone should
make a wordpress plugin to do this automatically.

------
ique
Just another reason to have content-adressable storage everywhere, then at
least if it changed you’ll know it changed, and if you can’t get the original
content anymore then the change is probably malicious.

------
axelfreeman
You could link to the original web url and also do a print version of the web
content as PDF. That's how i archive howtos and write-ups of interesting
content. Print view and create a PDF version.

------
hgo
Maybe the solution isn't technical and we should look at other fields that
have relied on referencing credible sources for a long time? I can think of
research, news and perhaps law.

------
not2b
It's probably better to link to both. If a site corrects a story, you readers
will want to see the correction, but if the page disappears, it's good to have
the backup.

------
andy_ppp
It would be good to create a distributed, consensus version (to help stop
edits) of the content rather than have a single point of failure...

------
scruffyherder
So it can be deleted too?

Or so there is no engagement at the source?

------
LostJourneyman
There's some subtle irony in that the linked site is not in fact a
WayBackMachine link, but instead a direct link to the site.

------
arnoooooo
On the same topic, I wish I could link with highlights in the page. Having a
spec for highlights in URLS would be neat.

~~~
basscomm
Chrome 80 supports this:
[https://www.chromestatus.com/feature/4733392803332096](https://www.chromestatus.com/feature/4733392803332096)

------
spurgu
I think a good solution might be to host the archive version yourself
(archive.org is slow, and always using it centralizes everything there).

Let's say you write an article on your site, [https://yoursite.com/my-
article](https://yoursite.com/my-article), and from it you want to link to an
article [https://example.com/some-article](https://example.com/some-article)

You then create a mirror of [https://example.com/some-
article](https://example.com/some-article) to be served from your site at
[https://yoursite.com/mirror/2019-09-08/some-
article](https://yoursite.com/mirror/2019-09-08/some-article) (put /mirror/ in
robots.txt and set to noindex (or maybe even better to put a rel="canonical"
towards the original article?)) and on the top of this mirrored page you add a
header bar thingy containing a link to the original article, as well as one to
archive.org if you so want.

tl;dr instead of linking to [https://example.com/some-
article](https://example.com/some-article) you link to
[https://yoursite.com/mirror/2019-09-08/some-
article](https://yoursite.com/mirror/2019-09-08/some-article) (which has links
to the original)

------
zoid_
I find that web archive pages always appear broken —- perhaps a lot of js or
css is not properly archived?

------
CassSunscreen
Everyone should be doing this in my opinion, articles get pulled all the time

------
sebastianconcpt
Clever way to make the reference immutable.

Some blockchain will end up taking care of this.

------
ImAlreadyTracer
Is there a chrome app that utilises waybackmachine?

------
LoSboccacc
has waybackmachine stopped retroactively applying robots?

if not link to that are one misconfiguration or one parked domain from being
wiped.

------
eruci
WBM is like a content snapshot. You can't go back in time and change anything.
That's why it is better than linking to the original.

------
Andrew_nenakhov
Hmm. is there a place for a service that makes a permanent copy of content,
available at the original url at the time of posting?

------
prgmatic
I stopped reading after the part where they describe the paywall gated version
of the journalism website as “Now it’s spam from a site suffering financial
need.”

That website spends money creating content for commercial viability, it
doesn’t have to bow to you and make sure you can consume it for free, and the
Wayback Machine isn’t a tool for you to bypass premium content.

------
TheSpiceIsLife
This behaviour should be reported to the WayBackMachine as abuse.

------
dirtnugget
He is actually showcasing a very nice technique to get around paywalls: turn
off JS. Often enough that’s enough to get around the paywall. I believe the
archives also disable JS when grabbing the content.

~~~
rchaud
That is changing. I've noticed over the past couple of years that sites that
could be accessed with JS turned off are now showing a "Please enable
Javascript to continue" (Quora) or just hiding the content entirely (Business
Insider).

I'm sure there are other examples as well.

~~~
dirtnugget
Not surprised. When paywalls started becoming a thing most of them could be
circumvented simply by removing a DOM element and some CSS classes. Nowadays
this is basically not possible anywhere anymore.

------
icemelt8
Just FYI, archive.org is banned in a few countries, including the UAE, where I
cannot open any links from there.

~~~
dirtnugget
Huh I wonder if they are also blocking mirrors. Also, in countries with
restrictions to internet access you probably want to make using TOR a general
habit.

------
s9w
In practice however, archive.org did censor content based on political
preference.

~~~
encom
Sounds plausible, but I sure would like a citation for that claim.

~~~
dependenttypes
They exclude Snopes and I think Salon from archiving.

~~~
Hitton
Do you have any source on that? Sites can request archive.org to stop
archiving them and to delete what is currently archived. They can do it for
any reason; concealing changes of article contents might be one of them.

~~~
dependenttypes
[https://web.archive.org/web/*/snopes.com](https://web.archive.org/web/*/snopes.com)

> Sorry.

> This URL has been excluded from the Wayback Machine.

They also do not exclude the archive.org bot in
[https://www.snopes.com/robots.txt](https://www.snopes.com/robots.txt)

~~~
Hitton
That only shows that it's excluded, not for what reason. In 2017 Internet
Archive announced it will start to ignore robots.txt in the future. When I
tried to archive random facebook page (it was not allowed in robots.txt), it
archived it happily. Afaik current way to exclude you site requires contacting
info@archive.org and proving that the site is your.

------
k1m
I think this a good idea, but especially because the WayBackMachine uses good
content security policies to prevent some of the intrusive JS ad-dependent
sites like to push on people. So you're not only protecting from future 404
scenarios, but also protecting your visitors' privacy from unscrupulous ad-
tech which seems to be everywhere now.

The example provided in the article, showing how a site looked cleaner before,
could simply be the content security policies at the WayBackMachine preventing
the clutter from getting loaded, rather than any specific changes on the site
- although I haven't checked that particular site.

