Hacker News new | past | comments | ask | show | jobs | submit login
Why I link to Wayback Machine instead of original web content (hawaiigentech.com)
578 points by puggo 10 months ago | hide | past | favorite | 248 comments



I'm not sure I'm a fan of this because it just turns WayBackMachine into another content silo. It's called the world wide web for a reason, and this isn't helping.

I can see it for corporate sites where they change content, remove pages, and break links without a moment's consideration.

But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine. Apart from anything else linking to WayBackMachine only drives traffic to WayBackMachine, not my site. Similarly, when I link to other content, I want to show its creators the same courtesy by linking directly to their content rather than WayBackMachine.

What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine, or (perhaps better) generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.

I think it would probably need to treat redirects like broken links given the prevalence of corporate sites where content is simply removed and redirected to the homepage, or geo-locked and redirected to the homepage in other locales (I'm looking at you and your international warranty, and access to tutorials, Fender. Grr.).

I also probably wouldn't run it on every build because it would take a while, but once a week or once a month would probably do it.


> But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine.

That would make sense if users were archiving your site for your benefit, but they're probably not. If I were to archive your site, it's because I want my own bookmarks/backups/etc to be more reliable than just a link, not because I'm looking out to preserve your website. Otherwise, I'm just gambling that you won't one day change your content, design, etc on a whim.

Hence I'm in a similar boat as the blog author. If there's a webpage I really like, I download and archive it myself. If it's not worth going through that process, I use the wayback machine. If it's not worth that, then I just keep a bookmark.


The issue is that if this becomes widespread then we're going to get into copyright claims against the wayback machine. When I write content it is mine. I don't even let Facebook crawlers index it because I don't want it appearing on their platform. I'm happy to have wayback machine archive it, but that's with the understanding that it is a backup, not an authoritative or primary source.

Ideally, links would be able to handle 404s and fallback. Like we can do with images and srcset in html. That way if my content goes away we have a backup. I can still write updates to a blog piece or add translations that people send in and everyone benefits from the dynamic nature of content, while still being able to either fallback or verify content at the time it was publish via the wayback machine.


There already have been copyright claims against The Wayback Machine. They've been responding to it by allowing site owners to use robots.txt to remove their content.


I politely claim that your view is unrealistic (for published content). You may legally own it, but the instant you make content available to a party other than yourself, you lose any guarantee that you control it. Like I said in my earlier comment, if I find your site and like it, it gets downloaded and saved into my archive. Somebody else could trivially copy and paste or screenshot it to facebook.

I feel similarly to you: I want to own and control what I create. However I'm also realistic about the consequences of publishing it, so I don't publish anything I create beyond personally showing stuff to people who are close to me, and preferably from my own equipment directly. Unless you're doing the same, you don't actually control your content.

This may seem like a neurotic approach, but if you actually care about your content, it's not. It's not difficult to find cases of content being stolen and reused without the creator knowing; e.g. https://www.youtube.com/watch?v=w7ZQoN6UrEw


But it’s also not guaranteed to be consistent. What if you don’t delete the content but just change it? (I.e. what if your opinions change or you’re pressured to edit information by a third party?).


I addressed this.

> I can still write updates to a blog piece or add translations that people send in and everyone benefits from the dynamic nature of content, while still being able to either fallback or verify content at the time it was publish via the wayback machine.

Updates are usually good. Sometimes you need to verify what was said though, and for that wayback machine works. I agree it would be nice if there was a technical way to support both, but for the average web request it's better to link to the source.


Perhaps the wayback machine can help fix that by telling users to visit the authoritative site and demanding a confirmation clickthrough before showing the archived content.


> Perhaps the wayback machine can help fix that by telling users to visit the authoritative site and demanding a confirmation clickthrough before showing the archived content.

I'm trying to figure out if you're being ironic or serious.

People on here (rightly) spend a lot of time complaining about how user experience on the web is becoming terrible due to ads, pop-ups, pop-unders, endless cookie banners, consent forms, and miscellaneous GDPR nonsense, all of which get in the way of whatever it is you're trying to read or watch, and all of it on top of the more run-of-the-mill UX snafus with which people casually litter their sites.

Your idea boils down to adding another layer of consent clicking to the mess, to implement a semi-manual redirect through the WayBackMachine for every link clicked. That's ridiculous.

I have to believe you're being ironic because nobody could seriously think this is a good idea.


Agree, cut the clutter just like it is simple on the HN website.


It's a deep problem with the web as we know it.

If I want to make a "scrapbook" to support a research project of some kind. Really I want to make a "pyramid" with a general overview that is at most a few pages at the top, then some documents that are more detailed, but with the original reference material incorporated and linked to what it supports.

In 2020 much of that reference material will come from the web and you are left with doing the "webby" thing (linking) which is doomed to fall victim to broken links or with archiving the content which is OK for personal use, but will not be OK with the content owners if you make it public. You could say the public web is also becoming a cess pool/crime scene, where even reputable web sites are suspected of pervasive click fraud, where the line between marketing and harassment gets harder to see every day.


Is it a deep problem? You can download content you want to keep. There are many services like evernote and pocket that can help you with it.


It is, because it ultimately comes down to owner's control of how their content is being used.

For example, a modern news site will want the ability to define which text is "authoritative", and make modifications to it on the fly, including unpublishing it. As a reader OTOH, I want a permanent, immutable copy of everything said site ever publishes, so that silent edits and unpublishing is not possible. These two perspectives are in conflict, and that conflict repeats itself throughout the entire web.


Some consumers will want the latest and greatest content. To please everyone (other than the owner) you'd need to look at the content across time, versions, alternate world views,... Thus "deep".

My central use case is that I might 'scrape' content from sources such as

https://en.wikipedia.org/wiki/List_of_U.S._states_and_territ...

and have the process be "repeatable" in the sense that:

1. The system archives the original inputs and the process to create refined data outputs

2. If the inputs change the system should normally be able to download updated versions of the inputs, apply the process and produce good outputs

3. If something goes wrong there are sufficient diagnostics and tests that would show invariants are broken, or that the system can't tell how many fingers you are holding up

4. and in that case you can revert to "known good" inputs

I am thinking of data products here, but even if the 'product' is a paper, presentation, or report that involves human judgements there should be a structured process to propagate changes.


> If it's not worth that, then I just keep a bookmark.

I've made a habit of saving every page I bookmark to the WayBackMachine. To my mind, this is the best of both worlds: you'll see any edits, additions, etc. to the source material and if something you remember has been changed or gone missing, you have a static reference. I just wish there was an simple way to diff the two.

I keep meaning to write browser extensions to do both of these things on my behalf ...


I can understand posting a link, plus an archival link just in case the original content is lost. But linking to an archival site only is IMO somewhat rude.


> What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine

Addendum: First, that same tool should – at the time of creating your web site / blog post / … – ask WayBackMachine to capture those links in the first place. That would actually be a very neat feature, as it would guarantee that you could always roll back the linked websites to exactly the time you linked to them on your page.


I don't care enough to look into it, but I think Gwern has something like this set up on gwern.net.


Doesn't Wikipedia do something like this? If not, the WBM/Archive.org does something like it on Wikipedia's behalf.


Gwern.net has a pretty sophisticated system for this https://www.gwern.net/Archiving-URLs


Would be nice if there's an automatic way to have a link revert to the Wayback Machine once the original link stops working. I can't think of an easy way to do that, though.


Brave browser has this built in, if you end up at a dead link the address bar offers to take you to wayback machine.

http://blog.archive.org/2020/02/25/brave-browser-and-the-way...


This was first implemented in Firefox, as an experiment, and is now an extension:

https://addons.mozilla.org/ro/firefox/addon/wayback-machine_...


I used this extension for a while but had to stop due to frequent false positives. YMMV


There exists a manual extension called Resurrect Pages for Firefox 57+, with Google Cache, archive.is, Wayback Machine, and WebCite.


I just use a bookmarklet

    javascript:void(window.open('https://web.archive.org/web/*/'+location.href.replace(/\/$/,%20'')));
(which is only slightly less convenient than what others have already pointed out — the FF extension and Brave built-in feature).


Another nice solution is to create a "search engine" for https://web.archive.org/web/*/%s you can then just add the keyword before the URL (For example I type `<Ctrl-l><Left>w <Enter>`). Search engines like this are supported by chrome and firefox.


I would love for there to be a site that redirected eg. better.site/ https://www.youtube.com/watch?v=jzwMjOl8Iyo to https://invidious.site/watch?v=jzwMjOl8Iyo so I could easily open YouTube links with Invidious, and the same for Twitter→Nitter, Instagram→bibliogram, Google Maps → OSM, etc without having to manually remove the beginning of the URL. I’d presume someone on HN has the skill to do this similarly to https://news.ycombinator.com/item?id=24344127


You can make a "search engine" or bookmarklet that is a javascript/data URL that does whatever URL mangling you need. (Other than some minor escaping issues).

Something like the following should work. You can add more logic to supoort all of the sites with the same script or make one per site.

javascript:document.location="%s".replace(/^https:\/\/www.youtube.com/, "https://invidious.site")


wikipedia just does "$some-link-here (Archived $archived-version-link)", and it works pretty well, imo.


For me that is the real solution when you know that the archived-link is the one consulted by the author/whatever and the normal one being the content (or its evolution).


Agreed, and it shouldn't be too much of a burden to use since the author was quite clear about it being for reference materials. The idea isn't all that different from referring to specific print editions.


iirc wikipedia has some logic for this. When you add a reference it automatically makes sure the page is backed up and if not it triggers a wayback copy, then it scans for dead links in references and if one is found it replaces the link with wayback.


Either a browser extension, or an 'active' system where your site checks the health of the pages it links to.



Their browser extention does exactly that...


The International Internet Preservation Consortium is attempting a technological solution that gives you the best of both worlds in a flexible way, and is meant to be extended to support multiple archival preservation content providers.

https://robustlinks.mementoweb.org/about/

(although nothing else like the IA Wayback machine exists presently, and I'm not sure what would make someone else try to 'compete' when IA is doing so well, which is a problem, but refusing to use the IA doesn't solve it!)


Or: snapshot a WARC archive of the site locally, then start serving it only in case the original goes down. For extra street cred, seed it to IPFS. (A.k.a. one of too many projects on my To Build One Day list.)


ArchiveBox is built for exactly this use-case :)

https://github.com/pirate/ArchiveBox


I use linkchecker for this on my personal sites:

https://linkchecker.github.io/linkchecker/

There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories, making it particularly easy to use with static websites before deploying them.

https://www.npmjs.com/package/broken-link-checker-local


> There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories

linkchecker can do this as well, if you provide it a directory path instead of a url.


Ah, thanks! I was not aware of that feature.


I made a browser extension which replaces links in articles and stackoverflow answers with archive.org links on the date of their publication (and date of answers for stackoverflow questions): https://github.com/alexyorke/archiveorg_link_restorer


> generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.

SEO tools like Ahrefs do this already. Although, the price might be a bit too steep if you only want that functionality. But there are probably cheaper alternatives as well.


yeah at some point, way back machine need to be on webttorrent, ipfs type of thing where it is immutable.


I was surprised when digital.com got purged

Then further dismayed that the utzoo Usenet archives were purged.

Archive sites are still subject to being censored and deleted.



it's there any active project perusing this idea ?


The largest active project doing this (to my knowledge) is the Inter-Planetary Wayback Machine:

https://github.com/oduwsdl/ipwb

There have been many other attempts though, including internetarchive.bak on IPFS, which ended up failing because it was too much data.

http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/i...

http://brewster.kahle.org/2015/08/11/locking-the-web-open-a-...


https://github.com/exp0nge/wayback

Here's an extension to archive pages on Skynet, which is similar to IPFS but uses financial compensation to ensure availability and reliability.

I don't know if the author intends to continue developing this idea or if it was a one-off for a hackathon.


FileCoin is the incentivization layer for IPFS, both built by Protocol Labs.


I'm hoping someone here in Hacker News will pick it up and apply for the next round at ycombinator. A non-profit would be better than for-profit in this case. Block-chain ish type tech would be perfect for this. If in a few years no one does, then I'll do it.


> generate a report of broken links

I actually made a little script that does just this. It’s pretty dinky but works a charm on a couple of sites I run.

https://github.com/finnito/link-checker


Not to forget that while I might go to an article written ten years ago, the Wayback archive won't show me a related article that you published two years ago updating the article information or correcting a mistake.


And when you die, who will be maintaining your personal site? What happens when the domain gets bought by a link scammer?

Maybe your pages should each contain a link to the original, so it's just a single click if someone wants to get to your original site from the wayback backup.


Wayback machine converts all links on a page to wayback links so you can navigate a dead site normally.


Well that's a bummer. Any way to defeat it?


If you're viewing a capture of a site, there's always a banner at the top of the page showing the original URL and when the page was captured, along with controls to view other snapshots. I do wish the banner had a "open actual site" button but it's pretty easy to copy the URL from the text box and paste it into your browser's location bar.


I spent hours getting all the stupid redirects working from different hosts, domains and platforms.

People still use rss to either steal my stuff, or discuss it off site (as if commenting to the author is so scary!) or in a way to make me totally unaware of it happening as so many times people either ask questions of the author on a site like this, or even bring up good points or something to go further on that I would miss otherwise.

It’s a shame ping backs were hijacked but the siloing sucks too.

Sometimes I forget for months at a time to check other sites, not every post generates 5000+ hits in an hour.


What if your personal site is, like so many others these days, on shared IP hosting like Cloudflare, AWS, Fastly, Azure, etc.

In the case of Cloudflare, for example, we as users are not reaching the target site, we are just accessing a CDN. The nice thing about archive.org is that it does not require SNI. (Cloudflare's TLS1.3 and ESNI works quite well AFAICT but they are the only CDN who has it working.)

I think there should be more archive.org's. We need more CDNs for users as opposed to CDNs for website owners.


The "target site" is the URL from the author's domain, and Cloudflare is the domain's designated CDN. The user is reaching the server that the webmaster wants reachable.

That's how the web works.

> The nice thing about archive.org is that it does not require SNI

I fail to see how that's even a thing to consider.


If the user follows an Internet Archive URL (or Google cache URL or BING cache URL or ...), then does she still she reach "the server the webmaster wants reachable".

SNI, more specifically sending domain names in plaintext over the wire when using HTTPS, matters to the IETF because they have gone through the trouble of encrypting server certificate in TLS 1.3 and eventually they will be encrypting SNI. If you truly know "how the web works", then you should be able to figure out why they think domain names in plaintext is an issue.


We suggest/encourage people link to original URLs but ALSO (as opposed to instead of) provide Wayback Machine URLs so that if/when the original URLs go bad (link rot) the archive URL is available, or to give people a way to compare the content associated with a given URL over time (content drift)

BTW, we archive all outlinks from all Wikipedia articles from all Wikipedia sites, in near-real-time... so that we are able to fix them if/when they break. We have rescued more than 10 million so far from more than 30 Wikipedia sites. We are now working to have Wayback Machine URLs added IN ADDITION to Live Web links when any new outlinks are added... so that those references are "born archived" and inherently persistent.

Note, I manage the Wayback Machine team at the Internet Archive. We appreciate all your support, advice, suggestions and requests.


It's interesting to think about how HTML could be modified to fix the issue. Initial thought: along with HREF, provide AREF- a list of archive links. The browser could automatically try a backup if the main one fails. The user should be able to right-click the link to select a specific backup. Another idea is to allow the web-page author to provide a rewrite rule to automatically generate wayback machine (or whatever) links from the original. This seems less error prone and browsers could provide a default that authors could override.

Anyway, the fix should work even with plain HTML. I'm sure there are a bunch of corner cases and security issues involved..

Well as mentioned by others, there is a browser extension. It's interesting to read the issues people have with it:

https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...


So this is a little indirect, but it does avoid the case where the Wayback machine goes down (or is subverted): include a HASHREF which is a hash of the state of the content when linked. Then you could find the resource using the content-addressable system of your choice. (Including, it must be said, the wayback machine itself).


I've found that web pages have so much dynamic content these days that even something that feels relatively static generates two different hashes almost on every pageload.


Indeed. I don't think you could or should hash the DOM - not least of which because it is, in general, the structured output of a program. Ideally you could hash the source. This might be a huge problem for single page applications, except you can always pre-render a SPA at any given URL, which solves the problem. (This is done all the time - the most elegant way is to run e.g. React on the server to pre-render, but you can also use another templating system in an arbitrary language, although you end up doing all features maybe not twice, but about 1.5x).


> (Including, it must be said, the wayback machine itself).

Citation needed? Eg something like http://web.archive.org/cdx/search/cdx?url=http://haskell.cs.... produces lines of the form:

  edu,yale,cs,haskell)/wp-content/uploads/2011/01/haskell-report-1.2.pdf 20170628055823 http://haskell.cs.yale.edu/wp-content/uploads/2011/01/haskell-report-1.2.pdf warc/revisit - WVI3426JEX42SRMSYNK74V2B7IEIYHAS 563
But there seems to be no documented way to turn WVI3426JEX42SRMSYNK74V2B7IEIYHAS (which I presume to be the hash) into a actual file. (Though http://web.archive.org/web/$DATEim_/$URL works fine, so it hasn't been a problem in practice.)


> Citation needed

Oh, sorry, I don't think the WM supports this today. I only meant that it could support it "trivially" (I put that in quotes since I don't know how WM is implemented. But in theory it would be easy to hash all their content and add an endpoint that maps from hashes to URLs).

My point was that you could add an addressing system that is both independent of the Wayback Machine, but which you could still (theoretically) use with it. But you'd have to add the facility to the WM.


Ah, that's disappointing, but oh well.


This is literally where my brain was going and I was glad to see someone went in the same direction. Given the <img> tag’s addition of srcset in recent years, there is precedent for doing something more with href.


Yup, I've been using the extension for probably about a year now and get the same issues they do. It really isn't that bad, most of the time backing out of the message once or twice does the trick, but it's funny because most of the time I get that message when going to the IA web uploader.


This is so much better than INSTEAD.

Not for the sole reason that it leaves some control to the content owner while ultimately leaving the choice to the user, but also because things like updates and erratums (eg. retracted papers) can't be found in archives. When you have both, it's the best of both world: you have the original version, the updated version, and you can somehow have the diff between them. IMHO, this is especially relevant in when the purpose is reference.


I mostly agree... however, given how many "news" sites are now going back and completely changing articles (headlines, content) without any history, I think it's a mixed bag.

Link rot isn't the only reason why one would want an archive link instead of original. Not that I'd want to overwhelm the internet archive's resources.


I love the feature that you easily can add a page to archive: https://web.archive.org/save/https://example.com

Replace https://example.com from the URL above. I try to respect the cost of archiving, by not saving to often the same page.


Thanks so much for running this site - as a small start-up we often manually request a snapshot of our privacy policy/terms of service/other important announcements whenever we make change to them (if we don't manually request them the re-crawl generally doesn't happen since I guess those pages are very rarely visited, even though they're linked from the main site). It's helped us in a thorny situation where someone tried to claim "it wasn't there when I signed up".

It might be an interesting use-case for you to check out, i.e. keep an eye of those rarely used legal sublinks for smaller companies.


Kudos for doing what you do.


I always wonder about rise the hosting costs in the wake of people liking to the Wayback Machine on popular sites.

How do you think about it?



Came here for this. Have my upvote.


This is a bad idea...

In the worst case one might write a cool article and get two hits, one noticing it exists, and the other from the archive service. After that it might go viral, but the author may have given up by then.

The author is losing out on inbound links so google thinks their site is irrelevant and gives it a bad pagerank.

All you need to do is get archive.org to take a copy at the time, you can always adjust your link to point to that if the original is dead.


There's no reason that pagerank couldn't be adapted to take into account wayback machine urls, there is a link with a url pointing at https://web.archive.org/web/*/https://news.ycombinator.com/ google could easily register that as a link to both resources - one to web.archive, the other to the site.

there is also no reason why that has to become a slippery slope, if anyone is going to ask "but where do you stop!!"


After all, they did change their search to accommodate AMP. Changing it to take WebArchive into account is a) peanuts and b) is actually better for the web


There's a business idea in there somewhere.

Some kind of CDN-edge-archive hybrid.


“CDN-Whether-You-Want-It-Or-Not”


Foreverspin meets Cloudflare


I was thinking more there was a business motto in there somewhere - like "Don't be evil, be actively Good!" or something catchy like that.


Google shouldn't be the center of the Web. They could also easily determine where the archive link is pointing to and not penalize. But I guess making sure we align with Google's incentives is more important than just using the Web.


> Google shouldn't be the center of the Web.

I agree, but are you suggesting it's going to be better if WayBackMachine is?


That's a strawman because I never said they should be. There's room for better alternatives.

We as a community need to think bigger rather than resigning ourselves to our fate.


It's not a strawman because (a) I agreed with you, (b) context, and (c) I asked a question based on what you seemed to be implying in that context: a question to which you still haven't provided an answer.

Let me put it another way: what specifically are you suggesting as an alternative?


If I had to pick a solution from what's available right now technology wise I'd pick something that links based on content hashes. And then pulls the content from decentralized hosting.

I don't think I like IPFS as an organization, but tech wise it's probably what I'd go with.


Yes. At least Archive.org isn't an evil mega corporation destroying the internet. Yet.


We'll see what their new owners do after the lawsuit.


> But I guess making sure we align with Google's incentives is more important than just using the Web.

It's not about Google's incentives. It's about directing the traffic where it should go. Google is just the means to do so.

Build an alternative, I'm sure nobody wants Google to be the number one way of finding content, it's just that they are, so pretending they're not and doing something that will hurt your ability to have your content found isn't productive.


Every search engine uses the number of backlinks as one of the key factors in influencing search rank; it's a fundamental KPI when it comes to understanding whether a link is credible.

What is true for Google in this regard is also true of Bing, DDG and Yandex.


I totally agree.

I guess the answer is "don't mess with your old site", but that's also impractical.

And I'm sorry, but if it's my site, then it's my site. I reserve the right to mess about with it endlessly. Including taking down a post for whatever reason I like.

I'm sorry if that conflicts with someone else's need for everything to stay the same but it's my site.

Also, if you're linking to my article, and I decide to remove said article, then surely that's my right? It's my article. Your right to not have a dead link doesn't supercede my right to withdraw a previous publication, surely?


You can go down this road, but it looks like you're advocating for each party to simply do whatever he wants. In which case the viewing party will continue to value archiving.


I certainly don't know about legal rights, but I think the ethical thing is to make sure that any writings published as freely accessible should remain so forever. What would people think if an author went into every library in the world to yank out one of their books they no longer want to be seen?

I do think the author is wrong to immediately post links to archived versions of sources. At the least, he could link to both the original and archived.


Publishing on your own website is more akin to putting up a signboard on your front lawn than writing a book for publication.

People are free to view it and take pictures for their own records, but I could still take it down and put something else up.


Why is that the most ethical thing to do?

As a motivating example, I wrote some stuff on my MySpace page as a teenager that I'm very glad is no longer available. They were published as "freely accessible" and indeed, I wanted people to see it. But when I read it back 15 years later, I was more than a little embarrassed about it, and I deleted it - despite it also having comments from my friends at the time, or being referenced in their pages.

No great value was contained in those works.


I another 15 or 20 years you might want to see the content again. I hope you saved it somewhere.


I might, but I might want to see a whole lot of other things from my past, too.

Forgetting is part of living, y'know?


I'm not sure I agree. I know that journalism (as a discipline) considers this ethical. I kinda get that this is part of the newspaper industry as a public service - that withdrawing publication of something, or changing it without alerting the reader to the change, alters the historical record.

But no-one has a problem with other creative industries withdrawing their publications. Film-makers are forever deciding that movies are no longer available, for purely commercial reasons. Why is writing different? Why is pulling your books from a library unethical but pulling your movie from distribution is OK?

I think we either need to extend this to all creative activity, or reconsider it for writing.


> But no-one has a problem with other creative industries withdrawing their publications

I wouldn't say no one has a problem with this. It does happen, but it certainly doesn't make everyone happy. I for one would like for all released media to be available, or at least not actively removed from access.


This has a very easy answer for me: It's not ethical for film makers to decide that movies are no longer available.

Copyright was created to encourage publication of information, not to squirrel it away. Copyright should be considered the exception of the standard - public domain.


Why not?

Is it unacceptable for an artist to throw her art away after it has finished its museum tour? Should a parent hang on to every drawing their child has ever made?

If you are a software developer - is all of the code you've ever written still accessible online, for free? (To the legal extent that you are able, of course.)

Have you written a blog before, or did you have a MySpace? Have you taken care to make sure your creative work has been preserved in perpetuity, regardless on how you feel about the artistic value of displaying your teen emotions?

Consider why you feel it is unethical for the author or persons responsible for the work to ever stop selling it.


> Is it unacceptable for an artist to throw her art away after it has finished its museum tour? Should a parent hang on to every drawing their child has ever made?

This boils down to the public domain, IMO. We have made a long practice of rescuing art from private caches and trash bins to make them publicly available after the artists' passing (the copyright expiring); regardless of their views on what should happen with those works.

> Consider why you feel it is unethical for the author or persons responsible for the work to ever stop selling it.

Selling something and then pulling it down is fundamentally an attempt to create scarcity for something that would otherwise be freely available. It's a marketing technique that capitalizes on our fear of missing out to make a sale.

Again, the right to even sell writings was enshrined in law as an exception to the norm of of it immediately being part of the public domain, in an effort to encourage more writing.


Not sure we need any more encouragement on that front ;)

Sure, after I'm dead, you can do with my stuff whatever you like.

But while I'm alive.... it's my stuff and I can do with it what I like. Including tearing it up because I hate it now and don't want anyone to look at it.


It seems like this is conflating two issues: 1) The right of others to copy your work. 2) The obligation of the author to make the work available.

This thread stems from the second; about whether a site owner is justified in deleting or rearranging old pages/information on their website.


One can also do it similar to Wikipedia references sections, which links to the original and the memento in the archive. (Once the bot notices it's gone)

Additional benefit: Some edits are good (addendums, typo corrections etc.)


archive.org sends the HTTP header

  Link: <https://example.com>; rel="original"
This can be used by search engines to adjust their ranking algorithms.


Even worse, when you have people using rss to wholesale copy your site and it’s updates and again that traffic and more importantly the engagement disappear.

It’s very demotivating


So, this is the problem of persistence of URL's always referencing the original content, regardless of where it is hosted, in an authoritative way.

It's an okay idea to link to WB, because (a) it's de facto assumed to be authoritative by the wider global community and (b) as an archive it provides a promise that it's URL's will keep pointing to the archived content come what may.

Though, such promises are just that: promises. Over a long period of time, no one can truly guarantee the persistence of a relationship between an URI and the resource it references to. That's not something technology itself solves.

The "original" URI still does carry the most authority, as that's the domain on which the content was first published. Moreover, the author can explicitly point to the original URI as the "canonical" URI in the HTML head of the document.

Moreover, when you link to the WB machine, what do you link to? A specific archived version? Or the overview page with many different archived versions? Which of those versions is currently endorsed by the original publisher, and which are deprecated? How do you know this?

Part of ensuring persistence is the responsibility of original publisher. That's where solutions such as URL resolving come into play. In the academic world, DOI or handle.net are trying to solve this problem. Protocols such as ORE or Memento further try to cater to this issue. It's a rabbit hole, really, when you start to think about this.


> Moreover, when you link to the WB machine, what do you link to? A specific archived version? Or the overview page with many different archived versions? Which of those versions is currently endorsed by the original publisher, and which are deprecated? How do you know this?

WB also supports linking to the very latest version. If the archive is updated frequently enough I would say it is reasonable to link to that if you use WB just as a mirror. In some cases I've seen error pages being archived after the original page has been moved or removed though but that is probably just a technical issue caused by some website misconfiguration or bad error handling.


Signed HTTP Exchanges could be a neat solution here.


You can create a bookmark in Firefox to save a link quickly.

Bookmark Location- https://web.archive.org/save/%s

Keyword - save

So searching 'save https://news.ycombinator.com/item?id=24406193' archives this post.

You can use any Keyword instead of 'save'.

You can also search with https://web.archive.org/*/%s


Does that `save` keyword work?

The problem is %s gets escaped, so Firefox generates this URL, which seems to be invalid:

https://web.archive.org/save/https%3A%2F%2Fnews.ycombinator....


Uppercase %S for unescaped, e.g.:

https://web.archive.org/web/*/%S


TIL. Thanks for the info.


Ah, nice, thanks!


web.archive.org automatically converts the https%3A%2F things to https:// for me. I noticed it many times.

If you are still facing problems, go to https://web.archive.org . In the bottom right 'Save page now' field, right click and select 'add keyword for search'. Choose your desired keyword.


>web.archive.org automatically converts the https%3A%2F

Did you try the link provided by the one you replied to?

Because it says "HTTP 400" here, so apparently it doesn't convert well, at least not in my end.


Yeah. I used to face the same problem. The links would get converted to a different format. But it got fixed, and I didnt change anything. It is getting automatically converted everytime now.


Nice. I forgot how you can do that.

I just use the extension myself:

https://addons.mozilla.org/en-US/firefox/addon/wayback-machi...


Yeah. That requires access to all sites. I wasn't comfortable adding another addon with that permission.

The permission is just for a simple reason and should be off by default. It is so you can right click a link on any page and select 'archive' from the menu. Small function, but requires access to all sites.


The source is available if you want to know what's going on with those permissions: https://github.com/internetarchive/wayback-machine-chrome


Thanks. I already knew that. I'm familiar with the dev's extensions. Clear Browsing Data and Captcha Buster and very useful.


One issue i have with this extension is that it randomly pops up the 'this site appears to be offline' (which overrides the entire page) even when the site actually works (i hit the back button and it appears). I have it installed for some time now and so far i have almost daily false negatives and only once actually it worked as intended.

Also there doesn't seem to be a way to open a URL directly from the extension which seems a weird omission, so i end up going to the archive site anyway since i very often want to find old long lost sites.


It pops up when there is a HTTP 404 status code or similar returned. So these false negatives are likely due to the specific sites that are configured in a wacky way.

(Don't get me wrong, it is still very annoying for the user regardless what the cause is.)


Does it pop up for any 404 error? If so it might be some script or font or whatever resource the site itself is using that would otherwise fail silently. If not... then there has to be some other bug/issue because i get it for many different sites that shouldn't have it.


Nope, only for the "main" page (for lack of a better word), and when there is an archive for it.


Can we update this link to point to the archive version?


Brilliant


This is building yet another silo and point of failure. We can't pass the entire Internet traffic thru WayBackMachine as its resources are limited.

Most preserving solutions are like that and at the end the funding or business priorities (google groups) become a serious problem.

I think we need something like web - distributed and dumb easy to participate and contribute a preservation space.

Look, there are Torrents available for 17 years [0]. Sure, there are some unintresting long gone but there is always a little chance somebody still has the file and someday becomes online with it.

I know about IPFS/Dat/SBB, but still that stuff, like Bitcoin, is too complex for a layman contributor with a plain altruistic motivation. It should be like SETI@Home - fire and forget. Eventually integrated with a browser to cache content you star/bookmark and share when it is offline.

[0] https://torrentfreak.com/worlds-oldest-torrent-still-alive-a...


Link rot has convinced me that the web is not good for its ostensible purpose. I used to roll my eyes reading how academic researchers and librarians would discourage using webpages as resources. Many years later, it's obvious that the web is pretty bad for anything that isn't ephemeral.


We have deposit libraries in the U.K., such as The British library and Oxford University's Bodleian. When you publish a book in the U.K. you are supposed to offer a copy to these institutions.

If we had legal deposit web archiving institutions, then academics, and others, could create an archive snapshot of some resource and then reference the URI to that (either with or without the original URI), so as to ensure permanence.


You can also copyright databases in the US as well as web content. It is similar here. However, many people do not actually register their copyrights to their web content. There is also a lot of misinformation about automatic copyright in the US as it relates to web content. While there is 'automatic copyright,' due to recent court decisions, you must have a registered (paid for, deposited with the copyright office) copyright in order to file a lawsuit for copyright infringement.

I agree that this is what should be done more often for durable content. This is another reason why social media sites are bad: the users technically have automatic copyright to their contributions, but they often also agree to grant licenses to the content website for things like their photos. The US Copyright Office is releasing a new application format intended for blog posts and other short form content that will hopefully be more straightforward to use.


>I used to roll my eyes reading how academic researchers and librarians would discourage using webpages as resources.

While this is true in general, I am amused that this is not true for citing wikipedia. Wikipedia can be trusted to remain online for many more years to come. And it has a built-in wayback machine in the form of Revision History.


Try following the references on big Wiki pages and you will see why Wikipedia pages are nightmarish for any kind of research. This is important when you are trying to drill down to the sources of various claims. Many major pages relating to significant events and concepts are riddled with rotted links.

The page can be completely correct and accurate, but if you cannot trace the references then it cannot be verified and you cannot make the claims in a new work as a result. The whole point of references is to make it so that the claims can be independently verified. Even when there isn't a link rot problem you will often find junk references that cannot be verified.

Wikipedia isn't a bad starting point and sometimes you can find good references. But it is not anywhere close to reliable: just trace the references in the next 20 Wiki articles you read and your faith will be shaken.


Usually a reference indicates that an author believes something to be true, but won't explicitly state their reasons. It isn't just a statement of where information comes from, but a justification for trusting that information. If the reference is from a reputable source, then it indicates that this belief is justified. If an author believes something to be true because they read it on wikipedia, then that belief probably isn't justified, because the reliability of wikipedia content is mixed.

Good quality information on wikipedia often refers back to published sources, and at the very least an author should check that source and refer to it, rather than wikipedia itself.


After someone published an authoritative ftp listening, so many people panicked as their were out of date and insecure versions so rather than patch they all went dark.

Anyone doing research just got screwed.

So many papers have code listed to places that don’t exist anymore.


By that reasoning, shouldn’t you be be using WayBack Machine links when posting your own content to HN, instead of posting direct links?


But how certain is the future of WayBackMachine, when disaster strikes, all your links are dead. On the other hand, the original links can still be read from the url, so the original reference is not completely gone.


Yeah, my thoughts were more of the way Waybackmachine is funded.

I don't feel comfortable sending a bunch of web traffic to them for no reason other than it being convenient. The wayback machine is a web archival project, not your personal content proxy to make sure your links don't go stale.

They need our help both in funding and in action, one simple action is not to abuse their service.


Precisely my first thoughts, too. It's an archive, not a free CDN.

I hope the author of this piece considers donating and promoting donation to their readers: https://archive.org/donate/


INTERNETARCHIVE.BAK:

The INTERNETARCHIVE.BAK project (also known as IA.BAK or IABAK) is a combined experiment and research project to back up the Internet Archive's data stores, utilizing zero infrastructure of the Archive itself (save for bandwidth used in download) and, along the way, gain real-world knowledge of what issues and considerations are involved with such a project. Started in April 2015, the project already has dozens of contributors and partners, and has resulted in a fairly robust environment backing up terabytes of the Archive in multiple locations around the world.

https://www.archiveteam.org/index.php?title=INTERNETARCHIVE....

Snapshots from 2002 and 2006 are preserved in Alexandria, Egypt. I hope there's good fire suppression.

https://www.bibalex.org/isis/frontend/archive/archive_web.as...


I wish there were a way to get a low-rez copy of their entire archive. So, only text, no images, binaries, PDFs (other than PDFs converted to text which they seem to do). As it stands the archive is so huge, the barrier to mirroring is high.


Agreed.

When scoping out the size of Google+, one of ArchiveTeam's recent projects, it emerged that the typical size of a post was roughly 120 bytes, but total page weight a minimum of 1 MB, for a 1% payload to throw-weight ratio. This seems typical of much the modern Web. And that excludes external assets: images, JS, CSS, etc.

If just the source text and sufficient metadata were preserved, all of G+ would be startlingly small -- on the order of 100 GB I believe. Yes, posts could be longer (I wrote some large ones), and images (associated with about 30% of posts by my estimate) blew things up a lot. But the scary thing is actually how little content there really was. And while G+ certainly had a "ghost town" image (which I somewhat helped define), it wasn't tiny --- there were plausibly 100 - 300 million users with substantial activity.

But IA's WBM has a goal and policy of preserving the Web as it manifests, which means one hell of a lot of cruft and bloat. As you note, increasingly a liability.


The external assets for a page could be archived separately though, right? I would think that the static G+ assets: JS, CSS, images, etc. could be archived once, and then all the remaining data would be much closer the 120B of real content. Is there a technical reason that's not the case?


In theory.

In practice, this would likely involve recreating at least some of the presentation side of numerous changing (some constantly) Web apps. Which is a substantial programming overhead.

WARC is dumb as rocks, from a redundancy standpoint, but also atomically complete, independent (all WARCs are entirely self-contained), and reliable. When dealing with billions of individual websites, these are useful attributes.

It's a matter of trade-offs.


WayBackMachine alternative, archive.is, has an option to download zip archive of HTML with images and CSS (but no JS) - this way you can preserve and host a copy of original webpage on your own website


Or just wget -rk...

Mirroring a website isn't so hard that you need a service to do it for you. Your browser even has such a function; try ctrl-s.


The "SingleFile" plugin is a better version of ctrl+s. It will save all pages as single html file and even include images as an octet stream in the file so they aren't missed.


I would be careful in mirroring a site. It's very likely to violate copyright or similar laws, depending on where you are. I think archive.org is considered fair use, but if you put it on a personal or even business page it might be different. For example Google News in EU is very limited in what content they may steal from other web pages.



Doesn't the link to the WayBackMachine contain the original link?


Good idea, by why not both (i.e. link to a webpage, and to the Archive)?

Linking to Archive only makes Archive a single point of failure.


Yes, this makes the most sense in my opinion:

Check out [this link](https://...) ([archived](https://...))

This can also help in the event of a "hug of death"


This is what I do on my blog, with some additional metadata:

    <p>
      <a 
        data-archive-date="2020-09-01T22:11:02.287871+00:00"
        data-archive-url="https://web.archive.org/web/20200901221101/https://reubenwu.com/projects/25/aeroglyphs"
        href="https://reubenwu.com/projects/25/aeroglyphs"
      >
        Aeroglyphs
      </a>
      <span class="archive">
        [<a href="https://web.archive.org/web/20200901221101/https://reubenwu.com/projects/25/aeroglyphs">archived</a>]
      </span>
      is an ongoing series of photos of nature with superimposed geometrical shapes drawn by drones.
    </p>


By the way the archive works, isn't the link just adding the https://web.archive.org/web/*/ before the actual link? I guess linking to both is especially important for people not knowing about the existence of archive.org, and a small convenience for everyone. But the link seems to be reversible in either direction.


The WBM link includes the canonical source clearly within the URL.


Yeah, and the non-technical users will surely understand that what they need to do when the link doesn't work is:

1. Recognize that it's an Archive.org URL

2. Understand that the link references an archived page whose URL is "clearly" referenced as a parameter

3. Edit the URL (especially pleasant on a cell phone) correctly and try loading that

If you expect the user to be able to go through all this trouble if the Archive is down, you can also expect them to look up the page on the Archive if the link does not load.

But better yet, one shouldn't expect either.


I wonder if the anchor tag should be altered to support this?

Alternatively, this is a good thing for a user agent to handles natively, or through a plugin.


Agreed. I usually link to both the original and then archive.org in parentheses.


I understand where the author is coming from, but I think the best approach is to write your content with direct links to the canonical versions of articles.

Have a link checking process you run regularly against your site, using some of the standard tools I've mentioned elsewhere in this thread:

https://www.npmjs.com/package/broken-link-checker-local

https://linkchecker.github.io/linkchecker/

When you run the link check (which should be regularly, perhaps at least weekly), also run a process that harvests the non-local links from your site and 1) adds any new links' content to your own local, unpublished archive of external content, and 2) submits those new links to archive.org.

This keeps canonical URLs canonical, makes sure content you've linked to is backed up on archive.org so a reasonably trustworthy source is available should the canonical one die out, and gives you your own backup in case archive.org and the original both vanish.

I don't currently do this with my own sites, but now I'm questioning why not. I already have the regular link checks, and the second half seems pretty straightforward to add (for static sites, anyway).


I think the fundamental problem here is that URLs locate resources. We find the desired content by finding its location given by an address. Now what server or content lives on that address may change from time to time or may even disappear. This leads to broken links.

The problem with linking to Wayback Machine is that we are still writing archive.org URLs still linking to Wayback Machine servers. What guarantee is there that those archive.org links will not break in future?

It would have been nice if the web were designed to be content-addressable. That is, the identifier or string we use to access a content addresses the content directly, not a location where the content lives. There is good effort going on in this area in the InterPlanetary File System (IPFS) project but I don't think the mainstream content providers on the Internet are going to move to IPFS anytime soon.


I'm all for Archive.org. However, using it in this way — setting up a mirror of some content and purposefuly diverting traffic to said mirror — is copyright infringement (freebooting), as it competes with the original source.


This is a bad idea for the reasons that other commenters have already stated. If WayBackMachine falls, all links would fall. Actually the "Web" would stop being one, if all links are all within the same service.

For docs and other texts, I just link to the original site and add an (Archive) suffix, e.g. the "Sources" section in https://doc-kurento.readthedocs.io/en/latest/knowledge/nat.h...

That is a simple and effective solution, yes it is a bit more cumbersome, but it does not bother me.


> So in Feb 14 2019 your users would have seen the content you intended. However in Sep 07 2020, your users are being asked to support independent Journalism instead.

Can you believe it? Yesterday, I tried to walk out of the grocery store with a head of lettuce for free, and they instead were more interested in making me pay money to support the grocery and agricultural business!


Right. I thought it was pretty bad form for him to call this "spam," as though they're the ones wronging him.


This seems like a problem that would be better solved by something like:

1. Browsers build in a system whereby if a link appears dead, they first check against the Wayback Machine to see if a backup exists.

2. If it does, they go there instead.

3. In return for this service, and to offset costs associated with increased traffic, they jointly agree to financially support the Internet Archive in perpetuity.


Here's a WayBackMachine Link to this article. :)

https://web.archive.org/web/20200908090515/https://hawaiigen...


Take a look at _Robustify Your Links_.[1] It is an API and a snippet of JavaScript that saves your target HREF in one of the web archiving services and adds a decorator to the link display that offers the option to the user to view the web archive.

[1] https://robustlinks.mementoweb.org/about/


No one touched on this but the experience of viewing through the waybackmachine is awful.

Media many times will not be saved so pages look broken. The iframe and the iframe breakers on original sites can kill any navigating.

The waybackmachine is okay for researching but a poor replacement as a perm link.


> Media many times will not be saved so pages look broken.

In my experience, this has gotten much, much better in the last few years. I haven't explored enough to know if this is part of the archival process or not, but I've noticed on a few occasions that assets will suddenly appear some time after archiving a page. For instance, when I first archived this page (https://web.archive.org/web/20180928051336/https://www.intel...), none of the stylesheets, scripts, fonts or images were present. However, after some amount of time (days/weeks) they suddenly appeared and I was able to use the site as it originally appeared.


This man’s entire argument is completely terrible for two reasons:

1) The example he uses is The Epoch Times, a questionable source even on the best of days.

2) What he refers to as “spam” is a paywall. He is literally taking away from business opportunities for this outlet that produced a piece of content he wants to draw attention to, but he does not want to otherwise support.

He’s a taker. And while the Wayback Machine is very useful for sharing archived information, that’s not what this guy is doing. He’s trying to undermine the business model of the outlets he’s reading.

The Epoch Times is one thing—it’s an outlet that is essentially propaganda—but when he does this to a local newspaper or an actual independent media outlet, what happens?


> 2) What he refers to as “spam” is a paywall. He is literally taking away from business opportunities for this outlet that produced a piece of content he wants to draw attention to, but he does not want to otherwise support.

For the destination site, this is all of the downsides of AMP with none of the upsides.


For reference: https://en.wikipedia.org/wiki/Epoch_Times

They're hyper right wing Qanon/antivax spreaders associated with the Falun Gong movement.


Is there any WordPress plugin that adds a link to the WayBack Machine next to the original link? I would use something like that.


Look at the format of the wayback machine URL. It's trivial to generate.

Where a WP plugin would add value is by saving to the archive whenever WP publishes a new or edited article.



The idea of being able to access the URL once it is gone is good. However this also means that any updates made to the original page are no longer seen.

Not all updates are about "begging for money" as the example in the article.


Or link to your own archive of the content with ArchiveBox!

That way we're not all completely reliant on a central system. (ArchiveBox submits your links to Archive.org in addition to saving them locally).

https://github.com/pirate/ArchiveBox

Also many other tools that can do this too:

https://github.com/pirate/ArchiveBox/wiki/Web-Archiving-Comm...


Apropos of nothing but I added the ability to archive links in Anarki a few months back[0]. If dang or someone wants to take it for HN they're welcome to. Excuse the crappy quality of my code and pr format, though.

It might be useful as a backup if the original site starts getting hugged to death.

[0]https://github.com/arclanguage/anarki/pull/179


> Now it’s spam from a site suffering financial need. Well, yeah!

Of course, linking to WBM is not the main reason why a site might be in this situation but it piles up.


Awesome. Hey, mods... Can you change the link on this post to http://web.archive.org/web/20200908090515/https://hawaiigent...


I link to WayBackMachine as Ive built a great many greenfield applications for startups as a freelancer, which only existed for about 6-8 months before hitting their burn rate. If I linked to their original domains, my portfolio would be a list of 404s.


I once discovered an information leak of German public broadcasting organization ARD which leaked real mobile numbers on their CI/CD page where they showed the business card designs (lol).

All records of this page on Archive.org were deleted after a couple of days, a twitter account posting the details with a screenshot and link was reported and my account temporarily suspended.

I assume it must be very easy to remove inconvenient content from archive.org.

(in German) https://blog.rolandmoriz.de/2019/04/25/sind-die-leute-von-de...


While I certainly wouldn't do this with every page and also not every time, I got so anxious of link rot lately I save out of reflex any good content I come across to the Waybackmachine.

The use of the bookmarklet makes this really convenient.


WayBackMachine is slow (slower than many bloated websites). So it’s not a good enough experience for the person clicking on that link.

Secondly, I personally don’t like the fact that WayBackMachine doesn’t provide an easy way to get content removed and to stop indexing and caching content (the only way I know is to email them, with delayed responses or responses that don’t help). It’s far easier to get content de-indexed in the major search engines. I know that the team running it have some reasons to archive anything and everything (as) permanently (as possible), but it doesn’t serve everybody’s needs.


This is both good and scary idea: for the good part, I'm frustrated enough that some unscrupulous websites (even some news outlets) secretly alter their contents without mentioning the change. I want a mechanism that holds the publisher responsible. At the same time, this is scary because we're basically using one private organization a single arbitrator. (I know it's a nonprofit, but they're probably not as public as a government entity.) Maybe it's good for the time being, but we should be aware that this is a solution that's far from perfect.


Public "or" a government entity.


This seems like a risky strategy, what with the pending lawsuit against archive.org over their National Emergency Library: I am fully expecting that web.archive.org will go away permanently within a few years.


I link to the original, but archive it in both WayBackMachine and Archive.is.


Yeah, that's another problem with the design of the web, and kind of a significant one! Somewhat pointless to link to external documents when half of them won't be around next year.


As others mentioned, it is a good habit to request the page to be archived. You don't have to link to the archive, but you would have the option to if the page were to disappear in the future.

I wish I had done this 15 years ago for a small project/website. Nowadays, my website is there, with all of its content, but most of the awesome references which I had linked to are unavailable. I wrote "most", but it is close to all of them.


While I generally disagree because I'd rather my site was the one getting the hits—and I would rather give the same courtesy to other authors—this does give me the idea of checking (or creating if none exists) an archive link of whatever I reference, and include that archive link in the metadata of every link I include.

Users will find the archive link if they really want to, and it will make it easier for me to replace broken links in the future.


Gotta completely agree ... for anything you need to be stable and available.

I've been building lists of -reference- URLs for over a decade ... and the ones aimed at Archive.org (are slower to load, but) are much more reliable.

Saved Wayback URLs contain the original site URL. It's really easy to check it to see if the site has deteriorated (usually it has). If it's gotten better ... it's easy to update your saved WB link.


If it's not distributed, it is going to disappear.

The waybackmachine is backed by WARC files. It's perhaps the only thing on archive.org that cant be downloaded... well except the original mpg files for 911 news footage.

https://news.ycombinator.com/item?id=20623177


This is such a fundamental problem that I'd like to be able to solve it at the HTML level.

An anchor type which allows several URLs, to be tried in order, would go a long way. Then we could add automatic archiving and backup links to a CMS.

It isn't real content-centric networking, which is a pity, but it's achievable with what we have.


The wayback machine has helps me on a daily basis. So many old links are dead.

The other day, I noticed that even old links from the front page of Google and Youtube are dead now. Internet Archive still has them. These were links on the front page of YT. Was very disappointed that even Google has dead links.


I wrote a link checker[1] to detect outbound links and mark dead links, so that, I can replace them manually with archive.org links.

1 - https://github.com/ashishb/outbound-link-checker


I made a chrome extension called Capsule that works perfectly for this use case. With just a click, you can create a publically shareable link that preserves the webpage exactly as you see it in your browser.

https://capsule.click


Does it use SingleFile under-the-hood? What storage format does this use, is it portable? e.g. WARC/memento/zim/etc?


I experienced this just the other day.

I was browsing an old HN post from 2018, with lots of what seemed like useful links to their blog

Upon visiting it the site had been rebranded and the blog entries had disappeared

Waybackmachine saved me in this cass, but a link to it originally would have saved me a few clicks


If it's to actually reference a third party source, it's probably better to make a self-hosted copy of the page. You can print it to a PDF file for example. I don't believe archive.org is eternal, or that its pages will remain the same.


I still link to the original URL because the author deserves the ad revenue and traffic, but I archive a copy to the Wayback Machine just in case the website can't handle the load, so there is an alternative way of getting the content.


The proper way is for a site to expose a canonical link to an article via a meta-link (rel=canonical) if necessary, and then have a browser plugin to automatically try archive.org with an URL generated from the canonical one if it is down.


Thank you! I've only been using the labor-intensive trust-issues version of this: paraphrasing things in my own words and linking to THAT.

I think I've been curating about 200 essays so far like that. You're now making me rethink my flow.


I maintain a Fork of a program that does exactly this! You can check it out here

https://github.com/Lifesgood123/prevent-link-rot


What would be even cooler is if there was an easy way to turn your own server into a Wayback machine, so that when your server rendered a webpage, it would use the original link if available, or its own cached version if not.


In the past I would fall back to WBM when something is no longer online. Though recently I've been bookmarking interesting content very rigorously and just rely on the archival feature of my bookmarking software.


Just another reason to have content-adressable storage everywhere, then at least if it changed you’ll know it changed, and if you can’t get the original content anymore then the change is probably malicious.


For anything important you can't beat a good save to pdf feature in the browser. You can then upload the pdf and link to that instead. Someone should make a wordpress plugin to do this automatically.


You could link to the original web url and also do a print version of the web content as PDF. That's how i archive howtos and write-ups of interesting content. Print view and create a PDF version.


Maybe the solution isn't technical and we should look at other fields that have relied on referencing credible sources for a long time? I can think of research, news and perhaps law.


It's probably better to link to both. If a site corrects a story, you readers will want to see the correction, but if the page disappears, it's good to have the backup.


It would be good to create a distributed, consensus version (to help stop edits) of the content rather than have a single point of failure...


So it can be deleted too?

Or so there is no engagement at the source?


There's some subtle irony in that the linked site is not in fact a WayBackMachine link, but instead a direct link to the site.


On the same topic, I wish I could link with highlights in the page. Having a spec for highlights in URLS would be neat.



I think a good solution might be to host the archive version yourself (archive.org is slow, and always using it centralizes everything there).

Let's say you write an article on your site, https://yoursite.com/my-article, and from it you want to link to an article https://example.com/some-article

You then create a mirror of https://example.com/some-article to be served from your site at https://yoursite.com/mirror/2019-09-08/some-article (put /mirror/ in robots.txt and set to noindex (or maybe even better to put a rel="canonical" towards the original article?)) and on the top of this mirrored page you add a header bar thingy containing a link to the original article, as well as one to archive.org if you so want.

tl;dr instead of linking to https://example.com/some-article you link to https://yoursite.com/mirror/2019-09-08/some-article (which has links to the original)


I find that web archive pages always appear broken —- perhaps a lot of js or css is not properly archived?


Everyone should be doing this in my opinion, articles get pulled all the time


Clever way to make the reference immutable.

Some blockchain will end up taking care of this.


Is there a chrome app that utilises waybackmachine?


has waybackmachine stopped retroactively applying robots?

if not link to that are one misconfiguration or one parked domain from being wiped.


WBM is like a content snapshot. You can't go back in time and change anything. That's why it is better than linking to the original.


Hmm. is there a place for a service that makes a permanent copy of content, available at the original url at the time of posting?


I stopped reading after the part where they describe the paywall gated version of the journalism website as “Now it’s spam from a site suffering financial need.”

That website spends money creating content for commercial viability, it doesn’t have to bow to you and make sure you can consume it for free, and the Wayback Machine isn’t a tool for you to bypass premium content.


This behaviour should be reported to the WayBackMachine as abuse.


He is actually showcasing a very nice technique to get around paywalls: turn off JS. Often enough that’s enough to get around the paywall. I believe the archives also disable JS when grabbing the content.


That is changing. I've noticed over the past couple of years that sites that could be accessed with JS turned off are now showing a "Please enable Javascript to continue" (Quora) or just hiding the content entirely (Business Insider).

I'm sure there are other examples as well.


Not surprised. When paywalls started becoming a thing most of them could be circumvented simply by removing a DOM element and some CSS classes. Nowadays this is basically not possible anywhere anymore.


I use to be able to archive paywall websites and view them which would get around the paywall but they seem to have gotten wise to it

Now the archive comes up with the paywall message in it

Still works on some sites by just simply archiving the page


Just FYI, archive.org is banned in a few countries, including the UAE, where I cannot open any links from there.


Huh I wonder if they are also blocking mirrors. Also, in countries with restrictions to internet access you probably want to make using TOR a general habit.


In practice however, archive.org did censor content based on political preference.


Sounds plausible, but I sure would like a citation for that claim.


They exclude Snopes and I think Salon from archiving.


Do you have any source on that? Sites can request archive.org to stop archiving them and to delete what is currently archived. They can do it for any reason; concealing changes of article contents might be one of them.


https://web.archive.org/web/*/snopes.com

> Sorry.

> This URL has been excluded from the Wayback Machine.

They also do not exclude the archive.org bot in https://www.snopes.com/robots.txt


That only shows that it's excluded, not for what reason. In 2017 Internet Archive announced it will start to ignore robots.txt in the future. When I tried to archive random facebook page (it was not allowed in robots.txt), it archived it happily. Afaik current way to exclude you site requires contacting info@archive.org and proving that the site is your.


I do have two links in my "clownworld" link list under, but ironically they're both in reddits that have since been banned and are therefore not available anymore.


post them regardless


I think this a good idea, but especially because the WayBackMachine uses good content security policies to prevent some of the intrusive JS ad-dependent sites like to push on people. So you're not only protecting from future 404 scenarios, but also protecting your visitors' privacy from unscrupulous ad-tech which seems to be everywhere now.

The example provided in the article, showing how a site looked cleaner before, could simply be the content security policies at the WayBackMachine preventing the clutter from getting loaded, rather than any specific changes on the site - although I haven't checked that particular site.




Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: