Ask HN: Best way to avoid link rot?

arvidkahl · on April 7, 2021

I’ve been working on a SaaS tool that combats link rot for writers.

The service periodically checks the original sources and redirects broken links to the archive.org snapshot automatically when the sources fail. Any redirection can be customized or fall back to an author’s website.

The usual problem is that anything that is supposed to prevent link rot itself is prone to rot.

Which is why my SaaS solution has a self-off boarding function: writers can use custom domains for their links and export a fully functional nginx and Apache-based redirection config file at any time. I also have a contingency plan for the next decade, which includes setting up such a low cost redirection server myself.

I needed this service for my own books. So I just built it. Dogfooding is what I recommend to Bootstrapper’s anyway, so I am my first customer.

It’s highly reliable (it’s a 301 forwarding system after all) and very cheap to host.

You can find this at https://permanent.link

epc · on April 6, 2021

I tried to do this when I blogged with MovableType and it’s a fool’s errand. You end up needing to cache all of the links somewhere and revalidate them periodically. But it's not enough to check for a 200 or 304 because the site could do a redesign and move the content around, the domain could expire, an RTBF demand could be served, etc. So you need to validate the content in some way to ensure you're still linking to what you more or less intended to in the past.

So, yeah, then you end up linking to the archive.org page (and if it's not there you could submit it to archive.org for archiving). For a time archive.org delisted pages if the current robots.txt blocked access (even if the page was archived a decade ago), I can't verify if this is still the case.

A really poorly quick scan of ~400 links from my decaying blog shows about 40% either do not resolve at all or return content from a link farm (another 30-40% appear to redirect to https versions of the site, but I didn't check further to see if the content was what was intended).

lgats · on April 7, 2021

The Wordpress plugin “broken link checker” has really helped me reduce link rot.

It provides a list of broken and a list of redirected links. I first ‘fixed’ the redirects to ensure I wasn’t redirecting to content that had been replaced or turned to a link farm. For fixing the broken links, I found many links were available via archive.org and swapping them out was a two click breeze. For the non-archived pages, I often found the new URL for the same resource by Google searching the link.

My only request would be a feature to archive all outgoing links

https://wordpress.org/plugins/broken-link-checker/

mjgs · on April 7, 2021

It would be cool if there was a javascript library that replaced broken links with Internet Archive version of the page if it exists.

I guess it’s more complicated than just that because you’d have to take into account the date the page was originally linked to to get the closest match but should be doable.

There might very well be reasons this isn’t a good idea, it’s just something that occurred to me might be useful since I have a lot of old links and no doubt many of them are broken now.

lgats · on April 7, 2021

This would not work for most links due to cross origin restrictions.

You would also end up sending the validation request for every client page view

mjgs · on April 7, 2021

I was thinking that a service worker could be used:

https://bitsofco.de/web-workers-vs-service-workers-vs-workle...

If the url returns a 404, check the internet archive, if it’s there then return that otherwise return a 404.

If CORS is an issue just proxy the requests via the server that is serving the website.

cyberlab · on April 7, 2021

Check out Filecoin[0] or even just IPFS[1]:

> Filecoin (⨎) is an open-source, public cryptocurrency and digital payment system intended to be a blockchain-based cooperative digital storage and data retrieval method.[1][2][3][4] It is made by Protocol Labs and builds on top of InterPlanetary File System,[1] allowing users to rent unused hard drive space.[5] A blockchain mechanism is used to register the deals.[6] It is a decentralized storage system that aims to “store humanity’s most important information.” Filecoin is open protocol and backed by a blockchain that records commitments made by the network’s participants, with transactions made using FIL, the blockchain’s native currency. The blockchain is based on both proof-of-replication and proof-of-spacetime.

[0] https://en.wikipedia.org/wiki/Filecoin

[1] https://en.wikipedia.org/wiki/InterPlanetary_File_System

loughnane · on April 7, 2021

Interesting. I've seen Filecoin and IPFS around but haven't looked into them. Will definitely look more into it.

8bitsrule · on April 7, 2021

IME over a decade, the more obscure the source is (and, to-a-degree, the content is) the more likely that the link will fail over time.

The links of big institutions used to change a lot, much less so now. Some, like WIRED, actually go to the trouble of making old content links resolve to their newer addresses.

If the content would still have value in 40 years, it will probably survive. (See the NYTimes archive.) But links themselves have little value (unless they're widely published on paper!). There may be a few big institutions thinking that far out.

gosub · on April 8, 2021

my strategy would be diversification: link + archive.org + archive.is + local copy + github/gitlab copy

worrying about linkrot, some years ago I wrote a little local webserver that took a link via a bookmarklet and saved a page in multiple formats:

    - html with wget -archive | tar.gz
    - pdf with wkhtmltopdf
    - txt with links
    - png with firefox --screenshot

DonCopal · on April 8, 2021

Your best bet is saving those pages as PDFs and backup those.