Hacker News new | past | comments | ask | show | jobs | submit login
Link rot and redirect madness (donmelton.com)
93 points by ca98am79 on Dec 10, 2015 | hide | past | favorite | 53 comments

For anyone interested, the Internet Archive has a small but tight open source community called Archive Labs <archivelab.org> (all experimental services)

We're working on several initiatives, like bringing OpenAnnotations to the wayback machine, reducing link rot (see: vinay goel & http://vinay.blog.archive.org/2014/12/17/9/), universalizing access to open-access publications, and supporting efforts to distribute and decentralize the web.

Send me an email if you're interested in checking out our slack channel and meeting fellow contributors. Also, if you have a public-good or open-access project you would like help with or resources for, we'd love to hear about it --


Here's a startup idea that I'd been holding close to my heart, but now that my teammate's working for a BigCorp, earning beyond her dreams, it's time for it to go public. Steal this idea!

Link rot is only going to get bad from now. It's real, and it's awful.

It's not even random websites. NYT will post a link to a popular Youtube video, two days later the video gets pulled, and one week into the article, the links are already stale.

As Wiki gets more 'reputable', newspapers will posts links to it. Wiki being what it is, the reference to the page will go stale.

So here's an idea for a service: hash all the outgoing/inside pages on a website. If they change, either 1) give options for users to review, 2) update the link 3) delete the reference. If the original website 404's, change the link to the archive page. If a video link is outdated, provide tools to search for similar videos to link, from inside the dashboard. For Wikipedia, automatically link to a certain version of history of a page, not the general page. This is different from the idea posted below because it integrates with existing publishing systems, so it'd be more b2b, and one would start cashflow right away.

For links to social media, use a 'photo snippet' tool that looks to check if the link is valid, and goes to the image version of the link if validity is dead.

I'm certain people would pay GOOD MONEY for this service. I know I would if I were running a publication.

Give me some spare change if it succeeds. : )

You can't simply hash the whole page, because for most sites the hash will change constantly due to any varying content (e.g. rotating ads, new comments, random messages, A/B testing...). I wrote up a proposal for using a CSS rule to select only the "relevant" portion of a page to hash.[1]

[1] https://bentrask.com/?q=hash://sha256/2c9e53858b5312564a2b8f...

The ultimate version being complete capture of the Internet at large with hour to hour snapshots and history perhaps using deltas between versions.

Git Internet!

I've personally started saving sites I want to keep using the print-to-PDF feature. Bookmarks aren't enough when you really care to save the data.

I use the Firefox version of Zotero[1] to archive links. It works quite well, and is more searchable than PDFs.

[1]: https://www.zotero.org/

You can save a lot of space by using a "reader-view" version of the webpage.

Some friends of mine built https://preserve.io to automate that. It's basically bookmarks-plus-print-to-pdf-as-a-service.

You've just described IPFS: https://ipfs.io/

Thanks for the link! That's a pretty interesting project I was previously unaware of.

You're welcome!

I hate to say it, but you've just described what a good SEO does. The good SEOs (and SEO firms) will analyze and take care of your internal links, your internal link structure, and make sure the site isn't linking out to outdated/old links.

If you want to do this yourself, there's several crawlers out there that you can find the data yourself.

Many publications have been reluctant to link out to sites over the years because of this very thing--link rot.

Good SEO still doesn't fix the root of the problem--lost content. If you write an article about a video, and that video gets pulled, there goes your article.

> As Wiki gets more 'reputable', newspapers will posts links to it. Wiki being what it is, the reference to the page will go stale.

Unless the page was non-notable and got deleted, there'll be a redirect.

I found myself so enraged at what Wikipedia privileged editors considered non-notable that I haven't had the energy to contribute in the past few years. I'm not talking about my 2nd cousin's garage band: see the talk page on reStructuredText (which managed to survive).

I don't understand why you were enraged. The editor who spoke with you made it very clear what was needed to prove notability (reliable secondary sources) and politely explained things to you while linking to official policy. I'd hardly call them privileged. It'd be another story if they said something like "I don't want this article on my encyclopedia"; he even recommended using the IBM DeveloperWorks link. The tag that was on the article is merely a signal to editors that the article needs better sourcing. To be deleted, it'd need to be nominated, where it would need to be proven that the topic is non-notable, so the system actually works in favor of keeping articles that are actually notable. It might seem like the rules are arbitrary or wielded by wikilawyers but if you read them they're not that bad. I don't agree with all of them, but "when in Rome..."

I was the one that found the IBM link. The notability skeptic liked it, but not enough to think that was sufficient for notability. After I gave up, someone with more influence came along and bigfooted the issue. I have no visibility into why one person's opinion is more important than another's - I assume it's because they've done something like devoted 10,000 hours to editing - but I've got no time to for environments that value fanatacism over reasoned argument.

And the Tool Support section - the most valuable part of the page, and which I vehemently contend constitutes essential secondary source material - remains deleted. It took me 15 minutes of digging through the page history to find it, which I'd never do if I didn't already know it was there.

The great challenge is proving something is notable. If contested it can be an uphill struggle, because journalists don't talk about obscure things much.

That's the whole point of Wikipedia. Otherwise it would be like PR Newswire.

In my visualization of perfect blogging platform, it automatically periodically scrapes all your old links and looks for either redirections or wild size changes in the content. It also auto-archives linked content for you (which used to be easier, but can still be at least tried), and could offer these archives to either the blog owner or post reader. But among the many challenges implementing that provides is that it is hard to make it work at scale, because of the resulting legal issues.

Or the links are turned into archive.org links automatically at publish time? With any pages not captured immediately captured at that point?

EDIT: This is a hack until the content addressable web arrives.

I've recently started to consider tools like wkhtmltopdf [1] to easily dump a website I'd like to link to. Then I'd include both the original link and the pdf version on the blog page, in order to keep the linked structure and the archived version at the same spot.

I haven't done it yet, but would be interested in other's experiences. It would be interesting to have a crawler that does it automatically for each new post.

[1] http://wkhtmltopdf.org/

This would be problematic legally, even when what you're linking to offers a permissive license of some sort.

I tend to keep local PDFs of everything that seems crucial. But linking to them seems ethically questionable. But maybe as a last resort, with an explanatory comment.

Yes, this would be awesome. Even just auto-detecting broken links and redirecting them to archive.org versions instead seems like it would be an easy win that would sidestep the legal problems of archiving them yourself.

Yes, archive.org is an amazing resource. But I'm wondering about the coverage, especially for dynamic and deep content. Maybe those are mainly issues on older snapshots.

Also, copyright issues aside, the prospect of a recursively archived Web is a little mind boggling. But storage is always cheaper, so hey.

Seems that would be better built as a stand-alone spider than a bolt-in to a single blogging platform.

It looks like there are some wordpress plugins that let you check and at least fix the links, but I don't see anything that takes the extra step of archiving.


I keep all my bookmarks in a Delicious account and I was shocked that around 40% of links I saved since 2013 are now gone. I managed to find few archived pages in archive.org but many of the pages I "saved" are gone forever and I can never refer to them again.

I now save every single page I find important or interesting on my local HDD and then move it to a backup disk later on.

I was having the same problem with my bookmarks disappearing from the internet. Searching for a solution I eventually made a browser extension to send bookmarks to an online archiving service (archive.is). I then added an option to save archives locally in mhtml format. I have around 370 bookmarks archived that way and it takes 240MB of space.

The addon can be found here: https://github.com/rahiel/archiveror.

Saving archives locally is great, but I wouldn't count on using archive.is for long-term reliable storage. Seeing as how it must cost its operator(s) more than a few ducats to operate and stores copies of copyrighted material on a centralized server(s) without being able to hide behind the "lol we're just a library" shield that Internet Archive can, I wouldn't be surprised to see it suddenly disappear one day.

Very nice! Thank you! Now I'll just have to be worried about what happens when archive.is goes down :)

Been looking for something like this for a while, thanks!

Note that you can "save page now" on the Internet Archive -- via web, API, or bookmarklet -- that would increase that % quite a lot.

There could even be an integration into Delicious or WordPress. Wikipedia is working on a bot for external links in their content.

Your comment reminded me that I have a Delicious account, which I haven't used for years. I logged in and just for fun went straight to my oldest link (saved in September 2006), and it still works!


But here's one saved the same day that doesn't:


Why do you prefer that to pinboard? Thanks!

Does pinboard save the content?

the $25/year plan does, yes

I recently went through a large portion of my old blog. It's astounding how many excellent resources from just a few years ago have completely vanished due to link rot; in a few cases I was unable to find archives even on archive.org.

It's really sad, and more than a little disheartening.

Hosting the content or having your own copy of something is the only real way to be certain. I had the same experience and I've started saving all my work. Thanks for the idea about the blog though, I need to go back through mine and export everything.

I can add to the sample size here and have found similar results. I wrote an SEO test piece a while ago which inadvertently became quite popular[0] (and did rank well for the targeted terms).

In updating it recently (couple months ago) I found many of the links simply 404'd, not even any redirect at all. Wikipedia was pretty bad for this too as the author found.

I took to hosting all the files I linked to myself to at least be sure they'd stay around but it is a bit of a losing battle with regard to linking to normal webpages.

I might look to link to an archive.org page instead in the future if the link rot keeps happening. Perhaps that would be a more stable option.

[0] http://josharcher.uk/blog/why-margaret-thatcher-is-hated/

Wikipedia handles redirects mostly fine. Mostly.

The one big problem with Wikipedia redirects is those to sections: if a page is moved or disappears, the software notices, but it doesn't if a section changes its name. Thus, frequently the anchor in an intra-wiki link is broken.

FWIW broken anchors are tracked here: https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Bro.... Maybe you can help alleviate the problem.

Awesome, I didn't know about that page. I've gone and fixed a few of the top ones.

To some degree, this is why some sites use services like archive.is to link to third party sites, because the archiving service will keep a cached version of the page even if the original source goes down. Maybe this could be the solution for certain cases? Okay, it won't be super popular with all site owners (they'd rather traffic came to their sites directly so their ads were viewed), but it's certainly more stable.

This is worse as centralizing creates a single point of failure over which you have no control. Specifically archive.is has repeatedly taken a stance against going opensource and offering a self-hosting solution while being privately funded with no option for you to take part in it.

Archive.org or maybe something like wallabag or even a screenshot, but anyways to work around link rot one is better off with a local cached copy.

I always thought blogs should link to immutable content addressed mirrors of sites. Perhaps it could be a paid service.

Wrote a script to help manage linkrot on my blog: https://github.com/kaihendry/linkrot

Number of issues: People who squat pages & accounting for temporary failures.

I very rarely update my website. (Most of it predates modern CMSes, anyway.)

One of the things that I do when I put up a page, though, is make PDFs of everything that I link to. This way, if the link dries up, or is changed, there's still something to fall back to.

> I’ll soon upgrade to HTTPS here.

Pardon my ignorance and probably silly question, but what is the point of making data transfers to/from blog hosted at own domain more secure? If I get it right, it will just hide which certain articles reader visits (not that he visits this certain domain) and it will prevent caching along the wire. I suppose there must be some real benefit of this right in from of me (see "everyone else seems to have performed the same upgrade"), but I cannot see it. Any clues?

For the user, the benefit is privacy and MITM protection. Hopefully, eventually browser chrome will explicitly mark non https sites as unsafe (like a red bar at the url).

Ok, valid point.

One possible approach to link rot: each web page embeds within it (or in associated resources) a copy of every page linked to. Then when a link is clicked on it's always possible to show the page as intended.

The biggest issue with this approach would probably be legal, as you'd find yourself redistributing copyright works. Could a fair use argument be made since it would be furthering public discourse?

This is why I don't bookmark things anymore, I save them directly to Evernote.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact