Hacker News new | past | comments | ask | show | jobs | submit login

> I gave up on dealing with link rot years ago. If I come across an old post with non-functioning links, I may just find a new resource, link to The Wayback Machine or (if I'm getting some spammer trying to get me to fix a broken link by linking to their black-hat-SEO laiden spamfarm) removing the link outright. I don't think it's worth the time to fix old links, given the average lifespan of a website is 2½ years and trying to automate the detection of link rot is a fools errand (a page that goes 404 or does not respond is easy—now handle the case where it's a new company running the site and all old links still go a page, but not the page that you linked to). I'm also beginning to think it's not worth linking at all, but old habits die hard.

After about nine years of writing, I've concluded something similar: my existing reactive approach (https://www.gwern.net/Archiving-URLs) is not going to scale either with expanding content or over time. Fixing individual links is OK if you have only a few or aren't going to be around too long, but as you approach tens of thousands of links over decades, the dead links build up.

So the solution I am going to implement soon is taking a tool like ArchiveBox or SinglePage and hosting my own copies of (most) external links, so they will be cached shortly after linking and can't break. The bandwidth and space will be somewhat expensive, but it'll save me and my readers a ton in the long run.




I stopped blogging after a couple years in because I started running out of new things to talk about. I started back up about 5 years later and repeated the same experience. I've seen other bloggers and web comics struggle with the same issue.

The mistake I vowed to correct if I started a third time was this feeling that if I'd already written a couple pages on a topic I should be done. People change. Tech changes. I shouldn't feel guilty 'retreading' something I said a couple years ago. I have new information.

Which is to say, rather than forever going back and updating old entries, it might be more productive to revisit the material you still find you have the strongest feelings about. Talk about what has changed, and what hasn't.


So the solution I am going to implement soon is taking a tool like ArchiveBox or SinglePage and hosting my own copies of (most) external links, so they will be cached shortly after linking and can't break. The bandwidth and space will be somewhat expensive, but it'll save me and my readers a ton in the long run.

Wouldn't a more ideal solution be archiving via a variety of external (or internal, I guess) sources the first time a link appears on your site, and then after a year automatically switching all links to archived versions? This would kill link rot in its tracks while preserving a lot of the value of links for the people you're linking to, and would cost less in bandwidth costs given the curve of access on old content.


> Wouldn't a more ideal solution be archiving via a variety of external (or internal, I guess) sources the first time a link appears on your site, and then after a year automatically switching all links to archived versions?

Yes, by 'shortly' I meant something like 90 days. In my experience, most pages won't change too much 90 days after I add them (I'm thinking particularly of social media-like things), but it's also rare for something to die that quickly. 365+ days, however, would be perilously long. My main concern there is balancing between delaying so long to snapshot that the link dies (thus generating the manual linkrot-repair problem I'm trying to avoid) and being too eager to snapshot and archiving a version which is not done and would mislead a reader.

(I also went through all my domains and created a whitelist of domains that my experience suggests are trustworthy or where local mirrors wouldn't be relevant. For example, obviously you don't really need to worry about Arxiv links breaking, or about English Wikipedia pages disappearing - barring the occasional deletionist rampage.)

> This would kill link rot in its tracks while preserving a lot of the value of links for the people you're linking to, and would cost less in bandwidth costs given the curve of access on old content.

My traffic patterns are different from a blog, so it wouldn't.


Always glad to hear your perspective on this. I am working on my own archival system for href.cool - I've already lost thewoodcutter.com and humdrum.life. (Been going for one year.)

My approach right now is to verify all of the links weekly by comparing last week's title tag on that page to this week's title tag. I've had to tweak this a bit - for PDF links or for pages that have dynamically generated titles - I can opt to use 'content-length' header or a meta tag. (Of course, the old title can be spoofed, so I'm going to improve my algorithm over time.)

I wish I had a way of seeding sites with you. I imagine we have some crossover on interests and would also love to contribute bandwidth - as would some of your other readers I'm sure.


I also it turns out have ~20 years of blog content. There's a bunch of gaps in the middle for various reasons of data loss (some of which were probably more useful than others), but beyond that I've tried to maintain as much as possible my internal links. At this point I have redirects in place to support link structures going back across two blog migrations (from a Drupal install to a custom Django blog to a much less custom Jekyll site). That's about where I feel my responsibility for link rot ends. I've done the best I can that if for some reason you find a /node/somerandomnumber link somewhere you still get redirected to whatever blog post that leads to even though I haven't used Drupal in over a decade. I've not been perfect in maintaining such redirects (there were some complex CMS sections I once had in Drupal I never bothered to migrate or became entirely new things; plus the aforementioned losses of data prior to that Drupal blog and whatever I'd managed to migrate into it from the other Drupal blog and crazier custom blogs that proceeded it, some of which back before blog was even a word), but I've tried my best. The onus the web asks for site admins is that we all collectively try our bests. It's not my responsibility to worry about external link rot out from my blog posts. I'll still lament it, as sometimes there are great losses, but I'm not the one that broke that link contract.

When I get a spammer asking me to fix a broken link I sometimes cry about whatever the web lost that they are bringing to my attention, but generally my only response is to suggest to the spammer that they file a Pull Request to my blog so I can more properly code review the suggested change. (Unsurprisingly, no spammer has actually bothered to take me up on that offer. It's unlikely I'd actually merge such a change in, but I'd love to see one try.)


It may not be your 'responsibility', whatever that means, and it would be nice if there was less link rot, but the fact remains: you provided those links because you thought they were relevant & useful for the reader; many of them are going to break; are you going to fix them, or not?

You have decided you do not care enough about your writings or your readers to invest the effort to fix them. That is your decision, and I don't know enough about you, your writings, or your readers to criticize it.

I have decided differently.


I was not criticizing your decision, simply trying to offer a differing viewpoint. I briefly considered doing something similar to the path you are traveling down, but realized that I was happier taking a different path.

I provided those links because I thought they were relevant and useful to the reader at the time. I can't fight time and I can't fight entropy for all of the web or even just my own tiny corner. I salute you for trying. I set my border at the end of domain names that I control, because I know I have responsibilities there and I'm able to also in good conscious end them there, otherwise I'd feel so much guilt for how the web has shifted and changed in > 20 years of posting webpages to it.

My blog captures moments in time, and just as I don't go back and fix rotten opinions that haven't aged well in some of them, I generally don't go back and fix broken links. I would hope that anyone exploring my past archives would give past me the benefit of the doubt and contemplate such archives from the context in which they were written and the very different person that wrote them and sometimes the very different web that they were posted to.


I've been meaning to set up some kind of basic crawl & archive system forever. Ideally I'd like to output something replayable and analyzable like a WARC or HAR, but also spit out a PDF, which I think chrome headless should be able to do. Right now I just print to file if I want to "save" something. But your write-up is very thorough, that is basically the situation.

So much good unique content on youtube too which is almost impossible to properly archive due to size, my subscriptions alone would probably be over 1TB.

I'd like to write a basic tool that would take a PDF, i.e. of a book, and output a directory of PDFs of snapshots of all web links in the book, to create basically a full reference snapshot for a given book that could be stored alongside the book. Not sure if it work well and result in a reasonable size.


> Ideally I'd like to output something replayable and analyzable like a WARC or HAR, but also spit out a PDF, which I think chrome headless should be able to do.

ArchiveBox does WARCs and PDFs, and does embedded media; it's easy to use, you can point it at a newline-delimited textfile of URLs and it'll process it.

I'm not sure how it handles YouTube - whether it shells out to something like youtube-dl or not... But really, 1TB is not all that much. You can get 8TB internal HDDs for like $200 now.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: