After about nine years of writing, I've concluded something similar: my existing reactive approach (https://www.gwern.net/Archiving-URLs) is not going to scale either with expanding content or over time. Fixing individual links is OK if you have only a few or aren't going to be around too long, but as you approach tens of thousands of links over decades, the dead links build up.
So the solution I am going to implement soon is taking a tool like ArchiveBox or SinglePage and hosting my own copies of (most) external links, so they will be cached shortly after linking and can't break. The bandwidth and space will be somewhat expensive, but it'll save me and my readers a ton in the long run.
The mistake I vowed to correct if I started a third time was this feeling that if I'd already written a couple pages on a topic I should be done. People change. Tech changes. I shouldn't feel guilty 'retreading' something I said a couple years ago. I have new information.
Which is to say, rather than forever going back and updating old entries, it might be more productive to revisit the material you still find you have the strongest feelings about. Talk about what has changed, and what hasn't.
Wouldn't a more ideal solution be archiving via a variety of external (or internal, I guess) sources the first time a link appears on your site, and then after a year automatically switching all links to archived versions? This would kill link rot in its tracks while preserving a lot of the value of links for the people you're linking to, and would cost less in bandwidth costs given the curve of access on old content.
Yes, by 'shortly' I meant something like 90 days. In my experience, most pages won't change too much 90 days after I add them (I'm thinking particularly of social media-like things), but it's also rare for something to die that quickly. 365+ days, however, would be perilously long. My main concern there is balancing between delaying so long to snapshot that the link dies (thus generating the manual linkrot-repair problem I'm trying to avoid) and being too eager to snapshot and archiving a version which is not done and would mislead a reader.
(I also went through all my domains and created a whitelist of domains that my experience suggests are trustworthy or where local mirrors wouldn't be relevant. For example, obviously you don't really need to worry about Arxiv links breaking, or about English Wikipedia pages disappearing - barring the occasional deletionist rampage.)
> This would kill link rot in its tracks while preserving a lot of the value of links for the people you're linking to, and would cost less in bandwidth costs given the curve of access on old content.
My traffic patterns are different from a blog, so it wouldn't.
My approach right now is to verify all of the links weekly by comparing last week's title tag on that page to this week's title tag. I've had to tweak this a bit - for PDF links or for pages that have dynamically generated titles - I can opt to use 'content-length' header or a meta tag. (Of course, the old title can be spoofed, so I'm going to improve my algorithm over time.)
I wish I had a way of seeding sites with you. I imagine we have some crossover on interests and would also love to contribute bandwidth - as would some of your other readers I'm sure.
When I get a spammer asking me to fix a broken link I sometimes cry about whatever the web lost that they are bringing to my attention, but generally my only response is to suggest to the spammer that they file a Pull Request to my blog so I can more properly code review the suggested change. (Unsurprisingly, no spammer has actually bothered to take me up on that offer. It's unlikely I'd actually merge such a change in, but I'd love to see one try.)
You have decided you do not care enough about your writings or your readers to invest the effort to fix them. That is your decision, and I don't know enough about you, your writings, or your readers to criticize it.
I have decided differently.
I provided those links because I thought they were relevant and useful to the reader at the time. I can't fight time and I can't fight entropy for all of the web or even just my own tiny corner. I salute you for trying. I set my border at the end of domain names that I control, because I know I have responsibilities there and I'm able to also in good conscious end them there, otherwise I'd feel so much guilt for how the web has shifted and changed in > 20 years of posting webpages to it.
My blog captures moments in time, and just as I don't go back and fix rotten opinions that haven't aged well in some of them, I generally don't go back and fix broken links. I would hope that anyone exploring my past archives would give past me the benefit of the doubt and contemplate such archives from the context in which they were written and the very different person that wrote them and sometimes the very different web that they were posted to.
So much good unique content on youtube too which is almost impossible to properly archive due to size, my subscriptions alone would probably be over 1TB.
I'd like to write a basic tool that would take a PDF, i.e. of a book, and output a directory of PDFs of snapshots of all web links in the book, to create basically a full reference snapshot for a given book that could be stored alongside the book. Not sure if it work well and result in a reasonable size.
ArchiveBox does WARCs and PDFs, and does embedded media; it's easy to use, you can point it at a newline-delimited textfile of URLs and it'll process it.
I'm not sure how it handles YouTube - whether it shells out to something like youtube-dl or not... But really, 1TB is not all that much. You can get 8TB internal HDDs for like $200 now.