Hacker News new | past | comments | ask | show | jobs | submit login
Link rot and content drift are endemic to the web (theatlantic.com)
321 points by timmytokyo 33 days ago | hide | past | favorite | 211 comments



By the way, the technical side of this is very interesting. If you look at the tools mentioned (the wayback machine, but also perma.cc and other archival solutions), almost all of them rely on a single semi-modern tech stack that produces WARCs (web archives - ISO - ISO 28500:2017 https://iipc.github.io/warc-specifications/specifications/wa...).

The main crawler still seems to be heritrix3 (https://github.com/internetarchive/heritrix3), but there's a great little ecosystem with tools such as webrecorder and warcprox.

Still, I've read through the code of these tools and am feeling that they are failing in the face of the modern web with single page apps, mobile phone apps and walled gardens. Even newer iterations with browser automation are getting increasingly throttled and blocked and excluded from walled gardens.

Perhaps the time has come for a coordinated, decentralized but omnipresent approach to archival.


WARC can record and replay single-page apps, but it struggles with knowing where a "page" begins and ends.

There was a time when I was furious with the web going to hell and I investigated the possibility of "web without browsers" that started with making a WARC capture of page and putting pages through extensive filtering and classification before the user sees anything.

With interactive capturing you can push a button to indicate that a page is done "loading" but with automated capturing you can't really know that the page is done or that you got a good capture. That ended the project right there.


I, too, was fascinated by a "web without browsers" (or with other kinds of browsers, really) until i stepped into the community and realized that community was a proto-nazi cesspit full of misogynistic attitudes.

Maybe now that times have passed, people have died who posthumously admitted their preferences for white supremacy (and heavily bitcoin-supported that), and whole projects have been renamed, there can be a more inclusive community built around browserless web? For those who haven't followed, i'm refering to Woob (previously Weboob). I'd be interested in other people's feedback about that community lately, the ideas are great!


> I, too, was fascinated by a "web without browsers" (or with other kinds of browsers, really) until i stepped into the community and realized that community was a proto-nazi cesspit full of misogynistic attitudes.

Could you expand on that? It seems a bit out of (anti-)left field, so to speak.


I had this idea too. Wrote some code to scrape data from my school's badly designed website and it significantly improved my quality of life. Really made me think. What if we had a huge library of scrapers for every single website out there? We could build custom clients and have full control over everything. If people can maintain absurdly huge adblocking databases, surely something like this would also be possible.

Nice to know about Weboob. No idea what the community was like but it's nice to know I'm not insane for thinking about stuff like this.


I'm completely out of the loop on something like this, but could you in theory apply some kind of ML to identify the end of pages to assist with good page captures?


Probably. Certainly the more you spent on it the better you could do.

At the time I was most bothered by the slow load times of web pages and blaming this phenomenon:

https://www.sjsu.edu/faculty/watkins/samplemax4.htm

particularly that if you take the max of N random variables, the expectation value you get gets worse as N increases -- that is, the page isn't done loading until the slowest http request completes.

So I saw the "knowing when the page is done" problem as being particularly core, and it would be if the goal was to "win the race" against a conventional web browser.

If you were (say) preloading all the links submitted to hacker news you might be able to tolerate the system taking 5 minutes to process an incoming page. (See archive.is)

Today I've noticed that sites like Wired are giving up on complaining about my anti-track and ad-blocker and they just load the page partially which would drive me crazy if I was serious about debugging.


> increasingly throttled and blocked and excluded from walled garden

I keep thinking back to Jacob Applebaum's stance of "facebook and the other walled gardens are the real dark web."


Honestly, it would be a better use of surplus resources than crypto mining.

If only there are a way to algorithmically tie a proof of work for a new cryptocurrency to archival of the internet in a way that wouldn't be easily gamed (by people archiving easy to access content or highly redundant archival of trivia).



I think “right to be forgotten” is important and I’m generally against everlasting social media posts, but for copyrighted works, we really need a centralized Library of Congress that acts to archive these. In order for that to happen there needs to be an equivalent “publishing” mechanism for the web - where the user says - I created something and I want it to be archived. This would cover things that exist behind a paywall or are only delivered as newsletters.


WARC is genuine genius, and a very real and valuable contribution in large part of the Internet Archive.


It's an Atlantic article so it's long, and several of the comments here show that people aren't actually reading the whole thing... but I did and it's worth the time. It's not only about links being dead, it's about the lack of transparency & audit when content is changed via takedown requests, it's about dead links showing up in decades-old supreme court decisions, it's about private industry's lack of incentive for wanting to improve any of these issues, etc.


Having read the whole thing, it seems partially good, martially misguided, and partially terrible.

The overall bent is a hand-wringing about link rot, which I thought we mostly got over a decade ago. The Internet is fundamentally ephemeral. If you see something you like, save it so you can repost it later. If you rely on someone else to keep in up indefinitely, you're being foolish.

Around the edges of that main discussion, the Atlantic also touches on censorship in all the wrong ways, re-iterating the too-common view that censorship is good as long as the good guys do it. They at least argue that this censorship should be transparent and censored works still accessible in some way, but they seem to not understand the nature of what they're talking about.

Censored works aren't censored to protect the public. They're censored to protect the rich and powerful. That's why "right to be forgotten" really exists. That's why Google and Youtube and Twitter and Facebook quash anything that goes against the accepted narrative in any given field. They aren't protecting the public from dangerous misinformation. No one gives a shit about the public. They're protecting the financial and political interests of some very powerful people.

Given this, talk of a "poison cabinet" only illustrates ignorance of the issue. The Memory Hole cannot be divorced from the censorship process, it's a core part of it. If people can still find the information in some form, it's not censored enough to make the people it threatened happy.

And this leads to the final point, which is that the real reason the web is "rotting" isn't link rot, it's censorship on the part of tech monopolies, due to their joined-at-the-hip relationship with every large corporation and industry imaginable due to advertising and other deals. The fact that links die doesn't matter much: you can just repost the material. The fact that links ARE ACTIVELY KILLED to suppress their information is a much more serious problem, and one that doesn't have an easy solution besides full breakup of the tech monopolies.


I'm not 100% convinced by your assertion that censorship only favors the rich and powerful. It can and often does, but it can also help people without power or society and large. For instance, taking down a dox for a niche YouTuber is clearly not helping a powerful person, but it's still arguably censorship.

The misinformation area is somewhat stickier, but here's a decent example: if somebody decided to hurt you by spreading rumors (let's say that you watch CP) and spends time and money to get that rumor top of any SEO and forum thread, what's the right course of action? How good are you going to feel about using speech to counteract that when the result is a Google search giving your denial in spot 1 and the accusation in spot 2?

We all have to grapple with power and the ability to abuse it, but I don't think it's effective to say power is fundamentally wrong. The conversation is more nuanced than that and has to be viewed as systems with checks on power, which means specific design-thinking.


> if somebody decided to hurt you by spreading rumors

We have libel laws to address that.

GPs point is that Google, Facebook, et. al. are premptively censoring non-mainstream content just to protect themselves. They don't really care about the public.


Have you ever had to deal with the court system? It is VERY slow, expensive as hell, and utterly frustrating.


> just to protect themselves

To protect themselves from the public. Whether it's because consumers might take their business (and their data) elsewhere in disgust at what a particular platform is turning into, or because democratically elected lawmakers could start imposing sanctions or new regulations.

Companies are always looking out for themselves, that's a given. But that doesn't mean their actions are completely divorced from public opinion.


And libel laws do not stop it being n1 result on google, or random people re-posting it. Libel laws were meant to deal woth traditional media


>The overall bent is a hand-wringing about link rot, which I thought we mostly got over a decade ago. The Internet is fundamentally ephemeral. If you see something you like, save it so you can repost it later. If you rely on someone else to keep in up indefinitely, you're being foolish.

Did we? Should we?

Acknowledging the current state of affairs doesn't require accepting its flawed nature.

Imagine a world where gasp the BBC, NYTimes, etc. kept all versions of their articles available and online. Where the "pretty URL" shows the most recent version of a page, provides a permalink, and provides permalinks to all previous versions of a page.

I don't expect most sites to do this, but since someone else mentioned the BBC, I am targeting journalism as an example.


Isn't the BBC typically pretty good at maintaining old URLs? An example: a BBC article on the 9/11 attacks, published on that day: http://news.bbc.co.uk/1/hi/world/americas/1537469.stm


Maybe the reason it feels impossible to stem the tide of link rot is that it's takes tremendous energy to constantly increase the entropy of a system. And that energy has opportunity cost that no one really wants to talk about.

The article has the unstated assumption that eternal preservation of all writing ever is a net benefit. I think it's worth having a discussion on that point.


I disagree that the unstated assumption is eternal preservation of all writing. IMO the author is clearly focusing on official or semi-official published information, not necessarily what you and I write on places like HN.


And there's the irony that's pointed out in the article as well - official documentation on government websites may not last past an elected official's term, yet blithe comments on a social media site that are later regretted may last forever.


> Maybe the reason it feels impossible to stem the tide of link rot is that it's takes tremendous energy to constantly increase the entropy of a system.

I suppose you meant "decrease"


No, they mean increase. Decreased entropy would be cheaper to store and maintain.


Yes, thanks. Somehow that's always been backwards in my brain. Maybe I'll remember it this time.


> audit when content is changed via takedown requests

Not just that, but even due to politics, someone wanting their history "changed", or even worse reasons.

There were multiple reddit arguments recently, when someone posts a newsstory about something, discussion starts, two hours later the story is changed (without any footnote or old version available), and arguments start, because "there is noting in the article that says <what you said>", and no way for the original poster to prove, that it was, because noone even thought they'd need a screenshot.


> People tend to overlook the decay of the modern web, when in fact these numbers are extraordinary—they represent a comprehensive breakdown in the chain of custody for facts.

This is a particularly good quote to sum up the article. The internet is not a repository of facts, it is a repository of facts, spam, junk, and things. Moreover, it is not the only repository of these.

Link rot happens. Content is subject to the will of the publisher to spend the time and/or money to continue to host it.

Depending on links to work eternally is a mistake. The problem is not the link rot, it is the bad assumption.


It's depressing.

I know somebody who started a business that was successful for a while and then failed. Spammers got control of the domain and now it is full of ads for a dangerous diet drug.

What makes my blood boil is that it impugns the integrity of the founder who is a decent person who has nothing to do with that scam.


This actually happened to me:

https://www.voo.st/

We shut down the business an abandoned the domain. Someone registered the domain, created a similar-looking website by hand (recycling a lot of the text and images), and added spam. It even has my old company address.

This is a .st domain, about $35/yr. The web design work cost something too. More than I would have expected the link juice from a single website to be worth.

What we really need is some sort of DNS record or meta content we can add that tells search engines "this domain is being abandoned, destroy all link juice".


The traffic you get from people following links is measurable (and real!) The traffic you get from "link juice" is imagined.

The original PageRank paper assumed that PageRank approximated the distribution of views on web pages assuming that people followed links at random.

If Google wanted to know what people are viewing today, they don't need to collect a link graph and do matrix math. They can measure it directly with Chrome, Google Analytics and data exhaust from the advertising platform.


Nodes in keywordspace don’t die with the businesses that created them. The popularity of domains and links and words and phrases are permanently altered by the existence of the business. It’s a digital footprint like how a real-world business leaves a physical footprint. Some footprints are harmless - just a memory of activity that once happened. Other footprints cause lasting harm, like contaminated soil.

Abandoned formerly-popular domains create a kind of long-tail info-environmental impact, just like an abandoned warehouse can become a real-world hazard.

Maybe we need a digital superfund process.


Domains, if not outright sold by the owner, should die then.


And never be possible to register ever again? I feel like most easy-to-type domains would have been permanently expended in the early days of the web.


You wish there was some way you could make the links go away...


Or ICANN can create policies about domain squatting like GP described.


Or the FDA could put alternative-health scammers in jail.

Back in the 1950s they put Wilhem Reich in jail, where he died. L. Ron Hubbard got the hint and left the country and when no country was safe he went to sea.

Today people like Dr. Oz run alt-health scams continuously and nobody seems to go to jail or even get a fine.


FDA is American, ICANN operates worldwide


Exactly. The Atlantic author seems to be laboring under the misguided assumption that the web is somehow the same sort of thing as a library of books. Even libraries often have some degree of garbage information in them, and represent a survival story: the vast majority of books ever written are no longer in print, or even discoverable anywhere.

Good stuff should be preserved, but it's not the Internet's job to somehow magically do it. It's OUR job, and the nature of digital information (DRM not withstanding) makes this easier than ever.


Somehow I'm replying to you twice, but this time I agree and wanted to note that libraries are curated spaces as well.


This is why when I see something I like on a website, I might bookmark it, but I'm also saving it locally.


I am increasingly worried about the valuable content on YouTube. There are so many old live concerts, useful how-to videos and other cultural treasures amidst all the junk. I suspect that one day, they will make their ads unblockable by embedding them in the video files. I sure hope that some people are downloading the valuable stuff and stashing it away to load onto YouTube's successor.

<and please skip the tired argument that I should just pay a subscription fee to avoid their ad crap - we are already paying them with our data. My data is worth far more to me than the value they provide for it. Plus - I won't give money to a company who forced this Faustian bargain on me.>


It’s really up to you who cares about something to archive it. I managed to find a torrent of early days video games from my region that has almost been lost to time. Luckily I found a discord and could coax someone to hop on to seed it. If I had waited 10 more years they might have been gone for good.

Everyone’s assuming that data now stays on the internet forever because it’s so massive. It’s usually one or two people who keep the flame alive


I set up a server specifically for downloading Youtube videos from all my playlists on a daily basis just for this reason. At some point I got fed up at seeing all the missing videos on my playlists (and not even knowing what was removed)


What do you use for automation? My first instinct is cron jobs running youtube-dl scripts against a set of playlist URLs, but I'd be interested to hear more.


Yep, you basically just described the setup, nothing super fancy :)

Cron job which runs a batch script at midnight which feeds youtube-dl some playlists URLS's which it downloads to a HDD. I also have nextcloud running which has access to the directory the videos are saved to, so I can easily share them if I want to.


> that I should just pay a subscription fee to avoid their ad crap

Fuck no. Do not do this. Use youtube-dl[0], and maintain local copies of anything useful you can find.

0: http://youtube-dl.org/


I’m happy to give YouTube my money to provide a real alternative to ad-supported models. Plus the vast majority of videos I watch id never want to see again.


You're not really providing YouTube an alternative to ad-supported model - you're only proving that lots of obnoxious ads can convert people into paying customers.


Yes, and how dare a business desire paying customers.


My point is that converting is a signal promoting more ads, not less.


> we are already paying them with our data


How many how to videos, memes, concerts, etc, are really that important?

In fifty years how many people will care? How many people should care because it would mean ignoring the huge volume of newer stuff? A hundred years? Two hundred?

I haven't even read or seen many of the existing cultural artifacts we have from past decades and centuries, what would I do if orders of magnitudes more of them had been preserved?

In fact, I'm incredibly greatful that I grew up before all the random shit that I threw out there as part of my youth was subjected to obsessive cataloging and archival efforts.


If you grew up before the internet, surely you remember all the DIY manuals that lived in everyone's home. Everyone had that same big hardbound book of how to do basic home repairs. Many people had sewing books, electrical books, Chilton manuals for their cars. Cookbooks, too. We have valued how-to information for decades, and that interest and need far pre-dates the internet.

So while I get what you are saying that much of the pop culture videos do not hold long-term value (which is also questionable considering how many of us older folk still have collections of vinyl)... there absolutely is valuable content out there that deserves preservation.


Sure, and nobody is going to the library or other archives to check out those old how-to books, they're using contemporary sources on Youtube instead.

And the same will be true of old vs contemporary sources in fifty years.

I don't expect my own collection of books - which includes some that are valuable to me primarily for nostalgia - to have much value past the death of myself and the rest of my generation. It might temporarily have a lot of monetary value near my death - when other copies have already been lost - but to someone born fifty years after me? What use would pulp fiction from the 80s be to many of them?

My parents and uncles are in a bit of disbelief of how little even I care about the Beatles already, after all.


Well not caring for the Beatle is like not caring for Mozart. Bad taste is always an option in a free society.


Even people who care about Mozart's music largely have no idea how much other stuff from the time period they might have enjoyed that was lost, and that hypothetical loss is not ruining their life at all.

We don't live dramatically longer or have dramatically larger memories than our ancestors, so things necessarily have to get lost and replaced by the new things that have been created since then.


There will probably always be niche sources for specific "how to" information. YouTube is currently a low-effort way to make that generally available in video format, but it's certainly not the only viable solution, and the information you can get from a specific enthusiast site or forum is often better and more detailed.


>I haven't even read or seen many of the existing cultural artifacts we have from past decades and centuries, what would I do if orders of magnitudes more of them had been preserved?

You might have a better understanding of the culture that produced them. You might appreciate a work of art that would otherwise not exist. We have graffiti from Pompeii, we know Ea-nasir sold cheap copper in ancient Ur 3700-odd years ago, but we've lost countless works of literature, music and film, some by the greatest masters of their age. What artifacts of culture survive the scouring sands of time is often a matter of happenstance, rather than quality.

Chances are almost everything our species has produced culturally, scientifically and artistically - the whole corpus of our knowledge output over the last century - is going to vanish within a generation or two anyway, simply because the digital foundation into which we've transferred so much of it is brittle and ephemeral. If we want to leave anything behind for future generations at all besides climate change, pollution and nuclear waste, we should save as much as possible rather than only what we consider to be relevant.


Minor nitpick, it was ~1750 BC, so that would be around 3700 years ago.


Oops, fair enough.


These subjective questions are pointless to try and answer.

The idea is that a future person could freely deep dive through a rich well indexed history of media about whatever specifically interests them

I wish people would stop trying to assess the value of a given piece of media and just tag and archive the stuff.

For instance, high quality footage of live music from 100 years ago would be very interesting to some.


The tagging is even harder than the storage. There might well be high quality footage of live music from not much less than 100 years ago sitting on film reels in a shed somewhere - my university's library had a whole basement of pre-1850 books that they just hadn't had time to catalogue yet.


Each individual meme or video might not be important, but then I don't think future historians are going to spend much time studying individual artifacts in detail in the same way that current historians do. We live in the age of big data and I think future historians will be focused on aggregating and automatically analysing that data. They probably won't be reading your comment or mine but they might analyse large sets of HN comments with a view to drawing conclusions about how particular demographics act, think and feel today. And if large swathes of those comments are lost the conclusions will be skewed, particularly if the loss is not random.

Our society is characterised by the constant generation and exchange of massive amounts of information. It's one of the things that sets us apart from previous generations. Preserving only a small subset of that data that we deem worthy or important will not allow future generations to fully understand today's society.


You don't know what's important until much later. That's what makes archiving difficult.

Also, important to whom?


> I am increasingly worried about the valuable content on YouTube.

Download it? This continues to be not difficult for YouTube.

> I suspect that one day, they will make their ads unblockable by embedding them in the video files.

That's fine. I honestly wish they would because most of the hangs in YouTube I experience when the stream changes to an ad, and then changes back. If it was embedded in the video then the stream wouldn't be interrupted.

If I hate the ads that much one can edit them out after it's downloaded.


Good news, you can block ads embedded in youtube videos: https://addons.mozilla.org/fr/firefox/addon/sponsorblock/

It skips sponsors (even from the youtuber itself), jingles, intro, etc.

It's awesome.


What I hate about youtube the most is their deleting of videos AND metadata... I make a "watch later" playlist, with stuff that i must watch, two weeks go by, and there is a "[Deleted video]" in my playlist... not even a title left behind (so i'll be able to find it elsewhere).


The 2014 Vulture interview with David Milch, where he reads form an unreleased Boss Tweed script, may be gone for good. https://twitter.com/mattzollerseitz/status/14096229692828753...


They won’t stop collecting your data even if you pay them



you're worried about....the ads? Having ads around doesn't make the content any less valuable. We're talking about the content itself still being available, who cares if there's some ads keeping the system up if all those live concerts and how-to vids are preserved forever


The ads make content less valuable. They distract and mislead and manipulate emotions, especially when the youtuber sponsors something during the video itself. The content isn’t abstract and isolated, the content and the ads come as a bundle. The digital procedural product placement on the way will make this 10x worse


Who is going to watch live concert footage with ads jammed into it? For those of us who grew up in the age of TV, advertising was clearly a slippery slope where they constantly increased the ad content until it was beyond unbearable (they even deleted parts of shows to make room for more ads!) YouTube will likely do the same when they decide to force all of us to watch unblockable and unskipable ads to access content that they didn't even create or curate. All they have done is provide a network effect monopoly that hoovered up the majority of content.


Commercials are also important media that must be archived. I watch some old TV show, old TVCMs that reflects old days are also interesting to see.


I download everything I like. Storage is cheap now.

Plus smaller sites will start disappearing because of regulatory capture. It will not be possible to run a forum or similar site in few years.


We say that a lot, but it's not that cheap if you're using redundancy and backups. It's cheap if you don't care too much about the data.


Cataloging and indexing and searchability are also not cheap if you are doing all of that on your own time.


> "and please skip the tired argument that I should just pay a subscription fee to avoid their ad crap - we are already paying them with our data. My data is worth far more to me than the value they provide for it. Plus - I won't give money to a company who forced this Faustian bargain on me."

Not only this, but many of the largest companies these days would never remove advertising even from a paid service. It's like cable TV. They wanna charge you and advertise at you for more money.


Paying signals that you have disposable income, making advertising to you more profitable.


I actually implemented a rule for my website: anytime I write anything and cite a link, I always also include the internet archive url as well just in case. If it's not been archived yet I submit it to be.

as an example:

"You don't have to trust me on this one, here's an article with [a bunch of data] | [*Archive link in case of link rot]"

from: https://kolemcrae.com/notebook/virtue.html

It's not perfect, but it helps reduce some of the issue.

Other than that solutions are incredibly hard to come by - you need institutions to preserve urls - through tech changes and the like, when they have very little incentive to do so. Eg. making sure they implement a redirect from the http to https sounds simple enough, but not everyone did it. Also if they switch CMSs and the like.


> a rule for my website

Note that you should also have a rule to save the link content locally, to avoid single-point-of-failure problems in the unlikely-but-catastrophic case that archive.org itself goes down. (Cf the attempts to attack them over their National Emergency Library programme last year.)


It's not a technology problem, it's an incentive problem.

Had the web somehow been centralized (I have no idea what that would even look like), content still would not be archived, it would be constantly changed, and subject to censorship. Just like in a decentralized web, perhaps even more so.

Archiving costs lots of money (and costs keep growing if you only add and never take away), can be highly challenging (in the case of web apps or complex dependencies), whilst providing zero immediate reward for the organization carrying this heavy load. Not only is there no incentive, many couldn't even afford to if they wanted to.

And it gets worse still. Digital archiving means paying forever. Imagine paying a 100 years of electricity, hardware replacements, migrations. The entity (business, person) is long gone before that.

As a ridiculous example of this: Facebook has several very large idle content data centers. Mega scale buildings full of servers storing photos of Facebook users they haven't accessed in years, and likely never will again. Yet should a user do this, they expect the photo to still be there.

That's why I believe the problem should be addressed with more pragmatism. Focus on things of unquestionable long term value, and think of a good solution for this smaller scope.


Knowing that it decays is what prompts us to try and save the bits worth saving.

I don't sit in the camp that everything digital must be preserved and that it's a disaster if it isn't. I try not to fight entropy in it's many manifestations. It's a shame when content disappears but I think it's also healthy to just accept it. We tend to only frame information disappearing in a negative light because we can always imagine a scenario where that information could have been valuable to someone, and that is a valid concern, I just don't think it's helpful to view it as the internet going into some downward rotting spiral and therefore every single 0 and 1 must be preserved.

The major problems of the internet seem to be almost entirely cultural currently.


Ironically, deleting stuff on the web is technically REDUCING its entropy.


I feel like it's been quite a while since I saw anyone talking about IPFS, but I used to hear about it not infrequently. I don't know if this is because it was too nerdy, or too misaligned with the incentives of most organizations (it's scary for some to not be able to unpublish), or because it comes with some privacy sacrifices, or perhaps it just matters less when hardly any page is actually static any longer.

But, for guarding against "published", supposedly static material disappearing, or changing silently, and for removing a short list of organizations from being responsible for preserving content, IPFS or something like it seems well-suited. Anyone who cares to preserve something can. Any change is noticeable.


Yep. I find it odd that the article doesn’t even mention IPFS. Brewster Khale is a big proponent of IPFS, so that would have been a perfect opportunity to bring it up.

IPFS has its flaws, but content addressing and immutability are powerful.


Libraries (like Internet Archive for one) seem like the logical 'permalink' curators of information collections. I'm thinking, for example, of a large regional history site created by a a 501(c)(3) corporation. [https://www.historylink.org/]

Such efforts deserve a guaranteed permanent home with permanent funding. Another example: hundreds of 'Old-time-radio' programs and early TV shows have only emerged and survived because of enthusiasts. If they disappeared from Youtube ....

It'd be good to set up a universal 'permalink' library system (UPLS) like the DOI system and its 'persistent interoperable identifiers'. This could (and, at least in part, must) be publicly-funded. Anyone who wishes could apply for a unique ID. Then, subject to a set of specs from some participating library ("yes, we'll host that"), they could package that content with metadata. Backups must be ensured. If eventually the content needs to move (or can't find a home), the ID (and metadata) remains.


I think about used book stores, libraries, the library of congress, and just shelves in people's homes. All of that allows books to survive, when they're printed enough, and people actually want them.

Lots of books, magazines, newspapers do not survive because nobody wants to read them. We don't keep a super-archive of every piece of paper ever published, because we acknowledge that it's OK for things to die. Important things we keep, useless things we throw away. It's irrational to hoard garbage.

So really, all we need is to proactively decide what we do want to save, and save it as soon as it's published. Maybe also provide a "digital shelf" for people to keep their own copies, and a standard way to search for and distribute that information, like NNTP + Gnutella for e-books (along with some features to avoid being sued a la DMCA). The rest should be allowed to rot like fallen logs in the woods.


This is why there are sites like WayBackMachine. I suggest people keep donating to them as well if they want to preserve internet history.


Buddhists chuckle at the notion of permanence and go back to constructing sand mandalas


Buddhists have preserved most of the Tripitaka for 2500 years, and for the first 500 years it was memorized and transmitted orally from generation to generation. Buddhist monks today spend significant amounts of their time memorizing parts of it. Printed, it's about 12000 pages; it's been translated into many languages, but not all of it has been translated into English yet. Thanissaro Bhikkhu has been working on it for 20 years, publishing his translations under CC-BY, and may finish the job before he dies. Aside from its value to devotees, the Tripitaka is one of our best historical sources about everyday life in South Asia 2500 years ago.

The invention of wood block printing 1300 years ago in the Tang was apparently specifically motivated by the desire to preserve and reproduce Buddhist sutras; the oldest surviving documents printed with movable type, from 900 years ago, are also Buddhist texts.

Of course the Tripitaka is not permanent; it will be lost some day. But you seem to be implicitly claiming that Buddhists do not apply effort to preserving information and in particular textual records, because they know that ultimately they will be lost. In fact, the truth is quite the opposite, and believing your implicit claim would require almost complete ignorance of Buddhism, printing technology, and South Asian classical studies.


Accepting the impermanence of all things is an important lesson, but it is not an excuse for nihilism, because even impermanent things can have value, however finite, and the impossibility of true permanence doesn’t have to distract from realising finite value while it lasts.


I know your comment is probably tongue-in-cheek but I must say (without disagreeing, regardless) that perhaps modern day digital infrastructure can transcend the considerations early buddhists may have had for the natural world and then-contemporary human-built structures.


I used to be a data hoarder but I learned to let it go. I save the important stuff, just like the Buddhist monks do. How much of the internet is really worth saving? What will Geocities mean to anyone 50 years from now? The internet is a dynamic process evolving in real time.

No man ever steps in the same river twice, for it's not the same river and he's not the same man

Heraclitus


I am a fan of that quote and I now personally eschew clutter. However I'm referring to the examples in the article such as the supreme court justice referencing links that no longer existed. Paraphrasing the article, >75% of links from the 90s are defunct. Sure Geocities may not have value to many, but an astonishing number of links in court rulings and law documents are leading to dead ends. I can see how this could lead to shaky ground upon which it would be more difficult to defend certain internet freedoms.


Google's recent invention of text links should help this:

https://en.wikipedia.org/wiki/Filler_text#:~:text=%22Now%20i....

# signifies an anchor

:~:text= signifies a text link

%22Now%20is%20the%20time%20for,21%20(1918). says show me the text between "Now is the time for" and "21 (1918)."


buddhists fundamentally reject clinging to the idea of permanence as a source of inevitable misery when your wishes go unfulfilled.


I think technology will be invented to keep track of everything and their associations even if the link becomes dangling. It's an obvious problem to work on.

You only lose what you cling to

Gautama Buddha


the problem is that doing so would require storing a copy of the entire subset of the internet that you choose to persist, which requires you to either choose a small subset, or pay huge storage fees!


I just bought a 2 terabyte drive for $55, which is mind boggling to an old salt like me. The need to store enormous amounts of real time data will keep driving storage advances. Text and image data may turn out to be a trivial percentage of the overall storage needs in the longer term.


Indeed yes, however much we may dislike it, change is the only constant. Of course that doesn't mean we shouldn't bother archiving, but there is no need to fret over saving every byte out there on the web.

Entropy is king. Eventually all information loses its coherency.


It's amazing how little some trusted institutions care about this. For example, the BBC has been bragging about how many people rely on their coverage of the pandemic, but have an obnoxious habit of repeatedly overwriting old articles with new ones on similar topics and not keeping the old versions available. The history of a once-in-a-century pandemic with huge local and global impacts is literally being overwritten day by day.

Sometimes this helps them whitewash their screw-ups which have lead to widespread false beliefs. For example, after the UK government targetted and hit 100,000 Covid-19 tests in a day, the BBC ran an article falsely claiming Germany had achieved this a month earlier and linked it prominently on their news front page for about a month. A large proportion of the population probably saw this and now falsely believe it, it got brought up all the time as part of the narrative that the government's big "world-leading" achievements were just playing catch up badly, but it was memory-holed from the article in a rewrite and they used that as an excuse for not publishing any correction - so unless historians dig deep in third-party archives, they'd never understand where that belief came from. (Apparently a previous version of the article also wrongly claimed France was carrying out more Covid-19 tests due to mistaking their weekly numbers for daily one, according to a correction which disappeared from the article after a few days and only exists in the Internet Archive now. I haven't been able to find the original version of that claim.)


On the bright side, using a tool like Internet archive it should be easy to filter out which articles were removed and/or edited by the BBC, in a way highlighting the most historically important articles.


I mean yes, if you are extremely motivated. But the wayback machine is pretty klunky and slow, honestly. And there's no good "diff" view that summarizes the changes to a URL over time AFAIK.


We've got a pretty solid diff view now, notes at https://blog.archive.org/2019/10/18/the-wayback-machine-figh... and an example at https://web.archive.org/web/diff/20170118202526/201701200403...

Re slowness, we're able to do a lot with a little, but there's always room for improvement. If you're interested in some of the specific infrastructural challenges, I did a presentation in February:

https://archive.org/details/jonah-edwards-presentation

and my colleague did a fantastic presentation detailing some of the internal workings of the Wayback Machine just last week:

https://archive.org/details/bridget-bell-presentation


Do you have statistics on what fraction of the pages on major outlets you have archived, and how often they change?


This is true, but they're also taking on an absolutely monumental task on a shoestring budget. I continue to be amazed at what they are able to accomplish. There aren't many heroes on the internet, but the Internet Archive team qualifies.


Thank you! We're doing what we can with what have where we are, or maybe just what we must because we can, depending on your preferred aphorism :)


Slow, yes. Klunky? Absolutely not. When you make a request to the Internet Archive, you're searching through a massive amount of data. The fact that it only takes a few seconds to pull up a decade-old webpage is amazing.


But the wayback machine is pretty klunky and slow, honestly.

This is very true. Sometimes it takes 5-10 seconds to load the calendar view for an archived page, and another 5-10, or more, to load a snapshot.

They have a ton of data to manage with limited resources, but it still seems it should be possible to go faster than this. If there's just not enough budget for I/O, maybe they could offer a donate-for-data-dump option, where you can donate in exchange for loading data of interest (say, BBC archives) into a storage medium or query engine of one's choice, so one could do research at a much faster pace.


"But the wayback machine is pretty klunky and slow, honestly."

archive.org is much less annoying if one avoids using a bloated browser and Javascript to make HTTP requests

here is a more lightweight approach, not nearly as klunky/slow, IMO (393 bytes)

    usage 1: (simple html page of all results)
    echo https://www.theatlantic.com/article/619320/|1.sh >1.htm
    firefox ./1.htm

    usage 2: (retrieve last result)
    echo https://www.theatlantic.com/article/619320/|1.sh 1 >2.htm
    firefox ./2.htm

    #!/bin/sh 
    read x0;
    x1=web.archive.org;
    curl -s "https://$x1/cdx/search/cdx?url=$x0&fl=timestamp,original" \
    |case $# in :)
    ;;0)( printf "<h2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$x0</h2><ol><pre>\n";
        sed -n "/[0-9]\{14\} [hf]/{s/\(.* \)\(.*\)/<li><a href=https:\/\/$x1\/web\/\1\/\2>\1<\/a>/;s/ </</;s/ //2p;}";
        printf "</ol></pre><br>\n" )
    ;;1)curl -s $(sed -n -e "s>.*>https:/$x1/web/&>;s> >/>" -e \$p);
    esac 

haproxy + nc version (965 bytes)

maybe it is faster than curl, maybe not; you be the judge

    #!/bin/sh
    read x0;
    x1=web.archive.org;
    printf "defaults\ntimeout client 50000ms\ntimeout server 50000ms\ntimeout connect 50000ms 
    \nglobal\npidfile $HOME/1.pid\nfrontend f\nbind 127.0.0.21:80\ndefault_backend b 
    \nbackend b\nserver s ipv4@207.241.237.3:443 ssl ca-file /etc/ssl/certs/ca-certificates.crt\n" \
    |exec haproxy -D -f /dev/stdin;
    printf "GET /cdx/search/cdx?url=$x0&fl=timestamp,original HTTP/1.1\r\nHost:\40$x1 \
    \r\nConnection: close\r\n\r\n"|exec nc -n 127.21 80 \
    |case $# in :)
    ;;0)( printf "<h2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$x0</h2><ol><pre>\n";
        sed -n "/[0-9]\{14\} [hf]/{s/\(.* \)\(.*\)/<li><a href=https:\/\/$x1\/web\/\1\/\2>\1<\/a>/;s/ </</;s/ //2p;}";
        printf "</ol></pre><br>\n" )
    ;;1) printf "GET %s HTTP/1.1\r\nHost: $x1\r\nConnection: close\r\n\r\n" \
         $(exec sed -n -e '/[0-9]\{14\} [hf]/{s/\(.* \)\(.*\)/\/web\/\1\/\2/;s/ //p;}'|exec sed -n \$p) \
         |exec nc -vvn 127.21 80;
    esac;
    if [ -f 1.pid ];then kill -9 $(sed b 1.pid);exec rm 1.pid;fi


Edit: Remove Wayback Machine's Javascript inserts (usage #2)

curl version

     #!/bin/sh
     read x0;
     x1=web.archive.org;
     curl -s "https://$x1/cdx/search/cdx?url=$x0&fl=timestamp,original"|case $# in :)
     ;;0)( printf "<h2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$x0</h2><ol><pre>\n";
         sed -n "/[0-9]\{14\} [hf]/{s/\(.* \)\(.*\)/<li><a href=https:\/\/$x1\/web\/\1\/\2>\1<\/a>/;s/ </</;s/ //2p;}";
         printf "</ol></pre><br>\n" )
     ;;1)x=$(echo x|exec tr x '\002');y=$(echo y|exec tr y '\003');z=$(echo z|exec tr z '\036');
         curl -s $(sed -n -e "s>.*>https:/$x1/web/&>;s> >/>" -e \$p)|exec tr -d '[\02\03\36]' \
         |exec sed "s|<script src=\"//archive.org/includes/analytics.js?v=|$z$x 1|;s/<.-- End Wayback Rewrite JS Include -->/$y 1/;
         s/<.-- BEGIN WAYBACK TOOLBAR INSERT -->/$z$x 2/;s/<.-- END WAYBACK TOOLBAR INSERT -->/$y 2/;" \
         |exec tr '\036' '\012'|exec sed "/$x 1/,/$y 1/d;/$x 2/,/$y 2/d;";
     esac
haproxy + nc version

     #!/bin/sh
     read x0;
     x1=web.archive.org;
     printf "defaults\ntimeout client 50000ms\ntimeout server 50000ms\ntimeout connect 50000ms 
     \nglobal\npidfile $HOME/1.pid\nfrontend f\nbind 127.0.0.21:80\ndefault_backend b 
     \nbackend b\nserver s ipv4@207.241.237.3:443 ssl ca-file /etc/ssl/certs/ca-certificates.crt\n" \
     |exec haproxy -D -f /dev/stdin;
     printf "GET /cdx/search/cdx?url=$x0&fl=timestamp,original HTTP/1.1\r\nHost:\40$x1 \
     \r\nConnection: close\r\n\r\n"|exec nc -n 127.21 80|case $# in :)
     ;;0)( printf "<h2>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$x0</h2><ol><pre>\n";
         sed -n "/[0-9]\{14\} [hf]/{s/\(.* \)\(.*\)/<li><a href=https:\/\/$x1\/web\/\1\/\2>\1<\/a>/;s/ </</;s/ //2p;}";
         printf "</ol></pre><br>\n" )
     ;;1) x=$(echo x|exec tr x '\002');y=$(echo y|exec tr y '\003');z=$(echo z|exec tr z '\036');
          printf "GET %s HTTP/1.1\r\nHost: ${x1}\r\nConnection: close\r\n\r\n" \
          $(exec sed -n -e '/[0-9]\{14\} [hf]/{s/\(.* \)\(.*\)/\/web\/\1\/\2/;s/ //p;}'|exec sed -n \$p) \
          |exec nc -vvn 127.21 80|exec tr -d '[\02\03\36]' \
          |exec sed "s|<script src=\"//archive.org/includes/analytics.js?v=|$z$x 1|;s/<.-- End Wayback Rewrite JS Include -->/$y 1/;
          s/<.-- BEGIN WAYBACK TOOLBAR INSERT -->/$z$x 2/;s/<.-- END WAYBACK TOOLBAR INSERT -->/$y 2/;" \
          |exec tr '\036' '\012'|exec sed "/$x 1/,/$y 1/d;/$x 2/,/$y 2/d;"
     esac;
     if [ -f 1.pid ];then kill -9 $(sed b 1.pid);exec rm 1.pid;fi


This works for major sites like BBC. For minor, local news sites, it doesn't because they don't get archived at all (atleast not fully and in time).

news sites should really be mandated to keep previous versions of their newsstories with all the edits, especially the ones paid by taxes.


Wayback Machine censors many websites(like 4chan) from being 'saved'. Wayback Machine also removes previously archived videos/websites in certain cases. They are not neutral.


Don’t spread misinformation. The Wayback Machine is not censoring 4chan. 4chan is ‘censoring’ the Wayback Machine.

https://www.4chan.org/robots.txt

Also, I let a domain of mine expire and the new domain owner (which just plastered ads) had a robots.txt that retroactively removed my “previously archived website” from the Wayback Machine.


4chan asks the WM to censor it, but the WM actually does it.


Fortunately 4chan has (unofficial) archives, but some content was probably lost.


> so unless historians dig deep in third-party archives, they'd never understand where that belief came from

I expect future historical tooling will exist to solve exactly this problem. Assuming Archive.org and the like nabbed it, the evidence is all there for future generations to see.


Assuming archive.org isn't shut down and deleted by court order in some future lawsuit.


How much of these overwrites are cover ups for failures? Certainly some percentage, just not sure what.


The BBC publishes nothing but garbage and there are other extant sources that are more durable. It's fine to forget things. We're missing entire libraries of classical literature from great authors which would be nice to have. Missing documentary sewage isn't a tragedy.

This should show us that most of the web isn't worth preserving anyway, much like McDonald's burger wrappers aren't worth preserving like sacred artifacts. Most web and social media content is worth less than said greasy burger wrappers.


Reading one of the BBC's technical articles, a cyber security news item, they had 3 errors in the first paragraph. I didn't bother reading to the end of the article.

I'm glad I no longer pay for a TV license.


The BBC (News's) tech section isn't aimed at you. Inaccuracies shouldn't be there but often they will dumb down or gloss over stuff for the mainstream audience they are aiming at.

You notice it cos you are in tech, but the same happens in financial news, science and even sport. Go read a tech publication.

For shits and giggle I did once try to get a technical story on how to copy DVD's published - it got very heavily edited! http://news.bbc.co.uk/2/hi/science/nature/1987665.stm

(I'm a former + early BBC News website employee)


Aside: That old version of BBC News is an absolute gem of history. Especially looking at some of the recommended sidebar stories:

> Britons 'baffled over euro rate'

> Wireless internet arrives in China

> Mobile spam on the rise

Fascinating to see how much our problems have stayed the same, despite the changing context.

I hope this is considered 'archived' and not 'forgotten'.


It has always been a constant of journalism that you read an article in your field and go "Wow, this is terrible, they got all of the details wrong". But then you turn around and trust the reporting on everything outside of your field of expertise.



What other more durable sources do you recommend?


I remember how calling the BBC garbage a few years ago got your comment heavily downvoted here. They'd tell you that they were the best thing since sliced bread and that they were good because both the left and the right hated them, as if that meant something. Now it seems everybody is recognising the BBC for what they are: utter shite.


A few years ago, any comment that didn't add new insight to a topic would get downvoted. I remember once reading a comment where the response was a quip, and someone replied "this response was funny but we don't want this site to become Reddit so I downvoted you".


I see this as a more general pattern on HN: Opinions not-yet-adopted by academia are often downvoted instead of being argued with. This stifles innovation because alternative opinions do not even show up in the casual reader's screen.


“Someone said it on Hacker News” carries no weight. Why should anyone take our comments seriously if they don’t recognize the username? I don’t see this as a bug.

Better to post links to trusted sources and let people judge for themselves.


Absent an explanation of why I've annoyed people, I get as much of a dopamine hit from downvotes as upvotes. I'd rather be polarising than forgettable.

Whenever I take an unpopular stance I remind myself of Rick Sanchez's wise words, "Your boos mean nothing, I've seen what makes you cheer".


Me too, I think a lot of my most upvoted comments are just truisms and preaching to the choir, whereas a lot of the more insightful things I've said quickly get greyed out.


We need to accept that link rot and content drift are part of the web.

And realize that the best place to preserve history is the Internet Archive' Wayback Machine.

Kind of the same way newspapers were never responsible for maintaining their archives, but librarians did on microfiche (remember that?).

But I'd take it farther.

First, the Internet Archive ought to have an official partnership with the Library of Congress and other national libraries across the world, that help provide funding. It shouldn't have to rely on private donations.

And second, it's time browsers integrated with it -- if content no longer exists there should be a built-in option to easily check Wayback Machine with a single click, and use a heuristic to show the most recent "good" version.

In other words, let the Wayback Machine be not just a, but the place for the Internet's history. Let's make it official.


Newspapers kept extensive archives, they called them 'the morgue', and depending on the newspapers and era, they either had microfilm/fiche or actual physical clippings. When I was studying journalism in college I would get access to all the newspapers morgues. Really anyone could call up the newspaper and ask for something from a past issue, although if they needed to actually do much searching there were other hurdles. They weren't public libraries, but they served their communities.

ETA https://en.wikipedia.org/wiki/Morgue_file


And of the eight external links on that page, four are now broken.


Visiting an old forum and all the pictures will be gone because, surprise, free image hosting doesn't make economic sense.


The web forum that I frequent most, https://forum.nasaspaceflight.com , has a policy of not allowing embedded images but requiring them to be attached on each post. The forum has been active for well over a decade now (site was founded in 2004) and has a thriving community that continues to grow. It is the community that keeps the forum alive instead of just one random company (although it technically is a company). This helps prevent link rot. Forums seem anachronistic in 2021, but they have massive benefits versus the gigantic platforms like Facebook or Reddit or Twitter, and the quality of the discourse and analysis is far higher. There are a few ads, but they’re very unobtrusive.

(Also, Twitter posts are often linked, but usually the text is copied for archival purposes.)

My experience with Wikipedia, forums such as those, and Wayback Machine and arxiv.org make me think that people will do a lot of stuff basically for free and that by building communities, you don’t need extremely clever trustless incentive systems like Blockchain or major paywalls (although granted, the forum does have something like a paywall for unverified pre-public info) or massive platforms with multi billion dollar companies in order to disseminate information, analysis, news, etc. Best practices of web forums from the 2000s (active moderation, a sense of common purpose, expectations of non-toxicness, etc), are a really good solution.


Forums are not anachronistic - they are the future again. The only way to guard against the censorship of the current social media platforms is to not rely on them.

What always baffled me is why forums didn't embrace technologies we had in the BBS days. Offline readers were the greatest thing since sliced bread - you could use whatever interface to a message board you liked, whatever editor you liked, etc. It was a lot easier to quickly scan through literally thousands of messages with a native, local client that relying on the constant ping pong between your client and a remote server.

Decentralized aggregation is what we really need. A combination of RSS and DNS. Ways to foster creation and discovery of hand curated lists like the original Yahoo - but thousands of them. No reliance on Google, Facebook, Twitter, etc. It's a nice dream anyway...


I expected the article to deal more with the rotting landscape of the internet, the rotting of our choices of content, the poor selection of links on the 1st page of any search engine.... I am less concerned with the fact that a link breaks than I am with what it says about the content that was there, is no longer there.


Links are the backbone of the internet. Archive.org is a huge asset, but it relies on individuals being prescient enough to archive pages that might be lost to time. That's not scalable. Plenty of people will visit Wayback Machine to pull up an old page that's gone to the big 404 in the sky, but they won't actively submit links to archive themselves.

The bad design and low quality content is a symptom of the Internet's broken underlying economics. That's a human problem, not a tech problem.


This is why I'm building/curating my personal archive with stuff that I think may be worthwhile saving (not only for myself necessarily).

Perhaps there will be many personal archives like mine that one day can be shared in a similar vein to copy parties.

We will need to treat the information we find online with its impermanence in mind (as authors, making things easy to copy, and consumers, copying stuff).

Perhaps it is this mindset that, when sufficiently prevalent, could make the internet more like a library again; weed out the garbage und curate the nuggets.

Btw I think archive.org is doing God's work but I don't believe any amount of coding and crawling will be able to save everything (nor should it). It can capture some raw data for (future AI?) historians to sift through though.


For anyone that wants to help with this, check out the Archive Team Warrior project. You can donate bandwidth and some CPU cycles to archiving different parts of the web. There's a VM image you can download that makes it really easy.

https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

You can choose to help archive reddit, pastebin, URL shorteners and other ephemeral parts of the internet

https://wiki.archiveteam.org/index.php/Warrior_projects

I've also taken to updating the citations in Wikipedia articles with archive links.


The whole architecture of the internet is inside out. People have become numb to the insanity of encountering null pointers multiple times per day. This is understandable since the inside out structure is what allowed the web to grow quickly, but it will also be what ultimately dooms it as a real lasting store for knowledge.

The problem is that the foundations are shifting sands, and we need something that has significantly more integrity at the bottom layer, we can't just bolt URNs on as an afterthought. Some organizations are able to maintain persistent data over time, but it is in spite of the technology, not because of it.

I will also note that a world where it is possible to delete things is a world where individuals can be made to have written anything in the past. On the internet, at a certain point the past can be fabricated from whole cloth.

edit: and ironically, the issue is that this is because the internet wasn't actually academic enough in its original design.


We should have kept developing Usenet. Handing control over to web browser providers was a mistake.


Usenet died because reasons. It had met its effective maximum scale by the early-to-mid 1990s, at a millionfold less use than today's largest Web platforms see.


I feel that the Internet as an archive isn't really feasible. At best, it can augment existing archival efforts such as public libraries. The fact people keep pushing off to webhosting what should be put into a library is a grave misunderstanding of the use cases for the Internet.


I'd highlight two arguments which are mentioned in the article but shadowed by main topic of prevention-linkrot-by-archiving (no wonder, as it is written by Perma.cc co-founder):

1. Failure of Lumen/ChillingEffect initiative to prevent bogus takedown requests. Currently, anybody is able to takedown any page on the Internet by sending bogus requests and then takedown any mention about who did it.

2. Google's failure to “organize the world’s information and make it universally accessible and useful”. As the author said: “no such transparent, academic competitive search engine exists in 2021“.

I see some relation between those two.


I've changed the baity title to a representative phrase from the article body [1], but it is maybe a bit too narrow now, relative to what the article is really about. Suggestions for a better title are welcome. Sometimes we use the HTML doc title but "The Rotting Internet Is a Collective Hallucination" is worse!

[1] That's the best way to get a better title. https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...


The article does not scroll on an older Chrome... the internet is indeed rotten.


Or on Firefox.


Two reasons more to avoid Atlantic


Part of it is technical. The most modern stack usually completely break deep linking:

- even the best spa out there often barely shim the normal browser behavior. Yesterday my back button broke once more, in 2021. Infinite scroll don't let you pinpoint your position. User don't expect being able to copy / paste a link to take them to the content anymore.

- the ecosystem of url handling is fractured. This month I worked on a Django + React app, and my clients asked that it should be able to handle being hosted behind an arbitrary URL prefix if provided in the conf. Here are the things I had to tweak:

    * adding the prefix to the proxy pass apache conf (yes, they are still using it);
    * adding the prefix the react router conf for which most tutorials were outdated; 
    * adding the prefix in the js bundler conf as the base, and for the dev proxy;
    * adding the prefix in the <base> element in the main template;
    * making all urls and ajax calls relative to the <base>;
    * making all the react router Link and history.push _absolute_ (took me a while to figure this out);
    * serving the index.html as a template file from django, not nginx, to inject all that stuff according to the env var.
    * hacking the build script to replace static files URLS with template place holders because the js bundler didn't have a hook for that (thanks sed);
 
And that's on top of the regular work of making urls in SPA works, which implies sync your backend and frontend URLS for pages and API. Who is going to do all that works? In fact, how many devs have the knowledge to do that? Pre-SPA, there would have been well documented 2 steps to do the same thing. The junior in the team could figure it out.

- we had a ton of manure on top of our urls. AMP. Url shorteners. Tracking ID and redirections. Content wall. Captcha. Often several of them at the same time. If one of them break in the chain, goodbye URL.

- low code mean low skill devs, that never heard the mantra "cool url don't change". They don't even know they should care.

- some browsers just hide the URL. The users don't know what an url is anyway.

- apps don't care about deep linking. They could handle url fine, mind you. We have the tech for it. But it's not even on the radar of most devs. You don't address the content, you consume whatever pops up, so why bother ?

Plus, google is so good at finding the content you want out of the barely readable drunken mess of letters you feed it that most people don't type url anymore. People don't care about URL just like people don't care about bees dying, because it's too abstract to worry about.


I would love to work on micro format/protocol which allows me to link to the original content but also include an IPFS node to a WARC. Have a little cli/desktop tool which produces and pins a WARC.

    <a href="https://example.com/some-article" 
       data-ipfs-warc="0xB45165ED3CD4 ...">
      Some Article
    </a>


This reminded me of a story that hit the front page a few years ago. Even if content sticks around and isn't modified, Google will eventually forget it and you won't be able to find it without a bookmark.

https://news.ycombinator.com/item?id=16153840


I doubt people will ever care about this enough for it to have momentum, but there are well known technological solutions: content addressable file storage. If you do that the url is always tied to the file content itself. Of course this requires documents to actually be documents. So I don't think it works for any modern business model.


Do you have any suggested reading on content-addressable file storage?


I think out obsession with copyright and attribution of ideas isn't helping. Many times you'll see people reference a page or PDF, which long ago became a broken link. Not one person bothers to paraphrase or copy relevant sections from it. And the wayback machine can't cover everything.


I think it's important to support tools that are driven with people in mind, rather than money. I like where Brave is going: https://news.ycombinator.com/item?id=27593360


Donate money to archive.org! Some of the FAANG companies match donations to it too.


I think the publishing media that predate the web all had the same problem; posters get torn down, newspapers burn, even stone carvings weather. Hardly a new, or possibly even unnecessary phenomenon.


The problem is, that previously, if you wrote a book about a poster, you put the photo of the poster directly in your book, and the reader could see the poster there.

Now you write an article about a poster, and instead of a photo, there is an embedded instagram photo from the artist, who then removed their account, and the content is missing. Or a hotlink to the authors website, and it's missing there too.

Yes, if the book was destroyed too, all info about the poster was lost, but atleast the book didn't say "go to the corner of X and Y street, and hope it wasn't removed or destroyed by the weather".


Note that a chief impediment here is copyright rather than the ability to reference an extant work.

A photo of an inscription is not the same as the inscription itself --- detail is lost in any translation.

For digital works, it is possible to faithfully reconstruct an original with full fidelity (if necessary, embed an emulated environment of the host, server, or network originally provisioning the work). But copyright claims make this a legal suicide maneuver, at least for any entity capable of being sued, or being sued effectively.

Note that numerous previous archivists have in fact been pirates or copyists, sometines under pre- or non-copyright regimes, but very often in direct rejection of copyright.


Project Xanadu was an attempt to fix this, but unfortunately it suffered from the creator wanting to capitalize on it's success. Which is, I think, why it ultimately failed.


Its brilliant. If the link is bad, find a different link. Email the webmaster if you believe its OK. Think about if you can actually help. Hashtag systematic whine.


Here's an idea, try to make things that need to survive into static downloadable content.

"But I'm the consumer of a service!"

--Ask the service provider to open source their work.


But at least it's apparent that an end must be in sight for internet as we know it. That there's still other forms of internet waiting to be discovered.


Such as / directions / goals to seek / pitfalls to avoid?



Luckily, community-driven emergent self-healing adaptations are also endemic to the Web :)


I always thought something like Ethereum could solve this type of thing; that is, if the content itself lived inside the blockchain. Obviously for larger formats that wouldn't work, but for many text based or lower resolution image formats, it wouldn't be too much overhead to just inject it all into the blockchain.


I would simplify it as SEO rot has ruined the internet.


This is why I think Twitter made a mistake in using the normal suspension mechanism to ban @realDonaldTrump.

Not passing judgement on the decision to take away his posting privileges. But by suspending POTUS, everything he posted during his term in office is just... gone. Every hot link to anything he said, on any website, is broken.

This is an enormous loss to any historian of the era. He was using Twitter as his main microphone to speak to the world, and all that content is, while not lost lost, thoroughly and permanently scrambled.

It would have been better to just lock him out of the account, publish a statement that the @realDonaldTrump account is now permanently archived, and that any new account he tried to open will be suspended.


Yes, and paper books go out of print.


"The benefits of internet far outrides its shortcomings"

Stop being pessimistic


Hashes are the answer.

How they get implemented in solving this problem is the question


What specifically do you hash?

How do you hash content which is programmatically determined and changes on every page load?

How do you account for the same work in multiple versions, translations, or updates, strictly using hashes?

(Note that a chained hash, e.g., a git history, is not a strict use of hashes, though it most definitely does use hashes.)


>programmatically determined and changes

Ethereum's Solidity is a great real world example. The instructions are compiled into a binary, and that binary can be addressed via its cryptographic digest.

Changes in state are easily represented as instructions.

>multiple versions, translations, or updates, strictly using hashes?

The same way a git repo does it now: Good software design.

>is not a strict use of hashes

Huh? I'm not sure what your trying to convey. Are you referring to the fact that git uses diffs between commits? At any commit, a repo may be re-hashed and the current state completely represented by a digest.

Of all things to worry about being represented as a digest, software seems to be one of the smallest concerns.


Git chain hashes.

Merkle trees are useful.

They include hashes. They are not simply hashes.

Content-addressable storage robust against variations is another approach.


I'm at a loss for your criticism.

Didn't the original post say, "How they [hashes] get implemented in solving this problem is the question"?

Merkle trees implement cryptographic hashes.

>Git chain hashes.

Git chains hashes? Yes, of course, along with diffs.

How the permanent web will implement digests in various applications is the question. Whatever the answer is for link rot, the guaranteed solution will implement and depend upon cryptographic hashing algorithms. There are already many examples of this class of problem being solved by utilizing hashes.


You may be trying to read too much into my statement.

It's simply that hashes, alone, do little, and that hash-free solutions might well exist as well.

Bald assertions of simple necessary and sufficient solutions are almost always mistaken.


Fungi serve an important role in nature.


A common misapprehension, even amongst those who have been engaged at foundationally-deep levels of the Web, is that the World Wide Web is not an archival mechanism but a publishing mechanism.

More specifically, it's an on-demand publishing mechanism, relating to media-based publishing (books, records, physical video media) much in the same was as the electrical grid (a transmission mechanism) does to fuels (a storage function).

It would be nice (in at least some regards) if publishing content at a specific addressable URL were a promise to 1) eternally provide that resource and 2) never change it. But there's no way to guarantee that this will be the case.

In the past, the means for achieving archival of information was:

- To specifically record that information in some form. As with, say, Plato's memorialising of Socrates's dialgogues.

- To create multiple copies of those recordings, so that loss of any one instance doesn't mean total loss.

- To define a refererencing or indexing system such that individual works can be identified unambiguously. (Or more specifically, with an acceptable level of ambiguity.)

- To develop a means of agreement as to what the canonical or recongised version of a work is, or absent that, of identifying more canonical forms or lineages. (The historiography of the history of philosophy is an interesting sub-field, with a good introductory treatment in Peter Adamson's History of Philosophy Without Any Gaps podcast.)

URLs of and by themselves address these needs poorly. At best they point to a location at which a document may have been available at a point in time. A URL plus a time-range (which is what the Internet Archive's Wayback Machine effectively delivers) is a much better approximation to what is needed. (The notion of time-bounded identifiers generally seems a useful one, as might be applied also to domain and user names, for example.)

As I see it, the goal of the archival web needs to address a number of points which Zittrain addresses only very indirectly:

- Identification of what really should be archived. Right To Be Forgotten exists for very well-founded reasons, and a world without forgetting (or with very capricious forgetting) is one form of hell.

- True document-centric identification. I've been thinking about this for a while (recent discussion here: https://news.ycombinator.com/item?id=27455520), and something that is based on the actual contents while being resilient to mild changes seems most optimal. Checksums, not so much, tuples or ngrams, possibly warmer.

- A number of archival institutions. The Internet Archive is certainly amongst these. Other libraries, if at all possible spread across multiple institutions and jurisdictions would be preferable. IA have been working with numerous academic institutions and the US Library of Congress, though IA itself still carries most of the burden.

This isn't a new problem. In particular, each time there's been an explosion in some new form of publishing, there's been a scramble by archivists to keep up. Denis Diderot, 18th century encyclopaedist, has an awesome quote about the information explosion drowning his own generation (the encyclopaedia was his technical solution to that problem).[1] Numerous elements of what we now accept as standard bibliographic elements (titles, authors, tables of contents, indices, references, citations, page numbers, paragraphs, inter-word spaces, ...) were invented to answer specific needs, not always of archivists as a principle focus, though often providing benefits to them. Cataloguing and classification systems likewise.

I appreciate Zittrain's message. He's crying over a lost cause and looking backwards, not to the future.

________________________________

Notes:

1. Diderot: https://www.historyofinformation.com/detail.php?entryid=2877


yes. i am down to HN and Reddit. Google calendar and telegram. I don't even know how to find cool stuff online anymore. Google SERPs are all business driven now unless you're research news.


I wish there was something like Reddit that could be organized by topic, but had the simplicity of HN's design instead of the monstrocity Reddit has become. My guess is it would still succumb to the Reddit Hive Mind effect without a reasonably benevolent moderation team though. For all the times I've said that HN basically does the same thing, I have to admit that it is much better about keeping it in check.


Reddit has been pushing to be like other social media sites now. They've added profile pictures and avatars, not to mention they're pushing video content like nothing else. They've got livestreaming and try to saturate your front page with as much video as possible. They've also changed the way their app handles video links to be more like TikTok or YouTube.

It used to be a lot like HN - discussions around links to articles. I wish there was a community with the feel of HN with the wide net of Reddit.

It feels like all social media is converging; Snapchat, Instagram, Facebook, Reddit, TikTok, Youtube, all an endless stream of ai-curated short videos that you can swipe through over and over.


I use third party Reddit apps that largely resemble RSS feeds, and pull directly from the API. So you don't see any of the new social media features. You don't even see ads!

The same is true for desktop, with with the Reddit Enhancement Suite browser extension. My Reddit has looked largely the same for nearly 10 years!


> It feels like all social media is converging; Snapchat, Instagram, Facebook, Reddit, TikTok, Youtube, all an endless stream of ai-curated short videos that you can swipe through over and over.

It's likely that AI curation is the future because no humans can shift through the vast amounts of data and content being made. Things that can't keep up without curation or with just human curation have died or will die.

Good AI curation can bring you the exact content you're looking for, can but not will. I've seen it work times and times again, but I've also noticed that you have to be aware of the flaws of the tool to be able to use them or it gets really bad really quick.

You can't let the AI take control, if it derails to content you don't like you must know what it uses as a quality signal and give it a thumbs down, if it is intentionally derailed, you must stop using the platform.

TikTok recently released an update to their algorithm, it ruined my FYP and replaced my content with inane videos made by people nearby - hyperlocal garbage. The feedback mechanisms given no longer work, before that update they did.

I do think that even people being nostalgic here about the "old internet" should try and learn how to turn AI curation for their own advantage instead of just being sad and nostalgic.


Some subreddits are decent. Use old reddit and go straight to your subreddits so you never see the home page.


I don’t know about userbase but old.reddit with disabled subreddit CSS is quite usable for me.


Yahoo did this in the 1990s and it was cool for a while, but the net got too big to maintain the directory. There was an open alternative but it got inundated by spam of course.


I exactly had the same thought few weeks back on Twitter. Since I have ditched the whole 'social media bubble' for my mental health, it seems sometimes I wish there was some sort of HN-like aggregator for Tweets from my favorite topics and people.


Gemini protocol might be something here. But it's probably too "techy" to get widespread traction.


The Gemini protocol will never gain widespread traction because its designed to appeal to a specific tech-contrarian anti-modernist mindset with restrictions that most people won't find appealing or useful.

And as soon as the mainstream knows about it, and it no longer feels quirky and niche, it will be declared dead and abandoned anyway.


All the cool stuff has gone back to IRL. Once people stopped making potato gun websites, the internet really stopped blossoming into an amazingly vibrant space. I would recommend hackaday.com because it hasn't changed in quite some time.


I used to go online to escape the boring, unimaginative people and the world they create. Now I have to disconnect to escape them.


You have to know where to look online, very rare these days.


FWIW, Siemens recently bought hackaday (or well, Siemens bought a company called Supplyframe which was the owner of Hackaday), so lets see how long that will last..


>Once people stopped making potato gun websites, the internet really stopped blossoming into an amazingly vibrant space.

Entire new genres of creative output - music, fiction, fandom, films, cosplay, hobbyist and enthusiast communities have been spawned by the modern web. It's never been more vibrant.

I'll never understand why people on Hacker News seem to believe the internet stopped evolving as an expressive space just because services replaced the need to design websites by hand. That's like believing literature ended once scribes were replaced by the printing press.


Yeah, the internet is bigger now - all of the weird fringe stuff from enthusiasts is still there and there's even more of it. One thing that has happened is that the mainstream is there too now, so if you don't venture outside that its easier to not see the more interesting stuff. In the old days, when just having a "homepage" placed you a bit outside the norm, stumbling on something non-mainstream was more of a given.


And the mainstream has become more interesting as a result. It's not uncommon to watch anime now, D&D and video games are no longer niche, people's interests are becoming more diverse as media is no longer being gatekept by communities, geography or publishers, and everything becomes available to everyone. People might see that as a negative, the Eternal September effect eating their favorite thing, but I see it as a positive.


> Once people stopped making potato gun websites, the internet really stopped blossoming into an amazingly vibrant space.

Have they or have you stopped looking? I'd say the former.


Google still works, you just need to be very specific on your search criteria (e.g. site:). I do agree on the premise that good quality data, art and content has been lost to the winds of time.


Needing to use 'site:______.com' in the search query defeats the purpose of an internet search engine.

I agree that it is necessary today, due to the sheer amount of useless sites that pop up on page 1 of the search. I wish Reddit invested more into making their internal site search better. If people did their searches directly on websites, Google would have an incentive to improve search results so it wasn't always the same 10-20 websites topping the list for nearly every query.


You either have to visit aggregating sites that have likeminded people (HN/Reddit/Obscure FB groups/Group chats/Forums) or know exactly what you're looking for.


The article is about link rot, not cultural rot.


And it seems to overlook a pretty straightforward question: in the era of the search engine, how much of an issue is link rot?

I've hit bad links before. Four out of five times, I can do a general search for the title of the document that should have been at the link or the quoted excerpt that the document I'm reading pulled from the link, and I get a clone of the document posted somewhere else.


TL;DR

The second law of thermodynamics applies to the Internet as well.


For those hitting the "monthly article limit" paywall, this can always be bypassed on The Atlantic by opening the link in a private window.




Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: