Hacker News new | past | comments | ask | show | jobs | submit login

We suggest/encourage people link to original URLs but ALSO (as opposed to instead of) provide Wayback Machine URLs so that if/when the original URLs go bad (link rot) the archive URL is available, or to give people a way to compare the content associated with a given URL over time (content drift)

BTW, we archive all outlinks from all Wikipedia articles from all Wikipedia sites, in near-real-time... so that we are able to fix them if/when they break. We have rescued more than 10 million so far from more than 30 Wikipedia sites. We are now working to have Wayback Machine URLs added IN ADDITION to Live Web links when any new outlinks are added... so that those references are "born archived" and inherently persistent.

Note, I manage the Wayback Machine team at the Internet Archive. We appreciate all your support, advice, suggestions and requests.

It's interesting to think about how HTML could be modified to fix the issue. Initial thought: along with HREF, provide AREF- a list of archive links. The browser could automatically try a backup if the main one fails. The user should be able to right-click the link to select a specific backup. Another idea is to allow the web-page author to provide a rewrite rule to automatically generate wayback machine (or whatever) links from the original. This seems less error prone and browsers could provide a default that authors could override.

Anyway, the fix should work even with plain HTML. I'm sure there are a bunch of corner cases and security issues involved..

Well as mentioned by others, there is a browser extension. It's interesting to read the issues people have with it:


So this is a little indirect, but it does avoid the case where the Wayback machine goes down (or is subverted): include a HASHREF which is a hash of the state of the content when linked. Then you could find the resource using the content-addressable system of your choice. (Including, it must be said, the wayback machine itself).

I've found that web pages have so much dynamic content these days that even something that feels relatively static generates two different hashes almost on every pageload.

Indeed. I don't think you could or should hash the DOM - not least of which because it is, in general, the structured output of a program. Ideally you could hash the source. This might be a huge problem for single page applications, except you can always pre-render a SPA at any given URL, which solves the problem. (This is done all the time - the most elegant way is to run e.g. React on the server to pre-render, but you can also use another templating system in an arbitrary language, although you end up doing all features maybe not twice, but about 1.5x).

> (Including, it must be said, the wayback machine itself).

Citation needed? Eg something like http://web.archive.org/cdx/search/cdx?url=http://haskell.cs.... produces lines of the form:

  edu,yale,cs,haskell)/wp-content/uploads/2011/01/haskell-report-1.2.pdf 20170628055823 http://haskell.cs.yale.edu/wp-content/uploads/2011/01/haskell-report-1.2.pdf warc/revisit - WVI3426JEX42SRMSYNK74V2B7IEIYHAS 563
But there seems to be no documented way to turn WVI3426JEX42SRMSYNK74V2B7IEIYHAS (which I presume to be the hash) into a actual file. (Though http://web.archive.org/web/$DATEim_/$URL works fine, so it hasn't been a problem in practice.)

> Citation needed

Oh, sorry, I don't think the WM supports this today. I only meant that it could support it "trivially" (I put that in quotes since I don't know how WM is implemented. But in theory it would be easy to hash all their content and add an endpoint that maps from hashes to URLs).

My point was that you could add an addressing system that is both independent of the Wayback Machine, but which you could still (theoretically) use with it. But you'd have to add the facility to the WM.

Ah, that's disappointing, but oh well.

This is literally where my brain was going and I was glad to see someone went in the same direction. Given the <img> tag’s addition of srcset in recent years, there is precedent for doing something more with href.

Yup, I've been using the extension for probably about a year now and get the same issues they do. It really isn't that bad, most of the time backing out of the message once or twice does the trick, but it's funny because most of the time I get that message when going to the IA web uploader.

This is so much better than INSTEAD.

Not for the sole reason that it leaves some control to the content owner while ultimately leaving the choice to the user, but also because things like updates and erratums (eg. retracted papers) can't be found in archives. When you have both, it's the best of both world: you have the original version, the updated version, and you can somehow have the diff between them. IMHO, this is especially relevant in when the purpose is reference.

I mostly agree... however, given how many "news" sites are now going back and completely changing articles (headlines, content) without any history, I think it's a mixed bag.

Link rot isn't the only reason why one would want an archive link instead of original. Not that I'd want to overwhelm the internet archive's resources.

I love the feature that you easily can add a page to archive: https://web.archive.org/save/https://example.com

Replace https://example.com from the URL above. I try to respect the cost of archiving, by not saving to often the same page.

Thanks so much for running this site - as a small start-up we often manually request a snapshot of our privacy policy/terms of service/other important announcements whenever we make change to them (if we don't manually request them the re-crawl generally doesn't happen since I guess those pages are very rarely visited, even though they're linked from the main site). It's helped us in a thorny situation where someone tried to claim "it wasn't there when I signed up".

It might be an interesting use-case for you to check out, i.e. keep an eye of those rarely used legal sublinks for smaller companies.

Kudos for doing what you do.

I always wonder about rise the hosting costs in the wake of people liking to the Wayback Machine on popular sites.

How do you think about it?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact