Hacker News new | past | comments | ask | show | jobs | submit login

I'm not sure I'm a fan of this because it just turns WayBackMachine into another content silo. It's called the world wide web for a reason, and this isn't helping.

I can see it for corporate sites where they change content, remove pages, and break links without a moment's consideration.

But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine. Apart from anything else linking to WayBackMachine only drives traffic to WayBackMachine, not my site. Similarly, when I link to other content, I want to show its creators the same courtesy by linking directly to their content rather than WayBackMachine.

What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine, or (perhaps better) generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.

I think it would probably need to treat redirects like broken links given the prevalence of corporate sites where content is simply removed and redirected to the homepage, or geo-locked and redirected to the homepage in other locales (I'm looking at you and your international warranty, and access to tutorials, Fender. Grr.).

I also probably wouldn't run it on every build because it would take a while, but once a week or once a month would probably do it.




> But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine.

That would make sense if users were archiving your site for your benefit, but they're probably not. If I were to archive your site, it's because I want my own bookmarks/backups/etc to be more reliable than just a link, not because I'm looking out to preserve your website. Otherwise, I'm just gambling that you won't one day change your content, design, etc on a whim.

Hence I'm in a similar boat as the blog author. If there's a webpage I really like, I download and archive it myself. If it's not worth going through that process, I use the wayback machine. If it's not worth that, then I just keep a bookmark.


The issue is that if this becomes widespread then we're going to get into copyright claims against the wayback machine. When I write content it is mine. I don't even let Facebook crawlers index it because I don't want it appearing on their platform. I'm happy to have wayback machine archive it, but that's with the understanding that it is a backup, not an authoritative or primary source.

Ideally, links would be able to handle 404s and fallback. Like we can do with images and srcset in html. That way if my content goes away we have a backup. I can still write updates to a blog piece or add translations that people send in and everyone benefits from the dynamic nature of content, while still being able to either fallback or verify content at the time it was publish via the wayback machine.


There already have been copyright claims against The Wayback Machine. They've been responding to it by allowing site owners to use robots.txt to remove their content.


I politely claim that your view is unrealistic (for published content). You may legally own it, but the instant you make content available to a party other than yourself, you lose any guarantee that you control it. Like I said in my earlier comment, if I find your site and like it, it gets downloaded and saved into my archive. Somebody else could trivially copy and paste or screenshot it to facebook.

I feel similarly to you: I want to own and control what I create. However I'm also realistic about the consequences of publishing it, so I don't publish anything I create beyond personally showing stuff to people who are close to me, and preferably from my own equipment directly. Unless you're doing the same, you don't actually control your content.

This may seem like a neurotic approach, but if you actually care about your content, it's not. It's not difficult to find cases of content being stolen and reused without the creator knowing; e.g. https://www.youtube.com/watch?v=w7ZQoN6UrEw


But it’s also not guaranteed to be consistent. What if you don’t delete the content but just change it? (I.e. what if your opinions change or you’re pressured to edit information by a third party?).


I addressed this.

> I can still write updates to a blog piece or add translations that people send in and everyone benefits from the dynamic nature of content, while still being able to either fallback or verify content at the time it was publish via the wayback machine.

Updates are usually good. Sometimes you need to verify what was said though, and for that wayback machine works. I agree it would be nice if there was a technical way to support both, but for the average web request it's better to link to the source.


Perhaps the wayback machine can help fix that by telling users to visit the authoritative site and demanding a confirmation clickthrough before showing the archived content.


> Perhaps the wayback machine can help fix that by telling users to visit the authoritative site and demanding a confirmation clickthrough before showing the archived content.

I'm trying to figure out if you're being ironic or serious.

People on here (rightly) spend a lot of time complaining about how user experience on the web is becoming terrible due to ads, pop-ups, pop-unders, endless cookie banners, consent forms, and miscellaneous GDPR nonsense, all of which get in the way of whatever it is you're trying to read or watch, and all of it on top of the more run-of-the-mill UX snafus with which people casually litter their sites.

Your idea boils down to adding another layer of consent clicking to the mess, to implement a semi-manual redirect through the WayBackMachine for every link clicked. That's ridiculous.

I have to believe you're being ironic because nobody could seriously think this is a good idea.


Agree, cut the clutter just like it is simple on the HN website.


It's a deep problem with the web as we know it.

If I want to make a "scrapbook" to support a research project of some kind. Really I want to make a "pyramid" with a general overview that is at most a few pages at the top, then some documents that are more detailed, but with the original reference material incorporated and linked to what it supports.

In 2020 much of that reference material will come from the web and you are left with doing the "webby" thing (linking) which is doomed to fall victim to broken links or with archiving the content which is OK for personal use, but will not be OK with the content owners if you make it public. You could say the public web is also becoming a cess pool/crime scene, where even reputable web sites are suspected of pervasive click fraud, where the line between marketing and harassment gets harder to see every day.


Is it a deep problem? You can download content you want to keep. There are many services like evernote and pocket that can help you with it.


It is, because it ultimately comes down to owner's control of how their content is being used.

For example, a modern news site will want the ability to define which text is "authoritative", and make modifications to it on the fly, including unpublishing it. As a reader OTOH, I want a permanent, immutable copy of everything said site ever publishes, so that silent edits and unpublishing is not possible. These two perspectives are in conflict, and that conflict repeats itself throughout the entire web.


Some consumers will want the latest and greatest content. To please everyone (other than the owner) you'd need to look at the content across time, versions, alternate world views,... Thus "deep".

My central use case is that I might 'scrape' content from sources such as

https://en.wikipedia.org/wiki/List_of_U.S._states_and_territ...

and have the process be "repeatable" in the sense that:

1. The system archives the original inputs and the process to create refined data outputs

2. If the inputs change the system should normally be able to download updated versions of the inputs, apply the process and produce good outputs

3. If something goes wrong there are sufficient diagnostics and tests that would show invariants are broken, or that the system can't tell how many fingers you are holding up

4. and in that case you can revert to "known good" inputs

I am thinking of data products here, but even if the 'product' is a paper, presentation, or report that involves human judgements there should be a structured process to propagate changes.


> If it's not worth that, then I just keep a bookmark.

I've made a habit of saving every page I bookmark to the WayBackMachine. To my mind, this is the best of both worlds: you'll see any edits, additions, etc. to the source material and if something you remember has been changed or gone missing, you have a static reference. I just wish there was an simple way to diff the two.

I keep meaning to write browser extensions to do both of these things on my behalf ...


I can understand posting a link, plus an archival link just in case the original content is lost. But linking to an archival site only is IMO somewhat rude.


> What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine

Addendum: First, that same tool should – at the time of creating your web site / blog post / … – ask WayBackMachine to capture those links in the first place. That would actually be a very neat feature, as it would guarantee that you could always roll back the linked websites to exactly the time you linked to them on your page.


I don't care enough to look into it, but I think Gwern has something like this set up on gwern.net.


Doesn't Wikipedia do something like this? If not, the WBM/Archive.org does something like it on Wikipedia's behalf.


Gwern.net has a pretty sophisticated system for this https://www.gwern.net/Archiving-URLs


Would be nice if there's an automatic way to have a link revert to the Wayback Machine once the original link stops working. I can't think of an easy way to do that, though.


Brave browser has this built in, if you end up at a dead link the address bar offers to take you to wayback machine.

http://blog.archive.org/2020/02/25/brave-browser-and-the-way...


This was first implemented in Firefox, as an experiment, and is now an extension:

https://addons.mozilla.org/ro/firefox/addon/wayback-machine_...


I used this extension for a while but had to stop due to frequent false positives. YMMV


There exists a manual extension called Resurrect Pages for Firefox 57+, with Google Cache, archive.is, Wayback Machine, and WebCite.


I just use a bookmarklet

    javascript:void(window.open('https://web.archive.org/web/*/'+location.href.replace(/\/$/,%20'')));
(which is only slightly less convenient than what others have already pointed out — the FF extension and Brave built-in feature).


Another nice solution is to create a "search engine" for https://web.archive.org/web/*/%s you can then just add the keyword before the URL (For example I type `<Ctrl-l><Left>w <Enter>`). Search engines like this are supported by chrome and firefox.


I would love for there to be a site that redirected eg. better.site/ https://www.youtube.com/watch?v=jzwMjOl8Iyo to https://invidious.site/watch?v=jzwMjOl8Iyo so I could easily open YouTube links with Invidious, and the same for Twitter→Nitter, Instagram→bibliogram, Google Maps → OSM, etc without having to manually remove the beginning of the URL. I’d presume someone on HN has the skill to do this similarly to https://news.ycombinator.com/item?id=24344127


You can make a "search engine" or bookmarklet that is a javascript/data URL that does whatever URL mangling you need. (Other than some minor escaping issues).

Something like the following should work. You can add more logic to supoort all of the sites with the same script or make one per site.

javascript:document.location="%s".replace(/^https:\/\/www.youtube.com/, "https://invidious.site")


wikipedia just does "$some-link-here (Archived $archived-version-link)", and it works pretty well, imo.


For me that is the real solution when you know that the archived-link is the one consulted by the author/whatever and the normal one being the content (or its evolution).


Agreed, and it shouldn't be too much of a burden to use since the author was quite clear about it being for reference materials. The idea isn't all that different from referring to specific print editions.


iirc wikipedia has some logic for this. When you add a reference it automatically makes sure the page is backed up and if not it triggers a wayback copy, then it scans for dead links in references and if one is found it replaces the link with wayback.


Either a browser extension, or an 'active' system where your site checks the health of the pages it links to.



Their browser extention does exactly that...


The International Internet Preservation Consortium is attempting a technological solution that gives you the best of both worlds in a flexible way, and is meant to be extended to support multiple archival preservation content providers.

https://robustlinks.mementoweb.org/about/

(although nothing else like the IA Wayback machine exists presently, and I'm not sure what would make someone else try to 'compete' when IA is doing so well, which is a problem, but refusing to use the IA doesn't solve it!)


Or: snapshot a WARC archive of the site locally, then start serving it only in case the original goes down. For extra street cred, seed it to IPFS. (A.k.a. one of too many projects on my To Build One Day list.)


ArchiveBox is built for exactly this use-case :)

https://github.com/pirate/ArchiveBox


I use linkchecker for this on my personal sites:

https://linkchecker.github.io/linkchecker/

There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories, making it particularly easy to use with static websites before deploying them.

https://www.npmjs.com/package/broken-link-checker-local


> There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories

linkchecker can do this as well, if you provide it a directory path instead of a url.


Ah, thanks! I was not aware of that feature.


I made a browser extension which replaces links in articles and stackoverflow answers with archive.org links on the date of their publication (and date of answers for stackoverflow questions): https://github.com/alexyorke/archiveorg_link_restorer


> generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.

SEO tools like Ahrefs do this already. Although, the price might be a bit too steep if you only want that functionality. But there are probably cheaper alternatives as well.


yeah at some point, way back machine need to be on webttorrent, ipfs type of thing where it is immutable.


I was surprised when digital.com got purged

Then further dismayed that the utzoo Usenet archives were purged.

Archive sites are still subject to being censored and deleted.



it's there any active project perusing this idea ?


The largest active project doing this (to my knowledge) is the Inter-Planetary Wayback Machine:

https://github.com/oduwsdl/ipwb

There have been many other attempts though, including internetarchive.bak on IPFS, which ended up failing because it was too much data.

http://archiveteam.org/index.php?title=INTERNETARCHIVE.BAK/i...

http://brewster.kahle.org/2015/08/11/locking-the-web-open-a-...


https://github.com/exp0nge/wayback

Here's an extension to archive pages on Skynet, which is similar to IPFS but uses financial compensation to ensure availability and reliability.

I don't know if the author intends to continue developing this idea or if it was a one-off for a hackathon.


FileCoin is the incentivization layer for IPFS, both built by Protocol Labs.


I'm hoping someone here in Hacker News will pick it up and apply for the next round at ycombinator. A non-profit would be better than for-profit in this case. Block-chain ish type tech would be perfect for this. If in a few years no one does, then I'll do it.


> generate a report of broken links

I actually made a little script that does just this. It’s pretty dinky but works a charm on a couple of sites I run.

https://github.com/finnito/link-checker


Not to forget that while I might go to an article written ten years ago, the Wayback archive won't show me a related article that you published two years ago updating the article information or correcting a mistake.


And when you die, who will be maintaining your personal site? What happens when the domain gets bought by a link scammer?

Maybe your pages should each contain a link to the original, so it's just a single click if someone wants to get to your original site from the wayback backup.


Wayback machine converts all links on a page to wayback links so you can navigate a dead site normally.


Well that's a bummer. Any way to defeat it?


If you're viewing a capture of a site, there's always a banner at the top of the page showing the original URL and when the page was captured, along with controls to view other snapshots. I do wish the banner had a "open actual site" button but it's pretty easy to copy the URL from the text box and paste it into your browser's location bar.


I spent hours getting all the stupid redirects working from different hosts, domains and platforms.

People still use rss to either steal my stuff, or discuss it off site (as if commenting to the author is so scary!) or in a way to make me totally unaware of it happening as so many times people either ask questions of the author on a site like this, or even bring up good points or something to go further on that I would miss otherwise.

It’s a shame ping backs were hijacked but the siloing sucks too.

Sometimes I forget for months at a time to check other sites, not every post generates 5000+ hits in an hour.


What if your personal site is, like so many others these days, on shared IP hosting like Cloudflare, AWS, Fastly, Azure, etc.

In the case of Cloudflare, for example, we as users are not reaching the target site, we are just accessing a CDN. The nice thing about archive.org is that it does not require SNI. (Cloudflare's TLS1.3 and ESNI works quite well AFAICT but they are the only CDN who has it working.)

I think there should be more archive.org's. We need more CDNs for users as opposed to CDNs for website owners.


The "target site" is the URL from the author's domain, and Cloudflare is the domain's designated CDN. The user is reaching the server that the webmaster wants reachable.

That's how the web works.

> The nice thing about archive.org is that it does not require SNI

I fail to see how that's even a thing to consider.


If the user follows an Internet Archive URL (or Google cache URL or BING cache URL or ...), then does she still she reach "the server the webmaster wants reachable".

SNI, more specifically sending domain names in plaintext over the wire when using HTTPS, matters to the IETF because they have gone through the trouble of encrypting server certificate in TLS 1.3 and eventually they will be encrypting SNI. If you truly know "how the web works", then you should be able to figure out why they think domain names in plaintext is an issue.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: