I can see it for corporate sites where they change content, remove pages, and break links without a moment's consideration.
But for my personal site, for example, I'd much rather you link to me directly rather than content in WayBackMachine. Apart from anything else linking to WayBackMachine only drives traffic to WayBackMachine, not my site. Similarly, when I link to other content, I want to show its creators the same courtesy by linking directly to their content rather than WayBackMachine.
What I can see, and I don't know if it exists yet (a quick search suggests perhaps not), is some build task that will check all links and replace those that are broken with links to WayBackMachine, or (perhaps better) generate a report of broken links and allow me to update them manually just in case a site or two happen to be down when my build runs.
I think it would probably need to treat redirects like broken links given the prevalence of corporate sites where content is simply removed and redirected to the homepage, or geo-locked and redirected to the homepage in other locales (I'm looking at you and your international warranty, and access to tutorials, Fender. Grr.).
I also probably wouldn't run it on every build because it would take a while, but once a week or once a month would probably do it.
That would make sense if users were archiving your site for your benefit, but they're probably not. If I were to archive your site, it's because I want my own bookmarks/backups/etc to be more reliable than just a link, not because I'm looking out to preserve your website. Otherwise, I'm just gambling that you won't one day change your content, design, etc on a whim.
Hence I'm in a similar boat as the blog author. If there's a webpage I really like, I download and archive it myself. If it's not worth going through that process, I use the wayback machine. If it's not worth that, then I just keep a bookmark.
Ideally, links would be able to handle 404s and fallback. Like we can do with images and srcset in html. That way if my content goes away we have a backup. I can still write updates to a blog piece or add translations that people send in and everyone benefits from the dynamic nature of content, while still being able to either fallback or verify content at the time it was publish via the wayback machine.
I feel similarly to you: I want to own and control what I create. However I'm also realistic about the consequences of publishing it, so I don't publish anything I create beyond personally showing stuff to people who are close to me, and preferably from my own equipment directly. Unless you're doing the same, you don't actually control your content.
This may seem like a neurotic approach, but if you actually care about your content, it's not. It's not difficult to find cases of content being stolen and reused without the creator knowing; e.g. https://www.youtube.com/watch?v=w7ZQoN6UrEw
> I can still write updates to a blog piece or add translations that people send in and everyone benefits from the dynamic nature of content, while still being able to either fallback or verify content at the time it was publish via the wayback machine.
Updates are usually good. Sometimes you need to verify what was said though, and for that wayback machine works. I agree it would be nice if there was a technical way to support both, but for the average web request it's better to link to the source.
I'm trying to figure out if you're being ironic or serious.
People on here (rightly) spend a lot of time complaining about how user experience on the web is becoming terrible due to ads, pop-ups, pop-unders, endless cookie banners, consent forms, and miscellaneous GDPR nonsense, all of which get in the way of whatever it is you're trying to read or watch, and all of it on top of the more run-of-the-mill UX snafus with which people casually litter their sites.
Your idea boils down to adding another layer of consent clicking to the mess, to implement a semi-manual redirect through the WayBackMachine for every link clicked. That's ridiculous.
I have to believe you're being ironic because nobody could seriously think this is a good idea.
If I want to make a "scrapbook" to support a research project of some kind. Really I want to make a "pyramid" with a general overview that is at most a few pages at the top, then some documents that are more detailed, but with the original reference material incorporated and linked to what it supports.
In 2020 much of that reference material will come from the web and you are left with doing the "webby" thing (linking) which is doomed to fall victim to broken links or with archiving the content which is OK for personal use, but will not be OK with the content owners if you make it public. You could say the public web is also becoming a cess pool/crime scene, where even reputable web sites are suspected of pervasive click fraud, where the line between marketing and harassment gets harder to see every day.
For example, a modern news site will want the ability to define which text is "authoritative", and make modifications to it on the fly, including unpublishing it. As a reader OTOH, I want a permanent, immutable copy of everything said site ever publishes, so that silent edits and unpublishing is not possible. These two perspectives are in conflict, and that conflict repeats itself throughout the entire web.
My central use case is that I might 'scrape' content from sources such as
and have the process be "repeatable" in the sense that:
1. The system archives the original inputs and the process to create refined data outputs
2. If the inputs change the system should normally be able to download updated versions of the inputs, apply the process and produce good outputs
3. If something goes wrong there are sufficient diagnostics and tests that would show invariants are broken, or that the system can't tell how many fingers you are holding up
4. and in that case you can revert to "known good" inputs
I am thinking of data products here, but even if the 'product' is a paper, presentation, or report that involves human judgements there should be a structured process to propagate changes.
I've made a habit of saving every page I bookmark to the WayBackMachine. To my mind, this is the best of both worlds: you'll see any edits, additions, etc. to the source material and if something you remember has been changed or gone missing, you have a static reference. I just wish there was an simple way to diff the two.
I keep meaning to write browser extensions to do both of these things on my behalf ...
Addendum: First, that same tool should – at the time of creating your web site / blog post / … – ask WayBackMachine to capture those links in the first place. That would actually be a very neat feature, as it would guarantee that you could always roll back the linked websites to exactly the time you linked to them on your page.
Something like the following should work. You can add more logic to supoort all of the sites with the same script or make one per site.
(although nothing else like the IA Wayback machine exists presently, and I'm not sure what would make someone else try to 'compete' when IA is doing so well, which is a problem, but refusing to use the IA doesn't solve it!)
There's a similar NodeJS program called blcl (broken-link-checker-local) which has the handy attribute that it works on local directories, making it particularly easy to use with static websites before deploying them.
linkchecker can do this as well, if you provide it a directory path instead of a url.
SEO tools like Ahrefs do this already. Although, the price might be a bit too steep if you only want that functionality. But there are probably cheaper alternatives as well.
Then further dismayed that the utzoo Usenet archives were purged.
Archive sites are still subject to being censored and deleted.
There have been many other attempts though, including internetarchive.bak on IPFS, which ended up failing because it was too much data.
Here's an extension to archive pages on Skynet, which is similar to IPFS but uses financial compensation to ensure availability and reliability.
I don't know if the author intends to continue developing this idea or if it was a one-off for a hackathon.
I actually made a little script that does just this. It’s pretty dinky but works a charm on a couple of sites I run.
Maybe your pages should each contain a link to the original, so it's just a single click if someone wants to get to your original site from the wayback backup.
People still use rss to either steal my stuff, or discuss it off site (as if commenting to the author is so scary!) or in a way to make me totally unaware of it happening as so many times people either ask questions of the author on a site like this, or even bring up good points or something to go further on that I would miss otherwise.
It’s a shame ping backs were hijacked but the siloing sucks too.
Sometimes I forget for months at a time to check other sites, not every post generates 5000+ hits in an hour.
In the case of Cloudflare, for example, we as users are not reaching the target site, we are just accessing a CDN. The nice thing about archive.org is that it does not require SNI. (Cloudflare's TLS1.3 and ESNI works quite well AFAICT but they are the only CDN who has it working.)
I think there should be more archive.org's. We need more CDNs for users as opposed to CDNs for website owners.
That's how the web works.
> The nice thing about archive.org is that it does not require SNI
I fail to see how that's even a thing to consider.
SNI, more specifically sending domain names in plaintext over the wire when using HTTPS, matters to the IETF because they have gone through the trouble of encrypting server certificate in TLS 1.3 and eventually they will be encrypting SNI. If you truly know "how the web works", then you should be able to figure out why they think domain names in plaintext is an issue.
BTW, we archive all outlinks from all Wikipedia articles from all Wikipedia sites, in near-real-time... so that we are able to fix them if/when they break. We have rescued more than 10 million so far from more than 30 Wikipedia sites. We are now working to have Wayback Machine URLs added IN ADDITION to Live Web links when any new outlinks are added... so that those references are "born archived" and inherently persistent.
Note, I manage the Wayback Machine team at the Internet Archive. We appreciate all your support, advice, suggestions and requests.
Anyway, the fix should work even with plain HTML. I'm sure there are a bunch of corner cases and security issues involved..
Well as mentioned by others, there is a browser extension. It's interesting to read the issues people have with it:
Citation needed? Eg something like http://web.archive.org/cdx/search/cdx?url=http://haskell.cs.... produces lines of the form:
edu,yale,cs,haskell)/wp-content/uploads/2011/01/haskell-report-1.2.pdf 20170628055823 http://haskell.cs.yale.edu/wp-content/uploads/2011/01/haskell-report-1.2.pdf warc/revisit - WVI3426JEX42SRMSYNK74V2B7IEIYHAS 563
Oh, sorry, I don't think the WM supports this today. I only meant that it could support it "trivially" (I put that in quotes since I don't know how WM is implemented. But in theory it would be easy to hash all their content and add an endpoint that maps from hashes to URLs).
My point was that you could add an addressing system that is both independent of the Wayback Machine, but which you could still (theoretically) use with it. But you'd have to add the facility to the WM.
Not for the sole reason that it leaves some control to the content owner while ultimately leaving the choice to the user, but also because things like updates and erratums (eg. retracted papers) can't be found in archives. When you have both, it's the best of both world: you have the original version, the updated version, and you can somehow have the diff between them. IMHO, this is especially relevant in when the purpose is reference.
Link rot isn't the only reason why one would want an archive link instead of original. Not that I'd want to overwhelm the internet archive's resources.
Replace https://example.com from the URL above.
I try to respect the cost of archiving, by not saving to often the same page.
It might be an interesting use-case for you to check out, i.e. keep an eye of those rarely used legal sublinks for smaller companies.
How do you think about it?
In the worst case one might write a cool article and get two hits, one noticing it exists, and the other from the archive service. After that it might go viral, but the author may have given up by then.
The author is losing out on inbound links so google thinks their site is irrelevant and gives it a bad pagerank.
All you need to do is get archive.org to take a copy at the time, you can always adjust your link to point to that if the original is dead.
there is also no reason why that has to become a slippery slope, if anyone is going to ask "but where do you stop!!"
Some kind of CDN-edge-archive hybrid.
I agree, but are you suggesting it's going to be better if WayBackMachine is?
We as a community need to think bigger rather than resigning ourselves to our fate.
Let me put it another way: what specifically are you suggesting as an alternative?
I don't think I like IPFS as an organization, but tech wise it's probably what I'd go with.
It's not about Google's incentives. It's about directing the traffic where it should go. Google is just the means to do so.
Build an alternative, I'm sure nobody wants Google to be the number one way of finding content, it's just that they are, so pretending they're not and doing something that will hurt your ability to have your content found isn't productive.
What is true for Google in this regard is also true of Bing, DDG and Yandex.
I guess the answer is "don't mess with your old site", but that's also impractical.
And I'm sorry, but if it's my site, then it's my site. I reserve the right to mess about with it endlessly. Including taking down a post for whatever reason I like.
I'm sorry if that conflicts with someone else's need for everything to stay the same but it's my site.
Also, if you're linking to my article, and I decide to remove said article, then surely that's my right? It's my article. Your right to not have a dead link doesn't supercede my right to withdraw a previous publication, surely?
I do think the author is wrong to immediately post links to archived versions of sources. At the least, he could link to both the original and archived.
People are free to view it and take pictures for their own records, but I could still take it down and put something else up.
As a motivating example, I wrote some stuff on my MySpace page as a teenager that I'm very glad is no longer available. They were published as "freely accessible" and indeed, I wanted people to see it. But when I read it back 15 years later, I was more than a little embarrassed about it, and I deleted it - despite it also having comments from my friends at the time, or being referenced in their pages.
No great value was contained in those works.
Forgetting is part of living, y'know?
But no-one has a problem with other creative industries withdrawing their publications. Film-makers are forever deciding that movies are no longer available, for purely commercial reasons. Why is writing different? Why is pulling your books from a library unethical but pulling your movie from distribution is OK?
I think we either need to extend this to all creative activity, or reconsider it for writing.
I wouldn't say no one has a problem with this. It does happen, but it certainly doesn't make everyone happy. I for one would like for all released media to be available, or at least not actively removed from access.
Copyright was created to encourage publication of information, not to squirrel it away. Copyright should be considered the exception of the standard - public domain.
Is it unacceptable for an artist to throw her art away after it has finished its museum tour? Should a parent hang on to every drawing their child has ever made?
If you are a software developer - is all of the code you've ever written still accessible online, for free? (To the legal extent that you are able, of course.)
Have you written a blog before, or did you have a MySpace? Have you taken care to make sure your creative work has been preserved in perpetuity, regardless on how you feel about the artistic value of displaying your teen emotions?
Consider why you feel it is unethical for the author or persons responsible for the work to ever stop selling it.
This boils down to the public domain, IMO. We have made a long practice of rescuing art from private caches and trash bins to make them publicly available after the artists' passing (the copyright expiring); regardless of their views on what should happen with those works.
> Consider why you feel it is unethical for the author or persons responsible for the work to ever stop selling it.
Selling something and then pulling it down is fundamentally an attempt to create scarcity for something that would otherwise be freely available. It's a marketing technique that capitalizes on our fear of missing out to make a sale.
Again, the right to even sell writings was enshrined in law as an exception to the norm of of it immediately being part of the public domain, in an effort to encourage more writing.
Sure, after I'm dead, you can do with my stuff whatever you like.
But while I'm alive.... it's my stuff and I can do with it what I like. Including tearing it up because I hate it now and don't want anyone to look at it.
This thread stems from the second; about whether a site owner is justified in deleting or rearranging old pages/information on their website.
Additional benefit: Some edits are good (addendums, typo corrections etc.)
Link: <https://example.com>; rel="original"
It’s very demotivating
It's an okay idea to link to WB, because (a) it's de facto assumed to be authoritative by the wider global community and (b) as an archive it provides a promise that it's URL's will keep pointing to the archived content come what may.
Though, such promises are just that: promises. Over a long period of time, no one can truly guarantee the persistence of a relationship between an URI and the resource it references to. That's not something technology itself solves.
The "original" URI still does carry the most authority, as that's the domain on which the content was first published. Moreover, the author can explicitly point to the original URI as the "canonical" URI in the HTML head of the document.
Moreover, when you link to the WB machine, what do you link to? A specific archived version? Or the overview page with many different archived versions? Which of those versions is currently endorsed by the original publisher, and which are deprecated? How do you know this?
Part of ensuring persistence is the responsibility of original publisher. That's where solutions such as URL resolving come into play. In the academic world, DOI or handle.net are trying to solve this problem. Protocols such as ORE or Memento further try to cater to this issue. It's a rabbit hole, really, when you start to think about this.
WB also supports linking to the very latest version. If the archive is updated frequently enough I would say it is reasonable to link to that if you use WB just as a mirror. In some cases I've seen error pages being archived after the original page has been moved or removed though but that is probably just a technical issue caused by some website misconfiguration or bad error handling.
Bookmark Location- https://web.archive.org/save/%s
Keyword - save
So searching 'save https://news.ycombinator.com/item?id=24406193' archives this post.
You can use any Keyword instead of 'save'.
You can also search with https://web.archive.org/*/%s
The problem is %s gets escaped, so Firefox generates this URL, which seems to be invalid:
If you are still facing problems, go to https://web.archive.org . In the bottom right 'Save page now' field, right click and select 'add keyword for search'. Choose your desired keyword.
Did you try the link provided by the one you replied to?
Because it says "HTTP 400" here, so apparently it doesn't convert well, at least not in my end.
I just use the extension myself:
The permission is just for a simple reason and should be off by default. It is so you can right click a link on any page and select 'archive' from the menu. Small function, but requires access to all sites.
Also there doesn't seem to be a way to open a URL directly from the extension which seems a weird omission, so i end up going to the archive site anyway since i very often want to find old long lost sites.
(Don't get me wrong, it is still very annoying for the user regardless what the cause is.)
Most preserving solutions are like that and at the end the funding or business priorities (google groups) become a serious problem.
I think we need something like web - distributed and dumb easy to participate and contribute a preservation space.
Look, there are Torrents available for 17 years . Sure, there are some unintresting long gone but there is always a little chance somebody still has the file and someday becomes online with it.
I know about IPFS/Dat/SBB, but still that stuff, like Bitcoin, is too complex for a layman contributor with a plain altruistic motivation.
It should be like SETI@Home - fire and forget. Eventually integrated with a browser to cache content you star/bookmark and share when it is offline.
If we had legal deposit web archiving institutions, then academics, and others, could create an archive snapshot of some resource and then reference the URI to that (either with or without the original URI), so as to ensure permanence.
I agree that this is what should be done more often for durable content. This is another reason why social media sites are bad: the users technically have automatic copyright to their contributions, but they often also agree to grant licenses to the content website for things like their photos. The US Copyright Office is releasing a new application format intended for blog posts and other short form content that will hopefully be more straightforward to use.
While this is true in general, I am amused that this is not true for citing wikipedia. Wikipedia can be trusted to remain online for many more years to come. And it has a built-in wayback machine in the form of Revision History.
The page can be completely correct and accurate, but if you cannot trace the references then it cannot be verified and you cannot make the claims in a new work as a result. The whole point of references is to make it so that the claims can be independently verified. Even when there isn't a link rot problem you will often find junk references that cannot be verified.
Wikipedia isn't a bad starting point and sometimes you can find good references. But it is not anywhere close to reliable: just trace the references in the next 20 Wiki articles you read and your faith will be shaken.
Good quality information on wikipedia often refers back to published sources, and at the very least an author should check that source and refer to it, rather than wikipedia itself.
Anyone doing research just got screwed.
So many papers have code listed to places that don’t exist anymore.
I don't feel comfortable sending a bunch of web traffic to them for no reason other than it being convenient. The wayback machine is a web archival project, not your personal content proxy to make sure your links don't go stale.
They need our help both in funding and in action, one simple action is not to abuse their service.
I hope the author of this piece considers donating and promoting donation to their readers: https://archive.org/donate/
The INTERNETARCHIVE.BAK project (also known as IA.BAK or IABAK) is a combined experiment and research project to back up the Internet Archive's data stores, utilizing zero infrastructure of the Archive itself (save for bandwidth used in download) and, along the way, gain real-world knowledge of what issues and considerations are involved with such a project. Started in April 2015, the project already has dozens of contributors and partners, and has resulted in a fairly robust environment backing up terabytes of the Archive in multiple locations around the world.
Snapshots from 2002 and 2006 are preserved in Alexandria, Egypt. I hope there's good fire suppression.
When scoping out the size of Google+, one of ArchiveTeam's recent projects, it emerged that the typical size of a post was roughly 120 bytes, but total page weight a minimum of 1 MB, for a 1% payload to throw-weight ratio. This seems typical of much the modern Web. And that excludes external assets: images, JS, CSS, etc.
If just the source text and sufficient metadata were preserved, all of G+ would be startlingly small -- on the order of 100 GB I believe. Yes, posts could be longer (I wrote some large ones), and images (associated with about 30% of posts by my estimate) blew things up a lot. But the scary thing is actually how little content there really was. And while G+ certainly had a "ghost town" image (which I somewhat helped define), it wasn't tiny --- there were plausibly 100 - 300 million users with substantial activity.
But IA's WBM has a goal and policy of preserving the Web as it manifests, which means one hell of a lot of cruft and bloat. As you note, increasingly a liability.
In practice, this would likely involve recreating at least some of the presentation side of numerous changing (some constantly) Web apps. Which is a substantial programming overhead.
WARC is dumb as rocks, from a redundancy standpoint, but also atomically complete, independent (all WARCs are entirely self-contained), and reliable. When dealing with billions of individual websites, these are useful attributes.
It's a matter of trade-offs.
Mirroring a website isn't so hard that you need a service to do it for you. Your browser even has such a function; try ctrl-s.
Linking to Archive only makes Archive a single point of failure.
Check out [this link](https://...) ([archived](https://...))
This can also help in the event of a "hug of death"
is an ongoing series of photos of nature with superimposed geometrical shapes drawn by drones.
1. Recognize that it's an Archive.org URL
2. Understand that the link references an archived page whose URL is "clearly" referenced as a parameter
3. Edit the URL (especially pleasant on a cell phone) correctly and try loading that
If you expect the user to be able to go through all this trouble if the Archive is down, you can also expect them to look up the page on the Archive if the link does not load.
But better yet, one shouldn't expect either.
Alternatively, this is a good thing for a user agent to handles natively, or through a plugin.
Have a link checking process you run regularly against your site, using some of the standard tools I've mentioned elsewhere in this thread:
When you run the link check (which should be regularly, perhaps at least weekly), also run a process that harvests the non-local links from your site and 1) adds any new links' content to your own local, unpublished archive of external content, and 2) submits those new links to archive.org.
This keeps canonical URLs canonical, makes sure content you've linked to is backed up on archive.org so a reasonably trustworthy source is available should the canonical one die out, and gives you your own backup in case archive.org and the original both vanish.
I don't currently do this with my own sites, but now I'm questioning why not. I already have the regular link checks, and the second half seems pretty straightforward to add (for static sites, anyway).
The problem with linking to Wayback Machine is that we are still writing archive.org URLs still linking to Wayback Machine servers. What guarantee is there that those archive.org links will not break in future?
It would have been nice if the web were designed to be content-addressable. That is, the identifier or string we use to access a content addresses the content directly, not a location where the content lives. There is good effort going on in this area in the InterPlanetary File System (IPFS) project but I don't think the mainstream content providers on the Internet are going to move to IPFS anytime soon.
For docs and other texts, I just link to the original site and add an (Archive) suffix, e.g. the "Sources" section in https://doc-kurento.readthedocs.io/en/latest/knowledge/nat.h...
That is a simple and effective solution, yes it is a bit more cumbersome, but it does not bother me.
Can you believe it? Yesterday, I tried to walk out of the grocery store with a head of lettuce for free, and they instead were more interested in making me pay money to support the grocery and agricultural business!
1. Browsers build in a system whereby if a link appears dead, they first check against the Wayback Machine to see if a backup exists.
2. If it does, they go there instead.
3. In return for this service, and to offset costs associated with increased traffic, they jointly agree to financially support the Internet Archive in perpetuity.
Media many times will not be saved so pages look broken. The iframe and the iframe breakers on original sites can kill any navigating.
The waybackmachine is okay for researching but a poor replacement as a perm link.
In my experience, this has gotten much, much better in the last few years. I haven't explored enough to know if this is part of the archival process or not, but I've noticed on a few occasions that assets will suddenly appear some time after archiving a page. For instance, when I first archived this page (https://web.archive.org/web/20180928051336/https://www.intel...), none of the stylesheets, scripts, fonts or images were present. However, after some amount of time (days/weeks) they suddenly appeared and I was able to use the site as it originally appeared.
1) The example he uses is The Epoch Times, a questionable source even on the best of days.
2) What he refers to as “spam” is a paywall. He is literally taking away from business opportunities for this outlet that produced a piece of content he wants to draw attention to, but he does not want to otherwise support.
He’s a taker. And while the Wayback Machine is very useful for sharing archived information, that’s not what this guy is doing. He’s trying to undermine the business model of the outlets he’s reading.
The Epoch Times is one thing—it’s an outlet that is essentially propaganda—but when he does this to a local newspaper or an actual independent media outlet, what happens?
For the destination site, this is all of the downsides of AMP with none of the upsides.
They're hyper right wing Qanon/antivax spreaders associated with the Falun Gong movement.
Where a WP plugin would add value is by saving to the archive whenever WP publishes a new or edited article.
Not all updates are about "begging for money" as the example in the article.
That way we're not all completely reliant on a central system. (ArchiveBox submits your links to Archive.org in addition to saving them locally).
Also many other tools that can do this too:
It might be useful as a backup if the original site starts getting hugged to death.
Of course, linking to WBM is not the main reason why a site might be in this situation but it piles up.
All records of this page on Archive.org were deleted after a couple of days, a twitter account posting the details with a screenshot and link was reported and my account temporarily suspended.
I assume it must be very easy to remove inconvenient content from archive.org.
The use of the bookmarklet makes this really convenient.
Secondly, I personally don’t like the fact that WayBackMachine doesn’t provide an easy way to get content removed and to stop indexing and caching content (the only way I know is to email them, with delayed responses or responses that don’t help). It’s far easier to get content de-indexed in the major search engines. I know that the team running it have some reasons to archive anything and everything (as) permanently (as possible), but it doesn’t serve everybody’s needs.
I wish I had done this 15 years ago for a small project/website. Nowadays, my website is there, with all of its content, but most of the awesome references which I had linked to are unavailable. I wrote "most", but it is close to all of them.
Users will find the archive link if they really want to, and it will make it easier for me to replace broken links in the future.
I've been building lists of -reference- URLs for over a decade ... and the ones aimed at Archive.org (are slower to load, but) are much more reliable.
Saved Wayback URLs contain the original site URL. It's really easy to check it to see if the site has deteriorated (usually it has). If it's gotten better ... it's easy to update your saved WB link.
The waybackmachine is backed by WARC files. It's perhaps the only thing on archive.org that cant be downloaded... well except the original mpg files for 911 news footage.
An anchor type which allows several URLs, to be tried in order, would go a long way. Then we could add automatic archiving and backup links to a CMS.
It isn't real content-centric networking, which is a pity, but it's achievable with what we have.
The other day, I noticed that even old links from the front page of Google and Youtube are dead now. Internet Archive still has them. These were links on the front page of YT. Was very disappointed that even Google has dead links.
1 - https://github.com/ashishb/outbound-link-checker
I was browsing an old HN post from 2018, with lots of what seemed like useful links to their blog
Upon visiting it the site had been rebranded and the blog entries had disappeared
Waybackmachine saved me in this cass, but a link to it originally would have saved me a few clicks
I think I've been curating about 200 essays so far like that. You're now making me rethink my flow.
Or so there is no engagement at the source?
Let's say you write an article on your site, https://yoursite.com/my-article, and from it you want to link to an article https://example.com/some-article
You then create a mirror of https://example.com/some-article to be served from your site at https://yoursite.com/mirror/2019-09-08/some-article (put /mirror/ in robots.txt and set to noindex (or maybe even better to put a rel="canonical" towards the original article?)) and on the top of this mirrored page you add a header bar thingy containing a link to the original article, as well as one to archive.org if you so want.
tl;dr instead of linking to https://example.com/some-article you link to https://yoursite.com/mirror/2019-09-08/some-article (which has links to the original)
Some blockchain will end up taking care of this.
if not link to that are one misconfiguration or one parked domain from being wiped.
That website spends money creating content for commercial viability, it doesn’t have to bow to you and make sure you can consume it for free, and the Wayback Machine isn’t a tool for you to bypass premium content.
I'm sure there are other examples as well.
Now the archive comes up with the paywall message in it
Still works on some sites by just simply archiving the page
> This URL has been excluded from the Wayback Machine.
They also do not exclude the archive.org bot in https://www.snopes.com/robots.txt
The example provided in the article, showing how a site looked cleaner before, could simply be the content security policies at the WayBackMachine preventing the clutter from getting loaded, rather than any specific changes on the site - although I haven't checked that particular site.