
Link rot and redirect madness - ca98am79
http://donmelton.com/2015/12/09/link-rot-and-redirect-madness/
======
mekarpeles
For anyone interested, the Internet Archive has a small but tight open source
community called Archive Labs <archivelab.org> (all experimental services)

We're working on several initiatives, like bringing OpenAnnotations to the
wayback machine, reducing link rot (see: vinay goel &
[http://vinay.blog.archive.org/2014/12/17/9/](http://vinay.blog.archive.org/2014/12/17/9/)),
universalizing access to open-access publications, and supporting efforts to
distribute and decentralize the web.

Send me an email if you're interested in checking out our slack channel and
meeting fellow contributors. Also, if you have a public-good or open-access
project you would like help with or resources for, we'd love to hear about it
--

mekarpeles@gmail.com

------
benten10
Here's a startup idea that I'd been holding close to my heart, but now that my
teammate's working for a BigCorp, earning beyond her dreams, it's time for it
to go public. Steal this idea!

Link rot is only going to get bad from now. It's real, and it's awful.

It's not even random websites. NYT will post a link to a popular Youtube
video, two days later the video gets pulled, and one week into the article,
the links are already stale.

As Wiki gets more 'reputable', newspapers will posts links to it. Wiki being
what it is, the reference to the page will go stale.

So here's an idea for a service: hash all the outgoing/inside pages on a
website. If they change, either 1) give options for users to review, 2) update
the link 3) delete the reference. If the original website 404's, change the
link to the archive page. If a video link is outdated, provide tools to search
for similar videos to link, from inside the dashboard. For Wikipedia,
automatically link to a certain version of history of a page, not the general
page. This is different from the idea posted below because it integrates with
existing publishing systems, so it'd be more b2b, and one would start cashflow
right away.

For links to social media, use a 'photo snippet' tool that looks to check if
the link is valid, and goes to the image version of the link if validity is
dead.

I'm certain people would pay GOOD MONEY for this service. I know I would if I
were running a publication.

Give me some spare change if it succeeds. : )

~~~
sosuke
The ultimate version being complete capture of the Internet at large with hour
to hour snapshots and history perhaps using deltas between versions.

Git Internet!

I've personally started saving sites I want to keep using the print-to-PDF
feature. Bookmarks aren't enough when you really care to save the data.

~~~
mmebane
I use the Firefox version of Zotero[1] to archive links. It works quite well,
and is more searchable than PDFs.

[1]: [https://www.zotero.org/](https://www.zotero.org/)

~~~
wodenokoto
You can save a lot of space by using a "reader-view" version of the webpage.

------
jerf
In my visualization of perfect blogging platform, it automatically
periodically scrapes all your old links and looks for either redirections or
wild size changes in the content. It also auto-archives linked content for you
(which used to be easier, but can still be at least tried), and could offer
these archives to either the blog owner or post reader. But among the many
challenges implementing that provides is that it is hard to make it work at
scale, because of the resulting legal issues.

~~~
isoos
I've recently started to consider tools like wkhtmltopdf [1] to easily dump a
website I'd like to link to. Then I'd include both the original link and the
pdf version on the blog page, in order to keep the linked structure and the
archived version at the same spot.

I haven't done it yet, but would be interested in other's experiences. It
would be interesting to have a crawler that does it automatically for each new
post.

[1] [http://wkhtmltopdf.org/](http://wkhtmltopdf.org/)

~~~
dangrossman
This would be problematic legally, even when what you're linking to offers a
permissive license of some sort.

------
Jerry2
I keep all my bookmarks in a Delicious account and I was shocked that around
40% of links I saved since 2013 are now gone. I managed to find few archived
pages in archive.org but many of the pages I "saved" are gone forever and I
can never refer to them again.

I now save every single page I find important or interesting on my local HDD
and then move it to a backup disk later on.

~~~
rahiel
I was having the same problem with my bookmarks disappearing from the
internet. Searching for a solution I eventually made a browser extension to
send bookmarks to an online archiving service (archive.is). I then added an
option to save archives locally in mhtml format. I have around 370 bookmarks
archived that way and it takes 240MB of space.

The addon can be found here:
[https://github.com/rahiel/archiveror](https://github.com/rahiel/archiveror).

~~~
Cyberdog
Saving archives locally is great, but I wouldn't count on using archive.is for
long-term reliable storage. Seeing as how it must cost its operator(s) more
than a few ducats to operate and stores copies of copyrighted material on a
centralized server(s) without being able to hide behind the "lol we're just a
library" shield that Internet Archive can, I wouldn't be surprised to see it
suddenly disappear one day.

------
zeveb
I recently went through a large portion of my old blog. It's astounding how
many excellent resources from just a few years ago have completely vanished
due to link rot; in a few cases I was unable to find archives even on
archive.org.

It's really sad, and more than a little disheartening.

~~~
sosuke
Hosting the content or having your own copy of something is the only real way
to be certain. I had the same experience and I've started saving all my work.
Thanks for the idea about the blog though, I need to go back through mine and
export everything.

------
PuffinBlue
I can add to the sample size here and have found similar results. I wrote an
SEO test piece a while ago which inadvertently became quite popular[0] (and
did rank well for the targeted terms).

In updating it recently (couple months ago) I found many of the links simply
404'd, not even any redirect at all. Wikipedia was pretty bad for this too as
the author found.

I took to hosting all the files I linked to myself to at least be sure they'd
stay around but it is a bit of a losing battle with regard to linking to
normal webpages.

I might look to link to an archive.org page instead in the future if the link
rot keeps happening. Perhaps that would be a more stable option.

[0] [http://josharcher.uk/blog/why-margaret-thatcher-is-
hated/](http://josharcher.uk/blog/why-margaret-thatcher-is-hated/)

------
TazeTSchnitzel
Wikipedia handles redirects mostly fine. Mostly.

The one big problem with Wikipedia redirects is those to sections: if a page
is moved or disappears, the software notices, but it doesn't if a section
changes its name. Thus, frequently the anchor in an intra-wiki link is broken.

~~~
cooper12
FWIW broken anchors are tracked here:
[https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Bro...](https://en.wikipedia.org/wiki/Wikipedia:Database_reports/Broken_section_anchors).
Maybe you can help alleviate the problem.

~~~
TazeTSchnitzel
Awesome, I didn't know about that page. I've gone and fixed a few of the top
ones.

------
CM30
To some degree, this is why some sites use services like archive.is to link to
third party sites, because the archiving service will keep a cached version of
the page even if the original source goes down. Maybe this could be the
solution for certain cases? Okay, it won't be super popular with all site
owners (they'd rather traffic came to their sites directly so their ads were
viewed), but it's certainly more stable.

~~~
bigbugbag
This is worse as centralizing creates a single point of failure over which you
have no control. Specifically archive.is has repeatedly taken a stance against
going opensource and offering a self-hosting solution while being privately
funded with no option for you to take part in it.

Archive.org or maybe something like wallabag or even a screenshot, but anyways
to work around link rot one is better off with a local cached copy.

------
andrewchambers
I always thought blogs should link to immutable content addressed mirrors of
sites. Perhaps it could be a paid service.

------
hendry
Wrote a script to help manage linkrot on my blog:
[https://github.com/kaihendry/linkrot](https://github.com/kaihendry/linkrot)

Number of issues: People who squat pages & accounting for temporary failures.

------
gwbas1c
I very rarely update my website. (Most of it predates modern CMSes, anyway.)

One of the things that I do when I put up a page, though, is make PDFs of
everything that I link to. This way, if the link dries up, or is changed,
there's still something to fall back to.

------
myfonj
> I’ll soon upgrade to HTTPS here.

Pardon my ignorance and probably silly question, but what is the point of
making data transfers to/from _blog_ hosted at _own domain_ more secure? If I
get it right, it will just hide which certain articles reader visits (not that
he visits this certain domain) and it will prevent caching along the wire. I
suppose there must be some real benefit of this right in from of me (see _"
everyone else seems to have performed the same upgrade"_), but I cannot see
it. Any clues?

~~~
x1798DE
For the user, the benefit is privacy and MITM protection. Hopefully,
eventually browser chrome will explicitly mark non https sites as unsafe (like
a red bar at the url).

~~~
myfonj
Ok, valid point.

------
pc2g4d
One possible approach to link rot: each web page embeds within it (or in
associated resources) a copy of every page linked to. Then when a link is
clicked on it's always possible to show the page as intended.

The biggest issue with this approach would probably be legal, as you'd find
yourself redistributing copyright works. Could a fair use argument be made
since it would be furthering public discourse?

------
DiabloD3
This is why I don't bookmark things anymore, I save them directly to Evernote.

