

About Linkrot: Proportion of Working Links on Pinboard.in, 1997-2011 - chl
http://blog.pinboard.in/2011/05/remembrance_of_links_past/

======
patio11
There is an easy way to increase your detection of redirects and parked pages:
make two requests, one to the real URL and one to a URL which is intentionally
broken. (example.com/i-am-a-link and example.com/fklsdfasdifo for example) Run
a heuristic for difference on the resulting content. This won't catch all of
them, particularly if you use a really naive heuristic that can't deal with
e.g. ads changing, but it's a heck of a lot quicker than comparing manually.

~~~
hcho
I suspect that's what Google does too. I see lots of example.com/fklsdfasdifo
type requests from Big G in my logs.

~~~
bauchidgw
if you see a lot of them go to webmaster tools, if you see them there too its
not some kind of test but some other reasone, mostly their shitty js parsinf,
which treats anything with an / as a relative url...

------
bmm6o
"Links appear to die at a steady rate (they don't have a half life), and you
can expect to lose about a quarter of them every seven years."

Is this self-contradictory, or is just poor wording of his findings?

~~~
pyninja
What's not clear about it? They die at a steady rate of 25% per 7 years.

~~~
bmm6o
As reemrevnivek explains, the answer to "25% of what?" is the issue here. An
exponential decay can be described with the same words.

~~~
erikpukinskis
There's a freaking graph. Are we really being this pedantic that we can't
understand what he's saying?

~~~
bmm6o
I discovered when I got home that the graph was blocked by my office's web
filter. Definitely makes more sense with it there. Thanks for your
understanding!

------
ugh
I think it's about time that some government or billionaire throws a few
millions at an internet archive project. The Internet Archive is nice but more
regular snapshots with a wider coverage would be something I'm certain future
historians would love to get their hands on (and they will hate us if we don't
do it).

~~~
tokenadult
_I think it's about time that some government or billionaire throws a few
millions at an internet archive project._

It may be that one or two governments have already done that. You are, of
course, referring to a publicly accessible Internet archive.

As for what a benevolent millionaire (it wouldn't have to be a full
billionaire for this to start up) could fund, pg has suggested, "There is room
to do to Wikipedia what Wikipedia did to Britannica."

<http://ycombinator.com/ideas.html>

It's interesting that pg thought then that Wikipedia's problem is excessive
deletionism, while I (after being a registered Wikipedian and working on
various articles) think that Wikipedia's problem is lack of thorough research
to prepare article content.

[http://strategy.wikimedia.org/wiki/Wikimedia_Movement_Strate...](http://strategy.wikimedia.org/wiki/Wikimedia_Movement_Strategic_Plan_Summary/Improve_Quality)

Whatever one's opinion of what's wrong with Wikipedia, the best way to prompt
improvement in Wikipedia (or replace it, if you prefer that) is to build
another site that does some of what Wikipedia does but does it better somehow.
That's not easy, not easy at all, but it's not terribly expensive. I have
looked at the Wikimedia Foundation financial reports, and building a strong
competitor to Wikipedia is a project that is well within the grasp of several
individual millionaires, and within the grasp of quite a few nonprofit
charitable organizations. A business corporation that can find out a way to
monetize a Wikipedia competitor might have a great business opportunity.

~~~
neilk
> It's interesting that pg thought then that Wikipedia's problem is excessive
> deletionism, while I (after being a registered Wikipedian and working on
> various articles) think that Wikipedia's problem is lack of thorough
> research to prepare article content.

Yep, that's the usual distinction. Non-Wikipedians believe that Wikipedia
should be a compendium of any information that could be useful, however
unverifiable or incomplete. Wikipedians want there to be higher standards, but
pg makes the common mistake of thinking this is because they are all OCD.[1]
The Wikipedia system relies on group verifiability. Low quality info imposes
long-term costs on the administrators. The article will be flagged more
frequently than others. So pruning low quality info is a matter of
administrator self-defence, even if you ignore the ideals of achieving a
trustworthy encyclopedia.

A Wikipedia successor would have to abandon trustworthiness (or figure out
some way to indicate that certain pages were untrustworthy). Or figure out how
not to impose the costs of maintaining unverifiable information on
administrators. One way might be to connect the info with the community that
cares about it in a more direct and intimate way. Wikipedia fails REALLY badly
at the latter, to the point where the wiki-insiders sometimes have more
control than the audience for a topic.

> That's not easy, not easy at all, but it's not terribly expensive.

In terms of software and services, it would be no problem at all. But you are
overlooking the cost of creating a new Wikipedia in a world where Wikipedia
already exists.

Wikipedia content is also famously intractable to reuse in any system other
than MediaWiki. We hope to begin alleviating that this year with the big
parser redesign. A side effect should be to enable competitors to try
different things with our content.

[1] They are, but this is not the primary reason. ;)

------
InclinedPlane
Sadly I think technological advances has only accelerated this phenomenon.
We've gone from an era of static pages that would require considerable effort
to change the overall layout of to CMSes that we can twiddle and upgrade with
nary a concern for backward link compatibility.

Personally I think it should be a principle of every professional web
developer that you just don't break links, period.

------
gojomo
Users may prune their own bookmarks when they discover the links broken –
especially when considering some of the pre-Pinboard systems (like in-browser
bookmarking) from which the earliest data in this analysis comes. So I suspect
this underestimates link-rot.

~~~
idlewords
Just like it says in the article!

~~~
gojomo
D'oh! Overlooked that in my quick-read (or seeing as how I chose the same
'prune' word, only noticed it subconsciously).

Still an important enough point to pull out for highlight here; I should have
used a direct quote.

