
Testing 3 million hyperlinks, lessons learned - sathyabhat
http://samsaffron.com/archive/2012/06/07/testing-3-million-hyperlinks-lessons-learned
======
akent
(I've had this rant before, but I'll repeat it.)

He points to Stack Overflow's 404 as a good example and claims "We do our best
to explain it was removed, why it was removed and where you could possibly
find it."

Yet there is still no permanent archive of deleted Stack Overflow content; you
have to rely on third party archives like archive.org and even then, you have
to be lucky.

SO moderators have a habit of retrospectively deleting old content that is
off-topic under current rules, even if it was perfectly _on-topic_ at some
time in the past. I feel this is bad internet citizenship -- it's removing
internet history for no good reason.

Fair enough, delete newly created off-topic questions under the current
moderation rules. But when these types of questions were asked originally they
_were_ on topic at the time. Deleting them retrospectively (completely - no
redirect either) is still poor form.

(Top read otherwise though!)

~~~
sageikosa
And thus we lost dozens of episodes of Doctor Who, and almost lost all of
Monty Python (if Terry Jones hadn't stepped in), because someone thought the
material was no longer relevant and archival not a chartered function (thought
the economics of maintaining an archive are valid).

~~~
grandinj
Actually no, we lost them because of bizarre copyright rules which meant they
had to be deleted.

~~~
pbhjpbhj
Can you explain?

------
citricsquid
> It would be trivial to do some rudimentary parsing on the url string to
> determine where you really wanted to go

Specific to this point, a new project I'm building supports "pretty" URLs and
I've found my (now) _favourite_ solution is to build an aliases system.

It works like so: when a user creates an item an "alias" is registered, it's
set to "current" and all future queries to that alias are logged. If the user
causes a change to the URL in future (name change, etc.) then the new alias is
registered but the old one is retained and 301s to the new alias. All aliases
are accessible by the user and they can invalidate them manually (if they want
to re-use an alias for example) _however_ if an alias has had a large amount
of hits from a single source _since_ that alias was retired (say 50 referrals
from website.com to mysite.com/previous-alias) the system assumes that the
user posted the link on another website and so invalidating that alias will
cause a dead link (and lose my site traffic) so it doesn't allow it.

I guess it's convoluted and adds extra overhead but I feel like if you have
pretty URLs (which are in my opinion something that a website should aim for)
you need to be in a position where they're not going to cause the site to
break the rest of the internet. The easy solution is to have _pseudo_ pretty
URLs (eg: website.com/123-pretty-url, where 123 = ID and pretty-url is just an
ignored string) or just not allow URLs to ever be changed, but I don't like
either.

I wonder if any other websites have a good approach to this.

~~~
nulluk
I wouldn't recommend having the "pretty" part not validated. It can cause some
serious issues with google & duplicate content and if someone wants to be
malicious they can create a bunch of fake urls that essentially point to the
same page, or even worse if they receive enough links they can be indexed at
the new "fake" url. A similar thing happened to a newspapers website but I
can't recall who of the top of my head.

Another potential solution & my preferred method is whenever a change is made
that would affect the url of a page. Update a "legacy" table with the old url
and the location of the new url, next time a 404 is going to be thrown do a
search against the database & redirect accordingly if a new url is found. I
rolled this approach into <https://github.com/leonsmith/django-legacy-url> and
whilst it's not polished it's by far the easiest & probably most
automatic/maintainable solution I have found.

~~~
fnulp
"I wouldn't recommend having the "pretty" part not validated. It can cause
some serious issues with google & duplicate content"

Not if you properly generate and apply canonical links :)

~~~
nulluk
canonical links are only hints to google (all be it very strong) they always
reserve the right to ignore it if they think a webmaster is shooting
themselves in the foot, that in itself is where the problem is. If I built up
a few hundred links to example.com/1234-this-site-sucks I'm sure google would
think that is the correct url rather than the canonical link version of
example.com/1234-the-real-slug

~~~
underwater
FYI: the word you wanted is "albeit".

------
kiba
Julian Assange on Self Destructing
Paper(<http://web.archive.org/web/20071020051936/http://iq.org/>):

 _The internet is self destructing paper. A place where anything written is
soon destroyed by rapacious competition and the only preservation is to
forever copy writing from sheet to sheet faster than they can burn.

If it's worth writing, it's worth keeping. If it can be kept, it might be
worth writing. Would your store your brain in a startup company's vat? If you
store your writing on a 3rd party site like blogger, livejournal or even on
your own site, but in the complex format used by blog/wiki software de jour
you will lose it forever as soon as hypersonic wings of internet labor flows
direct people's energies elsewhere. For most information published on the
internet, perhaps that is not a moment to soon, but how can the muse of
originality soar when immolating transience brushes every feather?_

~~~
micaeked
is there anything being done to create some persistence? i would imagine a
service like this would be useful. ex. small fee (10c) to publish 10kb of
unicode, available indefinitely

edit: found this: <http://www.chronicleoflife.com/> ... but i was thinking
something to publish instead of simple backup

edit2: probably the only company i could trust to pull this off (one-time fee
for publishing static content) would be amazon. it fits really well with their
core business (infrastructure), and amazon is very good with long term stuff

~~~
lucian1900
archive.org strive to help with this problem

~~~
tomerv
Specifically, for webpages there is the Wayback Machine:
<http://archive.org/web/web.php>

------
simias
> Some sites like giving you no information in the URL

For me one of the worst offender in this category is youtube. I can't
understand why they don't put a slug with the video name in the canonical URL
(especially since they have youtu.be for shortening URLs). It's really a pain
to find back an old video in, say, an IRC log with only the opaque video ID.

Vimeo does the same thing. Dailymotion however does put a meaningful slug.

~~~
pbhjpbhj
For me one of the worst offenders, given theaudience and content, is HN.

This story for example - <http://news.ycombinator.com/item?id=4077891>.

------
Aissen
Obligatory W3C link: Cool URIs don't change:
<http://www.w3.org/Provider/Style/URI.html>

(note: this page's URL didn't change since at least 1999)

~~~
illumen
I find it funny that it is now shown as www.w3.org/Provider/Style/URI.html in
one of my browsers, and <http://www.w3.org/Provider/Style/URI.html> in another
browser. On my mobile it is shown as <http://www.w3.org/Provider/Sty..>.

------
Gring
For mismanaged sites where the site owner changed URLs and could have added
proper redirects but instead chose to just show 404s for all of them (article
mentions the examples of github and java, but there are countless more), there
should be a wikipedia-style community-driven reference project with better
redirects. Is anybody working in this direction?

------
ConstantineXVI
Semi-OT:

On the subject of GitHub's robots.txt[0], would anyone have a guess at why
this particular repo[1] is singled out?

[0] <https://github.com/robots.txt>

[1] <https://github.com/ekansa/Open-Context-Data>

~~~
underwater
It could be a honeypot. Any robot that crawls that URL gets auto-banned.

------
mattmanser
A variable called stuff? Seriously?

Shame as the rest of the article is quite good, but that really flags me that
this is a little bit cowboy code.

Also interesting to read some sites are taking a 'white-list' approach to
robots.txt, as he says this is resulting in people starting to ignore it.

~~~
jackalope
_Starting_ to ignore robots.txt? Unfortunately, there's no way to prove that
anyone respects it. A well-identified bot can back off, then return from a
different IP address with a different User-Agent and attempt to mimic a human
user. Webmasters really have no defense against policy violations. If you run
a bot of any kind, including a link checker or SSL tester, please respect
robots.txt. If not, be prepared to be identified as malicious and blocked by
an IDS.

~~~
jackalope
Let me add: Since the purpose of your bot is to verify links and protect/serve
your users, consider removing the links from your site if robots.txt prohibits
you from checking them. That's what I would prefer as a webmaster who
explicitly set that policy on a site, since I have no control over who posts
the links.

~~~
grecy
That doesn't make sense.

The point of blocking a link with robots.txt is to say "Hey, web crawlers,
please don't load and index this page". it _does not_ mean "Hey, users, please
don't come and load and read this page".

So the script written, for all intents and purposes, is just the same as a
regular old user clicking the link and reading the page then keeping a list of
the links that work and those that don't. It's not a crawler, it's an
automated user.

If you are a webmaster than wants to block people from posting links to your
page all around the web allowing others to come and read it, make the page
403.

------
sparknlaunch
What are the common causes of broken links?

Seems unavoidable on large sites.

~~~
sams99
I guess some common ones are,

1\. sys-admin reorganisation, moving content from one spot to another without
redirects in place.

2\. developer reorganisation, for example moving from "confusing" urls to
"slug" based urls without adding redirects

3\. fragile content, content that moves depending on external changes (beta to
release for example)

4\. product retiring or companies getting acquired

5\. hackers messing stuff up in a way that can not be fully repaired (or an
non-recoverable data loss)

------
fnulp
"just ignore robots.txt?"

how about "fuck you"? I guess it's high time to make honeypots, tarpits and
bans common practice.

~~~
Aissen
He explains how and why in the article, and gives arguments. You do none.

The problem with whitelist-only robots.txt is that they favor monopolies and
startups are the ones getting the "fuck you". But maybe you don't care about
that.

~~~
tomjen3
As a webmaster, why would I want bots to go to my site that doesn't bring any
(or much) trafic?

