Hacker News new | comments | show | ask | jobs | submit login

Okay you guys obviously aren't getting the whole picture. A minority of our links are from old content that I no longer know the urls for. We redirect everything from 5 years ago and up. Stuff originally published 6-10 years ago could potentially be redirected, but none of it came from a database and was all static HTML in its initial incarnation and redirects weren't maintained. This was before I took charge of the Onion's link management.

Spiders make up the vast majority of my 404s. They request URIs that in no sane world should exist. They request http://www.theonion.com/video/breaking-news-some-bullshit-ha...

They request http://www.theonion.com/articles/man-plans-special-weekend-t... even though that domain is explicitly set to http://media.theonion.com/ and is not a relative url in the page source.

I can't fix a broken spider and tell it not to request these links that do not even exist, but I still have to serve their 404s.

Edit: In other words, because of broken spiders that try to guess URLs, I have roughly 15-20x as many 404s as 200 article responses, and there's nothing that can be done about it except move every single resource that exists on my page into a css spritemap.

Edit #2: You can't see the urls very clearly, but what's happening is a spider is finding a filename on our page appending it to the end of the URL and requesting that to see if it exists.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact