From a glance at their archive, The Onion is still hosting articles from ten years ago. What in the world makes webmasters think it's okay to fail to maintain their mappings between URLs and live content?
Cool URIs don't change: http://www.w3.org/Provider/Style/URI
I feel your pain. I'm assuming you've already got something that works for you, but on the assumption that any problem worth mentioning is probably going to bite someone here eventually:
1) Don't try to manage 301s in server configs. You'll go insane.
2) Make a simple table mapping old URLs to new URLs. You can update this when you make a change that breaks URLs.
3) If you feel like you want to 404 or 501, and your server is not overloaded (if it is, pfft, let that spider eat a 404 and save resources for real people), check memcached to see if you've resolved this recently. If not, check the table and set the cache accordingly. You can then return control to the webserver to serve the cached error page or, alternatively, send the 301 to the proper page.
4) Give the end-user a quick page which returns what URLs are consistently getting 404ed and asks for a best page to 301 them to. Since that page is behind an admin login, you can make it as expensive as you darn well please -- for example, grepping the heck out of a large log file, goign row by row, and searching for a "best guess". You can let the users approve them with one click. (Last time I wrote one I put a little forecast of how much of the marketing budget was saved due to the users' diligence in assigning 301s. Five minutes of work, got me more pats on the back than most project which take 6 months. Apparently the admin staff was fighting over who got to do the URL corrections every day.)
Your users will love you, your database load will be low, your SEO will be awesome, and your crusty ol' sysadmin will not tear out your intestines and use them for a necklace the 47th time you ask for him to add a 301 to the config file.
To put this in perspective, I get approximately 40000 unique 404s from spiders every day for things that simply don't exist and never did.
In the real world, those tend to break somewhere around the second or third CMS upgrade.
"I have trouble believing any CMS author could fail to support them"
It almost sounds like you are assuming that The Onion has some sort of duty in maintaining ancient URIs. IMO, they should maintain their redirects as long as deem it necessary - there is a cost in doing so after all. Sure, the cost of not maintaining outdated URIs may or may not outweigh the cost of losing vistors, but that is for them to decide, not the peanut gallery.
I'm sure I'm not the only one that isn't so entitled that I wouldn't forgive The Onion if the links from my 2001 bookmark.htm export fail to land on the original content.
The Internet is trying to give the gift of free traffic and The Onion is saying: "no thanks." Most sites have to pay for traffic, and wouldn't be trowing it away.
Way to play it like Big Media. What's next? A pay wall?
As a web analyst, I can tell you that in general old, deprecated content does not get very many visits except from spiders. I would not at all be surprised if the marginal ad revenue is break-even compared to the extra server load.
Note also that they're not just throwing the traffic away. It's a decent 404. It's not the best I've seen, there's room for improvements, but it's directing people straight to the archive so they can look for the story they were linked to.
Spidering and organic activity are bundled together in this 66% number. It would be interesting to know what percentage of this was people who wanted to see an Onion page.
The spiders and organic visitors should have been 301 redirected to the correct location. Spiders learn the 301 redirects, and some of their articles may even benefited from better search rankings.
These hits will never (and should never!) translate into even a single real user in my case.
I don't think you can fix this with code, much less your usual tactic of adding heuristics to moderate/voteweight/ban.
I, for one, won't try to second-guess your decisions. The Onion has been around for a long (internet) time, and I suspect most of the Web101 suggestions have been considered.
Besides, you guys have left me in stitches too many times for me to be particularly critical. Thanks for existing!
Spiders make up the vast majority of my 404s. They request URIs that in no sane world should exist. They request http://www.theonion.com/video/breaking-news-some-bullshit-ha...
They request http://www.theonion.com/articles/man-plans-special-weekend-t... even though that domain is explicitly set to http://media.theonion.com/ and is not a relative url in the page source.
I can't fix a broken spider and tell it not to request these links that do not even exist, but I still have to serve their 404s.
Edit: In other words, because of broken spiders that try to guess URLs, I have roughly 15-20x as many 404s as 200 article responses, and there's nothing that can be done about it except move every single resource that exists on my page into a css spritemap.
Edit #2: You can't see the urls very clearly, but what's happening is a spider is finding a filename on our page appending it to the end of the URL and requesting that to see if it exists.
> And the biggest performance boost of all: caching 404s and sending Cache-Control headers to the CDN on 404. Upwards of 66% of our server time is spent on serving 404s from spiders crawling invalid urls and from urls that exist out in the wild from 6-10 years ago.
They're still serving the pages, but they're serving them off CDN instead of serving them off their main server.
I'm not sure which is more mindboggling... Spending 66% of their bandwidth and 50% of their CPU on serving these trivially cacheable pages or the fact that they didn't correct the problem when they were serving more than 5%.
Edit: This is not to say that our 404s were not cached at our origin. Our precomputed 404 was cached and served out without a database hit on every connection, however this still invokes the regular expression engine for url pattern matching and taxes the machine's network IO resources.