OK, so what they are saying is that The Onion is throwing away 66% of its traffic by serving up 404s instead of serving up page views. Wouldn't they 1) have a better user experience by redirecting to the correct URL 2) have more page views and most important 3) make more money off of ads?
The Internet is trying to give the gift of free traffic and The Onion is saying: "no thanks." Most sites have to pay for traffic, and wouldn't be trowing it away.
Way to play it like Big Media. What's next? A pay wall?
66% of server time, not 66% of traffic. I am not a web developer, but my guess of what's happening: active pages get cached, so 500,000 page views to the homepage uses as much server time as a spider coming across a dead link from 2002. The Onion updates frequently, but not hundreds of time a day, and there's only a (relatively) small number of 'current' articles at any given time. It wouldn't take that many dead links to dwarf the server processing time, as long as they were all distinct links.
As a web analyst, I can tell you that in general old, deprecated content does not get very many visits except from spiders. I would not at all be surprised if the marginal ad revenue is break-even compared to the extra server load.
Note also that they're not just throwing the traffic away. It's a decent 404. It's not the best I've seen, there's room for improvements, but it's directing people straight to the archive so they can look for the story they were linked to.
As I said in my post below, spiders that request article_url/reddit.png or article_url/google-analytics.com/ga.gs do not get a 301 from me because they're not looking at an href of an <a> tag. They're guessing at a URL that never existed. They are legitimate 404 responses.
I feel sorry for John_Onion. The comments here are coming from people who have never looked at server logs (e.g. 'how do you know those are all spiders?'). Looking at my logs I would say at least 75% of my hits (juliusdavies.ca) are spiders. They come and visit every single page a few times a year to see if it's changed. For my own purposes I mirror some open source manuals and specifications (http://juliusdavies.ca/webdocs/). These have been on my site, unchanged, for at least 3 years, and the spiders come every couple months and check every page.
These hits will never (and should never!) translate into even a single real user in my case.
This particular link is a perfect troll — the headline stabs at your WTF button, then the comment linked to has zero context clarifying the reasoning but tons of technical details that fuel the nerd rage — people were so trolled by the perceived linkrot and cache problems that they didn't even bite at the obvious SQL & ORM bait!
I don't think you can fix this with code, much less your usual tactic of adding heuristics to moderate/voteweight/ban.
Not to be HN-litist, but I think this post is drawing questionable comments in particular because it's linked in several places from Reddit, from which we are likely receiving a flood of users uninclined to self-moderate.