OK, so what they are saying is that The Onion is throwing away 66% of its traffic by serving up 404s instead of serving up page views. Wouldn't they 1) have a better user experience by redirecting to the correct URL 2) have more page views and most important 3) make more money off of ads?

The Internet is trying to give the gift of free traffic and The Onion is saying: "no thanks." Most sites have to pay for traffic, and wouldn't be trowing it away.

Way to play it like Big Media. What's next? A pay wall?

66% of server time, not 66% of traffic. I am not a web developer, but my guess of what's happening: active pages get cached, so 500,000 page views to the homepage uses as much server time as a spider coming across a dead link from 2002. The Onion updates frequently, but not hundreds of time a day, and there's only a (relatively) small number of 'current' articles at any given time. It wouldn't take that many dead links to dwarf the server processing time, as long as they were all distinct links.

As a web analyst, I can tell you that in general old, deprecated content does not get very many visits except from spiders. I would not at all be surprised if the marginal ad revenue is break-even compared to the extra server load.

Note also that they're not just throwing the traffic away. It's a decent 404. It's not the best I've seen, there's room for improvements, but it's directing people straight to the archive so they can look for the story they were linked to.

Spidering and organic activity are bundled together in this 66% number. It would be interesting to know what percentage of this was people who wanted to see an Onion page.

It's irrelevant that it is spidering mixed in. You can think of spidering as a future page view, assuming that page gets indexed.

The spiders and organic visitors should have been 301 redirected to the correct location. Spiders learn the 301 redirects, and some of their articles may even benefited from better search rankings.

As I said in my post below, spiders that request article_url/reddit.png or article_url/google-analytics.com/ga.gs do not get a 301 from me because they're not looking at an href of an <a> tag. They're guessing at a URL that never existed. They are legitimate 404 responses.

I feel sorry for John_Onion. The comments here are coming from people who have never looked at server logs (e.g. 'how do you know those are all spiders?'). Looking at my logs I would say at least 75% of my hits (juliusdavies.ca) are spiders. They come and visit every single page a few times a year to see if it's changed. For my own purposes I mirror some open source manuals and specifications (http://juliusdavies.ca/webdocs/). These have been on my site, unchanged, for at least 3 years, and the spiders come every couple months and check every page.

These hits will never (and should never!) translate into even a single real user in my case.

Wouldn't exclusion via robots.txt be appropriate in this case?

Why are people so angry on the internet?

I'm sorry. Comments here used to be a lot more thoughtful (in both senses). I'm planning on working on this problem soon.

This particular link is a perfect troll — the headline stabs at your WTF button, then the comment linked to has zero context clarifying the reasoning but tons of technical details that fuel the nerd rage — people were so trolled by the perceived linkrot and cache problems that they didn't even bite at the obvious SQL & ORM bait!

I don't think you can fix this with code, much less your usual tactic of adding heuristics to moderate/voteweight/ban.

Not to be HN-litist, but I think this post is drawing questionable comments in particular because it's linked in several places from Reddit, from which we are likely receiving a flood of users uninclined to self-moderate.

I hope that one of the approaches you try is recursive on who tends to vote for whom.

I feel your pain, in more ways than one. Any site that keep running for a decade is impressive in and of itself.

I, for one, won't try to second-guess your decisions. The Onion has been around for a long (internet) time, and I suspect most of the Web101 suggestions have been considered.

Besides, you guys have left me in stitches too many times for me to be particularly critical. Thanks for existing!

The 404 pages are cached, not dead-ends. I don't know what they do with them. They very well could have "Check out our awesome stories, they're so funny".

A lot of wasted energy and commentary could have been avoided by simply taking a look:


Huh? They're serving their 404 page (and caching it on the CDN). That is a page view. You can have ads on that page.

