Hacker News new | past | comments | ask | show | jobs | submit login
The Onion cut 66% of their bandwidth by upstream caching 404s (reddit.com)
127 points by atlbeer on Mar 25, 2010 | hide | past | web | favorite | 54 comments

> from urls that exist out in the wild from 6-10 years ago

From a glance at their archive, The Onion is still hosting articles from ten years ago. What in the world makes webmasters think it's okay to fail to maintain their mappings between URLs and live content?

Cool URIs don't change: http://www.w3.org/Provider/Style/URI

Please see my other comments in this thread. You're making a judgement about what is happening without knowing all of the information involved in the situation. We diligently redirected as much as possible, and I'm currently managing more than 50000 301s.

I'm currently managing more than 50000 301s.

I feel your pain. I'm assuming you've already got something that works for you, but on the assumption that any problem worth mentioning is probably going to bite someone here eventually:

1) Don't try to manage 301s in server configs. You'll go insane.

2) Make a simple table mapping old URLs to new URLs. You can update this when you make a change that breaks URLs.

3) If you feel like you want to 404 or 501, and your server is not overloaded (if it is, pfft, let that spider eat a 404 and save resources for real people), check memcached to see if you've resolved this recently. If not, check the table and set the cache accordingly. You can then return control to the webserver to serve the cached error page or, alternatively, send the 301 to the proper page.

4) Give the end-user a quick page which returns what URLs are consistently getting 404ed and asks for a best page to 301 them to. Since that page is behind an admin login, you can make it as expensive as you darn well please -- for example, grepping the heck out of a large log file, goign row by row, and searching for a "best guess". You can let the users approve them with one click. (Last time I wrote one I put a little forecast of how much of the marketing budget was saved due to the users' diligence in assigning 301s. Five minutes of work, got me more pats on the back than most project which take 6 months. Apparently the admin staff was fighting over who got to do the URL corrections every day.)

Your users will love you, your database load will be low, your SEO will be awesome, and your crusty ol' sysadmin will not tear out your intestines and use them for a necklace the 47th time you ask for him to add a 301 to the config file.

It sounds like you've done what you can after someone else's negligent work; thanks for that.

If it makes you feel any better, there are approximately 5000 404s in links "in the wild" (according to google crawl error stats) that I have the capability of improving the quality of the redirect for. I may not be able to get 100% accuracy on the redirect, but I can get the user to an issue page of the archive where they can click the story they intended to see. The problem is that there is no metadata associated with those links to get a better idea of what story it was. news1.html isn't terribly descriptive.

To put this in perspective, I get approximately 40000 unique 404s from spiders every day for things that simply don't exist and never did.

Sure in an ideal world every URL that was ever valid in the last 10 years would still work. Even the ones about a contest that ended in 2002 and the download page for your ActiveX-based Push application.

In the real world, those tend to break somewhere around the second or third CMS upgrade.

Leaving archive.org as the only way to reach your older content because you're running out of disk is one thing. But for content that you actually want online, "new users have no existing URLs" is such an obviously bad assumption that I have trouble believing any CMS author could fail to support them.

"What in the world makes webmasters think it's okay to fail to maintain their mappings between URLs and live content?"

"I have trouble believing any CMS author could fail to support them"

It almost sounds like you are assuming that The Onion has some sort of duty in maintaining ancient URIs. IMO, they should maintain their redirects as long as deem it necessary - there is a cost in doing so after all. Sure, the cost of not maintaining outdated URIs may or may not outweigh the cost of losing vistors, but that is for them to decide, not the peanut gallery.

I'm sure I'm not the only one that isn't so entitled that I wouldn't forgive The Onion if the links from my 2001 bookmark.htm export fail to land on the original content.

The importance of a URL is how heavily it's used by the web, not how new it is. The cost of maintaining a URL is far less than the cost of maintaining the content behind it. If you aren't throwing your work down the memory hole, breaking every link to it damages the web and makes far more work for everyone else than you are saving yourself.

Or a nice 301 redirect.

Only if the resource actually moved and you know its new location. I hate it when people 301 /oldpages/* to the homepage.

One of those "holier than thou" topics, isn't it?

Or self-interest. If I were a beancounter, and they told me that they didn't bother serving new ads with old content, because it was too much hassle to map the URLs, I'd be asking for cost/benefit figures ...

One of those, "I want people to find my website usable" things.

OK, so what they are saying is that The Onion is throwing away 66% of its traffic by serving up 404s instead of serving up page views. Wouldn't they 1) have a better user experience by redirecting to the correct URL 2) have more page views and most important 3) make more money off of ads?

The Internet is trying to give the gift of free traffic and The Onion is saying: "no thanks." Most sites have to pay for traffic, and wouldn't be trowing it away.

Way to play it like Big Media. What's next? A pay wall?

66% of server time, not 66% of traffic. I am not a web developer, but my guess of what's happening: active pages get cached, so 500,000 page views to the homepage uses as much server time as a spider coming across a dead link from 2002. The Onion updates frequently, but not hundreds of time a day, and there's only a (relatively) small number of 'current' articles at any given time. It wouldn't take that many dead links to dwarf the server processing time, as long as they were all distinct links.

As a web analyst, I can tell you that in general old, deprecated content does not get very many visits except from spiders. I would not at all be surprised if the marginal ad revenue is break-even compared to the extra server load.

Note also that they're not just throwing the traffic away. It's a decent 404. It's not the best I've seen, there's room for improvements, but it's directing people straight to the archive so they can look for the story they were linked to.

"The Onion is throwing away 66% of its traffic by serving up 404s instead of serving up page views."

Spidering and organic activity are bundled together in this 66% number. It would be interesting to know what percentage of this was people who wanted to see an Onion page.

It's irrelevant that it is spidering mixed in. You can think of spidering as a future page view, assuming that page gets indexed.

The spiders and organic visitors should have been 301 redirected to the correct location. Spiders learn the 301 redirects, and some of their articles may even benefited from better search rankings.

As I said in my post below, spiders that request article_url/reddit.png or article_url/google-analytics.com/ga.gs do not get a 301 from me because they're not looking at an href of an <a> tag. They're guessing at a URL that never existed. They are legitimate 404 responses.

I feel sorry for John_Onion. The comments here are coming from people who have never looked at server logs (e.g. 'how do you know those are all spiders?'). Looking at my logs I would say at least 75% of my hits (juliusdavies.ca) are spiders. They come and visit every single page a few times a year to see if it's changed. For my own purposes I mirror some open source manuals and specifications (http://juliusdavies.ca/webdocs/). These have been on my site, unchanged, for at least 3 years, and the spiders come every couple months and check every page.

These hits will never (and should never!) translate into even a single real user in my case.

Wouldn't exclusion via robots.txt be appropriate in this case?

Why are people so angry on the internet?

I'm sorry. Comments here used to be a lot more thoughtful (in both senses). I'm planning on working on this problem soon.

This particular link is a perfect troll — the headline stabs at your WTF button, then the comment linked to has zero context clarifying the reasoning but tons of technical details that fuel the nerd rage — people were so trolled by the perceived linkrot and cache problems that they didn't even bite at the obvious SQL & ORM bait!

I don't think you can fix this with code, much less your usual tactic of adding heuristics to moderate/voteweight/ban.

Not to be HN-litist, but I think this post is drawing questionable comments in particular because it's linked in several places from Reddit, from which we are likely receiving a flood of users uninclined to self-moderate.

I hope that one of the approaches you try is recursive on who tends to vote for whom.

I feel your pain, in more ways than one. Any site that keep running for a decade is impressive in and of itself.

I, for one, won't try to second-guess your decisions. The Onion has been around for a long (internet) time, and I suspect most of the Web101 suggestions have been considered.

Besides, you guys have left me in stitches too many times for me to be particularly critical. Thanks for existing!

The 404 pages are cached, not dead-ends. I don't know what they do with them. They very well could have "Check out our awesome stories, they're so funny".

A lot of wasted energy and commentary could have been avoided by simply taking a look:


Huh? They're serving their 404 page (and caching it on the CDN). That is a page view. You can have ads on that page.

Seems like there's a bit of a missed opportunity there if a lot of people are looking for content that no longer exists. I wonder if they did something that broke old linked URLs or something.

Okay you guys obviously aren't getting the whole picture. A minority of our links are from old content that I no longer know the urls for. We redirect everything from 5 years ago and up. Stuff originally published 6-10 years ago could potentially be redirected, but none of it came from a database and was all static HTML in its initial incarnation and redirects weren't maintained. This was before I took charge of the Onion's link management.

Spiders make up the vast majority of my 404s. They request URIs that in no sane world should exist. They request http://www.theonion.com/video/breaking-news-some-bullshit-ha...

They request http://www.theonion.com/articles/man-plans-special-weekend-t... even though that domain is explicitly set to http://media.theonion.com/ and is not a relative url in the page source.

I can't fix a broken spider and tell it not to request these links that do not even exist, but I still have to serve their 404s.

Edit: In other words, because of broken spiders that try to guess URLs, I have roughly 15-20x as many 404s as 200 article responses, and there's nothing that can be done about it except move every single resource that exists on my page into a css spritemap.

Edit #2: You can't see the urls very clearly, but what's happening is a spider is finding a filename on our page appending it to the end of the URL and requesting that to see if it exists.

But you have to wonder how much of it really is spiders crawling over old pages with links or people actually clicking on links from old pages. Probably many more of the former, and that wouldn't be hard to test for.

The linked post itself says it's mostly spiders following links from various place on the web to old/invalid URLs:

> And the biggest performance boost of all: caching 404s and sending Cache-Control headers to the CDN on 404. Upwards of 66% of our server time is spent on serving 404s from spiders crawling invalid urls and from urls that exist out in the wild from 6-10 years ago.

They're still serving the pages, but they're serving them off CDN instead of serving them off their main server.

Still, I wonder what Google does with page rank that goes to a 404 vs a permanent redirect.

This seems more like a DailyWTF than a sweet optimization to be reminiscing about. I'd love to hear more details about this. I mean a single video is about the equivalent of tens of thousands of 404 pages.

I'm not sure which is more mindboggling... Spending 66% of their bandwidth and 50% of their CPU on serving these trivially cacheable pages or the fact that they didn't correct the problem when they were serving more than 5%.

If you read the entire thing (not just this select comment) you will find that they just re-engineered their site from Dupral to Django meaning they would have had this in place within days if not hours of discovering the bottle neck.


Those Onion web guys must be pretty dense.

Pardon? I did not mean to offend you. I apologize.

Those videos are most likely being served from a CDN.

I can confirm that videos, articles, images, 404 pages are all served by a CDN. Our 404 pages were not cached by the CDN. Allowing them to be cached reduced the origin penetration rate substantially enough to amount to a 66% reduction in outgoing bandwidth over uncached 404s.

Edit: This is not to say that our 404s were not cached at our origin. Our precomputed 404 was cached and served out without a database hit on every connection, however this still invokes the regular expression engine for url pattern matching and taxes the machine's network IO resources.

As are the 404 pages, according to the comment.

Their 404 page is 142KB...not huge, not small. Just sayin'

Is all that HTML? Spiders generally only care about HTML -- they're not going to download the js and css and images and flash and whatever.

8 years ago, maybe. Today all search engine spiders pull down and execute javascript & css as well just so that they can index a real user's experience.

The better ones may, but perhaps not for all pages. Generally, when viewing logs I see search engine spiders make a single request, for only the html content and not js or css.

duh, I didn't think about that :) HTML is way smaller

I got 48 KB of HTML, 110 KB of stylesheets, 59 KB of images, and 136 KB of Javascript. I'm curious how you got your 142 KB number. Do you mean after it's been gzipped?

just what firebug is telling me

Most likely the transferred size, including gzipped content encoding.

I believe Jakob Nielsen wrote about this almost 12 years ago, "Fighting Linkrot" http://www.useit.com/alertbox/980614.html

I dunno. I'd be more tempted to put a useful page behind the URLs that are already being requested, or redirect them to some relevant content.

They do put a useful page up (links to articles, archive, and search), that's why it accounts for so much demand.

ah. Makes sense, thanks for clarifying.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact