
The Onion cut 66% of their bandwidth by upstream caching 404s - atlbeer
http://www.reddit.com/r/django/comments/bhvhz/the_onion_uses_django_and_why_it_matters_to_us/c0mvow7
======
prodigal_erik
> from urls that exist out in the wild from 6-10 years ago

From a glance at their archive, The Onion is still hosting articles from ten
years ago. What in the world makes webmasters think it's okay to fail to
maintain their mappings between URLs and live content?

Cool URIs don't change: <http://www.w3.org/Provider/Style/URI>

~~~
john_onion
Please see my other comments in this thread. You're making a judgement about
what is happening without knowing all of the information involved in the
situation. We diligently redirected as much as possible, and I'm currently
managing more than 50000 301s.

~~~
prodigal_erik
It sounds like you've done what you can after someone else's negligent work;
thanks for that.

~~~
john_onion
If it makes you feel any better, there are approximately 5000 404s in links
"in the wild" (according to google crawl error stats) that I have the
capability of improving the quality of the redirect for. I may not be able to
get 100% accuracy on the redirect, but I can get the user to an issue page of
the archive where they can click the story they intended to see. The problem
is that there is no metadata associated with those links to get a better idea
of what story it was. news1.html isn't terribly descriptive.

To put this in perspective, I get approximately 40000 unique 404s from spiders
every day for things that simply don't exist and never did.

------
randomstring
OK, so what they are saying is that The Onion is throwing away 66% of its
traffic by serving up 404s instead of serving up page views. Wouldn't they 1)
have a better user experience by redirecting to the correct URL 2) have more
page views and most important 3) make more money off of ads?

The Internet is trying to give the gift of free traffic and The Onion is
saying: "no thanks." Most sites have to pay for traffic, and wouldn't be
trowing it away.

Way to play it like Big Media. What's next? A pay wall?

~~~
john_onion
Why are people so angry on the internet?

~~~
pg
I'm sorry. Comments here used to be a lot more thoughtful (in both senses).
I'm planning on working on this problem soon.

~~~
blasdel
This particular link is a perfect troll — the headline stabs at your WTF
button, then the comment linked to has zero context clarifying the reasoning
but tons of technical details that fuel the nerd rage — people were so trolled
by the perceived linkrot and cache problems that they didn't even bite at the
obvious SQL & ORM bait!

I don't think you can fix this with code, much less your usual tactic of
adding heuristics to moderate/voteweight/ban.

------
mattmaroon
Seems like there's a bit of a missed opportunity there if a lot of people are
looking for content that no longer exists. I wonder if they did something that
broke old linked URLs or something.

~~~
frederickcook
But you have to wonder how much of it really is spiders crawling over old
pages with links or people actually clicking on links from old pages. Probably
many more of the former, and that wouldn't be hard to test for.

~~~
grayrest
The linked post itself says it's mostly spiders following links from various
place on the web to old/invalid URLs:

> And the biggest performance boost of all: caching 404s and sending Cache-
> Control headers to the CDN on 404. Upwards of 66% of our server time is
> spent on serving 404s from spiders crawling invalid urls and from urls that
> exist out in the wild from 6-10 years ago.

They're still serving the pages, but they're serving them off CDN instead of
serving them off their main server.

~~~
dminor
Still, I wonder what Google does with page rank that goes to a 404 vs a
permanent redirect.

------
jeff18
This seems more like a DailyWTF than a sweet optimization to be reminiscing
about. I'd love to hear more details about this. I mean a single video is
about the equivalent of tens of thousands of 404 pages.

I'm not sure which is more mindboggling... Spending 66% of their _bandwidth_
and 50% of their CPU on serving these trivially cacheable pages or the fact
that they didn't correct the problem when they were serving more than 5%.

~~~
showerst
Those videos are most likely being served from a CDN.

~~~
john_onion
I can confirm that videos, articles, images, 404 pages are all served by a
CDN. Our 404 pages were not cached by the CDN. Allowing them to be cached
reduced the origin penetration rate substantially enough to amount to a 66%
reduction in outgoing bandwidth over uncached 404s.

Edit: This is not to say that our 404s were not cached at our origin. Our
precomputed 404 was cached and served out without a database hit on every
connection, however this still invokes the regular expression engine for url
pattern matching and taxes the machine's network IO resources.

------
latch
Their 404 page is 142KB...not huge, not small. Just sayin'

~~~
joevandyk
Is all that HTML? Spiders generally only care about HTML -- they're not going
to download the js and css and images and flash and whatever.

~~~
bluesmoon
8 years ago, maybe. Today all search engine spiders pull down and execute
javascript & css as well just so that they can index a real user's experience.

~~~
code_duck
The better ones may, but perhaps not for all pages. Generally, when viewing
logs I see search engine spiders make a single request, for only the html
content and not js or css.

------
Chirael
I believe Jakob Nielsen wrote about this almost 12 years ago, "Fighting
Linkrot" <http://www.useit.com/alertbox/980614.html>

------
tjpick
I dunno. I'd be more tempted to put a useful page behind the URLs that are
already being requested, or redirect them to some relevant content.

~~~
johngalt
They do put a useful page up (links to articles, archive, and search), that's
why it accounts for so much demand.

~~~
tjpick
ah. Makes sense, thanks for clarifying.

