Hacker News new | comments | show | ask | jobs | submit login

Seems like there's a bit of a missed opportunity there if a lot of people are looking for content that no longer exists. I wonder if they did something that broke old linked URLs or something.

Okay you guys obviously aren't getting the whole picture. A minority of our links are from old content that I no longer know the urls for. We redirect everything from 5 years ago and up. Stuff originally published 6-10 years ago could potentially be redirected, but none of it came from a database and was all static HTML in its initial incarnation and redirects weren't maintained. This was before I took charge of the Onion's link management.

Spiders make up the vast majority of my 404s. They request URIs that in no sane world should exist. They request http://www.theonion.com/video/breaking-news-some-bullshit-ha...

They request http://www.theonion.com/articles/man-plans-special-weekend-t... even though that domain is explicitly set to http://media.theonion.com/ and is not a relative url in the page source.

I can't fix a broken spider and tell it not to request these links that do not even exist, but I still have to serve their 404s.

Edit: In other words, because of broken spiders that try to guess URLs, I have roughly 15-20x as many 404s as 200 article responses, and there's nothing that can be done about it except move every single resource that exists on my page into a css spritemap.

Edit #2: You can't see the urls very clearly, but what's happening is a spider is finding a filename on our page appending it to the end of the URL and requesting that to see if it exists.

But you have to wonder how much of it really is spiders crawling over old pages with links or people actually clicking on links from old pages. Probably many more of the former, and that wouldn't be hard to test for.

The linked post itself says it's mostly spiders following links from various place on the web to old/invalid URLs:

> And the biggest performance boost of all: caching 404s and sending Cache-Control headers to the CDN on 404. Upwards of 66% of our server time is spent on serving 404s from spiders crawling invalid urls and from urls that exist out in the wild from 6-10 years ago.

They're still serving the pages, but they're serving them off CDN instead of serving them off their main server.

Still, I wonder what Google does with page rank that goes to a 404 vs a permanent redirect.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact