> from urls that exist out in the wild from 6-10 years ago
From a glance at their archive, The Onion is still hosting articles from ten years ago. What in the world makes webmasters think it's okay to fail to maintain their mappings between URLs and live content?
Please see my other comments in this thread. You're making a judgement about what is happening without knowing all of the information involved in the situation. We diligently redirected as much as possible, and I'm currently managing more than 50000 301s.
I feel your pain. I'm assuming you've already got something that works for you, but on the assumption that any problem worth mentioning is probably going to bite someone here eventually:
1) Don't try to manage 301s in server configs. You'll go insane.
2) Make a simple table mapping old URLs to new URLs. You can update this when you make a change that breaks URLs.
3) If you feel like you want to 404 or 501, and your server is not overloaded (if it is, pfft, let that spider eat a 404 and save resources for real people), check memcached to see if you've resolved this recently. If not, check the table and set the cache accordingly. You can then return control to the webserver to serve the cached error page or, alternatively, send the 301 to the proper page.
4) Give the end-user a quick page which returns what URLs are consistently getting 404ed and asks for a best page to 301 them to. Since that page is behind an admin login, you can make it as expensive as you darn well please -- for example, grepping the heck out of a large log file, goign row by row, and searching for a "best guess". You can let the users approve them with one click. (Last time I wrote one I put a little forecast of how much of the marketing budget was saved due to the users' diligence in assigning 301s. Five minutes of work, got me more pats on the back than most project which take 6 months. Apparently the admin staff was fighting over who got to do the URL corrections every day.)
Your users will love you, your database load will be low, your SEO will be awesome, and your crusty ol' sysadmin will not tear out your intestines and use them for a necklace the 47th time you ask for him to add a 301 to the config file.
If it makes you feel any better, there are approximately 5000 404s in links "in the wild" (according to google crawl error stats) that I have the capability of improving the quality of the redirect for. I may not be able to get 100% accuracy on the redirect, but I can get the user to an issue page of the archive where they can click the story they intended to see. The problem is that there is no metadata associated with those links to get a better idea of what story it was. news1.html isn't terribly descriptive.
To put this in perspective, I get approximately 40000 unique 404s from spiders every day for things that simply don't exist and never did.
Sure in an ideal world every URL that was ever valid in the last 10 years would still work. Even the ones about a contest that ended in 2002 and the download page for your ActiveX-based Push application.
In the real world, those tend to break somewhere around the second or third CMS upgrade.
Leaving archive.org as the only way to reach your older content because you're running out of disk is one thing. But for content that you actually want online, "new users have no existing URLs" is such an obviously bad assumption that I have trouble believing any CMS author could fail to support them.
"What in the world makes webmasters think it's okay to fail to maintain their mappings between URLs and live content?"
"I have trouble believing any CMS author could fail to support them"
It almost sounds like you are assuming that The Onion has some sort of duty in maintaining ancient URIs. IMO, they should maintain their redirects as long as deem it necessary - there is a cost in doing so after all. Sure, the cost of not maintaining outdated URIs may or may not outweigh the cost of losing vistors, but that is for them to decide, not the peanut gallery.
I'm sure I'm not the only one that isn't so entitled that I wouldn't forgive The Onion if the links from my 2001 bookmark.htm export fail to land on the original content.
The importance of a URL is how heavily it's used by the web, not how new it is. The cost of maintaining a URL is far less than the cost of maintaining the content behind it. If you aren't throwing your work down the memory hole, breaking every link to it damages the web and makes far more work for everyone else than you are saving yourself.
Or self-interest. If I were a beancounter, and they told me that they didn't bother serving new ads with old content, because it was too much hassle to map the URLs, I'd be asking for cost/benefit figures ...