But you have to wonder how much of it really is spiders crawling over old pages with links or people actually clicking on links from old pages. Probably many more of the former, and that wouldn't be hard to test for.
The linked post itself says it's mostly spiders following links from various place on the web to old/invalid URLs:
> And the biggest performance boost of all: caching 404s and sending Cache-Control headers to the CDN on 404. Upwards of 66% of our server time is spent on serving 404s from spiders crawling invalid urls and from urls that exist out in the wild from 6-10 years ago.
They're still serving the pages, but they're serving them off CDN instead of serving them off their main server.