Google tends to reduce its crawl frequency for content that changes less frequently. Perhaps in SO's case, it struggles to identify recently changed content without crawling the entire site.
A few thousand older posts are edited a day.
Throwing comments into the mix more than doubles the number of "new things", and of course every question, answer, comment, or edit is displayed on multiple pages (generally, on the owning user's page).
tl;dr - there's a lot more page churn than you might expect.
"and when Google hits thousands of pages in a few minutes, that can kick off a lot of background work, such as rebuilding related questions. Not expensive by itself, but when multiplied by a hundred at once.. can be quite painful."
Honestly, I'd have guessed a lot higher number.
I was sad the top result led to a deleted question without an answer, but was impressed an exact match to my question was cached 2 minutes after it was asked and was the top result (hopefully it will be dropped from results soon since it leads to SO's 404 page now).
10 qps isn't that bad. I remember some ad network launched using delicious widgets as their text ads platform that hit us for 25 qps sustained.
My two take aways are, generally you can make anything scale, cache like hell, and I personally don't see enough value in .NET to justify the licensing costs to roll it out initially or the long term.
Important part isn't caching. You can just create HTML copies off a page and serve them through a fast server, that is a very easy cache.
Difficult part is smart caching, cache getting updated when they need to be and persisted properly to the DB. These guys have nailed it in my opinion.
I often lookup these guys and optimize my own sites and have been following them since their beta phase.
Yes, BizSpark makes it free, but that basically locks you into the platform long enough for you to rely on it before you start paying those prices.
This old classic applies to business models as well as software:
Initially they weren't using anything OSS but moved to HAProxy and Redis. They even help update HAProxy from time to time.
For whatever reason, I have it in my head that the difficulty Facebook and Twitter (and even Digg) face in scaling are the social aspects of their sites. These are the things that require custom software (FlockDB and Cassandra) and a lot of machines.
Perhaps I need to use SO again, but in the day, this social aspect of SO didn't exist. This means their scaling challenges are far more traditional, say like slashdot. 99% cacheable reads type thing.
If I'm right, SO is really just a case study that, depending on what they are doing, some startups will be able to scale with .NET.
As long as you are able to measure where your bottlenecks are and address them you are fine.
I had a ball working on these performance issues and tuning down render times for question/show to 50ms (a totally dynamic page).
I guess not all developers get a kick out of the same things.
(edit: besides the obvious limitations that Google may not have an API that is exposed to the public)
But Google reserves the right to crawl non-sitemaps URLs, for obvious reasons. It would be quite a bad decision for them to restrict their crawls only to API-provided URLs.
We re-build our sitemaps nightly and make sure that new or recently-updated content is listed with an update frequency of "daily" or "weekly" and all other content pages are listed as being updated "monthly."
To be honest, I've never measured if it works, but it can't hurt.
The intel e1000 family is the safe default for NICs and using something else without an explicit reason is just not a good idea.
We added additional caching and manually lowered the crawl rate to address this at PatientsLikeMe.