Hacker News new | past | comments | ask | show | jobs | submit login
Google indexes Stack Overflow at a rate of 10 requests per second (stackoverflow.com)
196 points by mwsherman on Mar 29, 2011 | hide | past | web | favorite | 57 comments



This isn't exactly abnormal. SO is a big site with alot of fresh content, I'm guessing google indexes many thousands of sites at that rate. What's surprising to me is that it surprises them.


Indeed. There are only 86K seconds in a day, or 31.5 million seconds in a year. Even if you assume that Google refreshes a page only a couple of times per year (which is absurd in terms of their freshness), it still means that you cannot have more than a few million pages in the index without accepting on average multiple crawls per second from the bots.


Well, the necessity of scraping that much content would depend on how often it is being updated. There must be only a few thousand active pages on Stack Overflow on any given day, so executing a million-plus page views per day seems a bit overkill.

Google tends to reduce its crawl frequency for content that changes less frequently. Perhaps in SO's case, it struggles to identify recently changed content without crawling the entire site.


SO gets well over 10k completely new posts (Q&As) on a typical day.

A few thousand older posts are edited a day.

Throwing comments into the mix more than doubles the number of "new things", and of course every question, answer, comment, or edit is displayed on multiple pages (generally, on the owning user's page).

tl;dr - there's a lot more page churn than you might expect.


Here is our 10/s crawl stats too. Thought I'd share to get a contrast. Though about half the page crawls per day. Note the page load times :) Oops, recent regression due to building an internal cloud. So you're right, not out of the ordinary.

http://www.pinkbike.com/photo/6365090/


10 requests per second doesn't sound majorly high. That's 36,000 pages per hour which whilst big, doesn't sound too high, especially for a site as popular as SO (Alexa puts it at 137rd most popular site; granted, Alexa isn't the most accurate).


This is addressed in the post - apparently it's hitting pages that haven't been accessed in a while, starting background tasks - but it still seems odd to me. I'd have expected a huge amount of Stack Overflow's traffic to come from long tail searches, which should be basically the same thing. Excerpt for the lazy:

"and when Google hits thousands of pages in a few minutes, that can kick off a lot of background work, such as rebuilding related questions. Not expensive by itself, but when multiplied by a hundred at once.. can be quite painful."


The rules are that you can't send google different page content than regular browsers, but there's no reason they have to run all the background processes on googlebot requests -- can't they just send it the most recent cached version?


Not a bad idea, but seems like it would be tricky to get right. You do kinda want Google to have the most recent version of a page, all other things being equal.


Agreed, there's a sweet spot somewhere though, and it might not be the same for googlebot as a regular viewer).


Why not put the cached content in the page by default, then do an update via AJAX only in the case where the cached version is old? That way it's not triggered for crawlers. It's probably secondary content anyway.


Stack Overflow is almost at 10 questions asked per second, so where does the surprise come from?

Honestly, I'd have guessed a lot higher number.


I was searching around for some SVG radial background gradient something or other that I wasn't sure even existed a few days ago and the top hit to come up was a SO question that had been asked 7 hours before. Answered my question, too. I was impressed.


What's funny is when someone asks a question and you think "Oh, I'll bet I could answer this with a little googling." And the top result turns out to be the question you're trying to answer.


What's frustrating is trying to Google a question, finding forum threads telling the original questioner to use Google, and refusing to answer the question.


I was trying to look up an error[1] regarding Padrino the other day and someone had asked about the same problem I was having. The exact question I typed in had been asked on SO 6 hours before I encountered it and then deleted from SO, but Google had cached it.

I was sad the top result led to a deleted question without an answer, but was impressed an exact match to my question was cached 2 minutes after it was asked and was the top result (hopefully it will be dropped from results soon since it leads to SO's 404 page now).

[1] http://www.google.com/search?q=getting-nomethoderror-for-pad...


Delicious used to get slammed by crawlers, too.

10 qps isn't that bad. I remember some ad network launched using delicious widgets as their text ads platform that hit us for 25 qps sustained.


I remember having a blog that was crushed by a sudden, unthrottled interest from Baidu.


SO is the best case study of a startup scaling with .Net. Whenever I read their infrastructure stuff I cringe I am not in their team.


These articles may be insightful to you

http://highscalability.com/blog/2009/8/5/stack-overflow-arch... http://highscalability.com/blog/2011/3/3/stack-overflow-arch...

My two take aways are, generally you can make anything scale, cache like hell, and I personally don't see enough value in .NET to justify the licensing costs to roll it out initially or the long term.


They are part of Bizspark program so they got free licenses. They don't even have much hardware.

Important part isn't caching. You can just create HTML copies off a page and serve them through a fast server, that is a very easy cache.

Difficult part is smart caching, cache getting updated when they need to be and persisted properly to the DB. These guys have nailed it in my opinion.

I often lookup these guys and optimize my own sites and have been following them since their beta phase.


One point on the cost issue. To an individual a Windows server license looks like a lot of money but to a business it really doesn't matter. Every server my company buys has many paying customers tied to it so it really doesn't matter to us at all.


Besides the monetary cost, there's the opportunity cost of dealing with licenses in the first place. Part is compliance (Does your company have current licenses that cover every bit of software on every virtual machine on every developer laptop? Can you prove it?) and part is procurement (Do you get the plan with free upgrades, or do you buy new? Will you need enough licenses over the next two years that you should get a site-license or is it cheaper to stick with single user licenses?).


The cost of the Windows OS itself it not the real cost. SQLServer gets costly as you grow out, but not nearly as bad as Oracle.

http://www.microsoft.com/sqlserver/2008/en/us/pricing.aspx

Yes, BizSpark makes it free, but that basically locks you into the platform long enough for you to rely on it before you start paying those prices.


Correction: it doesn't matter for some workloads.

This old classic applies to business models as well as software: http://www.joelonsoftware.com/articles/FiveWorlds.html


It's also a great case study of a hybrid model, where .Net application servers are complemented with open source components such as Redis and HAProxy.


Exactly.

Initially they weren't using anything OSS but moved to HAProxy and Redis. They even help update HAProxy from time to time.

http://nosql.mypopescu.com/post/2669915777/powered-by-redis-...


No mention of the "L" word in this thread. Stack(Overflow|Exchange) is running Linux! (and windows of course).


I always thought that, at scale, SO and (Facebook or Twitter) were an apple to oranges comparison. Not because of load, but because of types of load.

For whatever reason, I have it in my head that the difficulty Facebook and Twitter (and even Digg) face in scaling are the social aspects of their sites. These are the things that require custom software (FlockDB and Cassandra) and a lot of machines.

Perhaps I need to use SO again, but in the day, this social aspect of SO didn't exist. This means their scaling challenges are far more traditional, say like slashdot. 99% cacheable reads type thing.

If I'm right, SO is really just a case study that, depending on what they are doing, some startups will be able to scale with .NET.


Scaling problems are not unique to .Net, you have them with Rails, PHP and even C++.

As long as you are able to measure where your bottlenecks are and address them you are fine.

I had a ball working on these performance issues and tuning down render times for question/show to 50ms (a totally dynamic page).

I guess not all developers get a kick out of the same things.


The interesting thing to know would be how much more efficient a push-based indexing approach would be instead of the current pull-based model. If frequently updated sites could push change notifications to google it would solve this problem. However, I'm not sure how google could trust such sites not to overload its own servers.


Sitemaps?


Yeah, Google supports the sitemaps standard but that doesn't really cater for content as dynamic as Stack Overflow's. The last-updated format is a day rather than a timestamp, for example, making it useless for very-frequently updated content.

http://www.sitemaps.org/protocol.php


This is true, but the big win is being able to cull out a list of URLs that have not been updated and thus do not need recrawling.


My question would be "does SO create contet at that rate?" It seems to me that google need not index your site faster than you're creating things for it to see. Is there a way to vary how often google indexes you with how often your uses create content automatically?


Google is constantly re-indexing old pages, so the rate of new content creation isn't that big a factor in the crawl rate (though I imagine it does cause Google to ramp up their crawling rates if they aren't already running at the maximum).


Yes, google webmaster tools include a multiposition slider for crawl rate. It's on the screenshot at the original answer.


Oops. Didn't see it there. Sorry.


this is a problem we had at a large social network i used to work with. launching a directory of users primarily for google's consumption was something that was difficult to scale for our huge size of database. The solution for us was node.js


Why doesn't Stack Overflow make a push API that feeds Google changes to pages when A. they know they happen and B. when they arn't under peak load?

(edit: besides the obvious limitations that Google may not have an API that is exposed to the public)


I imagine the ONLY reason is that Google don't have such an API that is exposed to the public.


There is an API via which you can describe your URLs (it's called a sitemap) and you can ping to Google your sitemap when its content changes. You can have multiple sitemaps and ping only your changes. More on www.sitemaps.org .

But Google reserves the right to crawl non-sitemaps URLs, for obvious reasons. It would be quite a bad decision for them to restrict their crawls only to API-provided URLs.


We try to give search engines hints with the update frequency in our sitemaps.

We re-build our sitemaps nightly and make sure that new or recently-updated content is listed with an update frequency of "daily" or "weekly" and all other content pages are listed as being updated "monthly."

To be honest, I've never measured if it works, but it can't hurt.


How do you ping only changes? Don't you have to resubmit the whole sitemap?


You can have multiple sitemaps. You can ping just 1 sitemap containing only the links that you want to notify. You can use the optional <lastmod> tag to indicate a URL's last change date.


There are services in Webmaster Tools for pushing sitemaps, no? Which would not be quite sufficient for this--Google still has to scrape the respective pages in the sitemap--but it's about as close as you are going to get.


Wonder if anyone noticed how crazy their engineers are:

http://blog.serverfault.com/post/performance-tuning-intel-ni...


I'd substitute "crazy" with "inexperienced" for this particular anecdote.

The intel e1000 family is the safe default for NICs and using something else without an explicit reason is just not a good idea.


What's interesting to me is that the crawlers suck on average 5GB of data per day (according to their graphs) :)


It's either that or let most of their content be uncrawled and show up on efreedom.com instead.


Isn't this against the spirit of PageRank's neutrality?


No?


Because that's what they asked Google to do in their Webmaster tools.


That screenshot was to point out that when changing from "automatic" to "custom", there was no difference ie the setting Google's automatic setting settled on was "full pelt".


I was surprised how much of a load webcrawlers (Google, Bing, Yahoo, etc) imposed on us at PatientsLikeMe, the majority of which is Google. The "intelligent" rate limiting results in a very high rate of crawl for many sites.

We added additional caching and manually lowered the crawl rate to address this at PatientsLikeMe.


Care to add some real numbers for PatientsLikeMe? Your post reads like namedropping for PatientsLikeMe.


so what? so does every other seo optimizes page. (but how SO dealt with it was interresting nontheless)




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: