Hacker News new | past | comments | ask | show | jobs | submit login

I always figured this was a performance issue related to sharding in distributed systems. Deep pagination is an expensive operation so most search clusters limit the number of visible results by default. That, in addition to an assumption that results beyond a certain number are unlikely to be useful - how many times have you found something on page 10 vs just reformulated your search query? - means that most applications just leave the default limit in place.

Returning a count of results, however (especially if it doesn't need to be precise), is a lot less expensive. Hence why Google is happy to give you the 22,000,000 number.




Yup, deep paging is a huge problem for distributed search systems. It's not just a Google thing, its every search engine. Here is a section from ElasticSearch's documentation[0]:

"Avoid using from and size to page too deeply or request too many results at once. Search requests usually span multiple shards. Each shard must load its requested hits and the hits for any previous pages into memory. For deep pages or large sets of results, these operations can significantly increase memory and CPU usage, resulting in degraded performance or node failures."

[0] https://www.elastic.co/guide/en/elasticsearch/reference/curr...


It's not just a Google thing, its every search engine.

OK, I see now. I tried it on Bing and got similar results with two small caveats. First, Bing gave me 861 accessible results, which is a base 2 order of magnitude greater than Google's. Second, Bing's total number isn't nearly as astronomical, it claims only 191K total results, not Google's 22M.

Could it be that Google has just indexed 100x more terms compared with Bing? Maybe, but my anecdotal use of both of them doesn't really seem to indicate that Bing is so deficient. For example, I tried using a phrase that would come up with just a few results. "bioavailable turmeric extract formulation" (in quotes) yielded 24 results on Google, (plus 4 ad results on top). On Bing I got 33 results, plus 2 ads on top. In fact, Bing looks more like "old Google" than new Google looks like old Google.


Number of results (“match set”) can differ even with the same document corpus. e.g. Tokenization, n-grams, language analysis, stop words, synonyms, etc.

The ElasticSearch documentation is actually pretty good documentation of all search engines in general.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: