
Ask HN: Why Google stops printing results after page #1,000 - guessmyname
For a &quot;hello world&quot; query on Google without filtering [1] I get around 60,300,000 results with 10 entries per page. Logic tells me that there is a total of 6,030,000 pages. Google allows to pass an additional parameter &quot;start&quot; with a numeric value to modify the page that I want to see, where page 10 corresponds to start 90, page 20 to start 190, and so on as seen in this table:<p><pre><code>    Page | Start | (Page - 1) * Max
    -----|-------|-----------------
      10 |   90  | (  10 - 1) * 10
      20 |  190  | (  20 - 1) * 10
      30 |  290  | (  30 - 1) * 10
      40 |  390  | (  40 - 1) * 10
      50 |  490  | (  50 - 1) * 10
      60 |  590  | (  60 - 1) * 10
      70 |  690  | (  70 - 1) * 10
      80 |  790  | (  80 - 1) * 10
      90 |  890  | (  90 - 1) * 10
     100 |  990  | ( 100 - 1) * 10
</code></pre>
This means that going to &quot;start=990&quot; should return a set of 10 results from page #100 [2]. However, it returns an error message saying &quot;Your search - hello world - did not match any documents.&quot; and the pagination ends at page #77 with a link pointing to &quot;start=759&quot;. If you go to this page the pagination will stop showing the &quot;next&quot; button, but the subtitle still states that there are &quot;60,300,000 results&quot;.<p>Skipping the joke of &quot;no one goes past page #2 in Google&quot; what is your theory about the missing pagination buttons? Why do they not allow the inspection of the results past 1,000 [3]? — In this case Google shows this message: &quot;Sorry, Google does not serve more than 1000 results for any query. (You asked for results starting from 1000.)&quot; Why not simply print &quot;More than 770 results&quot; considering that 77 is the last page in the pagination that you can inspect, at least with the &quot;hello world&quot; query.<p>[1] https:&#x2F;&#x2F;www.google.com&#x2F;search?q=hello+world&amp;filter=0<p>[2] https:&#x2F;&#x2F;www.google.com&#x2F;search?q=hello+world&amp;filter=0&amp;start=990<p>[3] https:&#x2F;&#x2F;www.google.com&#x2F;search?filter=0&amp;q=hello+world&amp;start=1000
======
lacker
To understand this you need to understand a bit about the way a search engine
works. (I used to work at Google on the search engine but I am not revealing
anything besides information retrieval 101 here.)

Roughly, there is a database of the entire internet that is broken into 1000
"shards". The database has a really complex indexing scheme. You can think of
each shard as having one "worker" process, and then there are some "merger"
processes that combine the results from workers.

So there are basically four phases: retrieval, scoring, merging, and display.
In the "retrieval" phase, each worker uses the complicated indexes to retrieve
a bunch of documents that are relevant to the query. In the "scoring" phase,
each worker puts a specific score on each document, using more information
about the document to get a more accurate score than the retrieval phase did.
Finally, in the "merging" phase, the results from all the shards are combined
by the mergers and you get your eventual top list.

When you ask for the 10th page of results, it actually only affects the final
"display" step. Everything before that is just producing a list of the top
1000. That way you can reuse caches in all the expensive places.

Why does it drop everything past the top 1000? In fact, you probably aren't
getting the precise top 1000 in the first place. The retrieval and scoring
phases use a lot of heuristics to determine when it's worth sending a document
on to the next phase. Each shard might only need to send back its top 5 or so
documents. Since it's very rare to go past the first 100 documents, going a
bit faster by sacrificing the quality of responses 200+ will almost always be
a good tradeoff.

There's a product-centric reason as well as an infrastructure-centric reason
to not go past 1000. Who actually wants those results? Normal people are
almost never looking at the 100th page of results. Viruses looking for
vulnerable webservers, spammers, scrapers, all of these malicious actors still
do want that 100th page of results, though. At some point it's not actually
helping the world to provide the nth page of results.

~~~
tedmiston
^ Great answer. So far, this is the only correct one in the thread.

I took Information Retrieval 101 in grad school and it was an interesting
course. If you're curious to learn more, term frequency–inverse document
frequency (tf–idf) is a good place to start. The underlying idea is
surprisingly simple.

[https://en.wikipedia.org/wiki/Tf–idf](https://en.wikipedia.org/wiki/Tf–idf)

Likewise with the core of Google's (original) ranking algorithm, PageRank,
which is inspired by ideas like h-index.

[https://en.wikipedia.org/wiki/PageRank](https://en.wikipedia.org/wiki/PageRank)

Also, the "standard" book which we used is quite readable: _Introduction to
Information Retrieval_ by Manning, et al.

[https://www.amazon.com/Introduction-Information-Retrieval-
Ch...](https://www.amazon.com/Introduction-Information-Retrieval-Christopher-
Manning/dp/0521865719/)

------
byoung2
_Why do they not allow the inspection of the results past 1,000_

It's probably a lot more work sorting the 1000th best result than the top 10
or 100. Imagine a stadium full of people (your search results) and a guy at
the podium asking for the tallest 10 people (top 10 results). You can do this
efficiently by getting the tallest person from each section of the stadium,
and then lining them up by height and taking the first 10 (sharding). But if
you ask for 10 tallest people between 5'7.5613" and 5'7.5614" you're going to
have thousands of people to sort, and very little to differentiate them (long
tail).

~~~
saghm
I'm not sure I follow your example; what if the ten tallest people are all on
the same section?

~~~
JimmyAustin
If 100% accuracy is important then getting the top 10 from each section and
then sorting then would be necessary. If you were willing to trade performance
for accuracy then getting the top 1 from each section would probably give you
a decent approximation.

In the perfect accuracy case, if you were grabbing the range of 990-1000, you
would need to grab the top 1000 from each section, which is 100x more data
transfer. I don't work for Google, but given the amount of money they spend on
building their own routers, they probably want to minimise that number.

~~~
byoung2
The approximation is better in this case since you only have the results I
return and you can't independently audit every section to see if I'm accurate.
Say I return the 7 foot guy from each section, but one section only had a 6'8"
guy. Meanwhile there is a section that had 10 7 footers. You can't verify that
without seeing my data.

------
mtmail
Results come from multiple sources (backend servers) and it would be too
computational expensive to look for duplicates before the user paginates.
That's also the reason why Google used to print "about XY results".

I've seen genuine users click 'next' on 50 or even 100 pages, the absolute
majority is scripts and scrapers though. I used to work on web search analysis
for a large search engine.

