
Ask HN: Why OR based search is still used? - smusamashah
Instead of reducing the results by adding keywords, it increases them exponentially making it hard to find what you are looking for.<p>1. Docker Hub
chrome -&gt; ~3800 results
desktop -&gt; ~1200 results
&quot;chrome desktop&quot; -&gt; ~5000 results<p>2. Udemy
aws -&gt; ~2300 results
docker -&gt; ~750 results
aws docker -&gt; 2600 ~results
&quot;aws docker&quot; -&gt; 0 results
&quot;docker aws&quot; -&gt; 8 results<p>What is this form of search still a default way on some places?
======
PaulHoule
Good question.

The dominant paradigm for search today is the "vector space model".

The rough idea is that you start with "OR" and then score the documents so:

* the more words occur, the higher the rank

* the more frequent the words are in the document, the higher the rank

* less common words contribute more to the score

* tuning has to be done so that small documents are not privileged relative to large documents or vice versa (that turns out to be difficult, the first good algorithm for this was discovered 25 years ago, but it is hardly used commercially because nobody can be bothered to tune it for their specific text corpus)

Given that you aren't going to read 5000 results for "chrome desktop" (unless
you're researching patents) it is not a problem so well as the results at the
top of the ranking are good.

Google has very different concerns than most. If you have a small collection
the immediate problem is that one of the documents means "chrome desktop" but
uses some other words to mean that, so you miss the document. So you need
tricks to find those documents. If you have a huge collection then there is
going to be some document where somebody used the same language as you and
you'll be satisfied.

If AND were the default, you'd find that ordinary users would frequently find
nothing and give up.

Recruiters, professional patent searchers, and other people who do nothing but
search all day collect large collections of "boolean strings" that help them
in their work. The best search engines, based on the VSM but using additional
tricks such as autoencoders, perform comparably. The typical web search engine
is worse.

~~~
smusamashah
This sounds very logical explanation and makes lot of sense.

The way results are presented in this OR based setup can probably be improved.
For example, by showing which of the keywords each result contains, or filter
out based on ranking or number of keywords, or specifying that results must
contain at least that specific keyword. This should reduce the number of
results significantly.

------
eesmith
An 'or' search doesn't increase things exponentially.

In the simplest model of an 'or' search, each keyword returns a postings list,
and the result is the union of those lists. This is linear.

The size of the resulting list is at most N times the largest posting list.

In practice, the merger uses a weighting scheme so that documents with (say) 5
words of the 7 come before documents with (say) 3 of the 7.

And documents with all terms should be on the top of the list.

The merger is at worst N log(N) for the sort.

Most people only look at the first page, or first few pages of results, so it
doesn't really matter, does it?

~~~
smusamashah
Ok. It's not exponential instead it's the sum of all n's. 3 keywords with 10
results each on its own, provided no other keyword was part of the result,
will return 30 results when used together.

~~~
eesmith
And ranked. Plus, "Most people only look at the first page, or first few pages
of results, so it doesn't really matter, does it?"

~~~
smusamashah
If there are hundreds of results then sure I will not bother going through
many pages. But if i am looking hard and see that there are only 2 pages, I
might checkout all of them.

~~~
eesmith
It simply doesn't matter how many pages are returned as a count.

Most people - indeed, nearly all - only look at the first page or two, no
matter what. (PaulHoule pointed out that "researching patents" is one
exception.)

Google reports "532 000 000" results for 'chrome desktop'. Yet Google is
pretty popular.

People still use Google since the results they want are usually on the first
page. Indeed, that's why people switched from AltaVista to Google; AltaVista
often required trawling through tens of pages to find a good match.

