Meta search engines leave a bad taste in everyone's mouth because they've always...

robrenaud · on Sept 16, 2021

> You can't combine a few different ranked lists and expect to get results better than any of the original ranked lists.

I am skeptical of this application of the theorem. Here is my proposal:

Take the top 10 Google and Bing results. If the top result from Bing is in the top 10 from Google, display Google results. If the top result from Bing is not in the top 10 from Google, place it at the 10th position. You'd have an algorithm that ties with Google, say 98% of the time, beats it say, 1.2% of the time, and loses .8% of the time.

vikingerik · on Sept 16, 2021

Right. Arrow's theorem just says it's impossible to do it in all cases. It's still quite possible to get an improvement in a large proportion of cases, as you're proposing.

PaulHoule · on Sept 16, 2021

I've had jobs tuning up the relevance of search engines with methods like

https://ccc.inaoep.mx/~villasen/bib/AN%20OVERVIEW%20OF%20EVA...

and the first conclusion is "something that you think will improve relevance probably won't"; the TREC conference went for about five years before making the first real discovery

https://en.wikipedia.org/wiki/Okapi_BM25

It's true that Arrow's Theorem doesn't strictly apply, but thinking about it makes it clear that the aggregation problem is ill-defined and tricky. (e.g. note also that a ranking function for full text search might have a range of 0-1 but is not a meaningful number, like a probability estimate that a document is relevant, but it just means that a result with a higher score is likely to be more relevant than one with a lower score.)

Another way to think about it is that for any given feature architecture (say "bag of words") there is an (unknown) ideal ranking function.

You might think that a real ranking function is the ideal ranking function plus an error and that averaging several ranking functions would keep the contribution of the ideal ranking function and the errors would average out, but actually the errors are correlated.

In the case of BM25 for instance, it turns out you have to carefully tune between the biases of "long documents get more hits because they have more words in them" and "short documents rank higher because the document vectors are spiky like the query the vectors". Until BM25 there wasn't a function that could be tuned up properly and just averaging several bad functions doesn't solve the real problem.

random314 · on Sept 16, 2021

Arrows theorem simply doesn't apply here. We don't need our personalized search results to satisfy the majority.

PaulHoule · on Sept 16, 2021

But in both cases you face the problem of aggregating preferences of many into one. In one case you are combining personal preferences in the other case aggregating ‘preferences’ expressed by search engines.

random314 · on Sept 17, 2021

But search engines aren't voting to maximize the chances that their preferred candidate shows up on top. The mixed ranker has no requirement to satisfy Arrows integrity constraints. It has to satisfy the end user, which is quite possible in theory.

Conditions the mixed ranker doesn't have to satisfy "ranking while also meeting a specified set of criteria: unrestricted domain, non-dictatorship, Pareto efficiency, and independence of irrelevant alternatives"

PaulHoule · on Sept 17, 2021

Sure, but the problem that conventional IR ranking functions are not meaningful other than by ordering leads you to the dismal world of political economy where you can't aggregate people's utility functions. (Thus you can't say anything about inequality, only about Pareto efficiency)

Hypothetically you could treat these functions as meaningful but when you try you find that they aren't very meaningful.

For instance IBM Watson aggregated multiple search sources by converting all the relevance scores to "the probability that this result is relevant".

A conventional search engine will do horribly in that respect, you can fit a logit curve to make a probability estimator and you might get p=0.7 at the most and very rarely get that, in fact, you rarely get p>0.5.

If you are combining search results from search engines that use similar approaches you know those p's are not independent so you can't take a large numbers of p=0.7's and turn that into a higher p.

If you are using search engines that use radically different matching strategies (say they return only p=0.99 results with low recall) the Watson approach works, but you need a big team to develop a long tail of matching strategies.

If you had a good p-estimator for search you could do all sorts of things that normal search engines do poorly, such as "get an email when a p>0.5 document is added to the collection."

For now alerting features are either absent or useless and most people have no idea why.

gnramires · on Sept 16, 2021

That's an invalid application of this theorem. (It doesn't necessarily hold)

Suppose there's an unambiguous ranked preference by all people among a set (webpages, ranking). Suppose one search engine ranks correctly the top 5 results and incorrectly the next 5 results, while another ranks incorrectly the top 5 and correctly the next 5.

What can happen is that some there may be no universally preferred search engine (likely). In practice, as another commenter noted, you can also have most users prefer more a certain combination of results (that's not difficult to imagine, for example by combining top independent results from different engines for example).