

Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff? [FlV] - sarosh
http://videolectures.net/wsdm08_mei_esl/

======
sarosh
From the ACM paper abstract:

How many pages are there on the Web? 5B? 20B? More? Less? Big bets on clusters
in the clouds could be wiped out if a small cache of a few million urls could
capture much of the value. Language modeling techniques are applied to MSN's
search logs to estimate entropy. The perplexity is surprisingly small:
millions, not billions.

Entropy is a powerful tool for sizing challenges and opportunities. How hard
is search? How hard are query suggestion mechanisms like auto-complete? How
much does personalization help? All these difficult questions can be answered
by estimation of entropy from search logs.

What is the potential opportunity for personalization? In this paper, we
propose a new way to personalize search, personalization with backoff. If we
have relevant data for a particular user, we should use it. But if we don't,
back off to larger and larger classes of similar users. As a proof of concept,
we use the first few bytes of the IP address to define classes. The
coefficients of each backoff class are estimated with an EM algorithm.
Ideally, classes would be defined by market segments, demographics and
surrogate variables such as time and geography

