

Peter Norvig: Tweaking Bayes' Theorem - stagga_lee
http://java.dzone.com/articles/tweaking-bayes%E2%80%99-theorem
In Peter Norvig’s talk The Unreasonable Effectiveness of Data, he describes a translation algorithm based on Bayes’ theorem. Pick the English word that has the highest posterior probability as the translation. No surprise here. Then he says something curious.
======
srean
This is more of tweak of naive Bayes than Bayes' theorem and I suspect he is
being a bit tongue in cheek and not letting on whats behind the tweak.

I am sure you have heard that naive Bayes makes gratuitous assumptions of
independence. What is not mentioned as often is that it also assumes the
document has been generated by a memory-less process.

So if I were to generate a document according to the model assumed by naive
Bayes, and I want to decide if I should concatenate another "the" in the
document, then I dont need to keep track of how many "the"s that I have
already added to the document. As a result the probability of multiple
occurrence _n_i_ of a word _i_ goes down exponentially, like this

    
    
      P(n_i) = p_i^n_i.
    

Many words do not behave like this. Their probability do not go down
monotonically from the start, rather, for words like "the" their probabilities
(conditioned on their history) climb initially and then go down.

Naive Bayes works surprisingly well in spite of their grossly violated
assumptions. There are many explanations for that. However, we usually forget
that to make NB work, we have to throw away "stop-words". Among them are those
exact "memory-full" words that violate NB's assumptions.

Lets get back to word frequencies: A model that fits word frequencies quite
well is the power law. <http://en.wikipedia.org/wiki/Power_law> They look like
this

    
    
         P(n_i) \propto  n_i^c
    

where _c_ is a constant for that word id _i_. For English, _c_ usually lies in
the ball park of -1.5 to -2.1

The tweak that Norvig mentions is not an exact match for power law assumptions
but it comes very close. Its using a power law assumption but with sub-optimal
parameters. In fact using the power law assumption and with their parameters
estimated from the data you could get an even better classifier. Though be
careful of the industry of seeing power laws everywhere. Log normal
distributions can look deceptively similar to a power law and is more
appropriate on many such occasions.

Yeah Naive Bayes has a bad assumption of independence, but there is no reason
that they have to be memory-less too and that is partly what the tweak fixes,
and the theorem isn't really being violated.

~~~
bermanoid
What does any of that have to do with modifying P(sentence) relative to
P(sentence is a translation of X), though? Everything you're talking about
involves getting that initial P(sentence) estimate, but I don't see anywhere
in there why uniformly raising all naive estimates to a power would come close
to accounting for power law word frequencies (which presumably would be
incorporated in the initial model that gave us a P(sentence) estimate).

According to Norvig, the real problem this is intended to address is that
P(sentence is a translation of X) tends to be a crappy estimate, and that in
fact P(sentence) itself is much more reliable. Which would seem to conflict
with your suggestion that this is really applying a correction for
P(sentence).

~~~
srean
Ah I see your point. Norvig uses a positive exponent on the prior whereas I
have written it in the form where there is a negative exponent on the
conditional. According to Bayes, we predict class 1 when

    
    
       P(C_1) P(words | C_1) > P(C_2) P(words | C_2)
    
    

The positive exponent on P(C_i) and a negative exponent on P(Words |C_i) are
equivalent as far as classification is concerned, one needs to take a suitable
(negative power) on both sides so that the exponent on the conditional becomes
one and you end up with a positive exponent on P(C_i).

Thanks for catching this.

The constant _c_ can and do vary with the class label. As a result the Bayes
classifier will use conditional distributions that have a fixed exponents.
Typically however these will depend per word and per class. The model that
Norvig is using a very simple variant where he is fixing the exponent. As I
said its not the exact expression that one would have obtained by assuming a
power law, but is very similar.

------
tel
This has been done for the last 20 years (sort of in secret) in the Automatic
Speech Recognition community. In brief, if A is the auditory observation
you're trying to transcribe and W a possible word sequence, then you use Bayes
law to rewrite

    
    
        P(W|A) = P(A|W)P(W)
    

The first is known as the auditory model which is responsible for transcribing
sounds into phonemes and potential words. The second is the language model
responsible just for ranking likely word sequences.

The first tends to be a lot harder than the second, it's models are fuzzier.
So to compensate, the real, in practice model is

    
    
        P(W|A) = P(W|A)P(W)^a
    

where a is a positive number. I forget what the typical value of the fudge
factor is, but generally you use cross validation to optimize it.

Exactly one class told me about the existence of it. Most others pretend like
it's not there.

~~~
sbmassey
Iwwelewant Cowwection! Actual Bayes law is:

P(W|A) = P(A|W)P(W)/P(A)

~~~
bermanoid
If you're maximizing over all possible values of W, A never changes, which is
why the /P(A) will usually be omitted from the function you're maximizing.

You're right, though, it shouldn't be written as P(W|A) in that case, that's
misleading.

------
jhales
I just saw Norvig give this talk. The quote cuts off before he offers his
explanation: the 1.5 means 'trust this data set more.' For whatever reason
(e.g. bigger sample size) the probabilities from that set tend to be more
accurate.

It's not some sort of deep mystical counter example. It's a clever tweak that
comes from the observation that empirical observations are not uniformly
reliable.

~~~
pool007
As it's probability^1.5, it makes the probability lower, and not higher, i.e.,
x^1.5 < x if 0 <= x <=1.

I think this might be a tweak due to data scarcity. To compute p(e), there's
tons of data to build a model. On the other hand, p(e|f) requires for you to
get a lot of parallel corpus (texts that's written both in English and Foreign
language) which is not easily doable.

As a result, p(e)^1.5 * p(e|f), lowers p(e) as it's supposed to be too high
compared to p(e|f), imho.

~~~
carlob
I think it's more helpful to think of it as a a gamma correction with γ > 1,
rather than saying higher or lower. Or as I said in another comment lower
temperature, in that it concentrates probability around its peaks.

------
tensor
There are a few things that really bother me about this post and many of the
comments regarding it.

First, let me address a common theme that I see underlying this discussion and
many others: theory vs reality. The common claim seems to be that scientists
or other members of the academic community are often disconnected with "the
real world" and blindly hold to models that don't quite work. This is both a
damaging idea and a false one.

Science and math are two different things and _science_ has always been about
_the real world_. The key difference between science and math (or theory as
some like to call it) is _empirical data_. In the physical sciences the fact
that collected data contains errors or biases is a fundamental part of the
process. Characterizing and understanding the error or other problems with
collected data is critical to doing good science.

It seems that people who study _computer science_ often forget this. When does
computer science change from mathematics to a proper science? When you start
applying it to real data or to real systems that contain imperfections.
Situations precisely like the one being described. In this case there is
nothing wrong with Bayes' theory and as many have pointed out the theory is
not being tweaking at all. Rather, change in the mathematics is to address a
shortcoming of the data. This is simply how science should work. Science is
the real world and if you think otherwise you are not understanding or
practicing good science.

My second criticism is with the suggestion that we should not care _why_ this
change improves results. Now, John Cook is not actually advocating this, but a
casual reading of his blog post could certainly be read that way in error. He
is saying that it's ok to _use_ an algorithm in practice even though we don't
fully understand it. In so far as it's been properly tested, I doubt most
people nor most academics would argue with this.

But it's still important to try to understand what is going wrong in the
application of the mathematics such that a fudge factor is required. Very
often, understanding the root cause can lead to better methodology or a better
modelling of the data.

------
imurray
Some comments are on the original blog posting:
[http://www.johndcook.com/blog/2012/03/09/monkeying-with-
baye...](http://www.johndcook.com/blog/2012/03/09/monkeying-with-bayes-
theorem/)

------
carlob
The biggest difficulty in Bayesian modeling is usually choosing the prior i.e.
the P(W) term. There are a few maximum entropy techniques, but basically the
rule is: "what works best".

In this case P(W) is a 1-gram probability which is clearly wrong, because it
amounts to saying that every word has a probability of appearing that is
independent of its surroundings. For example, while 'the' is very likely to
appear you don't want it to appear next to another 'the'.

Even when using more sophisticated priors that take n-grams into account, one
never has a real model of the English language, because that will depend on
the context, the period in which it was written, the linguistic domain…

That is why tweaks are actually justified theoretically.

Another interesting thought that popped out in the comments to the original
post is that if one uses a statistical mechanics interpretation of
probability, where P(W) = exp(-H(W)/T) where H is some energy function and T
is temperature, One can interpret the 1.5 factor as lowering the temperature
and making the probability less fuzzy.

------
asknemo
This happens all the time in machine learning applications. Or many other
engineering disciplines I dare say. If theorems and laws never need some
tweaks here and there in the real world, what do we need hackers and engineers
for?

~~~
lars
A good example is regularization. You have nice proofs saying that your
classifier is optimal, then you tack on a regularization term to it, which
breaks your optimality proof but improves your classification accuracy. It
seems unexpected, but it's not really all that surprising when you get down to
the details of it.

~~~
srean
Oops hit the down-arrow without intending to, my bad, hope someone will fix
that.

There is nothing tacked on about a regularizer though, it is very sound even
in theory. There are several ways to look at it. One way is to see it as a
natural consequence of Bayes law, it is just the log of the prior probability.
There are certain things we know or assume about the model even before looking
at the data, for example we expect the predictions to have a certain
smoothness etc, all this knowledge can incorporated into the prior model, and
that is what the regulaizer is. Another way to look at it from stability of
the estimates of the parameters. I find the former more convincing.

~~~
lars
Absolutely, there's a pretty clear mathematical justification for
regularization. However, it is very literally tacked on at the end. Take
logistic regression, if you minimize the cost function without regularization,
you get a max-likelihood estimate of the regression parameters. But what we do
is to add a regularization term to that cost function. Minimizing that cost-
function will no longer give a MLE solution, but it will (likely) give a
better solution. It all comes down to understanding that the MLE property is
an asymptotic result. Same goes for covariance matrix estimates, where you
have regularization procedures that are guaranteed to never be worse than the
plain MLE solution.

------
markstoehr
You can't justify this using theoretical Bayesian statistical theory in any
easy way that I see. However, one can view the problem as max log(P(W|A)) +
lambda * log(P(W)) and here log(P(W)) is a smoothness or regularization term
and log(P(W|A)) can be seen as relating to the expected loss of the function
(e.g. use Markov's inequality). Regularization has plenty of theoretical
justification in machine learning for improving performance on unseen data
(that is reasonably similar to the training data).

This is very much a discriminative tactic in disguise.

------
rjurney
I've done this before, to weight some properties higher than others. I
searched for the exponent, via simulation and cross validation.

It works.

------
mbq
Bypassing maths to make a better model? This is called machine learning (-;

------
alexchamberlain
Could this be considered a weighted iteration?

