This is more of tweak of naive Bayes than Bayes' theorem and I suspect he is being a bit tongue in cheek and not letting on whats behind the tweak.
I am sure you have heard that naive Bayes makes gratuitous assumptions of independence. What is not mentioned as often is that it also assumes the document has been generated by a memory-less process.
So if I were to generate a document according to the model assumed by naive Bayes, and I want to decide if I should concatenate another "the" in the document, then I dont need to keep track of how many "the"s that I have already added to the document. As a result the probability of multiple occurrence n_i of a word i goes down exponentially, like this
P(n_i) = p_i^n_i.
Many words do not behave like this. Their probability do not go down monotonically from the start, rather, for words like "the" their probabilities (conditioned on their history) climb initially and then go down.
Naive Bayes works surprisingly well in spite of their grossly violated assumptions. There are many explanations for that. However, we usually forget that to make NB work, we have to throw away "stop-words". Among them are those exact "memory-full" words that violate NB's assumptions.
Lets get back to word frequencies: A model that fits word frequencies quite well is the power law. http://en.wikipedia.org/wiki/Power_law They look like this
P(n_i) \propto n_i^c
where c is a constant for that word id i. For English, c usually lies in the ball park of -1.5 to -2.1
The tweak that Norvig mentions is not an exact match for power law assumptions but it comes very close. Its using a power law assumption but with sub-optimal parameters. In fact using the power law assumption and with their parameters estimated from the data you could get an even better classifier. Though be careful of the industry of seeing power laws everywhere. Log normal distributions can look deceptively similar to a power law and is more appropriate on many such occasions.
Yeah Naive Bayes has a bad assumption of independence, but there is no reason that they have to be memory-less too and that is partly what the tweak fixes, and the theorem isn't really being violated.
What does any of that have to do with modifying P(sentence) relative to P(sentence is a translation of X), though? Everything you're talking about involves getting that initial P(sentence) estimate, but I don't see anywhere in there why uniformly raising all naive estimates to a power would come close to accounting for power law word frequencies (which presumably would be incorporated in the initial model that gave us a P(sentence) estimate).
According to Norvig, the real problem this is intended to address is that P(sentence is a translation of X) tends to be a crappy estimate, and that in fact P(sentence) itself is much more reliable. Which would seem to conflict with your suggestion that this is really applying a correction for P(sentence).
Ah I see your point. Norvig uses a positive exponent on the prior whereas I have written it in the form where there is a negative exponent on the conditional. According to Bayes, we predict class 1 when
P(C_1) P(words | C_1) > P(C_2) P(words | C_2)
The positive exponent on P(C_i) and a negative exponent on P(Words |C_i) are equivalent as far as classification is concerned, one needs to take a suitable (negative power) on both sides so that the exponent on the conditional becomes one and you end up with a positive exponent on P(C_i).
Thanks for catching this.
The constant c can and do vary with the class label. As a result the Bayes classifier will use conditional distributions that have a fixed exponents. Typically however these will depend per word and per class. The model that Norvig is using a very simple variant where he is fixing the exponent. As I said its not the exact expression that one would have obtained by assuming a power law, but is very similar.
As I mention below, this tweak shows up even in models that go well out of their way to avoid independence assumptions. I didn't watch the video, so I'm not sure what Norvig mentions, but I don't think citing independence violations makes the fudge factor vanish.
I am sure you have heard that naive Bayes makes gratuitous assumptions of independence. What is not mentioned as often is that it also assumes the document has been generated by a memory-less process.
So if I were to generate a document according to the model assumed by naive Bayes, and I want to decide if I should concatenate another "the" in the document, then I dont need to keep track of how many "the"s that I have already added to the document. As a result the probability of multiple occurrence n_i of a word i goes down exponentially, like this
Many words do not behave like this. Their probability do not go down monotonically from the start, rather, for words like "the" their probabilities (conditioned on their history) climb initially and then go down.Naive Bayes works surprisingly well in spite of their grossly violated assumptions. There are many explanations for that. However, we usually forget that to make NB work, we have to throw away "stop-words". Among them are those exact "memory-full" words that violate NB's assumptions.
Lets get back to word frequencies: A model that fits word frequencies quite well is the power law. http://en.wikipedia.org/wiki/Power_law They look like this
where c is a constant for that word id i. For English, c usually lies in the ball park of -1.5 to -2.1The tweak that Norvig mentions is not an exact match for power law assumptions but it comes very close. Its using a power law assumption but with sub-optimal parameters. In fact using the power law assumption and with their parameters estimated from the data you could get an even better classifier. Though be careful of the industry of seeing power laws everywhere. Log normal distributions can look deceptively similar to a power law and is more appropriate on many such occasions.
Yeah Naive Bayes has a bad assumption of independence, but there is no reason that they have to be memory-less too and that is partly what the tweak fixes, and the theorem isn't really being violated.