1. As raverbashing points out, the key takeaway here is "removing the impossible and normalizing is equivalent to Bayes rule." This can be represented in a programming language using a "die unless X is true" instruction and some magic behind the scenes. It's nothing more than a trivial generalization of McCarthy's "AMB" operator for LISP with some weights attached to each possibility.
2. The nicest thing about this approach is that it makes it very easy to reason about probability correctly. For example, the Monty Hall problem, which is known for being counter-intuitive, can be stated very clearly with all the assumptions visible.
3. The original audience of this post was programming language designers and people who understand Haskell and monads. For everybody else, I'd like to apologize for the notation. "Simple" is relative, and in this post, I take a lot of background knowledge for granted.
4. To implement this idea, you can choose between several approaches: explicit enumeration of all possibilities (which explodes exponentially), particle systems, or weighted particle systems. The latter two would be ideally compiled down to GPU code for big problems. Unfortunately, you can't use either Kalman filters or modern Bayesian network solvers, which somewhat limits the usefulness of this approach. Basically, it will scale to lots of data points, but not to complex Bayesian networks.
5. Dealing with different false-positive and false-negative rates would add a couple lines of code to this example.
If you'd like a more detailed writeup of this idea, see: http://www.randomhacks.net/darcs/probability-monads/probabil... I'll be back online in a couple of hours and I'll be happy to answer any further questions.
For example, imagine that we know a priori that a coin has only two possible biases, both of which are neither 0 nor 1. When I use Bayes Rule to update my belief in the two biases in response to observed data, I don't see how I am removing any impossibilities. It seems much more like I am downweighting the unlikely bias and upweighting the likely bias.
fluStatusGivenPositiveTest = do
fluStatus <- percentWithFlu 10
testResult <- if fluStatus == Flu
then percentPositive 70
else percentPositive 10
guard (testResult == Pos)
Until we look at the test results, there are four possible "worlds" with the following probabilities:
7% (Flu, Pos)
3% (Flu, Neg)
9% (Healthy, Pos)
81% (Healthy, Neg)
7% (Flu, Pos)
9% (Healthy, Pos)
43.75% (Flu, Pos)
56.25% (Healthy, Pos)
As noted above, the number of worlds grows exponentially with the number of probabilistic choices we make. So if you want to use this on a larger problem, you need to use sampling (run the program N times, tabulate the results) or a particle system (which essentially does the same thing in parallel).
fluStatusGivenPositiveTest = [fluStatus | fluStatus <- percentWithFlu 10
, testResult <- testRate fluStatus
, testResult == Pos]
where testRate Flu = percentPositive 70
testRate Healthy = percentPositive 10
p(-1|a)p(a)/p(-1) (a, -1)
p(-1|b)p(b)/p(-1) (b, -1)
P(Flu) = 0.1
P(not Flu) = 1 - P(Flu) = 0.9
P(Pos|Flu) = 0.7
P(Pos|not Flu) = 0.1
P(Pos|Flu)P(Flu) + P(Pos|not Flu)P(not Flu)
= 0.07 / (0.07 + 0.09)
P(not Flu|Pos) = 0.09 / (0.07 + 0.09) = 0.5625
You may need to work through the math in both the Haskell version and the traditional version before it becomes completely obvious that they're exactly the same thing.
P(a)P(heads|a) (a, heads)
P(a)P(tails|a) (a, tails)
P(b)P(heads|b) (b, heads)
P(b)P(tails|b) (b, tails)
In your example, we are eliminating the possible worlds where the flip is heads, leaving a total remaining probability that is precisely p(-1).
Look at how Google does spell checking: it's not based on dictionaries; it's based on word usage statistics of the entire Internet, which is why Google knows how to correct my name, misspelled, and Microsoft Word doesn't.
It's also why Google knows "Jews should be wiped out" and "Muslims should be exterminated" and "blacks are ruining America" and "whites are neanderthals", all suggestions based on the first two words of each phrase. Yes, I'm provoking Google - but surely people also encounter these by accident.
If Microsoft shipped a version of Office that suggested any of the above as "corrections," they would be lambasted for it, and rightly so. Why does Google get a pass? Is it because our standard of decency is so much lower on the Internet? Or is it because we know that Google is merely reflecting popular sentiment on the Internet, and so the true villain is ourselves?
(In fairness, Bing happily suggests equally monstrous ideas.)
"lawyers are" -> "greedy sharks" (actual suggestion: scum)
"gays should" -> "be killed" (actual suggestion: be killed)
"marijuana should" -> "be legal" (actual suggestion: be legalized)
"drug users should" -> "get help" (actual suggestion: be shot)
"macs are" -> "better than windows" (actual suggestion: better than pcs)
As you can see, google's opinion on what is the largest cluster of opinions on the subject aligns with what I expect in 4 out of 5 cases. Also note how you're getting both the suggestion that marijuana should be legal and that people who smoke it should be shot. These suggestions aren't google's opinion on the matter, it's what google expects you to think. Thankfully, they're wrong most of the time.
In the end, it's not impossible to write an ethical filter the same way google has a spam filter, the only reason they haven't done it is because there's no pressure to do so. If you don't want to see such suggestions on google, you have two options:
1) Make people not talk about killing gays or how all lawyers are scum
2) Pressure google to add a filter
Also, white people really are partly neanderthal https://en.wikipedia.org/w/index.php?title=Neanderthal&o...
Sometimes people search for content that they might not agree with, because they want to see what is being said there out of curiosity. Not every search is someone submitting their opinion to google, I'd expect that most are not.
All the people who want the addicts dead, let's say they're 30% of all people who think seriously about drug addicts, will happily rally under "should be shot", while the ones who want them rehabilitated would form many smaller groups under specific kinds of rehabilitation programs, how those should be administered and really what is the best program for fixing these people. Though, when you look at it like that, you're really most likely to think "should be rehabilitated" or maybe "should... I don't really have an opinion one way or the other". But then, if google actually did high-level clustering; that is, extracting opinions that are all at the same level of specificity, would those suggestions be useful for a search engine?
I guess the really right way to put it is -- That's what the google crawler has seen written most frequently -- and assume it doesn't really mean what you or I think about things.
Not what the crawler has seen most, but what people typing the same thing as you have ended up searching for most frequently. (We may be thinking the same thing and just confusing the words.)
I don't believe it's supposed to be "what you're most likely to be thinking", it's just a commonly searched-for phrase. I don't think Google's trying to autocomplete with your opinion because people aren't just searching for their own opinion, they're searching for words that will hopefully return the information they want.
It would be neat to see prediction as a feature in Office/Docs etc. though, there's a pretty huge corpus of essays available, it'd be interesting to see how accurately a new one can be predicted.
Instead, I saw Bayes' rule in Haskell, a form that's more idiosyncratic than probability expressions.
Simple is in the eye of the beholder. Perhaps making monads isn't actually making Bayes' rule simpler.
P(Cause and Effect)
= P(Cause) × P(Effect given Cause)
= P(Effect) × P(Cause given Effect)
P(Effect) × P(Cause given Effect) ÷ P(Effect given Cause)
P(A and B are both true)
= P (A is true) x P (B is also true, given that we know A is already true)
= P (B is true) x P (A is also true, given that we know B is already true)
I thought it'd be better to make it easier to understand (and more practical) by showing how it helps you infer causes from effects.
ADDED. Explanation: hearing squeaking noises in the night is evidence for the hypothesis that the cheese will have bite marks on it when we look in the morning (in the sense that the squeaking noises increase the probability of the bite-mark hypothesis) even though the squeaking noises do not cause the bite marks nor do the bite marks cause the squeaking noises. You and I know that the squeaking noises and the bite marks have the same underlying cause, but Bayes's rule is useful in situations where cause-and-effect remain unknown. E.g., it can be used by a space alien without knowledge of mice who cannot afford to ponder on the possible causes of the bite marks and the squeaking noises.
As a bonus, others will easily be able to understand and fix or extend the code.
posterior = 
[observables, unknowns] = simulate_from_model(priors)
if observables == observed
This is my favorite explanation of Bayes statistics since it implements Bayes Theorem without math. It also is the underlying intuition behind the probabilistic programming approach to Bayesian statistics. Rubin (1984) has a great explanation: https://twitter.com/tristanzajonc/status/325120025428119552
I am sure it depends on the employee, not the company. MS has done amazing things in machine learning. I love Vowpal Wabbit which is maintained by an ex Yahoo, currently MS employee. I am amazed at the complexity of the Kinect software to translate images into body configurations. Just today I have seen another amazing tecnology from MS: IllumiRoom. Check out the vid : http://www.youtube.com/watch?v=nGGMv9RnJIA Check out MAVIS too, a system for transcribing videos and making them searchable. It's a little dated, but very cool nonetheless.
So, while I still think Google is superior, MS is making some amazing stuff too, from time to time.
Google is famous for data driven decision making.
The other quote 'removing the impossible and normalizing is equivalent to Bayes rule' is a good thing to remember.
Just to mention that drug tests have different rates of false positives and negatives, it's not usually '1% failure'
If you happen to fall on an edge case, they won't really care. Because you fall in the error.
Is there a good "bayesian probability for dummies" type resource out there? Any suggestions?
For example: I have two bags of sweets. One has 10 sweets, of which 8 are chocolate and 2 are marshmallows. The other has 20 sweets, of which 10 are chocolate and 20 are marshmallows.
I offer to give you a sweet, which I will select at random. Assuming I empty both bags on to the table and choose one at random (uniformly), what is the chance it comes from the first bag? Assume that I do this where you can't see what I choose.
(Answer: 10/30, or 1 in 3 - as these are the ratios of total sweet numbers).
Now, assume that after I've chosen (you still can't see), I tell you that the sweet is a chocolate (This is the evidence). What is the chance now that it came from the first bag?
(Answer: 8/18 - the ratio of first-bag-chocolates out of any-bag-chocolates)
We have now 'updated' our probability estimates based on the evidence ('It's a chocolate'). Important to note here is that the first bag has a higher ratio of chocolates than the second bag, BUT it's still more likely to have come from the second bag due to the second bag having more sweets in total - what we call the Prior probability.
This relationship can be expressed mathematically, which is what Bayes did.
P(B|A) = P(B & A)/P(A) = P(B)*P(A|B)/(P(B)*P(A|B) + P(!B)*P(A|!B))
Also, a good idea is not "trying to turn your mind inside out", which is what seems to be needed from a lot of explanations I have read.
It is easy, it is simple, it is normal.
I am sorry I do not have a reference handy (because of the above) but those are good suggestions (otherwise I would not be writing them) that have helped me overcome similar "difficulties".
Edit: Cannot help saying it. The 'a priori' and 'a posteriori' and all those terms are just fancy words (which mean what they signal) which scare a lot of people away from something, I repeat, natural.
An Intuitive Explanation of Eliezer Yudkowsky’s Intuitive Explanation of Bayes’ Theorem
Seeing the world through the lens of Bayes’ Theorem is like seeing The Matrix. Nothing is the same after you have seen Bayes.
His next post in the series is supposed to be on Bayes' theorem.
First, Bayesian formalism shouldn't be hard. Just remember that
P(A | B) = P(A & B) / P(B)
Let's say that I have a coin and I run a game where I always bet heads. If I flip tails, you get $1; on heads, I get $1. Let's also say that 1.00% of the people in the world are depraved enough to use unfair coins that always come up heads. You don't know me, so that's a good assumption for how likely I am to be a scumbag. So there's a prior probability of 0.01 that I'm a scumbag running an unfair game, and 0.99 that I'm using a fair coin. Now, we play, and I flip a head. I win. There's evidence (or signal) that is suggestive that it's more likely that I'm a scumbag, but far from conclusive (fair games will turn up heads as well). How much has the probability changed? Well, let's look at the possibilities:
I'm a scumbag, and flip heads: 1/100 * 1 = 1/100
-- There's a 1% prior prob., and if I'm a scumbag I'll always flip heads.
I'm a scumbag, and flip tails: 1/100 * 0 = 0
-- If I'm a scumbag, I'll never flip tails.
I'm using a fair coin, and flip heads: 99/100 * 1/2 = 99/200.
I'm using a fair coin, and flip tails: 99/100 * 1/2 = 99/200.
| Scumbag | Fair |
| Heads | Heads |
| | |
| 1/100 | 99/200 |
| Scumbag | Fair |
| Tails | Tails |
| | |
| 0 | 99/200 |
(1/100)/(101/200) = 2/101 ~ 2%
For probabilities over 0.5, "twice as likely" doesn't make sense, if you think in terms of probability. What about odds, however? Then, you can derive that "twice as likely" as 55% is 71%.
You can say "twice as likely" if you think in odds, not probability. Odds is probability transformed through p/(1-p), or the ratio between the probability of the event happening and it not happening; it has domain [0, +inf] which means that "twice as likely" doesn't cause that problem. If you think in odds terms, it turns out that odds(I'm a scumbag) exactly doubles each time I flip a head. After the first, that odds number goes from 1/99 to 2/99. If I flip 10 heads, then it's 1024/99; transformed back into a probability it's 1024/1123, or a 91% chance that I'm a scumbag.
In fact, what I think makes Bayesian inference hard is that it involves that subtle context switch between probability and odds.
If B is observed and there is a known prior _odds_ of A, the posterior _odds_ of A
are k times higher, where k is the ratio of the _probabilities_ of
(A & B) vs. (~A & B).
Ok, so what about Bayesian statistics? Essentially, it comes from the idea that:
(1) There is some prior probability distribution  over the possible states S
of affairs, which you cannot observe directly but whose relative
probabilities you can estimate based on observed events E.
(2) For each event E observed, compute unnormalized posteriors according to
posterior_unnormalized(S) = prior(S) * likelihood(E | S).
(3) To compute normalized posteriors, divide each of the unnormalized ones
by the sum (or integral) of those likelihoods over all states S.
You now have a sum of 1, which makes it a legitimate probability distribution.
Now, many actual Bayesian inference methods ignore (3). It's computationally expensive (when there aren't two possible states S, but millions to infinitely many) to normalize and often we don't need to do it. For example, if you're doing classification between two classes Q and R-- with "events" being features of what you're trying to classify-- then you generally only care about which is more likely, not whether there's specifically a 23.75% chance of Q. So the normalization step is generally either skipped (if relative probabilities are all that matters) or procrastinated to the end of the analysis. This "feels wrong" at first but it actually works, and is often better (numerical stability) for the accuracy of the conclusions.
I do wish every explanation of such things explained the syntax better. I still have to keep checking what P(A | B) and P(A & B) mean. You say the hard part is applying the equation, but the first "hard part" is learning what all the symbols mean!
Mathematica knows about Bayes' rule and makes reasoning about it quite simple. If you look here  you'll find Bayes' rule in the section: "Properties and relations".
What makes it extremely powerful is that it's fully symbolic so you can do things like:
Probability[x^2 > a \[Conditioned] x < b, x \[Distributed] NormalDistribution[m, s]]
Say the probability of being a User is 1%. You can initialize your variable, X, to a superpossition, with 99% chance of being clean, and a 1% chance of being a user.
Now, the function, f(x) for the test is such that:
f(user) = (99% positive) + (1% negative)
f(clean) = (1% positive) + (99% negative)
Now, look at the quantum state of your machine (this is why we are on an emulator). Each of the 4 potential states have a certain probability associated with them.
The (theoretical) slowdown of using a quantom emulator is the fact that as the size of your memory increases, you have exponentially more states to deal with. But, because we would need to deal with the same number of states any way, this is not actually relevant.
I should also mention that using a real quantom computer does not provide a (trivial) speed up to this prosess. This is because, while the quantom state contains all 4 possibilities with some probability, when we go to measure it, we will only see 1 of the 4 states. Not only will this not give us a probability, but we would need to re-run the computation to measure it again.
That was the only bit of the article that did make sense to me :) I get lost trying to read ML-style languages.
The entire article was a tutorial on Bayes's rule, with a completely irrelevant introductory paragraph thrown in at the top. Nothing about Google or Microsoft's practices was actually mentioned anywhere.
As the article clearly showed, this is wishful thinking.
And there's one simple reason why:
- Bayesian inference is generally computationally intractable.
- "If" statements are quite tractable.
Any language in which you could perform Bayesian inference as easily as you perform "if" statements in would result in horrifyingly slow code and become useless.
The implication of the quote is that Google attempts to implement the seemingly intractable algorithms by running them on huge clusters, and returning the results to a large audience that makes the computation worth it. If your language targets a data center then having a language support Bayesian inference might be useful.
Also, if you want to perform Bayseian inference, then it is equally as difficult computationally whether or not it can be expressed as simple 'if' statements. However, in terms of code simplicity, expressing it as simple if statements is easier.
Quantum computers also make the sun rise every morning.
I can claim anything about what they do, because they don't exist.
However, that is irrelevant because quantum computers do exist. For example, in 2012, researcher's factored the number 21. .
However, even the existence of quantum computers is irrelevant, because programming languages designed for them still exist.  Not only that, but you can still run programs written in such languages on a classical computer.
Of course, you could also do the type of Bayesian programming in a language designed for that, instead of one designed for Quantom computation.