

Bayes' rule in Haskell (2007) - tikhonj
http://www.randomhacks.net/articles/2007/02/22/bayes-rule-and-drug-tests

======
ekidd
This is a pretty old post of mine (2007), and I was a bit surprised to see it
on Hacker News this morning! For those of you who find all the Haskell
impenetrable, let me summarize:

1\. As raverbashing points out, the key takeaway here is "removing the
impossible and normalizing is equivalent to Bayes rule." This can be
represented in a programming language using a "die unless X is true"
instruction and some magic behind the scenes. It's nothing more than a trivial
generalization of McCarthy's "AMB" operator for LISP with some weights
attached to each possibility.

2\. The nicest thing about this approach is that it makes it very easy to
reason about probability correctly. For example, the Monty Hall problem, which
is known for being counter-intuitive, can be stated very clearly with all the
assumptions visible.

3\. The original audience of this post was programming language designers and
people who understand Haskell and monads. For everybody else, I'd like to
apologize for the notation. "Simple" is relative, and in this post, I take a
lot of background knowledge for granted.

4\. To implement this idea, you can choose between several approaches:
explicit enumeration of all possibilities (which explodes exponentially),
particle systems, or weighted particle systems. The latter two would be
ideally compiled down to GPU code for big problems. Unfortunately, you can't
use either Kalman filters or modern Bayesian network solvers, which somewhat
limits the usefulness of this approach. Basically, it will scale to lots of
data points, but not to complex Bayesian networks.

5\. Dealing with different false-positive and false-negative rates would add a
couple lines of code to this example.

If you'd like a more detailed writeup of this idea, see:
[http://www.randomhacks.net/darcs/probability-
monads/probabil...](http://www.randomhacks.net/darcs/probability-
monads/probability-monads.pdf) I'll be back online in a couple of hours and
I'll be happy to answer any further questions.

~~~
apw
Do you really mean that "removing impossibility and renormalizing" is
_equivalent_ to Bayes Rule?

For example, imagine that we know _a priori_ that a coin has only two possible
biases, both of which are neither 0 nor 1. When I use Bayes Rule to update my
belief in the two biases in response to observed data, I don't see how I am
removing any impossibilities. It seems much more like I am downweighting the
unlikely bias and upweighting the likely bias.

~~~
ekidd
Excellent question. Let me unpack a bit. Here's an example from the PDF I
linked above:

    
    
      fluStatusGivenPositiveTest = do
        fluStatus  <- percentWithFlu 10
        testResult <- if fluStatus == Flu
                        then percentPositive 70
                        else percentPositive 10
        guard (testResult == Pos)
        return fluStatus
    

In this code, you can read the operator "<-" as "pick a possible value from a
probability distribution."

Until we look at the test results, there are four possible "worlds" with the
following probabilities:

    
    
       7% (Flu, Pos)
       3% (Flu, Neg)
       9% (Healthy, Pos)
      81% (Healthy, Neg)
    

But once we see the the test result is "Pos", then all those worlds with "Neg"
become impossible, giving us:

    
    
      7% (Flu, Pos)
      9% (Healthy, Pos)
    

But our probabilities don't add up to 100% any more. Fortunately, all we need
to do is normalize them. That gives us:

    
    
      43.75% (Flu, Pos)
      56.25% (Healthy, Pos)
    

This normalization step is basically the denominator in Bayes rule.

As noted above, the number of worlds grows exponentially with the number of
probabilistic choices we make. So if you want to use this on a larger problem,
you need to use sampling (run the program N times, tabulate the results) or a
particle system (which essentially does the same thing in parallel).

~~~
apw
Until we look at the coin flip results for a coin known _a priori_ to have
_only two_ possible biases a and b, where neither a nor b equal 0 or 1, there
are two possible "worlds" with the following probabilities:

    
    
      0.5 (a)
      0.5 (b)
    

But once we see that the first coin flip is tails, which we code as -1, then
we update using Bayes rule, giving us:

    
    
      p(-1|a)p(a)/p(-1) (a, -1)
      p(-1|b)p(b)/p(-1) (b, -1)
    

I just don't see any removing of possible worlds here. Am I doing this wrong?

~~~
ekidd
Let's take the flu test I used in the grandparent post, and walk through it
step by step using the traditional notation.

    
    
      P(Flu) = 0.1
      P(not Flu) = 1 - P(Flu) = 0.9
      P(Pos|Flu) = 0.7
      P(Pos|not Flu) = 0.1
    
      P(Flu|Pos)
        = P(Pos|Flu)P(Flu)
          -------------------------------------------
          P(Pos|Flu)P(Flu) + P(Pos|not Flu)P(not Flu)
        = 0.07 / (0.07 + 0.09)
        = 0.4375
    
      P(not Flu|Pos) = 0.09 / (0.07 + 0.09) = 0.5625
    

As you can see, the same numbers show up in both versions of the problem. In
the traditional example, the "impossible worlds" are P(Flu|not Pos) and P(not
Flu|not Pos), because we know we have Pos. So these two terms never appear in
the equation. The division by (0.07 + 0.09) is the same as the normalization
step in my example: once we narrow our world down to 0.07 and 0.09, we need to
divide by 0.16 so that everything sums back up to 1.

You may need to work through the math in both the Haskell version and the
traditional version before it becomes completely obvious that they're exactly
the same thing.

------
millstone
Here's more of the quote:

 _Look at how Google does spell checking: it's not based on dictionaries; it's
based on word usage statistics of the entire Internet, which is why Google
knows how to correct my name, misspelled, and Microsoft Word doesn't._

It's also why Google knows "Jews should be wiped out" and "Muslims should be
exterminated" and "blacks are ruining America" and "whites are neanderthals",
all suggestions based on the first two words of each phrase. Yes, I'm
provoking Google - but surely people also encounter these by accident.

If Microsoft shipped a version of Office that suggested any of the above as
"corrections," they would be lambasted for it, and rightly so. Why does Google
get a pass? Is it because our standard of decency is so much lower on the
Internet? Or is it because we know that Google is merely reflecting popular
sentiment on the Internet, and so the true villain is ourselves?

(In fairness, Bing happily suggests equally monstrous ideas.)

~~~
anonymous
Because unethical sentences aren't a criminal offence? I mean, if you gave me
the first two words from those sentences and asked "what's the most probable
next words", I would give you pretty much the same answer. Let me try some
more and I'll put my answer next to google's

"lawyers are" -> "greedy sharks" (actual suggestion: scum)

"gays should" -> "be killed" (actual suggestion: be killed)

"marijuana should" -> "be legal" (actual suggestion: be legalized)

"drug users should" -> "get help" (actual suggestion: be shot)

"macs are" -> "better than windows" (actual suggestion: better than pcs)

As you can see, google's opinion on what is the largest cluster of opinions on
the subject aligns with what I expect in 4 out of 5 cases. Also note how
you're getting both the suggestion that marijuana should be legal and that
people who smoke it should be shot. These suggestions aren't google's opinion
on the matter, it's what google expects you to think. Thankfully, they're
wrong most of the time.

In the end, it's not impossible to write an ethical filter the same way google
has a spam filter, the only reason they haven't done it is because there's no
pressure to do so. If you don't want to see such suggestions on google, you
have two options:

1) Make people not talk about killing gays or how all lawyers are scum

2) Pressure google to add a filter

Also, white people really are partly neanderthal
[https://en.wikipedia.org/w/index.php?title=Neanderthal&o...](https://en.wikipedia.org/w/index.php?title=Neanderthal&oldid=553113335#Interbreeding_hypotheses)

~~~
pmcg
It's not necessarily what google expects you to _think_ , but rather what you
are most likely to be _searching for_.

Sometimes people search for content that they might not agree with, because
they want to see what is being said there out of curiosity. Not every search
is someone submitting their opinion to google, I'd expect that most are not.

~~~
anonymous
You're right, I didn't phrase that right. I should have written "what google
expects you to think of", like you said. Still, isn't that the same as "if we
divide people in groups based on what they think about topic A; what does the
largest group think?" Which to me sounds the same as "what you're most likely
to be thinking".

All the people who want the addicts dead, let's say they're 30% of all people
who think seriously about drug addicts, will happily rally under "should be
shot", while the ones who want them rehabilitated would form many smaller
groups under specific kinds of rehabilitation programs, how those should be
administered and really what is the best program for fixing these people.
Though, when you look at it like that, you're really most likely to think
"should be rehabilitated" or maybe "should... I don't really have an opinion
one way or the other". But then, if google actually did high-level clustering;
that is, extracting opinions that are all at the same level of specificity,
would those suggestions be useful for a search engine?

I guess the really right way to put it is -- That's what the google crawler
has seen written most frequently -- and assume it doesn't really mean what you
or I think about things.

~~~
pmcg
> I guess the really right way to put it is -- That's what the google crawler
> has seen written most frequently -- and assume it doesn't really mean what
> you or I think about things.

Not what the crawler has seen most, but what people typing the same thing as
you have ended up searching for most frequently. (We may be thinking the same
thing and just confusing the words.)

I don't believe it's supposed to be "what you're most likely to be thinking",
it's just a commonly searched-for phrase. I don't think Google's trying to
autocomplete with your opinion because people aren't just searching for their
own opinion, they're searching for words that will hopefully return the
information they want.

------
doctorpangloss
I entered expecting a cool implementation of Bayes through a simple if
statement.

Instead, I saw Bayes' rule in Haskell, a form that's more idiosyncratic than
probability expressions.

Simple is in the eye of the beholder. Perhaps making monads isn't actually
making Bayes' rule simpler.

~~~
wfunction
+1 it was actually a rather horrifying presentation of Bayes's theorem. I
would just explain it in 2 lines:

P(Cause and Effect) = P(Cause) × P(Effect given Cause) = P(Effect) × P(Cause
given Effect)

so P(Cause) = P(Effect) × P(Cause given Effect) ÷ P(Effect given Cause)

~~~
jhdevos
Naming it Cause and Effect is misleading; it makes it seem like bayes rule
only applies when there is a causal relation somewhere. It also make the
symmetry of the first line seem off. I would write the first line as

P(A and B are both true) = P (A is true) x P (B is also true, given that we
know A is already true) = P (B is true) x P (A is also true, given that we
know B is already true)

~~~
wfunction
I was actually going to write A and B at first, but then realized that
wouldn't show the relationship of the formula to the disease example very
obviously.

I thought it'd be better to make it easier to understand (and more practical)
by showing how it helps you infer causes from effects.

------
visarga
> Google uses Bayesian filtering the way Microsoft uses the if statement.

I am sure it depends on the employee, not the company. MS has done amazing
things in machine learning. I love Vowpal Wabbit which is maintained by an ex
Yahoo, currently MS employee. I am amazed at the complexity of the Kinect
software to translate images into body configurations. Just today I have seen
another amazing tecnology from MS: IllumiRoom. Check out the vid :
<http://www.youtube.com/watch?v=nGGMv9RnJIA> Check out MAVIS too, a system for
transcribing videos and making them searchable. It's a little dated, but very
cool nonetheless.

So, while I still think Google is superior, MS is making some amazing stuff
too, from time to time.

~~~
abecedarius
The quote was from most of a decade ago; probably truer then. (Though Clippy
was Bayesian. I read somewhere a claim that its awfulness came from management
messing with it.)

------
raverbashing
Well, the quoted statement mentions bayesian _filtering_ which is an
application of bayesian rule, but it's not as simple as an if.

The other quote 'removing the impossible and normalizing is equivalent to
Bayes rule' is a good thing to remember.

Just to mention that drug tests have different rates of false positives and
negatives, it's not usually '1% failure'

------
Eliezer
I've totally seen a LISP dialect with probabilistic execution and Bayesian
updating built in - it was one of the lecturers at SPARC 2012 - but damned if
I can remember the name of the speaker or the LISP dialect. Roughly, using
this dialect, you could build a program that would randomly branch at various
points, then keep only the cases of the program that matched some property,
and ask what the probabilities were for some other property over those cases.
I think the keyword was rejection-query but the first page of Google doesn't
turn up the desired result for that.

~~~
rawatson
Church? I'm not very familiar with the language but I believe it sounds
similar to what you're describing.
<http://projects.csail.mit.edu/church/wiki/Church>

------
tripzilch
This actually explains a _lot_ on Google's abysmal customer service,
especially where they appear way too triggerhappy on the account-suspensions
and pagerank penalizing for appearing too spammy.

~~~
adlpz
Google _lacks_ customer service. They are not interested in _you_ , but in
volume. And you handle volume through statistics.

If you happen to fall on an edge case, they won't really care. Because you
fall in the error.

------
forrestthewoods
I still don't understand bayesian statistics and every article I've read to
try and learn has had syntax I didn't understand. I just don't know
lisp/haskell/etc syntax. My college math stopped at calc 3 and it's been many
years since I took that.

Is there a good "bayesian probability for dummies" type resource out there?
Any suggestions?

~~~
michaelochurch
Let me throw my hat in the ring for this. Tell me how I did.

First, Bayesian _formalism_ shouldn't be hard. Just remember that

    
    
        P(A | B) = P(A & B) / P(B)
    

and it's easy to derive (left as exercise) Bayes's Law. That's quite easy. The
hard part, here, is applying it.

Let's say that I have a coin and I run a game where I always bet heads. If I
flip tails, you get $1; on heads, I get $1. Let's also say that 1.00% of the
people in the world are depraved enough to use unfair coins that always come
up heads. You don't know me, so that's a good assumption for how likely I am
to be a scumbag. So there's a prior probability of 0.01 that I'm a scumbag
running an unfair game, and 0.99 that I'm using a fair coin. Now, we play, and
I flip a head. I win. There's evidence (or signal) that is suggestive that
it's _more likely_ that I'm a scumbag, but far from conclusive (fair games
will turn up heads as well). How much has the probability changed? Well, let's
look at the possibilities:

    
    
        I'm a scumbag, and flip heads: 1/100 * 1 = 1/100
        -- There's a 1% prior prob., and if I'm a scumbag I'll always flip heads.
        I'm a scumbag, and flip tails: 1/100 * 0 = 0
        -- If I'm a scumbag, I'll never flip tails. 
        I'm using a fair coin, and flip heads: 99/100 * 1/2 = 99/200. 
        I'm using a fair coin, and flip tails: 99/100 * 1/2 = 99/200.
    

It might help to draw a rectangle for the four possibilities, like so:

    
    
          Scumbag    Fair
          (1/100)  (99/100)
        +---------+---------+
        | Scumbag | Fair    |
        | Heads   | Heads   |
        |         |         |
        |  1/100  |  99/200 |
        +---------+---------+
        | Scumbag | Fair    |
        | Tails   | Tails   |
        |         |         |
        |     0   |  99/200 |
        +---------+---------+
    

Once we flip a head, we can throw out the "flip tails" sections, because we
didn't. We flipped heads. That leaves us with a subspace that contains 99/200
+ 1/100 = 101/200 of the total space-- we know we're in that space, so we can
consider only it-- while the probability mass of the "I'm a scumbag and flip
heads" remains 1/100. So the probability that I'm a scumbag posterior to
flipping heads is:

    
    
        (1/100)/(101/200) = 2/101 ~ 2%
    

Probability wise, I'm almost twice as likely to flip a coin after a head comes
up. That relationship holds very well for small priors, but what if there's a
55% chance that I'm a scumbag? Flipping heads doesn't put that probability to
110%-- first of all, that makes no mathematical sense, and second of all,
there's _still_ a chance that I'm fair but just flipped another head. (The
actual posterior probability is about 71%.)

For probabilities over 0.5, "twice as likely" doesn't make sense, if you think
in terms of probability. What about _odds_ , however? Then, you can derive
that "twice as likely" as 55% is 71%.

You can say "twice as likely" if you think in _odds_ , not probability. Odds
is probability transformed through p/(1-p), or the ratio between the
probability of the event happening and it not happening; it has domain [0,
+inf] which means that "twice as likely" doesn't cause that problem. If you
think in odds terms, it turns out that odds(I'm a scumbag) exactly doubles
each time I flip a head. After the first, that _odds_ number goes from 1/99 to
2/99. If I flip 10 heads, then it's 1024/99; transformed back into a
probability it's 1024/1123, or a 91% chance that I'm a scumbag.

In fact, what I think makes Bayesian inference hard is that it involves that
subtle context switch between _probability_ and _odds_.

    
    
        If B is observed and there is a known prior _odds_ of A, the posterior _odds_ of A 
        are k times higher, where k is the ratio of the _probabilities_ of 
        (A & B) vs. (~A & B). 
    

What makes Bayesian inference neat is that you can run it as an online process
(you can look at events one at a time). If observed events (signals) are
independent, you don't need to process the corpus as a whole, and order
doesn't matter. That becomes nice when you factor in concurrency and
distributed computation. From these principles, you can also derive logistic
regression (multiplication in the log-odds space is a vertical shift along an
S-shaped logistic curve) and you start to see why the logistic curve comes up
so often in machine learning.

Ok, so what about Bayesian statistics? Essentially, it comes from the idea
that:

    
    
        (1) There is some prior probability distribution [0] over the possible states S 
        of affairs, which you cannot observe directly but whose relative 
        probabilities you can estimate based on observed events E. 
    
        (2) For each event E observed, compute unnormalized posteriors according to 
        posterior_unnormalized(S) = prior(S) * likelihood(E | S).
    
        (3) To compute normalized posteriors, divide each of the unnormalized ones 
        by the sum (or integral) of those likelihoods over all states S. 
        You now have a sum of 1, which makes it a legitimate probability distribution. 
    

[0] Regarding (1), picking priors is more of an art than a science, which is
why some people distrust Bayesian methods. The good news is that if you have a
lot of evidence and your priors are reasonable (few assumptions / Ockham's
Razor) you will converge to something close to the right answer with enough
evidence, regardless of priors.

Now, many actual Bayesian inference methods ignore (3). It's computationally
expensive (when there aren't two possible states S, but millions to infinitely
many) to normalize and often we don't need to do it. For example, if you're
doing classification between two classes Q and R-- with "events" being
features of what you're trying to classify-- then you generally only care
about which is more likely, not whether there's specifically a 23.75% chance
of Q. So the normalization step is generally either skipped (if relative
probabilities are all that matters) or procrastinated to the end of the
analysis. This "feels wrong" at first but it actually works, and is often
better (numerical stability) for the accuracy of the conclusions.

~~~
ajanuary
Typo, you have Scumbag heads twice in your grid.

~~~
michaelochurch
Good catch. Thanks.

------
carlob
Shameless plug:

Mathematica knows about Bayes' rule and makes reasoning about it quite simple.
If you look here [0] you'll find Bayes' rule in the section: "Properties and
relations".

What makes it extremely powerful is that it's fully symbolic so you can do
things like:

    
    
        Probability[x^2 > a \[Conditioned] x < b, x \[Distributed] NormalDistribution[m, s]]
    

and have it evalate to a bunch of Erf…

[0]
[http://reference.wolfram.com/mathematica/ref/Probability.htm...](http://reference.wolfram.com/mathematica/ref/Probability.html)

------
gizmo686
As much as I like Haskell, if your goal is to make Bayes' rule as simple as an
if statement, I doubt you can do better than quantum computer emulators.

Say the probability of being a User is 1%. You can initialize your variable,
X, to a superpossition, with 99% chance of being clean, and a 1% chance of
being a user.

Now, the function, f(x) for the test is such that: f(user) = (99% positive) +
(1% negative) f(clean) = (1% positive) + (99% negative)

Take y=f(x). Now, look at the quantum state of your machine (this is why we
are on an emulator). Each of the 4 potential states have a certain probability
associated with them.

The (theoretical) slowdown of using a quantom emulator is the fact that as the
size of your memory increases, you have exponentially more states to deal
with. But, because we would need to deal with the same number of states any
way, this is not actually relevant.

I should also mention that using a real quantom computer does not provide a
(trivial) speed up to this prosess. This is because, while the quantom state
contains all 4 possibilities with some probability, when we go to measure it,
we will only see 1 of the 4 states. Not only will this not give us a
probability, but we would need to re-run the computation to measure it again.

------
willvarfar
I would imagine that in real life you track the reliability of a drugs test as
% false positive and % false negative, rather than a single % reliable?

------
wfunction
Is it just me who doesn't understand what the body of the article had to do
with Google or Microsoft?

~~~
sordina
What people are referring to with respect to Google is that if higher-level
statistical operations are as easy as an if statement, then this may explain
why google takes (presumably amorally) such a statistical approach their
customers (because it is easy to).

~~~
wfunction
But what did any of that have to do with Microsoft, or even Google's
practices?

The entire article was a tutorial on Bayes's rule, with a completely
irrelevant introductory paragraph thrown in at the top. Nothing about Google
or Microsoft's practices was actually mentioned anywhere.

~~~
sordina
Yes, the article was about simplifying statistical operations. The comments
were inferring that some observed practices may have followed as a
consequence. I'd prefer to discus the meat of the article, but it's not
unusual to see comments go off on a tangent like this.

~~~
wfunction
What annoys me is when people write irrelevant provocative introductions just
to attract attention to their otherwise not so interesting writings.

~~~
dllthomas
In this case, it was a _quote_ of a _relevant_ provocative sentence, where the
provocation itself was irrelevant. Really, the title was far more intriguing
than the quote, but unfortunately the article didn't quite live up to it.

------
wfunction
> "What if Bayes' rule was as simple as an if statement?"

As the article clearly showed, this is wishful thinking.

And there's one simple reason why:

\- Bayesian inference is generally computationally intractable.

\- "If" statements are quite tractable.

Any language in which you could perform Bayesian inference as easily as you
perform "if" statements in would result in horrifyingly slow code and become
useless.

~~~
gizmo686
As I pointed out elsewhere in this thread, Quantom computer languages
literally do this.

Also, if you want to perform Bayseian inference, then it is equally as
difficult computationally whether or not it can be expressed as simple 'if'
statements. However, in terms of code simplicity, expressing it as simple if
statements is easier.

~~~
wfunction
> Quantom computer languages literally do this

Quantum computers also make the sun rise every morning.

I can claim anything about what they do, because they don't exist.

~~~
gizmo686
Even if quantum computers do not exist, a theoretical model of them does
exist, so we can discuss them the same way we discuss a turring machine.
(Although we might need to clarify which model we are talking about).

However, that is irrelevant because quantum computers do exist. For example,
in 2012, researcher's factored the number 21. [1].

However, even the existence of quantum computers is irrelevant, because
programming languages designed for them still exist. [2] Not only that, but
you can still run programs written in such languages on a classical computer.

Of course, you could also do the type of Bayesian programming in a language
designed for that, instead of one designed for Quantom computation.

[1] <http://arxiv.org/pdf/1111.4147v2.pdf> [2]
<http://tph.tuwien.ac.at/~oemer/qcl.html>

------
auggierose
This is funny, because Microsoft Research actually works on that.

[http://link.springer.com/chapter/10.1007%2F978-3-642-19718-5...](http://link.springer.com/chapter/10.1007%2F978-3-642-19718-5_5)

