

What’s Wrong with Probability Notation? - yarapavan
http://lingpipe-blog.com/2009/10/13/whats-wrong-with-probability-notation/

======
ramanujan
Huh. What I was expecting to see here was a critique of the notation for
expressing causality (and then expecting to cite Judea Pearl's do notation).

But regarding the post...

1) Subscripts basically solve the problem of the identical p's.

2) For iterated expectations, again, subscripts solve that problem as well.

E_X[X^Y]

indicates that it's the expectation over X.

3) However, for dummy variables it definitely can be annoying to use P(X=x),
especially when writing stuff by hand. Your mental dialogue is saying "x
equals x" and it's often very important to distinguish the variable from the
value during manipulation.

That's why I tend to use a different letter for the dummy variable -- P(X=k)
when X is discrete and either f_X(u) or P(X \in [u-du,u+du]) when X is
continuous.

~~~
caffeine
I don't really see the problem in the first place: we're just writing less by
assuming that the subscript is identical to the argument (except maybe for
capitalization) - and if it's not, people usually disambiguate by adding the
subscripts back in.

You would have to be a twisted soul to write P(x|y) if x is drawn from r.v. Y
and y from r.v. X and P[Y|X] is the distribution in question.

~~~
ramanujan
> You would have to be a twisted soul to write P(x|y) if x is drawn from r.v.
> Y and y from r.v. X and P[Y|X] is the distribution in question.

Sure, that would be perverse. What I was referring to was more the fact that
capital X and lowercase x look similar on the page and (more importantly)
sound similar in my head.

Pedagogically I've found that saying "X takes on the value x" confuses a LOT
of undergraduates.

Also, at least for me, if I start slinging around RVs and need to get closed
form solutions, it _can_ start to be very important to distinguish between X
and the value it takes on as k or u, particularly when trying to do
conditional expectations or get explicit distributions on functions of
multiple RVs. It's sort of an aural Hungarian notation.

~~~
caffeine
Sorry to re-open this so late, but I thought about this discussion today as I
was reading a paper which had a typo in the subscript specifying a
distribution: <http://dx.doi.org/10.1103/PhysRevLett.103.138101> (it never
ceases to amaze me that people get away with publishing glaring errors like
that one).

In eq. (4), they specify P_{k|v}[x|v]. Thankfully, here, it's easy to spot the
typo, because k is discrete and x is continuous. But this made me realize that
my objection to these subscripts is really akin to wanting to write DRY, self-
commenting code.

The fundamental information is already there in the equation. Adding extra
subscripts is then like adding unnecessary comments to code - if they're
right, they just add redundancy but maybe help the uninitiated; but if they're
wrong, they're infinitely worse than having put nothing at all (someone who
didn't know that PRL lets all kinds of crap fly could really be thrown for a
loop figuring out how equ. 4 is possible).

> Pedagogically I've found that saying "X takes on the value x" confuses a LOT
> of undergraduates.

I haven't taught this to anyone, so your experience is more valuable than mine
here. Nonetheless - I noticed in my undergraduate statistics class that
programmers (i.e. people who are accustomed to obtuse rules regarding case
sensitivity) had no problem with this, while other people accustomed to
playing fast and loose with notation (economists and physicists in particular)
were somewhat put off.

------
anshul
There is nothing really wrong with probability notation. As inferential steps
between concepts increase in math, abuse of notation becomes indispensable.

All that probability shorthand can be unambiguously translated to formal
definitions quite easily. But doing so would be analogous to writing a complex
program in assembly - doable (and defined pretty much by the very fact that
this is doable) but not very productive (and thus not worth doing unless you
are debugging or something).

~~~
ramanujan
> All that probability shorthand can be unambiguously translated to formal
> definitions quite easily. But doing so would be analogous to writing a
> complex program in assembly - doable (and defined pretty much by the very
> fact that this is doable) but not very productive (and thus not worth doing
> unless you are debugging or something).

Actually I kind of disagree here.

With R or Haskell you can easily work directly with probability densities
learned from data. One frequently uses the exact Bayes' rule expression with
P(X), P(Y), and P(X|Y) all being known functions to get P(Y|X).

See for example functions like ecdf, which takes in an N vector of points on a
1D line and returns an actual _function_ , namely the empirical cumulative
density.

[http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ec...](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/ecdf.html)

Can be very handy when you want empirical quantiles (e.g. "what percentage of
the time do I expect to see 12000 hits in a day, given this single column with
the hits for each of the last 200 days").

~~~
caffeine
I don't really understand why what you said disagrees with what your parent
said?

~~~
ramanujan
Perhaps I read too quickly -- when he said:

> All that probability shorthand can be unambiguously translated to formal
> definitions quite easily. But doing so would be analogous to writing a
> complex program in assembly

One possible interpretation (probably, in retrospect, the right one) is that
he meant that Whitehead/Russell style axiomatization of probability was in
theory possible, but would not be of much value.

I read it initially (likely wrongly in retrospect) as saying that translating
the equations into an unambiguous formal computer readable definition would be
intractable and/or only of theoretical interest.

------
BobCarpenter
I thought I'd jump in as the author of the original post.

The context is that I'm trying to write an introduction to Bayesian stats for
people who know calc and matrices, but may not have taken or understood math
stats. Specifically, I want to (a) use the notation that's commonly used in
the field (e.g. in Andrew Gelman et al.'s books, Michael Jordan et al.'s
papers, etc.), and (b) not confuse readers with a long introduction to sample
spaces and a sketchy description of measures, just so I could introduce
precise random variable notation only to abuse it.

The big problem with trying to define continuous densities is you never get
enough measure theory in an intro to probability (e.g. DeGroot and Schervis,
Larsen and Marx) to bottom out in a real definition. It's not that complex, so
if you're interested, I'd highly recommend Kolmogorov's own intro to analysis,
which has great coverage of both Lebesgue integration (so you can understand
the usual R^n case) and general measure theory (so you can impress your
friends with your knowledge of analysis).

------
psyklic
I don't think that the author's points are valid. He seems to be using
references with sloppy notation.

Suffice it to say that if he picks up a mathematical probability textbook he
should be satisfied.

* I do agree that people use shortcuts to make equations seem simpler, that some standard equations look complicated, and that you need to think hard about which scenario is appropriate for your application.

~~~
psyklic
Just a quick rebuttal of the author's specific points:

(1) Given a set of elements X, E(X) = \sum{x \in X} xp(x). The problem the
author mentioned is solved, since we are now summing over all elements in X
rather than using the input variable inappropriately.

(2) Given sets of elements X and Y, and the set of ALL elements O, then p(X),
p(Y), and p(X|Y) are all computed in the same manner. p(X) is shorthand for
p(X|O) -- so we are now given three analogous functions, p(X|O), p(Y|O), and
p(X|Y). So, Bayes' can be used to compute all three in the exact same manner,
if you so wish.

The above rebuttals are obviously discrete, but there are analogous continuous
variable scenarios.

------
nova
Oh yes, I agree. Standard probability notation is handy and fast but just
depends too much on the context.

I wonder if someone has created a more orthogonal notation for probability,
like Sussman did with the Schemish/functional notation for differential
geometry.

~~~
eru
Any good pointers to Sussman's notation?

~~~
gwern
<http://mitpress.mit.edu/sicm/book.html>

~~~
eru
Thanks.

------
catzaa
The author of the article already abused the notation before he said what is
wrong with it. P(A) usually denotes the probability function and p(x) is a
probability density function.

The problem is that the distinction between evens and variables isn't always
clear.

------
kvh
The P in P(X) and P(Y) IS actually the same P. It represents the probability
of the underlying sample space. X and Y are random variables mapping from that
sample space to the real line. P(X=x) is shorthand for P(X^-1(x)).

------
RyanMcGreal
I was hoping to see something like "It's unambiguous only around three-
quarters of the time." Alas.

