The assumption is that there's some constant p underlying probability that a random person will rate a given thing positively. If we observe, for instance, 4 positive and 5 negative reviews or votes, there's a probability distribution (known as a Beta distribution) which tells us what the possible values of p are given the votes we observe: p^4 (1-p)^5. graph: https://www.google.com/search?q=x%5E4+(1-x)%5E5%20from%200%2...
Now if we observe 40 and 50, respectively, the curve looks like this:
(I had to do it in the log domain because Google's grapher underflows otherwise -- the 20 is just to make the numbers big enough to graph. The more correct thing involves gamma functions and that just gets in the way right now)
The more you observe, the more sharply peaked the likelihood function is. The funky equation in the article is an approximation to the confidence interval of that graph -- 95% of the probability mass is said to be within those bounds.
It's not a great approximation, for one because the graph is skewed (try it with 10/50) and it assumes that the mean is exactly in the middle of the confidence interval. The correct computation involves the inversion of a messy integral called the incomplete beta function. Scipy has a package which includes betaincinv which solves this more exactly:
>>> import scipy.special
>>> scipy.special.betaincinv(5,6, [0.025, 0.975])
array([ 0.18708603, 0.73762192])
would be the 95% confidence interval for 4 positive and 5 negative votes;
>>> scipy.special.betaincinv(41,51, [0.025, 0.975])
array([ 0.34599562, 0.54754792])
for 40 and 50, respectively.
[edit: apologies, I had to run and get ready for work -- I didn't really have time to make this very comprehensible; but i just now fixed a bug in my confidence interval stuff above]
And to make the description slightly more accurate, at the expense of more complexity: "What number are we 80% certain the approving percentage will exceed?"
Anyway, I personally think 95% confidence intervals are a crutch. The correct Bayesian approach is to consider two items, each with their own up and down votes, and integrate over all possible values for p1 and p2 (being the underlying probabilities of upvotes for item 1 and 2, respectively) over the observed data, and compute the likelihood of superiority of p1 over p2.
How to turn that into an actual ranking function? No idea. I doubt it would work, but you could compute against a benchmark distribution (i.e. the uniform 0-1 distribution).
If you do that, it probably turns out that your ranking function is the mean of the Beta distribution, which is simple: (U+1)/(U+D+2) where U and D are the upvote/downvote counts [note: we started with the prior assumption that p could be anywhere between 0 and 1, uniformly]. Basically, the counts shrink towards 1/2 by 1. This is a hell of a lot less complicated, and it achieves the goal of ranking different items by votes pretty well with more votes being better.