Hacker News new | past | comments | ask | show | jobs | submit login

This is a called a binomial process. We want to estimate the true proportion of exits for a given city. A city has x exits out of n trials, so one estimate of this proportion is just x/n. However, if you have a bunch of cities like this list, you're going to have cities that by random chance end up with a fraction close to 1. That doesn't mean that startups there are guaranteed to succeed. It means if you flip a coin 5 time, and replicate it 1000 times, you're doing to have some runs of 5 heads, and some runs of 5 tails. If you kept going, you would find the proportion approaching 1/2 (for a coin) for all cases.

So the problem is how you compare a city with 5/6 exits like Branford, CT USA with 143/208 like Mountain View. Is Branford that much better because 5/6 > 143/208 ? Mostly all you know is that the error in your estimate is much larger for Branford than for Mountain View, because your value of n is 6 vs 208. You can't say with statistical confidence that Branford is better.

So one trick to punish the little n locations is to do some smoothing. Laplace smoothing is to add 1 for all outcomes, so 1 success to x and one failure, meaning we add 2 to n. That also means that nothing gets exactly to 0.0 or 1.0. The odds in Saint Petersburg Russia aren't really 0.0 because the were 0/11. There is some chance you could succeed, so it gives you a better estimate of "unseen events".

The next thing you want to do is look at confidence intervals, rather than our point estimate of x/n or even (x+1)/(n+2). There are a number of formulas you can use, I used one built into R, a statistical modelling language. This gives you a lower and upper bound on your true estimate of the proportion. If the bounds is exact, then 95% of the time the interval will contain this true, unknown proportion.

The bounds on my smoothed counts are:

                             x+1     n+2     lb      x1/n2   ub
     Mountain View, CA USA   144     209     0.621   0.689   0.751
     Branford, CT USA        6       7       0.421   0.857   0.996
     Los Angeles, CA USA     56      180     0.244   0.311   0.384
So the true estimate of Mountain View is somewhere between 0.621 to 0.751, while Branford CT is between 0.421 and 0.996. Since these estimates overlap, we can't really say one is better than the other. Also consider LA, which has range of 0.244 to 0.384. Since 0.384 < 0.421, we could say that LA has a worse exit ratio than either Brandford or Mountain View, with 95% confidence.

To sort, it is often good to be conservative and use the lower bound. I used a 95%, which is good for saying Branford is better than LA, but might be a bit large for sorting. You could use a 90% or even 80% interval for that, if desired.

It is really crucial to take into account what you don't know when comparing fractions based on different values of n.

Hope this helps...

(10 days later...)

It does. Thank you.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact