Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Bootstrapped – A Python library to generate confidence intervals (github.com/facebookincubator)
94 points by jimarcey on Feb 22, 2017 | hide | past | favorite | 30 comments


There are better algorithms for bootstrap intervals that you perhaps should look into. Better in the sense of quality, not speed.

Google e.g. "interval BCa"


Thanks for the feedback Petters! I agree in principle. I am familiar with that method. The use case for this is for situations where you have large initial sample counts (so the correction should be less important, we do throw warnings when the initial sample counts are low). We also provide tools to check power (I'll commit an example of this later today).

Also - I gladly accept diffs if you are motivated. It is not clear to me that BCa and other variants provide substantial improvement for most practical situations. I would invite criticism here.

Tldr - thanks for the feedback


From the project README:

How bootstrapped works tldr - Percentile based confidence intervals based on bootstrap re-sampling with replacement.

---

MIT OCW 18.05 has this to say about the technique:

https://ocw.mit.edu/courses/mathematics/18-05-introduction-t...

The bootstrap percentile method is appealing due to its simplicity. However it depends on the bootstrap distribution of mean(x') based on a particular sample being a good approximation to the true distribution of mean(x). Rice says of the percentile method, “Although this direct equation of quantiles of the bootstrap sampling distribution with confidence limits may seem initially appealing, it’s rationale is somewhat obscure.”

In short, don’t use it.

Use the empirical bootstrap instead (we have explained both in the hopes that you won’t confuse the empirical bootstrap for the percentile bootstrap).

---

Updated to reflect suggestions in comments below.


I'm not a big fan of the percentile bootstrap, but that reference is a little too cavalier with material that deserves to be treated more rigorously. Chapter 8 of Wasserman's "All of Statistics" is much more careful about outlining the conditions under which the percentile bootstrap will work. Moreover, he works through a specific example that demonstrates that the percentile bootstrap does not generate results that are profoundly different from other methods.


John, you are a true wizard. I admire you & will work to incorporate your feedback (gathered offline) into the library =)

Thanks for the feedback!


Just to make sure your comment is clear, the powerful portion of your quote "In short, don't use it." is strictly intended for the percentile bootstrap. Immediately after the paper says "Use the empirical bootstrap instead..."


Yep! You are right. I don't think it is a huge disparity but I would like to implement the pivotal/empirical bootstrap instead. The change is just a few lines of code.


Is there a reason you chose the percentile instead of something like the pivotal (and for not adding the options for other intervals)? Is it solely for simplicity?


Excellent q - we intend to add in other options as we go. Pivotal being one. Id also like to add in permutation tests. If you have ideas we welcome diffs =)


Would love to contribute. I'll try to put some work in this weekend when I have the time.


<3


While on the topic of confidence intervals, has anyone encountered a package capable of generating multinomial confidence intervals similar to CRAN's MultinomialCI? I've yet to find a Python solution.


Neat!

OP, how does this compare to scikits.bootstrap [1] feature/performance-wise?

[1] https://scikits.appspot.com/bootstrap


That uses the BCa method which in some situations is better.

This library gives you a/b test functionality and should be faster on large input datasets.


It's a nice wrapper on a powerful technique. Could be very useful to some folks - but requiring numpy and pandas is kind of excessive.


Yeah, I'm not sure that this is a fair comment: how would _you_ avoid the necessity of pandas and numpy? Besides which, in most projects where you're interested in confidence intervals you'll probably already have both imported for other functionally anyway.


Accumulating percentile data isn't a very difficult technique. I did the same thing on arduino using nesC so it is surely doable without pandas or numpy.


Completely agree it is easy. Doing it quickly (for Python) is what this is optimized for. We would love a contribution if you have a method for resampling + percentiles that beats numpy.


I will surely give it a look if I have a need for this. Until then, thanks for the contribution and the invitation, and I have nothing against numpy.


Thanks for the feedback - happy hacking =)


TBH, I haven't used python for science in a few years, so maybe numpy is the norm now and I'm showing my age. But when I was doing more python, I wrote bootstrapping, monte carlo and CI code without anything but the standard lib. I probably used pypy to get it fast enough, but if everyone has numpy now then that's definitely the way to go and I retract my comment!

I'm not trying to sound arrogant or anything, if numpy is the standard now then there's definitely no point in reinventing the wheel. (But pandas is still overkill...)


I agree with you! Pandas is only used in the power analysis code (which also has matplotlib for plotting). The best thing would be to pair this down. We would gladly take contributions - i think the path forward on this feedback is clear but will take a little code =)

Most of the important stuff is just numpy which i feel is pretty fair for most peeps.


pandas is the standard now for most python data munging I've come across. numpy is "low level".


bootstrapping does not require any munging.


I would assume that it is likely that anyone doing statistical analysis with python would have numpy and pandas installed prior. Has your experience indicated otherwise?


Pretty much any stats work I do in python needs numpy. I don't use pandas as often, but that's because I developed numpy based habits before pandas was really mature, and this is a failing on my part.

"pands and numpy are imported by default" is a pretty safe assumption in this space, IMO


Thanks for the feedback!

numpy is used to give a speed improvement when generating the bootstrap samples - this would be very slow in a Python for loop.

Pandas is only used in the power analysis code. Ill make that more clear.

Would love more feedback if you have it!


numpy is pretty much a given for any scientific or numeric python code unless you want to start writing things in Cython.


Not a disagreement, but AFAIK NumPy and Cython are not mutually exclusive, as Cython supports NumPy type annotations and can index into its arrays quickly.


I am 87% confident that this has a 74% change of hitting it off with the HN community.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: