Thanks for the feedback Petters! I agree in principle. I am familiar with that method. The use case for this is for situations where you have large initial sample counts (so the correction should be less important, we do throw warnings when the initial sample counts are low). We also provide tools to check power (I'll commit an example of this later today).
Also - I gladly accept diffs if you are motivated. It is not clear to me that BCa and other variants provide substantial improvement for most practical situations. I would invite criticism here.
The bootstrap percentile method is appealing due to its simplicity. However it depends on the bootstrap distribution of mean(x') based on a particular sample being a good approximation to the true distribution of mean(x). Rice says of the percentile method, “Although this direct equation of quantiles of the bootstrap sampling distribution with confidence limits may seem initially appealing, it’s rationale is somewhat obscure.”
In short, don’t use it.
Use the empirical bootstrap instead (we have explained both in the hopes that you won’t confuse the empirical bootstrap for the percentile bootstrap).
I'm not a big fan of the percentile bootstrap, but that reference is a little too cavalier with material that deserves to be treated more rigorously. Chapter 8 of Wasserman's "All of Statistics" is much more careful about outlining the conditions under which the percentile bootstrap will work. Moreover, he works through a specific example that demonstrates that the percentile bootstrap does not generate results that are profoundly different from other methods.
Just to make sure your comment is clear, the powerful portion of your quote "In short, don't use it." is strictly intended for the percentile bootstrap. Immediately after the paper says "Use the empirical bootstrap instead..."
Yep! You are right. I don't think it is a huge disparity but I would like to implement the pivotal/empirical bootstrap instead. The change is just a few lines of code.
Is there a reason you chose the percentile instead of something like the pivotal (and for not adding the options for other intervals)? Is it solely for simplicity?
Excellent q - we intend to add in other options as we go. Pivotal being one. Id also like to add in permutation tests. If you have ideas we welcome diffs =)
While on the topic of confidence intervals, has anyone encountered a package capable of generating multinomial confidence intervals similar to CRAN's MultinomialCI? I've yet to find a Python solution.
Yeah, I'm not sure that this is a fair comment: how would _you_ avoid the necessity of pandas and numpy? Besides which, in most projects where you're interested in confidence intervals you'll probably already have both imported for other functionally anyway.
Accumulating percentile data isn't a very difficult technique. I did the same thing on arduino using nesC so it is surely doable without pandas or numpy.
Completely agree it is easy. Doing it quickly (for Python) is what this is optimized for. We would love a contribution if you have a method for resampling + percentiles that beats numpy.
TBH, I haven't used python for science in a few years, so maybe numpy is the norm now and I'm showing my age. But when I was doing more python, I wrote bootstrapping, monte carlo and CI code without anything but the standard lib. I probably used pypy to get it fast enough, but if everyone has numpy now then that's definitely the way to go and I retract my comment!
I'm not trying to sound arrogant or anything, if numpy is the standard now then there's definitely no point in reinventing the wheel. (But pandas is still overkill...)
I agree with you! Pandas is only used in the power analysis code (which also has matplotlib for plotting). The best thing would be to pair this down. We would gladly take contributions - i think the path forward on this feedback is clear but will take a little code =)
Most of the important stuff is just numpy which i feel is pretty fair for most peeps.
I would assume that it is likely that anyone doing statistical analysis with python would have numpy and pandas installed prior. Has your experience indicated otherwise?
Pretty much any stats work I do in python needs numpy. I don't use pandas as often, but that's because I developed numpy based habits before pandas was really mature, and this is a failing on my part.
"pands and numpy are imported by default" is a pretty safe assumption in this space, IMO
Not a disagreement, but AFAIK NumPy and Cython are not mutually exclusive, as Cython supports NumPy type annotations and can index into its arrays quickly.
Google e.g. "interval BCa"