More

e10v_me · 2026-03-02T18:32:30 1772476350

I published a practical comparison of Python packages for A/B test analysis: tea-tasting, Pingouin, statsmodels, and SciPy.

Instead of choosing a single "best" tool, I break down where each package fits and how much manual work is needed for production-style experiment reporting.

Includes code examples and a feature matrix across power analysis, ratio metrics, relative effect CIs, CUPED, multiple testing correction, and working aggregated statistics for efficiency.

Disclosure: I am also the author of tea-tasting.

e10v_me · 2025-10-24T15:53:03 1761321183

I was thinking about the labor market congestion problem and came up with a solution that is often used in service marketplaces: pay to apply. Then I asked myself what this solution has that AI doesn't. That’s how I arrived at the analogy that prices act as model weights: they encode market information. An important difference: prices incorporate signals from dispersed, hard-to-observe data that an AI/ML model may not access.

P.S. Paying to apply may sound provocative and require thoughtful consideration and careful testing. Payments can be made with platform-issued virtual points, available in a limited supply. But here, I focus on why price signals may address this problem better than AI-based screening.

e10v_me · on Aug 29, 2024

Funny, the same) When I was switching from R (data.table) to Python, it was painful. Not only because it was slow, but because of the API. At that time, I thought that maybe it's because of switching to something new. Several years later, switching from Pandas to Polars API was a real joy (Ibis is also good is that sense). So, I learned that it had been Pandas fault all along))

e10v_me · on Aug 20, 2024

Congrats! What’s impressive is not just the speed of the tools Astral develops but also the speed of delivery.

I wonder, if you plan to extend the functionality of building and publishing packages. For example, support for dynamic version (from github) and trusted publishers.

e10v_me · on Aug 5, 2024

Here’s another take on the Dunning–Kruger effect. I made two main points:

1. Consider N independent observations of two variables, X1 and X2, with imperfect correlation. Next, we assign percentiles to them: P1 and P2, respectively. We then take a subsample of size N/4 with the lowest values of P1. The statistical fact, not psychological effect, is that, in this subsample, the average of P1 will always be less than the average of P2. With a large enough sample size, the difference will be statistically significant.

2. Percentile is a measure with bounded support (between 0 and 100). It's not correct to use it as a measure of abilities in this experiment. Participants from the top test score quartile can overestimate their abilities by a maximum of 24 percentiles and an average of 12 percentiles. Participants from the bottom test score quartile can overestimate their abilities by a maximum of 99 percentiles and an average of 87 percentiles. There is certainly a bias here.

The first point is not particularly new; there are published papers on this topic. However, not everyone is aware of this, so it's worth mentioning. I also provided a Streamlit app for simulations and a source code.

The second point is somewhat novel. At least, I haven't encountered any references to it in the context of the Dunning–Kruger experiment.

I also encourage you to think about the question: if a person scores the maximum number of points in a test, does this mean that they cannot overestimate their abilities?

e10v_me · on July 29, 2024

You would need probability distributions anyway. In Python, SciPy is the most mature and popular package with probability distributions. And it depends on NumPy. But I'll gladly consider better alternatives if you propose them.

e10v_me · on July 29, 2024

I thought about multiple comparison corrections. Here what my thoughts were:

1. Experiments with 3 or more variants are quite rare in my practice. I usually try to avoid them.

2. In my opinion, the Bonferroni correction is just wrong. It's too pessimistic. There are better methods though.

3. The choice of alpha is subjective. Why use a precise smart method to adjust a subjective parameter? Just choose another subjective alpha, a smaller one :)

But I can change my opinion if I see a good argument.

cschmidt · on July 30, 2024

If you work for a large website (as I used to), they probably run hundreds of tests a week across various groups. So false positives are a real problem, and often you don't see the gain suggested by the A/B when rolling it out.

I agree that Bonferroni is often too pessimistic. If you Bonferroni correct you'll usually find nothing is significant. And I take your point that you could adjust the $\alpha$. But then of course, you can make things significant or not as you like by the choice.

False Discover Rate is less conservative, and I have used it successfully in the past.

People have strong incentives to find significant results that can be rolled out, so you don't want that person choosing $\alpha$. They will also be peaking at the results every day of a weekly test, and wanting to roll it out if it bumps into significance. I just mention this because the most useful A/B libraries are ones that are resistant to human nature. PM's will talk about things being "almost significant" at 0.2 everywhere I've worked.

e10v_me · on July 31, 2024

Thank you for explanation and for drawing a vivid picture) I will add FWER and FDR to the roadmap. Which specific controlling procedures do you find the most useful on practice?

I'm considering the following: - FWER: Holm–Bonferroni, Hochberg's step-up. - FDR: Benjamini–Hochberg, Benjamini–Yekutieli.

cschmidt · on July 31, 2024

Personally, I've used FDR, but FWER is meant to be good as well. I guess I don't have a preference.

e10v_me · on July 29, 2024

Thank you! I hope it will be useful for you.

Regarding your question, first, I'd like to understand what problem you want to solve, and whether this approach will be useful for other users of tea-tasting.

thegginthesky · on July 30, 2024

No problem! I have most of the code in very small functions that I'd be willing to contribute.

At my company we have very time sensitive AB tests that we have to run with very few data points (at most 20 conversions per week, after 1000 or so failures).

We found out that using Bayesian A/B testing was excellent for our needs as it could be run with fewer data points than regular AB for the sort of conversion changes we aim for. It gives a probability of group B converting better than A, and we can run checks to see if we should stop the test.

Regular ABs would take too long and the significance of the test wouldnt make much sense because after a few weeks we would be comparing apples to oranges.

e10v_me · on July 30, 2024

Thank you for explanation. If I understand correctly, you use this approach to increase sensitivity (compared to NHST) using the same data.

Most probably, in your case, higher sensitivity (or power) comes at the cost of higher type I error rate. And this might be fine. Sometimes making more changes and faster is more important than false positives. In this case, you can just use a higher p-value threshold in the NHST framework.

You might argue that the discrete type I error does not concern you. And that the potential loss in metric value is what matters. This might be true in your setting. But in real life scenarios, in most cases, there are additional costs that are not taken into account in the proposed solution: increased complexity, more time spent on development, implementation, and maintenance.

I suggest reading this old post by David Robinson: https://varianceexplained.org/r/bayesian-ab-testing/

While the approach might fit in your setting, I don't believe most of other users of tea-tasting would benefit from it. For the moment, I must decline your kind contribution.

But you still can use tea-tasting and perform the calculations described in the whitepaper. See the guide on how to define a custom metric with a statistical test of your choice: https://tea-tasting.e10v.me/custom-metrics/

e10v_me · on July 29, 2024

Thank you. I will think about it. There are many different things I can add. The idea is to focus on the most needed features first. And the word "exotic" tells for itself ;)

e10v_me · on July 29, 2024

Not at the moment. If you have a specific method in mind, I will gladly look into it.