Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Tea-tasting, a Python package for the statistical analysis of A/B tests (e10v.me)
150 points by e10v_me 11 months ago | hide | past | favorite | 48 comments
I'm excited to introduce tea-tasting, a Python package for the statistical analysis of A/B tests

It features Student's t-test, Bootstrap, variance reduction using CUPED, power analysis, and other statistical methods.

tea-tasting supports a wide range of data backends, including BigQuery, ClickHouse, PostgreSQL, Snowflake, Spark, and more, all thanks to Ibis.

I consider it ready for important tasks and use it for the analysis of switchback experiments in my work.




It would probably be good to have something considering multiple comparisons (False Discovery Rate, Bonferroni correction), which is often the bane of running a whole series of A/B tests. And, as another poster has mentioned, an anytime approach that is resistant to early stopping due to peaking [1].

For those who haven't read about Fisher's tea experiment: There was a woman who claimed she could tell if the milk was put into the cup before or after pouring the tea. Fished didn't think so, and developed the experimental technique to test this idea. Indeed she could, getting them all right iirc.

[1] see https://media.trustradius.com/product-downloadables/UP/GB/AD... for a discussion of the problems with a t-test. There is also a more detailed whitepaper from Optimizely somewhere


For anyone interested in anytime-valid testing, I wrote a Python library [1] implementing multinomial and time inhomogeneous Bernoulli / Poisson process tests based in [2].

[1] https://github.com/assuncaolfi/savvi/

[2] https://openreview.net/forum?id=a4zg0jiuVi


I thought about multiple comparison corrections. Here what my thoughts were:

1. Experiments with 3 or more variants are quite rare in my practice. I usually try to avoid them.

2. In my opinion, the Bonferroni correction is just wrong. It's too pessimistic. There are better methods though.

3. The choice of alpha is subjective. Why use a precise smart method to adjust a subjective parameter? Just choose another subjective alpha, a smaller one :)

But I can change my opinion if I see a good argument.


If you work for a large website (as I used to), they probably run hundreds of tests a week across various groups. So false positives are a real problem, and often you don't see the gain suggested by the A/B when rolling it out.

I agree that Bonferroni is often too pessimistic. If you Bonferroni correct you'll usually find nothing is significant. And I take your point that you could adjust the $\alpha$. But then of course, you can make things significant or not as you like by the choice.

False Discover Rate is less conservative, and I have used it successfully in the past.

People have strong incentives to find significant results that can be rolled out, so you don't want that person choosing $\alpha$. They will also be peaking at the results every day of a weekly test, and wanting to roll it out if it bumps into significance. I just mention this because the most useful A/B libraries are ones that are resistant to human nature. PM's will talk about things being "almost significant" at 0.2 everywhere I've worked.


Thank you for explanation and for drawing a vivid picture) I will add FWER and FDR to the roadmap. Which specific controlling procedures do you find the most useful on practice?

I'm considering the following: - FWER: Holm–Bonferroni, Hochberg's step-up. - FDR: Benjamini–Hochberg, Benjamini–Yekutieli.


Personally, I've used FDR, but FWER is meant to be good as well. I guess I don't have a preference.


And the Student's t-test which was named so because William Sealy Gosset's employer (Guinness beer) allowed him to publish it anonymously, so he published using the pseudonym "Student".


Great package! I'll test it out on my free time.

I'm wondering if you'd like to accept a contribution for Bayesian AB Testing, based on this whitepaper[0] and developed in Numpy.

If so, we can chat at my email gbenatt92 at zohomail dot com, or I can open a draft PR to discuss the code and paper.

[0]https://vwo.com/downloads/VWO_SmartStats_technical_whitepape...


Thank you! I hope it will be useful for you.

Regarding your question, first, I'd like to understand what problem you want to solve, and whether this approach will be useful for other users of tea-tasting.


No problem! I have most of the code in very small functions that I'd be willing to contribute.

At my company we have very time sensitive AB tests that we have to run with very few data points (at most 20 conversions per week, after 1000 or so failures).

We found out that using Bayesian A/B testing was excellent for our needs as it could be run with fewer data points than regular AB for the sort of conversion changes we aim for. It gives a probability of group B converting better than A, and we can run checks to see if we should stop the test.

Regular ABs would take too long and the significance of the test wouldnt make much sense because after a few weeks we would be comparing apples to oranges.


Thank you for explanation. If I understand correctly, you use this approach to increase sensitivity (compared to NHST) using the same data.

Most probably, in your case, higher sensitivity (or power) comes at the cost of higher type I error rate. And this might be fine. Sometimes making more changes and faster is more important than false positives. In this case, you can just use a higher p-value threshold in the NHST framework.

You might argue that the discrete type I error does not concern you. And that the potential loss in metric value is what matters. This might be true in your setting. But in real life scenarios, in most cases, there are additional costs that are not taken into account in the proposed solution: increased complexity, more time spent on development, implementation, and maintenance.

I suggest reading this old post by David Robinson: https://varianceexplained.org/r/bayesian-ab-testing/

While the approach might fit in your setting, I don't believe most of other users of tea-tasting would benefit from it. For the moment, I must decline your kind contribution.

But you still can use tea-tasting and perform the calculations described in the whitepaper. See the guide on how to define a custom metric with a statistical test of your choice: https://tea-tasting.e10v.me/custom-metrics/


Imo, with a/b tests, its really easy to get sucked into the 30 different analysis algos, but the most important thing by far is experiment hygiene


And knowing beforehand when you won't get enough exposures to reach significance.

Not many people have enough traffic to A/B test small effects and reach significance without running the test for multiple years.

I don't use CUPED in my tests... how much can it reduce wait times?


Strictly speaking you don't need to wait for some arbitrary significance threshold. I don't know why so many people treat website A/B tests as similar to carefully, traditional nhst controlled experiments. Website A/B testing is much better thought of as an optimization problem rather than a true hypothesis test.

What's really important if you want to improve a website via A/B testing is a constant stream of new hypotheses (i.e. new variants). You can call tests "early" so long as you have new tests lined up it boils don't to a classic exploitation/exploration problem. In fact, in early development rapid iteration often yields superior results to waiting for significance.

As a website matures and reaches closer to some theoretical optimal conversion point, then it starts becoming increasing important to wait until you are very certain of an improvement. But if you're just starting A/B testing, more iteration will yield greater success than more certainty.


> You can call tests "early"

Another way to say that is: you can randomly pick a winner


Of course at the extreme you are over tuning for exploitation but in practice it's never completely random. You always have some information about the probably winner, so long as the P(A>B|obs) is not 0.5

Taking a long time to reach "significance" just means there is a small difference between the two variants, so it's better to just choose one and the try the next challenger which might have a larger difference.

In the early stages of running A/B tests being 90% certain that one variant is superior is perfectly fine so long as you have another challenger ready. Conversely, In the later stages of a mature website when you're searching for minor gains you probably want a much higher level of certainty that then standard 95%.

In either case thinking in terms of arbitrary significance thresholds doesn't make that much sense for A/B testing.


This may be true when B is missing the “Try/buy” button.

But for incremental, smaller changes, calling early is probably gambling.


You don't want to do that if you have seasonality, or novelty effects.


I don't think CUPED is super useful if you just stratify your users properly before the experiment begins.


CUPED is easier than stratifying users. Or, probably, you mean post-stratification. Still, CUPED is easier, on my personal opinion :)


I cannot agree more. It's one of the reasons I've developed the package. With tea-tasting, I can save time and focus on the more important tasks, like experiment design.


I guess I'm not very versed in website A/B testing, but wouldn't it be much better to analyze these results in a regression framework where you can correct for the covariates?

On top of this, logistic regression makes your units a lot more interpretable than just looking at differences in means. I.E. The odds of buying something are 1.1 when you are assigned in group B.


This is the correct approach, but having done A/B testing for many years (and basically moved away from this area of work), nobody in the industry really cares about understanding the problem they care about prompting themselves as experts and creating the illusion of rigorous marketting.

Correct A/B testing should involved starting with an A/A test to validate the setup, building a basic causal model of what you expect the treatment impact to be, controlling of covariates, and finally ensuring that when the causal factor is controlled for the results change as expected.

But even the "experts" I've read in this area largely focus on statistical details that honestly don't matter (and if they do the change you're proposing is so small that you shouldn't be wasting time on it).

In practice if you need "statistical significance" to determine if change has made an impact on your users you're already focused on problems that are too small to be worth your time.


Ok so, that’s interesting. I like examples so are you saying I should build a “framework” that presents two (landing) pages exactly the same, and (hopefully) is able to collect things like what source the visitor came from, maybe some demographics. And I then try to get 100 impressions with random blue and red buttons, then check to see if there is some confounding factor (blue was always picked by females linking from google ads) and then remove the random next time and show blue ads to half females from google and half anyone else

I think the dumb underlying question I have is - how does one do experimental design

Edit: and if you aren’t seeing giant obvious improvements, try improving something else (I get the idea that my B is going to be so obvious that there is no need to worry about stats - if it’s not that’s a signal to chnage something else?


There exist some solutions for this that overlay your webpage, and there is a heatmap to show where a user's cursor has traveled to. More popular areas show "hotter" in red, which could show how effective your changes are, or where you may want to center content you're trying to get users to notice around. I haven't directly worked with the data, but have seen the heatmaps from Hotjar on sites I've implemented (doing both frontend and backend development, but not involved in the design or SEO/marketing).


Thank you for the interest and for the suggestion.

Yes, one can analyze A/B tests in a regression framework. In fact, CUPED is an equivalent to the linear regression with a single covariate.

Would it be better? It depends on the definition of "better". There are several factors to consider. Scientific rigor is one of them. So is the computational efficiency.

A/B tests are usually conducted at scale of thousands of randomization units (actually it's more like tens or hundreds of thousands). There are two consequences:

1. Computational efficiency is very important, especially if we take into account the number of experiments and the number of metrics. And pulling granular data into a Python environment and fitting a regression is much less efficient than calculating aggregated statistics like mean and variance.

2. I didn't check, but I'm pretty sure that, at such scale, logistic and linear regressions' results will be very close, if not equal.

And even if, for some reason, there is a real need to analyze a test using logistic model, multi-level model, or a clustered error, in tea-tasting, it's possible via custom metrics: https://tea-tasting.e10v.me/custom-metrics/


> And pulling granular data into a Python environment and fitting a regression is much less efficient than calculating aggregated statistics like mean and variance.

This is not true. You almost never need to perform logistic regression on individual observations. Consider that estimating a single Bernoulli rv on N observations is the same as estimate a single Binomial rv for k/N. Most common statistical software (e.g. statsmodels) will support this grouped format.

If all of our covariates a discrete categories (which is typically the case for A/B tests) then you only need to regression on the number of examples equal to the number of unique configurations of the variables.

That is if you're running an A/B test on 10 million users across 50 states and 2 variants you only need 100 observations for your final model.


> Most common statistical software (e.g. statsmodels) will support this grouped format.

Interesting, I didn't know this about statsmodels. But maybe documentation a bit misleading: "A nobs x k array where nobs is the number of observations and k is the number of regressors". Source: https://www.statsmodels.org/stable/generated/statsmodels.gen...

I would be grateful for the references on how to apply statsmodels for solving logistic model using only aggregated statistics. Or not statsmodels. Any references will do.


For statsmodels for the methods I am familiar with you can pass in frequency weights, https://www.statsmodels.org/stable/generated/statsmodels.gen...

So that will be a bit different than r style formula's using cbind, but yes if you only have a few categories of data using weights makes sense. (Even many of sklearn's functions allow you to pass in weights.)

I have not worked out closed form for logit regression, but for Poisson regression you can get closed form for the incident rate ratio, https://andrewpwheeler.com/2024/03/18/poisson-designs-and-mi.... So no need to use maximum likelihood at all in that scenario.


A logistic regression is the same as a Bernoulli regression, which is the single trial case of a Binomial regression [1].

[1] https://www.pymc.io/projects/examples/en/latest/generalized_...


Thank you, I'm aware of this. But I don't understand how your link answers my previous message. I was asking for example of how to fit it using only aggregated statistics (focus on "aggregated"). I'm afraid the MCMC or other Bayesian sampling algorithms are not the right examples.


I won't use any library that depends on numpy because of all the install issues in the past. Can't you do these tests with pure Python these days?


You would need probability distributions anyway. In Python, SciPy is the most mature and popular package with probability distributions. And it depends on NumPy. But I'll gladly consider better alternatives if you propose them.


What year is this? I have not had problems installing numpy in over a decade. They are a core library that takes its position seriously.

If numpy is out of consideration so is the entire scientific Python ecosystem. Python is not a fast language and doing any kind of math heavy algorithm is going to suffer significant performance penalties.


What installation issues have you had with Numpy lately?

Python packaging is a mess, but compared to issues with Torch or Nvidia stuff, Numpy has been a cakewalk whether using pip, conda, poetry, rye, etc.


Would be great if it included sequential sampling as well: https://www.evanmiller.org/ab-testing/sequential.html . Especially given how A/B tests usually get run in product companies, a peek proof method helps quite a bit.


I will consider it. Thank you for the suggestion.


Congrats! Let's say I have 5 variants I want to try. Does this package have anything to help with realtime experiment design, where I stop trying the less-promising variants, and focus my experimental budget on the more promising variants?


Not at the moment. If you have a specific method in mind, I will gladly look into it.


Very cool! Just curious, would you consider adding more exotic experimental design setups like Latin Square Design to the roadmap?


Thank you. I will think about it. There are many different things I can add. The idea is to focus on the most needed features first. And the word "exotic" tells for itself ;)


This is awesome and very useful. Thanks!


It's called tea-tasting but doesn't include Fisher's exact test :(

https://en.wikipedia.org/wiki/Fisher%27s_exact_test?useskin=...


Yeah, I know :) But it's in the roadmap. Btw, aren't Barnard's test or Boschloo's test better alternatives?


What is a good resource for someone looking to learn more about A/B testing? Not specifically about website dark pattern optimization, but fine if that is the framing device.


I recommend “Trustworthy Online Controlled Experiments”. If you’re only going to read one book about it, it should be this one. It will walk you through why we experiment, how it’s typically done, and how to use them to improve your decision making.


Agree. I also suggest looking at Alex Deng's unfinished book on causal inference and, particularly, A/B testing: https://alexdeng.github.io/causal/

Alex Deng worked with Ron Kohavi at Microsoft Analysis and Experimentation Team and co-autored many important papers on the topic, including paper about CUPED.


Goes way beyond t-tests, but I really like this free online book on causal inference more broadly

https://matheusfacure.github.io/python-causality-handbook/la...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: