Hacker News new | comments | ask | show | jobs | submit login

To avoid big A/B testing mistakes, perhaps you shouldn't use phrases like "To counter that noise it is important to first “prove the Null hypothesis.” To prove the Null hypothesis you..."

One of the most important underlying statistical principles in inference tests is that the null hypothesis can never be proven. Any data you collect can only reject the null hypothesis or fail to reject it.

Good point. Not sure where I picked up that phrasing - especially since the wikipedia article linked by the phrase specifically says:

> It is important to understand that the null hypothesis can never be proven. A set of data can only reject a null hypothesis or fail to reject it. For example, if comparison of two groups (e.g.: treatment, no treatment) reveals no statistically significant difference between the two, it does not mean that there is no difference in reality. It only means that there is not enough evidence to reject the null hypothesis (in other words, the experiment fails to reject the null hypothesis).

I'm going to update the post to fix that mistake. Thanks.

Edit: I updated that section. This part and the next part of the series are the ones I'm most anxious about because of all the math and the questions that have right and wrong answers.

Here is better phrasing or a good way to think about hypothesis testing:

"The goal of the test is to determine if the null hypothesis can be rejected. A statistical test can either reject (prove false) or fail to reject (fail to prove false) a null hypothesis, but never prove it true (i.e., failing to reject a null hypothesis does not prove it true)." (Wikipedia)

I might also change some of the wording of these:

"The Null Hypothesis states that if you don’t change anything than nothing will be different."

The null hypothesis states that there are no differences in X (where X is the metric you are using to evaluate the performance of the different experiences, or recipes. X could be things like Conversion Rate, Click Rate, Time on Page, Revenue per Visitor etc) between the control experience and any number of test experiences.

"If you divide your traffic and see significant differences in your metrics you have failed the Null hypothesis."

I would just say "...you would reject the null hypothesis." In this case, you're basically saying that there is evidence to suggest that the differences in X (where X is the variable you're measuring) are not due to chance / noise / natural variation alone. By saying "you would reject the null hypothesis", the implication is that you would accept the alternative hypothesis. Remember, the null is that there are no differences in X and the alternative is that there are differences in X (where X is your success metric).

"Failing the Null hypothesis means that either you have not collected enough results to even out the noise or there is something wrong with the algorithm that you are using to segment your traffic."

By "Failing the Null hypothesis," do you mean "rejecting it" or "failing to reject it?"

If you mean "rejecting it", I'm not sure how to read what you wrote.

If you mean "failing to reject it," you're basically saying here that if you don't reject the null hypothesis, it is due to one of 3 things (or a combination of the 3):

1) Your sample size is not large enough

This is a tricky boat to get into. Any difference in the metric your are testing for will reach statistical significance with a large enough sample size. If you're interested more in this, go look at the math formulas.

You should decide ahead of time how long you want to run a test for, knowing how much traffic the page you're testing will receive, and how many experiences / groups / segments you are splitting this traffic into.

If you don't reach significance, it could just be that the differences you were testing for are too small.

Knowing the standard deviation / variance in the metric you're testing can help you understand which one of the above it is (sample size is too small or differences you are tested are too small)

2) You don't have enough evidence to suggest that any differences you see are due to anything other than variation / chance / noise.

This can happen a lot when the changes / differences you're testing are too small.

3) You have a Type II error

You have failed to reject a false null hypothesis. The test should have told you there was a difference, but for whatever reason, the sample data you saw did not provide you with evidence to do so.

Oh - but I do also want to say that you did a great job on 99% of the article. You hit on pretty much all the major considerations, biases and issues one could run into when testing.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact