Hacker News new | past | comments | ask | show | jobs | submit login

    During this time there were a total of 67 new
    subscriptions. Of these 58% (39) came from the
    new design and 42% (28) came from the old design.
    Looks like the new one is a clear winner.
Is it? This seems a small population to settle on a clear winner.

Using R's prop test, I get a p value of 0.22. (Type "prop.test(39,67)" to calculate it).

I think this means that in a world where it makes no difference which design is used, you would get a result as significant as this 22% of the time.

An alternative is the Adjusted Wald method. You can try it online here:

https://measuringu.com/wald/

Which gives some confidence intervals which also range from "could be better" to "could be worse". Even when you reduce the confidence level from the typical 95% to 90%.

    a quick check with an A/B testing calculator
    even says that this result has significance
    (~90% likely)
Which calculator was that?



Using the techniques described on my blog [0], which are ideal for KPIs like conversion rates and small sample sizes (since no Gaussian approximation is made), I get a p-value of 0.177, which is not significant. The observed treatment effect is a 36.8% lift in conversion rate, but a confidence interval on this effect has endpoints -10.7% and +97.4%. Anything in that range would be considered consistent with the observed result at a 0.05 significance threshold.

With 6000 total impressions and a 50/50 split, the experiment is only able reliably to detect a 74% lift in conversion rate (with power = 80%).

If you want to rigorously determine the impact, decide what effect size you hope to see. Use a power calculator to decide the sample size needed to detect that effect size. Administer the test, waiting to acquire the planned sample size. When analyzing, be sure to compute a p-value and a confidence interval on the treatment effect.

[0] https://www.adventuresinwhy.com/post/ab-testing-random-sampl...


That's not how you use prop.test. What you've tested using that invocation is the null hypothesis that the underlying probability of 39/67 is 0.5.

If you want to perform a test of a difference of two proportions, you need to do:

prop.test(c(39, 67), c(total_group_a_impressions, total_group_b_impressions))

I don't have experience with A/B testing, so I'm not sure if this is typically or best handled using this particular statistical test.

Edit: The first parameter should be c(39, 28), meaning the total conversions in each group. I have no excuse beyond being tired.

Edit 2: To clarify, I think he should still use the two-sample form of prop.test, especially since we did not know at the time of his posting that the sample sizes are equal.


    What you've tested using that invocation is the
    null hypothesis that the underlying probability
    of 39/67 is 0.5.
Isn't that equivalent to my interpretation of the test result? "In a world where it makes no difference which design is used, you would get a result as significant as this 22% of the time".

    If you want to perform a test of a difference of two proportions, you need to do:
    prop.test(c(39, 67), c(total_group_a_impressions, total_group_b_impressions))
Do you mean c(39,28)? Because group_a had 39 hits and group_b hat 28. Doing so with the group sizes Bemmu stated (3000/3000) also gives me a p value of 0.22.

As long as the group sizes are equal, the test is not very sensitive to the sizes.


I think there is a difference in the approaches, given that the Chi-squared test statistic for the two-sample version is ~1.52, while for your one-sample version it is ~1.81. If group size doesn't matter and if you're justified in adding up successes as you have, I'd expect the test statistics to be nearly the same.

Edit: I'd expect them to be nearly the same since the Chi-squared distributions would be parameterized similarly, so if we have similar results, we should see similar test statistics. Maybe my reasoning here is incorrect though!


I used http://www.abtestcalculator.com/ and entered 3000 participants -> 28 conversions and 3000 participants -> 39 conversions.

I neglected to record how many views each version had, but should be at least 3000 since the conversion ratio is about 0.5 - 1%.


Resource provided uses a very naive approach to determining the outcome of an AB test. It's not accurate, given the very small numbers.


Yes, I don't have much data to work with, and was also surprised that the calculator considered this significant. But even without significance, I assume it still makes sense to go with the winner?


Does the calculator really use the word "significant"? I don't see it. I am not sure how to interpret the language it uses.

As for going with the winner: Yes, if the test result (39/28) is the only information you have and there are only two choices (go with winner / go with loser) then it makes sense to go with the winner.


The statement I see on that page is "There is a 91% chance that Variation A has a higher conversion rate".

I am not sure how to interpret that. We would have to dive into the GitHub repo and figure out which test it performs I guess.


It looks like the difference of two beta distributions based on the visualization.

So, assuming a uniform prior and updating with 39/3000 and 28/3000 conversions the difference between the two distributions is greater than zero 91% of the time. It's only guaranteed to be above zero at about the 80% credible interval, and since we started with an uninformed prior that'd be about p=.2?

I'm open to correction here.


You get 91% if you put a uniform prior on the proportion coming from each alternative.


Looks like I misunderstood the result, I will change the post to reflect this.


If you have Google Analyics on your site, you can use the Unique Pageviews metric for each of the two page variations and use that as the metric, instead of arbitrarily assigning 3000 views to both.


If you had 3000 visits, shouldn't that be 1500 -> 28 and 1500 -> 39? (assuming you're doing a uniform split of both groups)


I don't think that's the correct use of prop.test for this question. When you give it two numbers (success, total), it tests against the null that chance of success is 50%.

Here, we want to test whether p(success|cond) differs across conditions, not whether prob(cond|success) is 50%.

This distinction is important because when p(success|cond) for some cond is low, its variance is also very small, but prop.test(39,67) doesn't reflect this. That could be 67 successes out of a small sample (and high chance of success), or out of a huge sample (and low chance of success).

Edit: whoops, I didn't notice other comments point out this issue


They said the following in the beginning though:

> For example if you want to test a tweak that results in 5% more conversions, you need about 3000 sales to detect it! For Candy Japan this would mean waiting for about 10 years for the test to complete.

But they want to do something still to try and improve sales. Seems reasonable even if not scientific.


This is not the right approach to take then. There are lots of other approaches to decision making outside of hypothesis testing, use them! This is not an appropriate use of hypothesis testing and can very much lead you toward making the WRONG decision.

For example, with such small numbers, there isn't much value in aggregate statistics. It would take a day or two to go through each one individually and see what happened, and you'd probably learn way more about your customers.


Whatever test you do, you would need to know the total number of visitors in each group, right?

And unless I missed it the article doesn't state those numbers.

Intuitively, the numbers you quoted would be more significant the bigger the test and control groups are.


That's not very intuitive for me. Let's do some limit analysis: imagine the groups were one million sessions each, but the convertions in the groups were only one and two people respectively. Wouldn't this result seem like the result of random chance?

The conversion rate is basically one in a million in both cases.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: