
A/B Testing Duration Data - EvanMiller
http://www.evanmiller.org/ab-testing-duration-data.html
======
shalmanese
The t-test is invalid because duration data is non-normal. You can actually
see duration data as a histogram in GA if you want to test hypotheses on it's
distribution.

~~~
btilly
I came here to say this, and am glad someone else said it first.

That on top of the unwarranted distribution assumption is very dangerous. How
dangerous? If the true distribution has heavier tails than assumed, then the
true variance can be many times what is assumed. If the true variance is, say,
4x what is assumed then an average standard deviation variation will show up
as passing a 95% confidence interval in a random direction. Getting to 99%
confidence by accident becomes quite easy, with bad results.

~~~
tel
Ugh. I didn't even read the article and was looking for this response.
Whenever I see "A/B test" I am usually right in assuming that the author is
about to assume normality.

~~~
btilly
I understand, but the author of this particular article actually does know
better and usually gets his statistics right.

------
RyanZAG
_" Here's the assumption: the probability of a visitor leaving the site at any
given moment is constant._"

Terrible assumption. It's probably something closer to this:

    
    
      P(Leaves site without reading / sub 2 sec): 0.2
      P(Leaves site after skimming / sub 10 sec): 0.6
      P(Leaves site after reading 1 page / 5 minutes): 0.1
      P(Leaves site after reading a bunch of pages / 30 mins): 0.1
    

If you're going to make decisions off of something like duration, then you
should code up a basic bit of javascript/database that can store real values.

~~~
alexatkeplar
No need to hack a whole analytics stack together - just use Snowplow with page
pings turned on:

[http://snowplowanalytics.com/analytics/catalog-
analytics/mea...](http://snowplowanalytics.com/analytics/catalog-
analytics/measuring-and-comparing-content-page-performance.html)

(You can't measure visit durations including bounces accurately without a
JavaScript tracker that pings back on-page DOM activity.)

------
redajax
Merg. There are contradictory assumptions here. For the approximation on the
standard deviation you assume duration is a Poisson random variable. To use
the t-test you assume duration is a normally distributed r.v. There are ways
to unify the two assumptions, but they can't be used without direct access to
the data. At that point, we could just calculate s.d.

I'd stick with assuming duration time is normally distributed. This leads to
another class of adhociness:
[http://www.statit.com/support/quality_practice_tips/estimati...](http://www.statit.com/support/quality_practice_tips/estimating_std_dev.shtml)

------
Sprint
Especially with timing data I find it much more useful to look at the median
instead of average value. If someone lets the browser sit open for a while it
will skew the average but not the median.

~~~
jfarmer
If you use the same assumption in Evan's article you can calculate the median
by multiplying the mean by ln(2), the natural logarithm of 2. I think Evan's
writing this for a situation where your hands are tied WRT the data you're
gathering, but still want to draw statistically sensible conclusions, e.g.,
Google Analytics.

~~~
yummyfajitas
If anyone is interested, I'm building [1] an analytics service which won't tie
your hands. Why settle for the mean and count when you can have a histogram or
even the raw data?

If anyone is interested in trying it out, my email is in my profile. Always
looking for feedback.

[1] Actually I'm waiting to build it. I quit my job a couple of days ago and
will begin work after my notice period is up.

------
darkxanthos
I'm a professional split tester as well and the internet is so full of
opinions and noise. Programmers have plenty of chances to learn by just
downloading someone else's code. You can execute it and see that it works. In
statistics, it's much more difficult. There needs to be a stronger emphasis on
proving one's point using simulation.

Run something like this through 10,000 simulations and you'll start to see how
the theory really gets applied.

------
viennacoder
I am afraid that the assumptions made in the article (regarding the
distribution) are invalid.

One very long (say 2 days) visit will skew the average. And you can't infer
the standard deviation from the average.

For an easy alternative, you could calculate the median time on site ex ante
-- let's say it's 1 minute. Then for your ab test, see what percent leave
within the first minute. Use that percent and determine if difference is
statistically significant.

