
The Surprising Power of Online Experiments (2017) - nadalizadeh
https://hbr.org/2017/09/the-surprising-power-of-online-experiments
======
jameslk
Notably absent from this article is any discussion about multi-armed bandits.
A/B testing only leads to finding the more optimal of treatments, where as
multi-armed bandit algorithms help find the optimal treatment _and exploit it_
during testing. If profit is involved, it seems obvious that you should be
considering both exploring treatments but most importantly exploiting the
treatments that currently yields the most profit.

Google has a pretty good FAQ on this:

[https://support.google.com/analytics/answer/2847021?hl=en&re...](https://support.google.com/analytics/answer/2847021?hl=en&ref_topic=1745207)

~~~
joshuamorton
Gonna toot my own horn here for a second, but here[0] is a presentation I did
on multi-armed bandits (and specifically Thompson sampling, the tragically
underutilized optimal bandit method) to an undergrad ML group. Amusingly, the
diagram I use on slide 3 is from a research paper from Microsoft. We had a
guest speaker from MS's exp-platform team [1] the prior week, and she had
discussed A/B testing, but not touched on bandits, and I felt a need to make
this exact point.

[0]: [https://gtagency.github.io/2016/experimentation-with-no-
ragr...](https://gtagency.github.io/2016/experimentation-with-no-ragrets)

[1]: [https://exp-platform.com/](https://exp-platform.com/)

------
evv
It feels click-baity to reference a $50M revenue improvement without
specifying the revenue before the split-tests. Is $50M a 1% improvement? 100%?

~~~
greglindahl
Clickbait scores well in A/B testing, so I'd expect an article about A/B
testing to have a clickbait title.

------
schwax
Anybody have a recommendation for an A/B testing service?

We've talked to Optimizely but their pricing was going to come in at the same
ballpark as our AWS spend (into the six-figure range), which seems absurd.
They charge based on monthly users, but a lot of our traffic consists of
organic search bounces.

For now we just want to run ~5 experiments per month, want to record events
server-side so we can be sure not to lose any, and are wary about implementing
it ourselves since there are so many ways to screw it up without realizing it.

~~~
rbinv
You could try the free version of Google Optimize if you don't mind using a
Google product.

~~~
fillskills
We use Hansel

------
b_tterc_p
I hypothesize that many of these statistical tests have led to worse outcomes.
Not in the commonly espoused bad for society externality view, but as a
straight up bad for business view. There are three problems I see with A/B
tests on a vast scale for small tests.

1\. If the effect size is incredibly small, which most minor UI changes will
be, finding statistical tests to prove them is really difficult. If you’re
looking for an incredibly small positive effect, even with hundreds of
thousands of sample size, the probability of rejecting the null hypothesis
while the effect is actually a negative influence is surprisingly high! Very
easy to make mistakes.

2) short term gains on engagement may lead to long term disengagement.

3) business incentives for management are easily misaligned. I would imagine a
dominant negative influence is managers exaggerating the statistical influence
found in a test because that means they get to lead the change, on an
otherwise vast tech ecosystem the performance of which probably won’t change
all that much. Attribution is also hard (how sure are they on how much to
attribute here?) so credit is difficult to allocate beyond initial value
sizing.

~~~
CrazyCatDog
I literally did a research project for a firm recently where we lowered wages.
Workers are super monitored, I have data on every minute of their day. Post
wage cut, workers worked just as hard. Awesome! 5 months later, the best
people are (significantly) gone. 7 months later, and it is demonstrably
Obvious that this was a value destroying move--average fixed effects look
horrible now, yet this is only clear to those of us watching from the outside.
Internally, the difference is completely overlooked.

I needed to find cites, and ironically I found the Parent article this
morning. This one is way better about the lies we tell ourselves:
[https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3204791](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3204791)

~~~
b_tterc_p
How do you measure people like that?

------
RyJones
Source: [https://hbr.org/2017/09/the-surprising-power-of-online-
exper...](https://hbr.org/2017/09/the-surprising-power-of-online-experiments)

I used to work with Ron Kohavi in his group.

~~~
dang
Ouch. We changed the URL from [http://blog.rootshell.ir/2019/02/how-to-
increase-annual-reve...](http://blog.rootshell.ir/2019/02/how-to-increase-
annual-revenue-by-50-million-using-online-experiments-methods-and-tricks/),
which seems to have copied that content, and banned that site.

Thanks for the heads-up.

~~~
RyJones
Thanks. I was wondering why Ron was writing there; he wasn't.

------
skywhopper
Good in-depth overview but unfortunately doesn’t get into bigger issues of
potential ethical concerns of this sort of narrow maximization. At a certain
point you end up becoming Facebook or YouTube and vastly amplifying toxic and
dangerous content because it generates more comments or more time spent on the
site. And even if you believe in pure amoral capitalism, blindly following
what your algorithms tell you to do will eventually lead you into a trap and
the backlash won’t necessarily be pretty.

~~~
CharlesW
Supporting your point, YouTube claims to be backing off its "clicks at any
cost" model.

 _As has been mentioned previously, our business depends on the trust users
place in our services to provide reliable, high-quality information. The
primary goal of our recommendation systems today is to create a trusted and
positive experience for our users. Ensuring these recommendation systems less
frequently provide fringe or low-quality disinformation content is a top
priority for the company. The YouTube company-wide goal is framed not just as
“Growth”, but as “Responsible Growth”._

[1]
[https://blog.google/documents/33/HowGoogleFightsDisinformati...](https://blog.google/documents/33/HowGoogleFightsDisinformation.pdf)

------
ypolito
Nice article but visual examples would really help.

Someone could change the CSS of the ads, for example to flash periodically,
and that would be the reason for the increased clicks.

------
ape4
I always wonder how long the better of the A/B results will stay.

