
Conservation of Intent: why A/B tests aren’t as effective as they look - dedalus
http://andrewchen.co/conservation-of-intent/
======
snovv_crash
A/B tests tell you about short term gains, but don't tell you about long term
issues you may be accumulating due to things like dark patterns, clickbait
headlines, shoddy article topics and more. A/B tests don't take into account
the loss of prestige or reputation that the options give.

I've seen this repeatedly with ArsTechnica, which has devolved into so much
political and clickbait material that I don't even really visit anymore. Yes,
I'm guity myself of clicking on those articles when I do visit, but at a
certain point I've found that Ars doesn't have the news I'm after, so I turn
elsewhere and now Ars has one less viewer.

~~~
citrablue
What's the alternative to A/B testing? Everywhere I've worked, design teams
hate A/B testing, but suggest "trust us we know what we're doing" as the
alternative.

A/B/n testing let's you explore a search space. There are issues that can come
up from that, but IMO it's a better option than this Dilbert comic[0].

[0] [http://dilbert.com/strip/2014-10-27](http://dilbert.com/strip/2014-10-27)

~~~
swied
A/B testing is a great tool for incrementally improving upon a product design
given the correct business strategy. Just don't let the limitations of A/B
testing drive how you choose your business strategy. The strategy is
ultimately about maximizing total profit, and not necessarily short term
profit per user.

~~~
citrablue
How do you create a discipline around making good decisions for business
strategy? I've seen "highest paid person" win, I've seen "Most vocal" win,
I've seen "5 versus 1" win.

The testing advocates I've worked with have an attitude of "I don't know what
will work - that's why we're testing". I have not seem much of that attitude
from most other individuals/teams involved in decision making -- they have
preferred to say "this will work", and get angry when I say, "compared to
what? how do you know?"

edit - fwiw, I totally agree with understanding the limitations of testing.
You have to know what it can be good for. My argument is that it's actually
better for long term strategy than most other decision making processes I have
experienced, which usually boils down to "gut based".

------
birken
I could not disagree with this more. I remember vividly having this "low-
intent" vs "high-intent" debate at Thumbtack, when we rolled out changes that
A/B tests showed increased conversion (by a lot), but some people in the
company thought the changes were ugly and "off-brand" and argued they brought
in the wrong type of customers. So we ran the test again that we knew raised
conversion by a lot, and then followed the 2 cohorts of customers and watched
their behavior. The control group vs the 10% more from whatever the test was
that increased conversion. They behaved exactly the same. They came back again
at the same rates. They made the same amount of profit (per customer). Their
response rates to emails were the same. They closed jobs at the same rates. As
far as we could tell they were identical.

I have to admit I was a little surprised too, but for our business it didn't
seem this "high-intent" vs "low-intent" distinction existed. And with that out
of the way we continued to optimize conversion rates, and our revenue
continued to go up.

Every company is different so I don't want to generalize too much, but if
somebody tells me they ran an A/B test that said some key flow went up 10%,
but then afterwards the traffic/revenue/whatever didn't go up 10%, I think the
most likely candidate is bad test design. Humans are really good at rigging
A/B tests to produce wrong results in their favor. I guarantee every single
company who isn't maniacal about A/B testing does at least one of the
following:

\- Uses a tool to grade A/B tests that isn't statistically sound

\- Let's people check tests too often and allows them to stop the test when it
hits a good result

\- Running a test with a lot of similar variations and cherry picking the best
one

\- Doesn't plan for enough traffic to detect the percentage of change their
test is likely to produce

All of these create the potential for the perceived gains of the A/B test not
matching up with real world result.

I'm not saying the distinction between "low-intent" and "high-intent"
customers doesn't exist, but it is fairly easy to test for. Do that test for
your business and see if that distinction exists. But don't use it as some
magical explanation for why your A/B tests aren't producing the results you
want as this article suggests.

~~~
sokoloff
> Let's people check test too often and allows them to stop the test when it
> hits a good result

I admit to attempting to be guilty of this in the past and being stopped by
our analytics team (in the sense that they took the time to patiently explain
to me why what I was doing was statistically unsound). It's not obvious, IMO.

~~~
btilly
Meh.

There is statistics for the purpose of uncovering Truth, and statistics for
the purpose of making a business decision. The difference is that when we talk
about Truth, a small error is still an error. When we make business decisions,
it is fine to make a decision that is probably right, and we know isn't far
wrong.

Here is a perfectly valid test procedure that illustrates the difference.
Decide the most time you would be willing to spend to get a test result.
Multiply that by current conversion rates to get N, the number of conversions
that you expect to see by the end of the test.

Start running the test with two variations. Stop at any point if one variation
is at least sqrt(N) conversions ahead of the other. Stop at N if there is no
clear winner and go with whoever is ahead, even by a hair.

Here are features of this test procedure.

o You always make a decision.

o Running a test has a known fixed cost. You know how long it takes. And a bad
idea will cost you no more than sqrt(N) conversions to test.

o The results are very simple and easy to understand.

o Your answers are usually right.

o Your bad decisions are not very bad. If the true conversion rate for one
version is better by 1/sqrt(N), you've got a 95% chance of making the right
choice. You will probably never make a mistake as big as 2/sqrt(N).

The result is a test procedure that is a horrible approach for doing science,
but an excellent tool for improving a business. You'll never find it in a
statistics class. And I'm sure it would horrify your analytics team.

~~~
joshuamorton
I mean I think the only reason it would horrify your analytics team is that if
you want to do something that sophisticated, you may as well just use a better
multi-armed bandit function like Thompson Sampling or UCB-1 (which is very,
very similar to what you've described, although more formalized).

So I think its wrong to say that you'd never find it in a stats class.

~~~
srean
Bandits and A/B are meant for solving very different problem.

In bandit there is a clear explore/exploit trade off. There is no such trade
off in the A/B formulation, although it does get used in scenarios that have
such trade offs.

If I can pull the lever a finite and small number of times there is a strong
incentive for using a bandit. In this case I don't want to pull the wrong
lever as few times as possible. On the other hand if I am given an unlimited
number of pulls, I can afford to pull the wrong one many times more (still
finite) for the sake of 'knowledge' knowing well that I would have infinitely
many opportunities to exploit that knowledge.

~~~
joshuamorton
And for a business there always is. In an a/b test that is to improve customer
conversion, in a perfect world, you use the superior method on everyone,
converting the maximum number of customers. That saves you money.

In other words, the opportunity cost of putting someone in the wrong group
creates such a trade-off. You can pull the lever as many times as you want,
but each one potentially costs you money. It's textbook bandits.

~~~
srean
To a large extent I agree with you.

Differences creep in when there is ambiguity and judgement involved on what is
that metric that the org wants to optimize. This is fairly common. Typically,
in these situations its the PMs who make the final call. There the goal of the
experiment protocol is glean as much knowledge as possible, and present it to
the PM. The thinking there is -- if that comes at the cost of exposing some
customers to bad choices, so be it.

------
gfodor
The title is misleading relative to the article's content. Surely, as the
author points out, sometimes A/B tests can be misleading especially if you
ignore longer term cohort analysis, etc.

But often times, if you fix an obviously broken part of your funnel,
particularly in the early acquisition stages, you're fixing things that are
universally lifting the amount of people who ultimately are able to engage
with your brand and product to the point where they can even form intent. The
reality is most people are only willing to give you a tiny bit of their time
during their first one or two engagements with your brand, so at that stage
you're trying to sell them on your product, and build intent. A/B testing
helps reduce the friction needed to get them through the core of your sales
pitch.

It's easy to come up with a thought experiment that shows A/B testing can
sometimes be as simple as you'd imagine: just break the site. Your conversion
drops to 0%, now split test the fix. Like magic, your control stays at 0% and
your variant returns to normal. Nothing about "intent" in this scenario, this
is pure friction resolution. Just a thought experiment, but shows that surely
there are plenty of places where pure A/B testing and removing friction is a
net positive without any fretting over this "conservation of intent" issue.

------
foobaw
Slightly relevant but useful: use meditation modeling
([https://eng.uber.com/mediation-modeling/](https://eng.uber.com/mediation-
modeling/))

~~~
mwexler
This is a nice read. BTW, in case folks get confused, moderators are different
from mediators.
[http://psych.wisc.edu/henriques/mediator.html](http://psych.wisc.edu/henriques/mediator.html)

------
smueller1234
What the article largely discusses seems to be a problem with the metrics one
chooses as a proxy.

Let's say you're actually trying to optimize total transaction value on the
site or total number of transactions or something like the overall fraction of
users with at least one transaction within a certain window of time. Then - as
the article rightly observes - getting users not to bounce on a particular
page is a TERRIBLE proxy to what you're optimizing for. If that's not clear to
you, you have no business running A/B tests without supervision.

Source: co-designed one iteration of the experimentation framework for
Booking.com many years ago. Indirectly managed the team of much more qualified
people that took it a world further.

------
User23
One of my coworkers is a trained particle physicist and he informs me he
almost never sees properly designed experiments used by our A/B testers. The
result is that the testers almost always find what they are looking to find.

------
ben509
I think the idea of "high intent" is the same fallacy as the notion of
"affordable." We say something is "affordable" because we have "enough" money
to buy it, but that's not how people make decisions in aggregate.

The reason economists talk about opportunity cost is because people are
constantly optimizing decisions based on new information. (Humans may not deal
with prices and numbers very well, but they're pretty well evolved to break
time into chunks and work out plans to solve problems.)

If you talk to an individual, they might say "I can't afford it," or you may
talk to someone who didn't click through and they might say, "I was just
browsing." The fallacy behind both is you're creating archetypes and assuming
they represent the modes of the population.

And even if you talk to the individuals you based those archetypes on, there
is a whole history behind how they arrived at "I can't afford it." Those
changing circumstances are why the aggregate behavior doesn't show some
arbitrary level of "affordability," and instead you see a smooth curve of
consumer demand.

And the opportunity cost of continuing to view a web page will not have neatly
quantized levels of intent, but rather individuals have a broad array of
competing interests.

------
MaxBarraclough
> You ship an experiment that’s +10% in your conversion funnel. Then your
> revenue/installs/whatever goes up by +10% right? Wrong :( Turns out usually
> it goes up a little bit, or maybe not at all.

Never mind "The difference between high- and low-intent users", this could be
explained in terms of regression-toward-the-mean, a phenomenon mention in
neither the article, nor the discussion here.

Have 1000 students do an IQ test. Pick the top 20 students. Have them do
another IQ test next week. Their mean score second time round will almost
certainly be lower than their mean score first time round. The reason they
made the top 20 the first time round was a combination of having a high true
IQ, and being lucky on the day. Second time round, they aren't 'defined to be
lucky', as it were.

It's the reason movie sequels tend to be worse than the original. The reason
the sequel was made was that the original movie was far more successful than
the average movie, on account of both unusually skillful creators, and
unusually good luck. Second time round, you can't count on the luck component
again.

------
baybal2
Totally true, I list count of people coming from the web/startupey scene A/B
testing their companies/business units into insolvency

------
a-dub
Frustration is not linear. Film at 11.

------
jbob2000
Wait, so you’re telling me the laziest form of scientific analysis, the A/B
test, doesn’t produce accurate results? Colour me shocked.

A/B tests routinely leave out important observations, have way too small a
scope, uncontrolled populations, I could go on... they run the gamut of anti-
patterns.

~~~
matt4077
....and criticising statistics is the laziest kind of scientific criticism....

A/B tests are fine. They work. They allow inference of causality. They are
easy to understand, and can be fun to run. They get you 90% of wherever you
want to go, and such over-the-top criticism just seems like badly executed
pretentiousness.

~~~
jbob2000
You’re partly right - a scientific endeavor to figure out the color of a
button would be over-the-top, because it’s not _that_ important.

But to the article’s point, if you’re running banking software or something,
your users don’t give a shit what the button colours are; they will slog
through whatever you develop because they _need_ to get stuff done.

A/B tests are a small tool that sometimes get taken too far or used in the
wrong context.

~~~
lostcolony
'Taken too far or used in the wrong context' \- sure.

In the language of this article, banking app users are mostly all 'high
intent'. But that doesn't mean you can't evaluate criteria other than users
who completed the workflow to determine what is a design improvement. You can
still measure time to completion, how long it took the user from entering the
workflow to completing a given task as a measure, and play with the design.
Optimize the things users are doing the most, that sort of thing. A/B testing
can help you there. It's not the be all end all; you still need sound UX
design to figure out what designs to test out, but it can give you measurable
data as to what works, rather than just UX gut feeling, or purely lab based
results which don't reflect reality.

