Management by Statistics

Wilduck · on Dec 12, 2011

This article has a valid message that basically boils down to "correlation does not imply causation." While the discussion centers around the problems with making decisions based off of aggregate statistics, the underlying theme that having a piece of data doesn't mean you know what caused it, boils down to the tried and true correlation/causation problem.

Unfortunately, I don't think the author teased these notions apart very well. He does suggest one solution to this problem, doing A/B tests. That's a fine solution where it works. However, as he mentioned it is hard to do outside of limited circumstances. At this point, more sophisticated tools are necessary.

Coming from a background in economics/econometrics, I'm acutely aware of the statistical difficulties that arise when you cannot perform controlled experiments. Here, the root of the issue seems to be endogeniety [1]. For example, in his discussion the 'Stickiness' of facebook he claims:

  "Stickiness" is an aggregate number, it represents the result of the quality of the app. It's a result. It's not a cause.

I think this only gets us half way to the problem. Yes, to some extent it measures the quality of the app, but there is also a feedback loop, where stickiness increases the quality of the app. So, if you're trying to measure the effect of quality on stickiness, or vice versa, you need some way of disentangling this feedback loop.

These tools exist. If you're interested in estimating both effects jointly, you can use Two Staged Least Squares [2]. If you're just looking to estimate one side, you can use an instrumental variable approach [3]. These approaches are used all the time in economics, as the data involved is almost always messy.

I'm curious why I hear so much about A/B testing, but rarely hear about other statistical methods for assessing changes in usage patterns. Are they not being used? Am I simply looking in the wrong places? Or is A/B testing really sufficient in most cases?

[1] http://en.wikipedia.org/wiki/Endogeneity_(economics)

[2] http://en.wikipedia.org/wiki/Simultaneous_equations_model#Es...

[3] http://en.wikipedia.org/wiki/Instrumental_variable

DanielBMarkham · on Dec 12, 2011

Author here. Thanks for the comment. My intent wasn't really to tease apart much or even to shed any light on statistics, but just to persuade the reader that this is both a critical area for startups and an area where our intuition is for shit.

I'd like to hear more about this. Anybody else with applied knowledge out there?

Seems to me there are two problems here. The first is that we live in a world with lots of data yet little information. Understanding the limits of what we can know is important. (And I hope it didn't all come down to "correlation does not equal causation" for you. The Monty Hall problem is definitely not an example of that) The second thing is taking the pre-existing math and applying it in a useful manner. The second problem is probably much more interesting to this crowd, but I was concerned that most of us (myself included) don't recognize the first problem to the degree we should. The essay actually began as a rant against trusting too much in aggregate numbers, but at the end I was left with having to make a recommendation, and the best I could come up with -- the most practical solution that is easily explainable and implementable -- is A/B testing. Are there other options that people are using?

jcr · on Dec 12, 2011

> Out of all the startup skills I've studied, this one -- effectively inferring intent from reams of numbers -- is probably the most difficult.

"There are only two types of people in this world; those who can extrapolate from incomplete information."

It's a humorous way to get the same point across. It's particularly fun to say in person during a discussion, then just stop speaking and wait for the point to hit home.

The more interesting question is why human beings get a warm, fuzzy feeling when they extrapolate from incomplete information (or extrapolate from incorrect information, or extrapolate incorrectly from correct information)?

I haven't bothered to count them all, but there are plenty of myths about some deity dragging something bright or burning across the sky to explain the rising and setting of the sun.

jcr · on Dec 12, 2011

> I'm curious why I hear so much about A/B testing, but rarely hear about other statistical methods for assessing changes in usage patterns. Are they not being used? Am I simply looking in the wrong places? Or is A/B testing really sufficient in most cases?

I believe part of the problem is you actually know the definition of "A/B Testing" and the real definition (testing a single, specific change) is mostly lost on the majority of those using the phrase. In other words, "A/B Testing" in common use has essentially become slang for "Multi-Variate Testing" and other forms of testing.

For example, if you only change the color of a button, you're (formally) doing "A/B Testing" but if you change both the color and location of a button, you're (formally) doing "Multi-Variate Testing." None the less, you'll often see the later called the former in common usage.

noelwelsh · on Dec 12, 2011

Regarding your last point, we're trying to change things! As a economist you might be familiar with the bandit problem. We're bringing bandit algorithms to the web at http://www.mynaewb.com/ Our results show a BIG improvement over A/B (see the blog).

danso · on Dec 12, 2011

This seems like a shallow assessment of the use of statistics.

The author is absolutely right in this sense:

> In the first example [involving a low percentage of employed females compared to females in the general population], we are asked to reason by correlation and simile. Because something occurs at one rate in one place, we are asked whether or not a similar thing should occur somewhere else. No, the math does not say with certainty one way or another.

The math does not say with certainty that there is a management problem...but no single statistic will assert that for any situation, not even situation B. The point is is that that statistic, combined with other aggregations (what is the % of women in the local population? What is the % of women in comparable companies?) may lead us to assert that there is a management problem.

Just because the problem is one that is particularly political does not mean that the inference from statistics is invalid.

adamio · on Dec 12, 2011

In my experience managers will look at the statistics and delegate discovering and solving of the why. In big companies statistics are gold (and very very simple) because the managers don't understand the why. It's a chain of delegation based on "Key Performance Indicators" that ultimately leads to the people who know how to discover & fix the problem. This bottom up approach may work in reactionary & manufacturing environments but it fails in most other areas. Startups mean the fixers have to self-manage.

tryitnow · on Dec 12, 2011

This doesn't even work in a lot of big companies. The "KPIs" many managers see reveal very little. They have often been massaged to present a rosier picture than the reality. Or they may present information that just isn't very actionable.

This problem is due to poor data management by companies, specifically, the data often require human intervention to turn into summary statistics (e.g. KPIs). Anytime there is a human being touching data several potential problems arise: 1) unintended mistakes, e.g. misspellings 2) bias, for example truncating "outliers" that disagrees with your desired message 3) failure to explore messy data that may contain more powerful messages than the clean data 4) failure to link data in one function to data in another function (e.g. linking operations data with financial data)

ironchef · on Dec 12, 2011

Another issue I've seen in the past is that management by statistics can lead to settling at local maxima. More extreme a/b tests can reveal said state.