I once encountered this in the real world as a data analyst a long time ago. I was working at an e-commerce company, called The Hut Group, and the whole year our marketing team had been saying our marketing cost of goods sold (the percentage of our revenue we needed to spend on marketing) had been declining across every product category. But at year end, the execs were shocked to realize that our cost of goods sold had almost doubled, from 10% to nearly 20%.
The finance team had asked me to double check the marketing team's numbers, to see if there'd been some funny math in the reporting. But the marketing team were totally right, marketing spend across the three main categories - games, beauty, and nutrition had all fallen (~15% to ~10%, ~30% to ~25%, and ~50% to ~30% respectively). However, the mix of these product categories had shifted massively, with nutrition growing from roughly 10% of our total sales to now nearly 50%.
In net that meant that whilst the marketing team had gotten more cost-efficient at selling every individual product category, the growth in the nutrition industry had vastly outstripped the growth in all other categories, and since that was the highest individual category, the aggregate marketing costs % had gone up, even though the team had improved every category. I then had the fun job of explaining the Yule Simpson paradox to a bunch of accountants.
Pretty much every dataset I work with as an SRE is full of these paradoxes. One classic published example comes from Google:
A network engineer took a trip to Indonesia or something (can't find the citation to confirm the exact tale), noticed the service was slow, and when asking around everyone said "that's how its always been." Basically the local cellular networks are slow and off island fiber connects are saturated. Back at the office they decide to attack the problem by optimizing payload sizes. Does the work, reducing download sizes by half, and ships it. Latency metrics? Average and p95 latency actually increased after shipping the work to production.
How does an objectively good change make things worse? Well, the service had improved for those customers so much that they used it a lot more. Even with the lighter demand on bandwidth the network latency to the datacenter was worse than typical US customers, so as more of these people realized the service sucked way less, they used it more and drove the numbers up.
I have tons of these examples where a data team looks at a particular slice of request telemetry, and comes to a wrong conclusion because they didn't model enough of the system, or controlled for the wrong (or too many) variables. The worst ones the cyclic finger pointing situations that Simpson's paradox can produce: App developers blaming a regression on the server side component while the server team blames the app team, often because the server and app release schedules accidentally aligned too well. In this case we have canary data to exonerate our side of the equation, but sometimes the problem lies in even deeper spaces, like app updates from an entirely different app.
Good point! I'm just a humble Linux sysadmin dubbed "SRE" who slept through Stats for Engineers and now pays the price every week dealing with SWE eager to blame me for their mistakes.
You were right; that was a case of Simpson's paradox. Every category experienced a latency boost but the overall statistic worsened. Jevon's paradox is what caused the induced demand, but when the new usage data was gathered the initial review was an example of Simpson's paradox.
Effect of the change -> Jevon's paradox.
Measurement of Jevon's paradox -> Simpson's paradox (in this case, that isn't a general rule).
The fact that the two are easily linked is one of the reasons the statistical paradox is so common in practice.
Latency improved for everyone, but overall average latency increased because usage increased faster in high latency areas. That's Simpson's Paradox. Simpson's Paradox doesn't care where the subpopulations you're measuring came from.
If I recall the youtube slow-internet optimisation case correction, I think it is an example of Simpson's paradox. They made it faster for countries with fast internet, and faster for countries with slow internet, and then the average performance across all users/countries was slower, because now the countries with slow internet used youtube much more than before.
It would be Simpsons' Paradox if Google services in Indonesia were initially slow because Indonesians tend to use YouTube more often than lighter services.
There wasn't an error in the conclusions of the initial measuremen. It was the solution that had problems.
How does "Average and p95 latency actually increased after shipping the work to production.
How does an objectively good change make things worse?" relate to Simpson's paradox again?
That's exactly it. After "shipping the work to production" (making it faster for everybody), the overall average and p95 got worse. Each sub-population experienced improvement: countries with fast internet got faster youtube, countries with slow internet got faster youtube. But the overall average and p95 got worse: overall average was slower youtube. Because now more users from the second sub-population bring the overall average speed down (or latency up). That's Simpson's paradox.
Ah, you may be right. It's not clear in the story that "Average and p95 latency actually increased after shipping the work to production." means average of Indonesia and ex-Indonesia and not just Indonesian average.
This reminds me of a similar story with YouTube [1] where improving the page weight decreased the metrics because more people with lower end connections could access the page.
Metrics interpretation is as important as the metrics themselves!
It is, but usually the meme misrepresents induced demand. While I don't like cars and we should focus on other infrastructure, adding a lane does help.
It does not reduce congestion, but it does now serve more people at this same current congestion level. And those people have come from somewhere. Sometimes from public transport, which isn't really good, but sometimes from some backwater road.
The bigger problem with induced demand is that it's often poor ROI to add that lane where the demand is highest.
That is, imagine you have a big city. You can add capacity for 1m extra people to travel to the city centre, where there's lots of congestion. Or you find ways to induce demand around the other limits of town, even town current demand is low there.
Odds are you'll pick the first, because it's "obvious" and doesn't require much thinking to see it'd help. But we really ought to look at cost-benefit of the second option too, because repeatedly inducing demand in the centre keeps driving up the incremental cost of further improvements, along plenty of other undesirable second order effects.
Adding lanes is like getting a bigger cache with the same throughput.
It's obvious at the supermarket: what goes faster, a single cashier processing four short lanes of 10 people with round robin, or two cashiers processing a single lane with 40 people?
Is the city center able to process 1m extra people? If not, it doesn't matter how many lanes you build.
Well you often can make it able to "process" 1m extra people: You can build overpasses, and tunnels, and taller buildings. But the cost-per-extra-person will tend to go up accordingly, to the point where you could spend an extraordinary amount attracting people out of the centre.
E.g. London's "Crossrail" / Elizabeth line cost $24 billion. Granted, it also allows some people to go through London faster, but I can't help to wonder what that money could've done if applied to attract businesses out of the centre instead. E.g. upgrading links between towns on the outskirts, upgrading town centres, and generally try to make it more attractive for businesses to be located further out.
Given the extraordinary costs it takes to do large infrastructure projects in London, I'd be very surprised if you couldn't get a higher return on investment that way, or by investing similar sums elsewhere in the UK entirely.
Until more people choose to live further away because the commute is now tolerable with the extra lane (and it's cheaper), and then you're back to square one.
Covid vaccination rates and deaths were rather famously subject to it. E.g. some combination of stats like “most covid deaths were vaccinated individuals”, “vaccination reduces death rate”, and “population segment with lowest vaccination rates has lowest covid death rates.” were all true at the same time.
Those aren't examples of Simpsons even taken together, but there was a famous (by which I mean it got a lot of press, including being written up in the Times and Post when it came out) study that showed that although every subgroup in Italian demographic data had lower CFRs than their Chinese counterparts, the Chinese group had a lower CFR when taken as a whole:
IME, the "problem" (to the extent there is one) is almost always that the naïvely-chosen KPI metric wasn't specific enough.
Here's a recent example from a friend. You're a SaaS company, and your home page's load time is reported as slow. You set your KPI for the quarter to be "reduce p99 load time of the home page by 50%".
The load time is a function of customer size, so bigger customers = slower home page. It's actually a quadratic function. So the p99 of small customers is like the p50 of large customers. You have 20 small customers and 20 big customers.
That quarter, the sales team onboards 10 new tiny customers, and 10 big customers churn. It's the holiday season in your big customers' geo, so mostly small customers are using the platform. It's the busiest time of year for the small customers, so they're over-using the platform.
All these factors lead to p99 latency dropping by 60%, smashing the KPI goal. Bonuses all around, pats on the back. And no code changes needed, besides!
The solution is: choose a KPI that is tightly coupled to your problem, and not confounded with other variables.
In the above case, a better KPI would have been "p99 latency for large customers", because it is robust to the distribution of customer sizes across current users, churned users, and seasonal differences in usage.
I thought it is pretty common to apply mixed / hierarchical linear models? I didn't study statistics but in our field of many problems of modelling biological effects we would do that.
> Mathematician Jordan Ellenberg argues that Simpson's paradox is misnamed as "there's no contradiction involved, just two different ways to think about the same data" and suggests that its lesson "isn't really to tell us which viewpoint to take but to insist that we keep both the parts and the whole in mind at once."
Keeping multiple possibilities in mind at once was what allowed the Epicureans to determine survival of the fittest, trait inheritance from each parent, that light was made of discrete units that weighed very little and were moving very fast, and that in order for free will to exist the quanta making up matter had to have multiple possible results under the same governing physical laws and conditions - all several millennia before the scientific method independently found the same results.
It's a great analytical method, especially in data analysis as suggested here.
> Mathematician Jordan Ellenberg argues that Simpson's paradox is misnamed as "there's no contradiction involved, just two different ways to think about the same data"
Isn't this a bit of a misunderstanding on their part on the meaning of the word "paradox"? The fact that they're called paradoxes is that they go against initial intuition and _seem_ contradictory, not that they necessarily are. If anything, I'd guess that most of the named paradoxes turn out to not actually be contradictory because when something seems incorrect and actually is incorrect, it's a lot less likely to be interesting enough to give a name.
There are various categories of paradoxes and different ways people have categorized them.
Quine calls this one a “veridical paradox”, where it seems false but is true.
Example of a different types of paradox are: any proof that 1=0, Russell’s Paradox, and Zeno’s paradox. These are either false in some sense or used to illustrate fallacious reasoning.
Since you didn't specify under what system we need to prove that 0=1 doesn't exist, I vaguely remembered or figured there was a simpler version of arithmetic under which that concept makes sense, but which wouldn't be strong enough to fall into incompleteness territory (so it would have to be weaker than Peano arithmetic, like you said).
> The signature of Presburger arithmetic contains only the addition operation and equality, omitting the multiplication operation entirely. The theory is computably axiomatizable; the axioms include a schema of induction.
So a very dumbed-down version of arithmetic, but which does contain a notion like 0=1, and which is complete and consistent, so it can't contain a proof of 0=1.
Obviously, this is probably not the kind of thing you meant, hence my cheekily bringing it up :)
> that in order for free will to exist the quanta making up matter had to have multiple possible results under the same governing physical laws and conditions
That seems like a strange and out-of-place statement, unless I'm misunderstanding it.
I assume this is talking about Quantum mechanics, but I don't think this represents Quantum mechanics or free will correctly, and I doubt the Epicureans knew anything about QM at all.
Ellenberg says the way to avoid falling into Simpson's paradox is to keep multiple views of the data in mind when doing analyses.
Let's say you were in ancient Greece, and you separately observe a drummer on a hill bang a drum before you hear it.
Then on another day you see lightning before you hear thunder.
If you consider each event on its own, a perfectly logical explanation is that there's something unique to drums that slows down the sound from them so it takes longer to reach you, and that lighting and thunder occur at different points in time.
But if you consider the set of both events together, a hypothesis that solves both at the same time is that things you hear take longer to reach you over long distances than things you see.
This was actually one of the examples directly from Lucretius, who in discussing the multiple hypotheses for why lightning and thunder occur at different times tied his suggestion that they occur at the same time but have different travel speeds to his observations of drummers on hills.
It's less specifically Simpson's paradox and more the general value of Ellenberg's analytical advice on avoiding the Simpson's paradox as having been at the root of the success (in hindsight) of one of the wiser philosophy schools in antiquity.
I can't recommend enough straight up reading Lucretius'sNature of Things.
But one of the examples in there of how their methodology ends up successful is when he's discussing the possible reasons lighting and thunder occur at different times.
One possibility thrown out is that they are actually occurring at different times. But another is that they occur at the same time but one takes longer to reach the viewer than the other.
On its own, these two ideas don't indicate the correct answer.
But then Lucretius ties the latter to another observation - that this seems similar to how a drummer in the distance can be seen to beat the drums before you would hear the drums.
Essentially in an age without the methodology of testable predictions, they circumvented that shortcoming by considering multiple hypotheses for multiple naturally occurring observations and looking for overlaps between them.
This seems to have pointed them in the correct direction on a number of major topics, especially relative to their contemporaries who were generally arguing for a particular hypothesis with various appeals to rhetoric or principle (like Aristotle claiming the leader of a bee hive couldn't be female because it had a stinger and "the gods don't give women weapons").
The times the Epicureans completely miss the mark is generally when they disregarded their principle of avoiding false negatives and discounted things with insufficient observational evidence (for example, they had pretty bad cosmology and they rejected the Stoic pre-gravity due to their incorrect base assumption of infinite amounts of matter). The times they kept an open mind and considered how concepts overlapped, even when they were wrong about the 'why' of an initial assumption they were often correct in secondary assumptions when tying it into multiple other systems and observations.
Sorry to nitpick, but "light was made of discrete units that weighed very little and were moving very fast" is not really correct.
First of all, light has exactly zero weight (only a massless particle can travel at exactly the speed of light, and at no other speed for that matter).
Secondly, you're leaving out the wave/particle duality of light, which sort of reminds the Simpson's paradox description of "just two different ways to think about the same data", without which you simply can't fully understand the behaviour of light (or of the statistical system you're looking at).
This was written in 50 BCE, nearly two thousand years before Einstein's Nobel winning work proving the discrete qualities of photons.
I'm well aware it's at best a partial description of light.
But it's leagues ahead of Plato's tiny triangles of fire in Timaeus or any other contemporary descriptions.
Also, technically zero mass is very little weight (the least, in fact). And the speed of light is very fast (the fastest). So Lucretius was correct in his statements, if just conservative in the degree to which he stated them (which was in line with the Epicurean commitment to the avoidance of false negatives).
Wave particle duality doesn't really get discussed in Western antiquity outside of a single tangent describing the beliefs of the Peratae who claim the universe has a threefold nature, with the first being continuous and infinitely divisible, the second being a near infinite number of potentialities, and the third being a formal instance. There's a bit of an Everettian quality to their thinking, but outside of its quite broad scope of thought I'm unaware of anyone saying "yeah, reality is both continuous and discrete at the same time" until physicists grappling with contradictory experimental results in the 20th century. The closest in antiquity outside of this group was arguably Plato's theory of forms where the forms were continuous and their physical manifestations discrete, though this is materially different from the idea they are both simultaneously occurring in what's around us (even if Plato's paradigm most likely influenced the much later Peratae).
Well, if we're going to nitpick, light has zero rest mass[0], but does have mass while in motion. This is how solar sails can work, since they use the momentum from the photons.
Weight isn't mass; weight is the force acting on something due to gravity. Gravity effects light, albeit only by a little, so in this sense light has a small but nonzero weight.
I don't believe that's a correct interpretation.
The reason light bends in the presence of gravity is that space time itself is curved, and light follows a "straight line" on that curved space time.
Given weight is defined as `W=mg`, and `m` is `0` for light, light can't have any weight. I think the question is itself incorrect: you can't weigh light because light is not something you can "stop" and put on a balance.
The fact that gravity appears to "attract" light is an illusion. Light only has what is called "relativistic mass" which has very little to do with how we normally think of mass and weight.
> the reason light bends in the presence of gravity ...
This is also why gravity bends the trajectories of massive particles, which also follow geodesics of the curved spacetime (in the absence of other forces).
When I taught intro stats many years ago I used to use house prices as a nice example of Simpson's Paradox (with actual data, for the students to investigate as part of a computational lab). The data I had was on US house sales from 2008, so it's 15 years out of date now--perhaps things have changed since.
At the time, the average price for single-family house sales was higher for houses without central AC than for houses with central AC. Yet when you split the data down by state, in every state the relationship was reversed: houses with central AC were more expensive than houses without.
The higher nationwide average price of houses without central AC was driven primarily by the large number of expensive houses in California without central AC.
Im reading this outcome to be the reverse of the examples above. Or perhaps identifying the correct stat to use based on your goals.
In other words, in this case I don't really care what the national average is. I care about my house, my street, my area.
In other cases, like in marketing, the stat that matters first is overall net profit. From there we can burrow down to understand the factors. In which case we come across business share before marketing spend.
In the networking example, the goal is usage (throughput). Not speed or latency.
Drawing the wrong stat first leads to incorrect conclusions.
Right, one of the interesting things about Simpson's paradox is that there's not a uniform right answer: sometimes you care about the overall average, sometimes you care about the averages of subpopulations. You have to judge that based on the situation.
One of the other comments linked [1] which includes Judea Pearl's analysis of Simpson's paradox from a causal inference point of view [2], which lays this out nicely (though maybe not easy to understand--it took me many hours of study to get comfortable with Pearl's causal inference work, even with a strong stats background).
> like in marketing, the stat that matters first is overall net profit.
I have a take on that. The stat that matters is the profit per unit of non scalable business resource. As in how much management, marketing, sales, accounting, and engineering time does the product take per unit. It's important because those are often hard to scale. You can have a low margin product that requires zip of the above and it's good business. And the reverse, high margins but requires too much of the above and it's bad.
These two effects explain a lot of the stupid decisions that come out of "data driven" processes. It is common for data to suggest the opposite of the truth.
> It is common for data to suggest the opposite of the truth.
Actually, I think the best takeaway from phenomena like these is that just doing statistics on a set of data can't tell you "the truth". If you don't understand the actual causal factors in play, your knowledge is very limited, no matter how much data you have or how many different ways you slice the statistics.
For example, in the UC Berkeley case described in the Simpson's Paradox article, the data actually doesn't tell you anything useful about "bias" in the sense of "something people are doing that they should do differently to make the admissions process fairer". It doesn't even tell you where to look for possible "bias" without knowing more about the admissions process: it is controlled primarily by departments or by the university as a whole?
Is the UC Berkeley case a good example of the importance of normalizing data before analyzing? Where things need to be put on a level playing field and handicaps applied to remove auxiliary noise.
Normalizing data doesn't fix the issue in the UC Berkeley case, because you still have to pick what to normalize over: do you normalize over the entire university, or separately over each department?
The answer to questions like that can't be found in the data. You have to go look at how the university admission process actually works, and what roles the university vs. the individual departments play in it.
> just doing statistics on a set of data can't tell you "the truth". If you don't understand the actual causal factors in play, your knowledge is very limited
I would argue that ultimately, all your knowledge and understanding comes from "doing statistics on data". Maybe the statistics is done by sloppy slurpy things in the brain instead of in R, and maybe it's actually mathematically unsound most of the time, but it's still some sort of statistics.
I think the key difference is between statistics on passively collected data vs results from active experiments. The former will only ever show correlations, while the latter can prove causal results from the actions of the experimenter.
Also, results from active experiments aren't limited to statistics. You can set up experiments to have discrete results, where no statistics is required to test a hypothesis.
For example, the GHZ experiment [1] can rule out local hidden variable models and confirm QM predictions with no statistics at all: the two different models make contradictory predictions with no continuous variation between them.
Sure, but that doesn't contradict what I said. From the Wikipedia article I referenced:
"For specific combinations of orientations, perfect (rather than statistical) correlations between the three polarizations are predicted by both local hidden variable theory (aka "local realism") and by quantum mechanical theory, and the predictions may be contradictory."
"Perfect" correlations means, as the parenthetical comment shows, "doesn't require statistics to check".
> the GHZ experiment [1] can rule out local hidden variable models and confirm QM predictions with no statistics at all
However, one needs to use statistics to even show GHZ works. That does sound contradictory to me. The correlations you get in experiments are never perfect and in this case they can be pretty far from perfect.
> one needs to use statistics to even show GHZ works
Not for the particular cases described in the quote I gave. For a complete verification of all the GHZ theorem's predictions, yes, you need to do statistics, because some of those predictions are probabilistic.
> The correlations you get in experiments are never perfect
In some cases, like the ones described in the quote I gave, it isn't a matter of correlations. You have contradictory results predicted by two different models, each prediction being 100% certain according to the model. You don't need any statistics to test that: just do one single run and see which way it comes out.
Imagine the horizontal axis is dose of a drug, and the vertical is the response, like hours of sleep. Looking at Lisa's response, it's clear that increasing the dose reduces sleep. Same for Bart. But if you do a linear regression of all the data, shown by the red line, dose increases sleep, which is wrong.
I did not know Simpson's paradox was an object lesson in causal inference until the other day. The right paradigm dispels the paradox. Here's a better article: https://plato.stanford.edu/entries/paradox-simpson/
Upon first glance I assumed it was the Simpson's episode about Mr. Burns having "a vast range of diseases so great in fact that they cancel each other out," explained here: https://simpsons.fandom.com/wiki/Three_Stooges_Syndrome
Surprised at the similarity but of course that was probably on purpose by the genius Simpsons writers of the late 90's.
I was reading the example of UC Berkely appearing to have gender bias in the admissions and read the following:
“it showed that women tended to apply to more competitive departments with lower rates of admission, even among qualified applicants (such as in the English department), whereas men tended to apply to less competitive departments with higher rates of admission (such as in the engineering department)”
That’s the opposite of what I would expect, I’d expect that English and the arts in general would be a lot easier to get into than stem, that’s how it is in Australia
Edit: When I say get into I mean get into university, not getting into the industry
The data is for application to graduate programs. There is a ton more funding for engineering, and many/most students going for a PhD in engineering don't pay for it. There's very little funding in the humanities, and most students are not willing to pay high costs for a PhD in the humanities, so the department tightly restricts admission.
As a result, it's easier to get into an engineering PhD program - as long as you are competent enough.
I had a friend who was a fellow engineering student. He became disillusioned and wanted to go into journalism. He applied to transfer to the Communications program at the university and told me how competitive it was - they admit less than 10 people per year. He did not get in.
[1] Obvious if you've spent a lot of time in grad school.
I was surprised to read that too. I think the answer is that here we're looking at admissions rate = number admitted / number applied, which is not the same as overall difficulty in a conceptual sense.
The only people applying to grad school in math are people who got a BS in math and did so with good grades (or perhaps some other STEM field + significant theoretical math coursework). On the arts side I suspect they draw from a larger pool (plus people tend to switch from STEM to something else a lot more than the other way around) of backgrounds. It's easier to convince oneself that a short story is great (when others may disagree) than convincing oneself a math proof is correct when it objectively is not. So there's less self-selection on the applicant side, and hence a lower admissions rate.
Competitiveness is one measure of difficulty, but there are others. Engineering departments tend to be qualitatively more difficult to get into than the humanities, which are quantitatively more difficult.
My gripe with most takes on the affirmative action debate is that they completely ignore this issue (in spite of the Berkeley case-study). It's trivial to take a set of admissions data and partition it by race to reveal a "racial bias", which would shrink or disappear if other factors correlated with race (like income) were accounted for.
For all of the examples on Wikipedia, it seems like there was some confounding extra variable that was missed. I wonder if anybody knows of a case where it just sort of happened randomly, with no big underlying cause?
Or maybe I’m thinking of it wrong and this is impossible.
It can happen any time there is a mix shift in the underlying quantity of the subgroups. It's just that random changes in quantities are not likely to be studied or reported. It's easy to generate manually though.
In order for it to count as Simpsons paradox I think there would need to be a confounding variable. It's certainly possible for it to appear spuriously, and for something to look like a confounder when it isn't, but there would need to be some type of subgroup.
Encountered it recently. I had two different dataset to evaluate model performance on from different domains.
One dataset was closer to training data and the other was closer to our business use case. The hypothesis was that performance on the latter dataset would be poorer due to overfitting.
Indeed the accuracy on all categories had reduced. However, overall accuracy was much higher!
This was because the second dataset had higher frequency of easy to predict categories.
If we had just looked at overall number we would have concluded that there was no overfitting to train domain, which was not the case.
That second graph reminds me of Shepard tones [1], known e.g. from the Super Mario 64 staircase, where each component is steadily rising in pitch, yet the tone as a whole stays exactly the same in the long term.
The finance team had asked me to double check the marketing team's numbers, to see if there'd been some funny math in the reporting. But the marketing team were totally right, marketing spend across the three main categories - games, beauty, and nutrition had all fallen (~15% to ~10%, ~30% to ~25%, and ~50% to ~30% respectively). However, the mix of these product categories had shifted massively, with nutrition growing from roughly 10% of our total sales to now nearly 50%.
In net that meant that whilst the marketing team had gotten more cost-efficient at selling every individual product category, the growth in the nutrition industry had vastly outstripped the growth in all other categories, and since that was the highest individual category, the aggregate marketing costs % had gone up, even though the team had improved every category. I then had the fun job of explaining the Yule Simpson paradox to a bunch of accountants.