
Misleading Graph Generator - gulbrandr
http://www.yrden.de/misleading-graph-generator/
======
jere
This isn't a graph generator. It's just a graph. You can plug data into it and
get a graph, I guess, but the same goes for Excel.

A misleading graph generator that automatically matched concurrent data sets
based on correlation would be quite interesting, but this isn't it.

~~~
gavinpc
So are you saying that the term "graph generator" is misleading here?

~~~
jere
_Ba dum cha_

------
mrtksn
How is this graphs fault? The only fault the graph has is that it clearly
displays the data, the problem is in the idea that is represented. It's
formally called "Correlation does not imply causation" and it's fault of the
person who is suggesting it.

There is a famous satirical version of this too:
[http://en.wikipedia.org/wiki/File:PiratesVsTemp(en).svg](http://en.wikipedia.org/wiki/File:PiratesVsTemp\(en\).svg)

It has nothing to do with graphs, graphs are are just visual tools and people
reading data from them are supposed to evaluate it just like reading data from
any other tool and not jump into conclusions.

What the graph in question suggests is that both "transistor count" and
"average life expectancy in germany" have risen trough time and if you are
reading this as "rising transistor count increases life expectancy" it's your
fault. Why not read it like "from 1971 to 2011 both transistor count and life
expectancy increased steadily, maybe because of the advances in technology -
we should look into it, it's too early to say anything"

~~~
coldtea
> _It 's formally called "Correlation does not imply causation" and it's fault
> of the person who is suggesting it._

Well, correlation sure DOES imply causation.

In the dictionary sense of "imply", as: "suggested but not directly expressed;
implicit".

What it does not is necessitate causation, but it sure as hell does imply it.

~~~
mrtksn
This is not true. Correlation means that there is some kind of relation
between two variables(for example, how they change over time) but it says
nothing about the mechanics of the relation, which is the causation.

Let me investigate this example:

the observation: Students who watch less TV have better grades.

It's not like pirates v.s. global warming, it actually makes sense at first
and if you are the minister of education whose goal is to increase the grades
among students and you don't have an idea about "Correlation does not imply
causation" principle you may actually suggests to ban TV's to improve the
education. You can even draw a fancy graph to support your idea.

However, after further investigation you may find out that this would not work
because it's not the TV that is causing the lower grades. Maybe the better
students tend to watch less TV because they prefer to read books instead of
watching TV.

This time it may seem to suggests the opposite: Increasing the students
success in school reduces the time the TV is watched. However after even
further investigation you may find out that your better students are from poor
families(maybe the schools in your country are accepting the best students
without tuition and are not selective about these who can pay) that just don't
have a TV set, or maybe these better students also need work part-time to
support their families, thus they just don't have time for TV.

Thus, you can't really change one value by synthetically manipulating the
other one. Turns out, the correlation did not meant causation.

~~~
TeMPOraL
It's not like that. Correlation does not imply causation only in the strict
meaning of "imply", =>, in mathematical logic. But correlation is sure as hell
_evidence_ for casual relationships. Just yesterday I found a blog post by
author of "Think Bayes" book explaining that nicely.

[http://allendowney.blogspot.com/2014/02/correlation-is-
evide...](http://allendowney.blogspot.com/2014/02/correlation-is-evidence-of-
causation.html)

Moreover, with good enough correlation data, you can even determine _the
direction of casual relationship_ if you're willing to do a little bit of
maths:

[http://lesswrong.com/lw/ev3/causal_diagrams_and_causal_model...](http://lesswrong.com/lw/ev3/causal_diagrams_and_causal_models/)

Also, [http://xkcd.com/552/](http://xkcd.com/552/) ;).

~~~
mrtksn
I would not go into lengths to call correlation an evidence. It can be taken
as a clue for further investigation because it also does not deny causation.

~~~
TeMPOraL
It is evidence _for_ causation, in the Bayesian statistics meaning of
"evidence". The first article I linked explains the process in detail.

~~~
mrtksn
Thank you for the links.

------
Homunculiheaded
I hear more and more chanting of "correlation does not equal causation!" which
is great if your goal is to form a causal model of the world, but there are
plenty of insights you can arrive at from correlation alone.

For starters in the world of machine learning and predictive analytics, it
doesn't really matter if X causes Y so long as X is a consistently good
predictor of Y. Maybe powerlines being over someone's home are not the cause
of cancer, but if their presence can be used to predict cancer rates that's a
good thing.

More important imho is the idea of latent or hidden variables. Two things that
are clearly correlated but also seem to not have a causal relationship (just
as transistors and longevity) may share a latent variable, that may be either
non-quantifiable or completely unobservable. For either case measuring outputs
that share a common latent variable and thus correlate with each other might
be the only way to attempt to measure hidden, non-quantifiable causes.

For example employee happiness might be the cause of employee retention.
However you can't currently measure or observe 'happiness', but there may be
many, seemingly, unrelated employee activities that correlate with retention
because they are also driven by this same latent variable. Studying them is
the only way to get a quantifiable understanding of this latent cause.

tl;dr somethimes correlation is just as important as causation.

------
robert_tweed
Ironically, this graph does an excellent job of showing a correlation that
_really exists_ between the two data sets, albeit a non-linear one.

The only misleading thing about this graph is the title, which states a causal
link with no evidence of one.

~~~
mjn
Yeah, I think this is completely off-base in blaming the axes. The axes are
perfectly fine in showing the actual correlation that's present here (almost
certainly due to a third common causal factor, of course).

There's no really magically "unbiased" way of choosing axes. There seems to be
a popular view recently that you should always start your axes at zero, I
assume as a backlash to some graphs magnifying very small differences by
choosing zoomed-in axis values that visually exaggerate variation. That isn't
really an absolute truth either, since interesting data regions are not always
near zero. For example, if you graph temperature variation in different cities
staring at 0 K, you can make it look like essentially all habitable cities
have around the same temperature, somewhere in the range of 250-300 K give or
take. Of course 250 K versus 300 K is a huge difference to human perception of
temperature, while the entire range 0-200 K is more or less irrelevant when
discussing weather, so starting your axis at 0 would be a poor choice. In this
case that would be the biased choice, intended to visually minimize actually
important variation by choosing an unreasonably low starting point for the Y
axis.

~~~
scotty79
> (almost certainly due to a third common causal factor, of course).

..or absolutely certainly associated so remotely that correlation is purely
accidental and found only by carefully cherry picking data sources, ranges,
functions to massage the data (log with the right base) and axis ranges. To
sum up ... not meaningful.

You could find similar correlation between ocean temperatures and lottery
numbers but you'd have to precisely adjust so many inputs that the correlation
would be not so much found as constructed.

~~~
Someone
_" log with the right base"_ doesn't make any difference. That's a linear
mapping. For instance:

    
    
        2log x = log x / log 2 = ln x / ln 2
    

(With 2log the logarithm in base 2, log and ln the logarithm in bases you
pick, but you can think of 10 and e)

See, for example,
[http://www.purplemath.com/modules/logrules5.htm](http://www.purplemath.com/modules/logrules5.htm),
or derive it from the axioms.

~~~
scotty79
I think this ln2 factor matters when you are drawing a graph. But I guess
that's included in what I called axis range.

------
zealon
Maybe a bit off-topic here, but I think there is a cause-effect relationship
between number of transistors and life expectancy. More transistors implies
more computing power. More computing power leads to better/faster information
processing, including medical information. This leads to faster patient
diagnostics, better treatments (pharmaceutical innovations), earlier and more
precise health warnings (lab tests an medical equipment), and so on.

Faster and better information processing leads also to higher food quality
(food processing plants), higher life quality (environmental temperature and
humidity control), etc.

Germany is a highly industrialized country, so information processing power
causes a big social impact.

~~~
xtacy
Maybe, but there is another, perhaps simpler hypothesis: Both the transistor
count and life expectancy increase with time.

One way to verify one or the other is to look at a linear hypothesis:

    
    
        States: LifeExp, Transistors, Time
        Structural Equation Model [1]:
    

Model 1:

    
    
        LifeExp ~ Time + Transistors + noise_exp_1
        Transistors ~ Time + noise_trans_1
    

(The 2nd equation means that: Transistors(t) = a * t + noise, and you try to
estimate "a" from the data.)

vs Model 2:

    
    
        LifeExp ~ Time + noise_exp_2
        Transistors ~ Time + noise_trans_2
    

If model 2 has more predictive power of LifeExp than model 2 (i.e.,
noise_exp_1 is lower than noise_exp_2), then according to _this_ model,
Transistors causally affects LifeExp. However, this model is way too
simplistic and doesn't incorporate other causal paths, such as the one you
describe (Transistors -> Computing -> DiagnosticRate -> ...)

[1]
[http://en.wikipedia.org/wiki/Structural_equation_modeling](http://en.wikipedia.org/wiki/Structural_equation_modeling)

~~~
zealon
Math is not my field of expertise, but I think I get your point. My point was
based on a social and economics perspective. I think these other causal paths
are fundamental, because they are empirically verified (computing power vs
health and life quality).

------
NPMaxwell
For causation, you need A/B testing -- real A/B testing (randomized controlled
experiments).

Not everyone understands the need for A/B testing and not everyone understands
how to do A/B testing.

I recently worked on modeling to forecast revenues for a large multinational.
One of the variables that was essential for an accurate model was whether
prior data came from before or after the company hired their current A/B
testing guru. With the new guru, revenues rose every week. With the previous
"guru", changes by Engineering had no impact on revenues.

------
jkrems
Everyone's talking about "correlation vs. causation" but I'm pretty sure the
page explicitly states that that's not what this is about:

> if its axes are chosen unwisely

The graph is misleading because it makes it look like the transistor count and
the life expectancy grow at the exact same rate. It pretends there's a linear
correlation. And that's the misleading part.

------
IvyMike
Those who have not read "How to Lie With Statistics" are doomed to reinvent
it.

[https://archive.org/details/HowToLieWithStatistics](https://archive.org/details/HowToLieWithStatistics)

~~~
DanBC
_Beyond Numeracy_ (John Allen Paulos) is another great book.

[http://www.amazon.com/gp/aw/d/067973807X](http://www.amazon.com/gp/aw/d/067973807X)

------
squigs25
This is only one misleading graph, which, to be honest, I find misleading.

------
breunigs
I feel that many missed my point: It’s not to say causation never implies
correlation, that there’s no common source or that latent factors are fiction.

In my experience many people believe something is true, just because of “math”
or “data”. So, this is basically a variation of the joke that 73.37% of people
put more trust into statistics, if the value isn’t rounded. Since you all
critically discussed the topic, you are far beyond of what this can little
project can teach you.

If you have suggestions on how to express this more clearly, please let me
know.

------
ehsanu1
[http://www.correlated.org/](http://www.correlated.org/)

------
yoha
I think this is a very good idea. Lots of people still mix up correlation and
causation. Either this tool will make them understand what this is wrong,
either they will enjoy finding lots of weird relations between data sources.

