
Existence does not imply correlation - spindritf
http://www.chrisstucchio.com/blog/2014/existence_does_not_imply_correlation.html
======
csbrooks
I'm going to point something out here that's not really central to the
argument the article is making.

In programming, in my experience, the nastiest bugs to fix are actually two or
three separate bugs interacting in weird ways. If you find a bug like he did,
and it's easy to fix and unlikely to break something else, but you can't
reason how it could be causing the issue you're seeing, FIX IT ANYWAY. It's
quite possible it's interacting in some subtle way with another bug, and
fixing it may make the other issue start behaving more consistantly, and
easier to fix.

It may feel wrong, because you feel like you should set that theoretically
unrelated bugfix aside until you can work out the bug you're trying to focus
on. In my experience, that's often not the right approach.

~~~
sillysaurus3
_If you find a bug like he did, and it 's easy to fix and unlikely to break
something else, but you can't reason how it could be causing the issue you're
seeing, FIX IT ANYWAY. It's quite possible it's interacting in some subtle way
with another bug_

Do people really reason like this? I mean, programming is absolutely clear-
cut. It's the most clear-cut aspect of life, in some ways. It's purely a logic
problem. If you see something that can't possibly be interacting with your
problem, then spending additional mental cycles on it is always a waste of
your time for solving that problem.

Now, it may be a good idea to fix that new problem. That's perfectly true. But
if it can't possibly affect your current problem, then fixing the new problem
won't do a darn thing to help you fix your current problem. That sounds like
tautology, but your comment is saying the opposite.

You can prove to yourself that a new problem can't possibly be interacting
with your current problem. If the problems exist in two separate modules, you
look at the connections between the modules and see whether any data can
possibly flow from one to the other. If there is no state that flows between
them, then the problems are necessarily independent.

I think what you're saying is that "most codebases suck, because they have a
lot of interdependencies and are hard to analyze." That's probably true. But
resorting to voodoo thinking isn't going to help.

~~~
kordless
> can't possible be interacting with your current problem

The only thing not logical with programming is a programmer's ability to fully
simulate the circumstances by which all bugs may occur. That ability will vary
greatly from programmer to programmer.

So yes, people think like this because it's easier to fix the bug than it is
trying to figure out correlation.

~~~
sillysaurus3
Maybe my previous comment was unclear. If so, sorry about that.

The point was that, in programing, there is never any "figure out
correlation." You can rule out whether a bug is being caused by a given line
of code by examining the flow of data between what you're seeing on screen and
the lines of code responsible for what is shown on that screen. A bug is never
"correlated" with any given line of code. The line of code is either logically
related to the bug, or not related at all.

I'd be interested to hear more about how programming could be made into a
correlation game, though. It sounds like a new mental tool that I've never
learned, which means I should learn it.

~~~
yummyfajitas
To make programming into a correlation game, build a distributed system and
work on performance. You suddenly have flow of data, together with lots of
external factors that are difficult to measure (e.g., a spike in latency in
US-East but not US-West for 0.5% of packets, or 2% of your shared instances
having a noisy neighbor).

In such contexts, you usually also have a LOT of code.

Correlation analysis becomes very important in figuring out which piece of
code to even look at. Bugs do become "correlated" with a line of code, because
bugs take the form "noisy neighbor + blocking disk read (code) + high latency
to master DB => slow response".

~~~
sillysaurus3
Thank you. That's a very interesting way to think about it, and I hadn't
considered that before.

I apologize for making so many comments in this submission. I feel pretty
terrible about it, because the number of comments are higher than the number
of upvotes, which has pushed your submission off of the front page. I didn't
realize it was happening until too late. But more than that, in retrospect, I
should have behaved differently altogether, which would've resulted in fewer
comments. Sorry.

------
cynicalkane
This is a good post to begin with but then the author shoots himself in the
foot with claims like

"A couple of weeks ago I pointed out Sam Altman making bogus arguments about
sexism - arguing by juxtuposition that the existence of sexism reduces the
number of women in technology."

I think the fact that sexism reduces the desire of women to participate in
something is established. It doesn't make sense to dismiss this as an
'argument by juxtaposition' since scientific thought begins with drawing new
guesses from established patterns, particularly when the data are found
wanting. So now, reading between the lines, we have the dual inference that

1) the author doesn't understand his own argument, and

2) the author has a reactionary axe to grind regarding gender issues in
technology.

Whether or not these inferences are true (I don't think they are), a large
audience is now lost since most people are socially aware enough to understand
that 'sexism doesn't repel women' is a poor null hypothesis, even if they
can't articulate these feelings in the language of science.

~~~
yummyfajitas
Author here.

Altman provided no argument whatsoever in his post that the existence of a
sexist act is the cause of women's underrepresentation in tech. He was fairly
careful - he didn't even explicitly state causality, he merely implied it.
Maybe an argument is out there somewhere, but not in Altman's post.

 _I think the fact that sexism reduces the desire of women to participate in
something is established._

Directionally, maybe, but I've seen little evidence the magnitude is
significant. In spite of the built in sexism, Catholicism and Islam both have
far more women than Atheism (or technology). I don't see any reason to believe
medicine or law are less sexist than tech, or were less sexist back when women
started flooding into those fields.

If you have evidence of correlation, link to it - Sam Altman didn't.

I do have a "reactionary axe" to grind against bad reasoning and this
particular topic is a huge source of bad reasoning.

~~~
cynicalkane
> In spite of the built in sexism, Catholicism and Islam both have far more
> women than Atheism

Let's break down what's bad about this sentence. It contains:

* a complete misunderstanding of the nature of the sexism we're talking about. A woman who walks into a church will not be constantly devalued for her gender and/or treated like a hot piece instead of a colleague, while simultaneously having their feelings devalued with neckbeardy thought-terminating "rationalism". In technology and business, sexism isn't an abstract teaching most people ignore, it's something in the air.

* At least two of the statistical fallacies you just decried.

* The 'appeal to worse problems' fallacy, and

* the hidden assumption that women's participation in religion is comparable to their choice of careers, as though they 'join' the Catholic Church or Islam the same way one might join a technology company.

~~~
JoeAltmaier
Yeah 'something in the air'. What does that mean? That some people (many
women) don't like the culture of technology. That's not sexism; that's just a
fact. E.g. Some people (many women) don't like hunting - does that make it
sexist?

~~~
DanBC
You risk losing money and market share if you think like that.

If you make smaller guns and if you make guns in a range of colours and you
make nice clothing you then open up the market beyond its current range. The
people currently hunting make dislike a powder blue rifle and may scoff at
comfy footwear, but you don't care because you're not selling to those people,
you're selling to the peple who are not traditionally hunters.

~~~
JoeAltmaier
Excellent observation. But still not sexism. Its gotten popular to use blaming
language, paint everything with the sexism brush, and generally polarize
discussions about workplaces. I don't think that helps get us anywhere. If
only more folks thought like you.

------
tptacek
Here's why a reasonable reader might come to the conclusion that this is not
in fact a post about correlation, but rather a device for making a drive-by
dig at Sam Altman's post:

* The Redis example is forced. Here's why: most engineers faced with that problem would, sooner rather than later, collect stats about the nodes and immediately observe that the root cause was an overloaded server. As the post admits, the cause was immediately obvious. No statistical reasoning was required, and, in fact, this reader doubts that statistics played too much of a role in Stucchio's own diagnosis. (Whether it actually did or didn't isn't material to my point.)

* The insight being provided about correlation is extremely simplistic. Essentially, it communicates in two graphs the definition of correlation. That might make sense if the post was written in a fashion that tried to communicate the fundamental idea of correlation to someone with literally no acquaintance to the term. But it's not; despite having subheds about what "is" and "isn't" correlation, the writing style more clearly signifies that the author is trying to debunk a mistaken idea about what the term is, implying that its readers are already somewhat acquainted with the idea. Or, put more simply: the fundamental idea the post tries to communicate could have been conveyed in two simple sentences. (Glass houses about my own writing style: duly noted).

* The transition from technical discussion to Sam Altman is abrupt. More importantly, the Altman subject has much more valence than the point about diagnosing Redis failures. Thought experiment: chop everything in this article _after_ Redis out, and imagine it was posted not by Chris but by some random account. Would anyone pay attention to this post?

Unfortunately, the post doesn't have much insight to offer about Altman's
post. It's framed Altman in a manner incompatible with Altman's post ---
suggesting, contrary to reality, that Altman was trying to present a complete
empirical argument about gender disparities in technology. It then tries to
beat Altman over the head with that framing. Altman emerges unscathed, because
the author is swinging at his shadow, not him.

------
digibo
Now I hope for a follow-up post that explains what the east cost problem was.

~~~
yummyfajitas
Network trouble inside US-East combined with a call that involved a lot of
round trips.

I was asynchronously sending maybe 50 messages out and waiting for replies. In
US-East, the standard deviation of network latency went way up, meaning that
while most of those messages return in 2-3ms, a few took up to 50. It turns
out that I didn't need all 50 responses anyway, so I just set a timeout of
10ms and did the best I could with whatever messages did return.

------
thanatropism
I'm waiting for a paper in a peer-reviewed journal - even a journal in the
humanities by a feminist cultural critic will do - that quotes statistics
(even flippantly, en passant, without analysis) to the point that sexism
scares people away from professions.

It sounds true. It kind of feels true to me, even. But there's too many feels
being passed for fact lately.

------
tel
I think that all of these arguments are pivoted off the same idea. Roughly,
but not exactly( _):

    
    
        (Co-)existence is necessary for correlation
        Correlation is necessary for causality
    

Or maybe (E > Co > Ca). What this means is two-fold

    
    
        1. It's valuable and common to note E and Co
           suggestively of Co and Ca respectively. If
           I *notice* E then I begin to hypothesize about
           Co because I know have evidence for its 
           possibility if not yet evidence for Co directly
    
        2. Being lazy/imprecise in speech or reasoning
           might cause someone to "skip a step" and claim
           something more powerful than what they actually
           have.
    

With regard to (1), I think it's completely valuable to make these
observations. At their heart, they're nothing more than broadcasting and
contextualizing _data* and that's an important function. So (2) is where all
of the danger lies--- _properly_ contextualizing data and what it actually
gives credence to.

So how can we make sure that (2) occurs casually without precluding (1)?

I think that articles like this one and all around the whole "correlation is
not causation" tagline are great. They help to ensure people remember the size
of the space between each step in E > Co > Ca.

I think another powerful technique would be to demonstrate more arguments that
make the jumps and highlight the properties which allow that to happen. Chris
did a good job here demonstrating correlation—what it looks like when it
exists or fails to exist—but much, much more can be said about the E > Co
jump.

The Co > Ca jump is far more complex. Worse, it's often obscured through
opaque words like "randomized, controlled study" or "scientific method" which
are actually quite far away from the mechanisms which allow that jump to be
made—they're more like implementation details obscuring a great API.
Demonstrating clear arguments here (and not "rain + wet grass" oversimplified
ones) could be a great boon to public reasoning.

What else can be done to reduce (2) without precluding (1)?

(*): Really (linear) correlation is just one kind of relationship-of-interest
between things. This is often pointed out when people talk about correlation
in terms like "circular relationships have 0 linear correlation, oh no!"

What I'd like to write instead might be E > M > C where M becomes "model
building". Choosing to highlight linear correlation means that you're choosing
a linear model. That might be perfect, or it might be wrong, but you still
must choose it before you can structure your observations into evidence of
some kind of causality. Generally, at this point, you would also want to begin
developing covariates all in preparation for the C step.

~~~
conistonwater
Bear in mind that causality does not imply correlation. (I know I'm arguing
with just one sentence in your post, but I think it needs to be pointed out.)

Sometimes you are observing controlled variables, in which case an counter-
example like Friedman's thermostat will disprove this notion.

~~~
tel
Yeah, I would like to leave "correlation" out of it entirely. The real steps
are more like "model building" and "evidence collection", but those are both
more complex and not common vernacular.

------
rjaco31
'localhost:8000' link on the page? Classy

------
andylei
sexism is not one thing. it doesn't simply exist or not exist. it varies (in
quantity, quality, nature, and scope) from company to company, community to
community, and relationship to relationship.

"sexism in technology" is not an entity, it is a description of an aggregate.
saying that the "sample size" is 1 is absurd.

