Hacker News new | past | comments | ask | show | jobs | submit login
Programmers Need To Learn Statistics Or I Will Kill Them All (2005) (zedshaw.com)
174 points by vaksel on May 26, 2009 | hide | past | favorite | 116 comments

The article does seem familiar, and I think I've even seen it on HN before.


Yep. (There seems to be a slight change in the base URL for the submitted article, but this has been discussed here before.)

Still this is well worth discussing again. It's amazing how much most college-educated people think they know about statistics that they really don't. Two of my favorite quick overview articles about statistics, both by Ph.D. professors of statistics, are "Advice to Mathematics Teachers on Evaluating Introductory Statistics Textbooks"


(the link is dead because of site maintenance now, but should be fixed soon)


"The Introductory Statistics Course: A Ptolemaic Curriculum?"


Both are thought-provoking articles about what usually isn't taught to undergraduates about statistics.

Wayback has a copy of MAAFIXED.PDF (I'm not sure whether it's appropriate / a good idea to include the link).

Reminds me of the film clip Alan Kay showed at one of his talks. It was taken at a Harvard graduation, and it shows many, many Harvard graduates, students, and faculty saying that the Earth was warmer in Summer because the orbit takes it closer to the sun.


That's part of a film called "The private universe" (http://www.learner.org/resources/series28.html) about how students can learn stuff in class and then promptly revert to their nonscientific understanding unless they are forced to really confront their prior knowledge in light of what they are being taught. It's a fascinating film.

The importance of explicitly confronting prior knowledge and integrating it into learning is one of the basic principles of learning that came out of the "How People Learn" studies (http://www.nap.edu/openbook.php?record_id=6160).

Thanks for the link. I was wondering where that clip was from. Reminds me of Feynman's bit from his time teaching in Brazil. The students there could stand up in class and recite the textbook description of Polarized Light, but when asked to identify it in the real world in the view out the window, couldn't. (The definition mentioned the example of reflected light, and there was sunlight playing off of water in the bay.)

It's caused by tilt/angle of Earth, correct?

It's caused by tilt/angle of Earth, correct?

Yes. My second-grade teacher made sure her class knew this. A year or two later I learned that the earth is closer to the sun (at perihelion) when the SOUTHERN Hemisphere is having its summer, and the Northern Hemisphere is having its winter.

which is why the northern hemisphere has a slightly milder climate.

I just checked my atlas - I remembered correctly - the northern hemisphere has significantly larger temperature variation than the southern. It's probably the difference in land versus water areas more than offsetting the tiny effects of solar distance.

Huh, is this actually true? I had been under the impression that the orbital difference was insignificant from a climate standpoint.

there are actually several things at play, eccentricity, precession, and obliquity. the net result is that the north gets milder winters and cooler summers.

the northern hemisphere has significantly larger temperature variation than the southern

the net result is that the north gets milder winters and cooler summers

One of these statements appears to disagree with the other. If I remember correctly what I've read, the proportion of land rather than ocean in each hemisphere plays a major role in climate, as mentioned by the reply saying that the Northern Hemisphere has more variation in temperature.

yes, since we're on a tilt the two hemispheres get their summer and winter during opposite seasons when their respective hemisphere is 'facing' the sun while the equator pretty much always gets the same amount of sun

One of the things that happens in the film clip, is that a professorial gentleman gives the "orbit is closer" answer, then is asked what season it is in the Southern hemisphere when it's Summer in the Northern one.

600 days ago & HN was still worried about turning into digg. Plus ça change...

Oh, and you wonder why I say, “he”? I never have this problem with female programmers...I think women are better programmers because they have less ego and are typically more interested in the gear rather than the pissing contest.

When I was hanging out in Homer Alaska, squatting in the tent city on the beach, I heard that on the halibut fishing tours, the women would catch more and bigger fish. (And halibut are big fish. A 25 pound halibut is puny. You can catch 300 pound halibut, and it's not a super-rare event.) This is because the husbands would bring their fishing egos with them from the lower 48 states and not listen to the deckhands. But the wives, with no preconceived ideas, would listen carefully and do things the right way to catch halibut. (As opposed to trout.)

I wonder if this is a "feature" of women or all minorities? For example, do male nurses show the same egolessness as female programmers?

As a minority Irish musician, I find that this seems to have a dampening effect on my ego. (Not 100% sufficient to keep me from making a fool of myself occasionally, though.) It's also caused me to be very, very paranoid about whether or not I "get it" and as a result I've had bouts of beneficial studiousness. And yes, I have to do better to win respect, and sometimes I am judged by my skin color, not my actual playing.

I thought Irish people were renowned for their musicality. Eg, singing is a common social activity in Ireland (to the point where it's socially unacceptable not to sing), and Ireland has produced far more commercially successful musicians than their population would suggest. What country are you playing in?

I read his sentence as that he's not Irish, but plays Irish music.

Correct. And for some reason, I am often the only one who ever gets asked about their origins, even though some of my fellow musicians are of Cajun, Hungarian, German, Mexican, and Swedish extraction.

For many, Irish music is tied up in the context of some sort of romanticized Irish Nationalism, and is only appreciated as some sort of ethnic tchotchke. Their imagination fails at the notion that I'm enjoying it in a purely musical context. I associate those people with the ones in the audience who can't clap on the beat.

I've heard more than one person say no one from North America can play proper Irish music. Always struck me as completely insane -- if nothing else, there were enough emigrant Irish musicians in NYC and Chicago to form their own styles of the music!

Liz Carroll has that story about sitting in a session in Ireland, playing tunes for hours, when someone says "Let's play some of your music!" and launches into Turkey in the Straw....

I might be misunderstanding your comment, but women are not a minority.


It's bad usage, but in sociological terms, women are a minority because they aren't the dominant subgroup.

> in sociological terms, women are a minority because they aren't the dominant subgroup.

This is a useful sentence to read, because it helps remind me that "sociological terms" have little or nothing to do with honest intellectual framing of topics.

I'm going with the idea that you generally misunderstood, rather than the idea that you are trolling.

Consider a monarchy. The king is the majority. It's hard to imagine a world where everyone you see doesn't have the same power and opportunity that we enjoy. But you can see that cultures have existed where one guy was more important than everyone else combined.

You and I know, a pointer isn't a dog. However, people are often irresponsible, and use jargon in common conversation. Take a moment, and consider what is "honest intellectual framing" and what is a "sociological term". I think that while you may disagree, the assertion that women have less power (aren't the dominant subgroup) is a fairly honest analysis.

I haven't commented here in months, and don't plan on replying, but like xkcd says... someone on the internet is wrong. Take a moment to consider the possibility your parent poster isn't a fucking moron.

I had largely the same reaction you did when I saw "minority" used that way.

You're right. "Minority" is a really bad word for what biohacker42 was trying to say, because many "minorities" (in terms of social status) are not numeric minorities. (Blacks in apartheid-era South Africa. Women throughout history.) As jfoutz mentioned, we both know a pointer is not a dog.

It's convenient that you misquoted me, too, because doing so leaves out the part where I already conceded that it's bad usage. Which leaves me at a loss as to what you're trying to add to the discussion--did you say something more than what I've already implied (and exhaustively reiterated here)?

female programmers most certainly are though.

And Linda Greenlaw is probably the best fisherman anywhere.

> I think women are better programmers because they have less ego and are typically more interested in the gear rather than the pissing contest.

People need to stop writing shit like this. If it's not okay to say that men are better programmers simply because of their gender (it's not), then it's not okay to say it the other way around either.

> If it's not okay to say that men are better programmers simply because of their gender (it's not), then it's not okay to say it the other way around either.

God forbid that someone make an argument and try to support it with data, when one person has already decided for all of us what the correct conclusion is.

WT* ever happened to debate, intellectual discourse, and the marketplace of ideas?

This is one of the things that I find most disgusting about political correctness: that it tries to just wall off huge swaths of POTENTIAL CONCLUSIONS based on an argument that boils down to a misconstrued sense of manners (at best) or political preferences (at worst).


Want to say that women and men have (a) different average heights; (b) different standard deviations in intelligence; (c) different hormone levels; (d) massively different thicknesses in their corpus collosums, and therefore one or the other might on average, make better programmers/accountants/engineers? THE DEBATE IS CLOSED. IT IS UNACCEPTABLE TO SPECULATE ON THIS TOPIC.

I call bullshit on that.

Intellectually honest people respond to facts and arguments with OTHER facts and arguments.

Intellectually dishonest people try to shut down debates using social control.

WilliamLP writes

"People need to stop writing shit like this"

That's the phrase of a bully, and/or a censor.

I think you have some interesting points. I'm going to play Devil's advocate:

An assumption that you make there is that all forms of social control used for discouraging debates/speculation on certain topics are inherently bad or stem from dishonesty. It could be that knowing in advance the emotional, legal, political, or otherwise time-wasting repercussions that a certain type of debate causes justifies avoiding the discussion altogether. None of us go after truth in a completely unbiased manner with no agendas whatsoever, though we may fool ourselves in thinking so.

There doesn't seem to be any disagreement about the social rule of not discussing politics or religion in this forum. We don't think of this as a moral rule, but then what are morals?

> It could be that knowing in advance the emotional, legal, political, or otherwise time-wasting repercussions that a certain type of debate causes justifies avoiding the discussion altogether

So person A, and B and C are interested in having a debate.

Person X "knows" that persons A,B,and C (and some bystanders) would be better off if they don't even speculate or speak on the topic.

...so person X responds "People need to stop writing shit like this" ?

My objections:

* Why should I accept person X's assertion that he knows - better than I do - what will make me happy, or what will waste my time ?

* If person X truly thinks that, he should make a compelling case, put it on up a website, and respond not with "People need to stop writing shit like this", but with "I think that this debate is fruitless and time-wasting - check out this blog post for why".

* Even if person X is right, for a large percent of people, trying to shut down speech "for someone's own good" is un-American, and illiberal. If it's done under the color of law it's called "prior restraint".

* Even if person X is right given the conditions on the ground, conditions change, and over time his flawless heuristic for when to force people to "to stop writing shit like this" will become more and more disconnected from reality. What's needed is a constant feedback loop that keeps in touch with reality. ...and "ongoing debate" is the name of the ongoing feedback loop.

Agreed. As Sophocles famously wrote: "Knowledge must come through action; you can have no test which is not fanciful, save by trial."

Whether the statement (that women are better programmers) is actually true, or whether it belongs to Paul Graham's set of things you can't say doesn't even matter here.

It was not presented as a nuanced statistical statement. (Ironically, in an article about how statistical statements need to be more nuanced!) It was presented as a boorish, stupid, and unsupported assertion completely irrelevant to the focus of the article.

There are a few "debates" which should be shut down using social control. Among them are the ones that attempt to box people in and remove their individuality by asserting that some subset of their humanity is the most important thing about them in some context.

You don't need to cite a blog post to do that. There also are many debates which for all intents and purposes are closed, and for which social control or even condescension are appropriate. (NO SERIOUS SCIENTIST BELIEVES THAT THE EARTH IS 6000 YEARS OLD, OR THAT MERCURY IN VACCINES COULD BE A CAUSE OF AUTISM.)

> God forbid that someone make an argument and try to support it with data

That's well and fine, but the author stated an opinion based on anecdotal evidence. On its own though, I think it's a really stupid thing to say but I think it fits fine given the tone of his rant.

While I do agree with your central point, I don't think there's any value in exploring whether or not women might make better programmers on average. If I were to hire someone, I'd base it on their past experience and how well the interview(s) went, not their race or sex. There's no harm in speculating and even doing the research if it interests one that much though, just like there's no harm in researching whether or not painting red stripes on your car will make it go faster.

> If I were to hire someone, I'd base it on their past experience and how well the interview(s) went, not their race or sex.

Absolutely. I agree 100%.

OTOH, discussing things in aggregates also makes sense.

If 4 out of 1,000 women would make excellent engineers, and 1 out of 1,000 men would make excellent engineers, and yet we see that the distribution of actual engineers is something other than 4:1, we should investigate.

If, OTOH, the numbers are 1 out of 1,000 and 10 out of 1,000 respectively, and the ratio of actual engineers is 1:10, then we might choose to spend less time and energy on the investigation.

If 4 out of 1,000 women would make excellent engineers, and 1 out of 1,000 men would make excellent engineers, and yet we see that the distribution of actual engineers is something other than 4:1, we should investigate.

We might want to investigate, but "should" is a strong word. There are all kinds of reasons why this could happen, and some of them might not need fixing.

It could be that the total number of people who would make excellent engineers is too low for the number needed, and that men (in your example) are more likely to make adequate or good engineers than women, even though women are more likely to make excellent ones.

It could be that women in general, in spite of being four times as likely to make excellent engineers, tend not to enjoy engineering for cultural or other reasons.

It could be that the women who would be excellent engineers are in the group that would be pretty good CEOs, and they all become CEOs because there's more money in it.

But you get my point: the failure of actual people to conform to the occupations that they would be best at is not, in and of itself, evidence of a problem.

Maybe it's because you're using a contrived example, but I don't quite agree. These sort of studies are usually undertaken at a higher, more abstract level. eg. How do various cognitive abilities differ between sexes? With the resulting data, we might be able to explain certain phenomena, or dispel certain stereotypes. But I don't think we're anywhere close to being in a position we can say x out y men/women would make excellent engineers. That kind of information is usually extracted from trend analyses (eg. studying the proportion of males/female engineers who are highly successful at their career vs those who aren't), and they're inherently skewed because of the nature of the society we live in.

It's wrong because it's a faulty abstraction.

It's generalising. That is, assuming that one or many things are one way just because some things are one way.

Just because one person is one way, doesn't mean they all are, or even mostly are.

I hope you now see why it is FUCKING STUPID TO BE A SEXIST, racist or any other kind of generalist.

Best regards, hugs, kiss, love,

ps. sorry for the non-capital PC parts.

Exactly. Why not just say that ego tends to prohibit good programmers? Bringing gender into it is hardly necessary, and it actually detracts from the important point about ego.

Then again, Zed Shaw isn't exactly known for being PC.

Why? You can say whatever you like if you can back it up with analysis. Not to do so is pointless political correctness.

Car insurance companies believe that young male drivers are a bigger risk than older female drivers. They have statistics to back this up. Saying that young, male drivers are worse drivers is not ageist or sexist. This sentence is not judging individuals but a demographic group (i.e. it would be wrong for me to say you are a worse driver than someone else, if I had no proof).

The problem with judging programmers in this way is that it's hard to empirically measure results or to even agree what metrics are good or bad.

Speaking of statistics, FreeBSD comes with a tool called "ministat", which accepts a bunch of input files filled with numbers and tells you how statistically significant the difference is between them. It's frequently used to demonstrate performance improvements in kernel code:


You know, I've read a bunch of Zed's articles, and I always end up thinking that he's the source of most of his problems. There was a talk I saw him give (can't find the link) to a bunch of college kids where he basically said "phone in your job, do the stuff you love on your own time, everyone else is retarded and they'll never understand you."

Great attitude you have there. I know guys like that; guys who are extremely egotistical, are always right, and know everything about everything. They're team killing, energy sucking, wastes.

He may be able to hack like a dream, and "thanks for Mongrel" and all, but I wouldn't even want to be in the same room as him. I think from now on I'll do my best to ignore Zed's perspectives on life; they're more than a little skewed.

phone in your job

I know the talk you're referring to and he did not say this. He does suggest doing your best and working really hard to make sure your employers get what they want. Do the job you're paid to do and do it well.

What he advises against is going the extra mile for a company that does not trust your judgement and that you have no stake in. Leave work at work. Your job is not your life. You don't owe them anything that they aren't paying for, least of all your creativity.

I don't see how that can be construed as telling them to phone it in.

everyone else is retarded

In fact he points out several times that the people in charge at most companies are not stupid, they're just clueless about technology; not a very contentious statement last I checked. The whole first half of the talk is explaining this to an audience that has has yet to encounter it in the real world, and how to advocate for superior technical solutions in spite of it, with such "team killing, energy sucking" advice as be objective, honest, and prepared for hard technical questions. Truly industry-ruining suggestions.

I strongly suggest you watch the talk again. Even for someone that doesn't like him, you seem to have missed the point of it entirely.

You'll forgive my hyperbole, but I was trying to make a point.

What I'm trying to get at is the reason the people in charge of most companies are "clueless about technology" is that some technical people aren't capable of speaking to them in their language. The people in charge are going to be more concerned with how these issues affect their bottom line, if you can't communicate in those terms you'll rarely be effective.

Zed goes on forever about how his solution was technically better in every manner possible, but he was unable to convince the powers that be that he was right. This is a communication issue likely brought about by his "I'm a genius, and you aren't smart enough to understand" attitude. This isn't because business people are arrogant, pig headed assholes (though some certainly are). No, they don't get the tech, but I for one am willing to take the time to help them understand why these things matter and why they should give a shit. Yes, it sucks, but I feel that not doing so would be doing my employer a disservice. What I dislike about that talk is that Zed basically abdicates responsibility for having to communicate with non-technical people because they'll never be smart enough to understand, when it's really his failing that they don't understand in the first place.

Further, Zed's attitude towards others is beyond condescending. He may pay lip service to the people in charge and say "they're just clueless about technology", but it's so abundantly clear how little he values them, and the jobs they provide. "Working hard and doing a good job", but doing something that you know is both wrong and dumb is phoning it in. It's doing what you're supposed to because you couldn't give a shit about the company, or what becomes of it.

It just rubs me the wrong way.

the reason the people in charge of most companies are "clueless about technology" is that some technical people aren't capable of speaking to them in their language.

The reason I'm clueless about marketing is not that marketing people aren't capable of speaking to me in my language. It's that, fundamentally, I don't care about marketing. It seems likely to me that the people in charge of most companies really don't care about technology (nor should they except in cases it's critical for the company; I'm not criticizing).

but he was unable to convince the powers that be that he was right

Yes he was. That was the point of him giving advice on the subject. It wouldn't make much sense to suggest a bunch of things that didn't work at all, would it?

Zed basically abdicates responsibility for having to communicate with non-technical people

He presents a six-point strategy for doing exactly that. I suggest watching the talk again, as you seem to have missed out on most of it.

The talk is from CUSEC, probably this one: http://vimeo.com/2723800

As for his "attitude", I've actually met him and talked to him face-to-face. His internet persona is a put-on and he only talks like that to piss people off. In actuality, he's a really nice guy who is super-smart and who does not take shit from people. It's actually too bad that there aren't more of him in our industry. Scope creeps, memory leaks, etc. would be a thing of the past.

You are who your project yourself as, and to the billions of people who haven't met him who may stumble across his stuff online, he comes across as a selfish dick. IRL he could be happiness and teddy bears stuffed with sunshine, but so far as I'm concerned if the industry was full of people like him it would be in ruins.

Also, that was the exact video I was talking about, so thanks.

Thanks for posting this. I went from despising Shaw (based on his writings) to kind of liking him.

You like someone for being a double-faced troll? IMO, I like him less now because he actually doesn't believe in what he says online.

They're team killing, energy sucking, wastes.

You do realize that this could have been written by Zed?

Serious question: who here who has at least a BS in CS didn't have a mandatory stats class where you learned all about picking sample sizes to give you an accuracy you're happy with or that an average without a standard deviation is nigh-meaningless?

It's just a fact of our profession, there is a significant percentage of people who just slap together APIs and have zero understanding of the maths behind it.

I don't see why it takes Zed 1000 words to say it or why he has to get sanctimonious about it.

The problem isn't that people know nothing about statistics. The problem is that they think they know more than they actually do. It's fairly easy to get past your "intro to stats" class with only a passing familiarity with the subject and no deep understanding. Only once you truly grasp stats do you realize that you don't actually know anything.

Any statistical formula will predict its own failure. That's kind of mind-blowing, and the implications aren't always grasped after just one semester (or quarter) studying the topic.

I studied CS and I got past my "intro to stats" class with only a passing familiarity with the subject and no deep understanding (aside from the probability aspects of the course).

The professor was very poor overall. I was able to cram enough to do well, but I retained very little after the final. I would love a deep understanding built from the foundations-- the coursework we had was a lot more obtuse and "take my word for it".

In a sense, I felt I got a better foundation for statistics from a freshman sociology class than from the Math professor.

Why aren't there statistics for CS classes that put stats primarily in the context of system profiling and metrics? Are there any out there?

That would have been cool. Stats wasn't required for us, but if it would have been Statistical Analysis of Computer Systems(or programs, or whatever), I would have signed up for it anyway.

I heard the stats class offered where I went to school was a joke so I didn't pursue it optionally. I'm sure it's different depending on your university.

Also, my school was not renowned for any sciences.

I once listened to a stats professor bemoaning that he had to teach stats to various groups of students in their first year when they didn't know enough about their own subjects for him to use any relevant examples.

Since stats on it's own can be a bit abstract that's a double whammy.

Then stats in CS should be a 2nd or 3rd year course. By then, there should be a great deal of material usable as examples.

A really good textbook along these lines might really be something.

By "intro to stats" do you mean the calc-based stats class that most people in CS take? I'd hardly call that an intro class considering you need to complete Calc I & II before enrolling.

Like many things, Stats is a vast field. Would you call a class about the basics of groups and rings anything other than an intro to abstract algebra, even though most places require both calculus and a fair amount of mathematical background?

This is sort of Zed's point. It is an intro class into a field with a great deal of depth. All you can do in one undergraduate semester is touch on a few of the most important parts.

At CMU we had a mandatory stats class but it didn't have anything concerning sample size picking. Ironically, the stats class I had to take when I was in the humanities dept. was more comprehensive. When I walked into industry, I realized after two or three years that practically everyone I'd met knew jack shit about statistics. I myself learned most of what I know from humanities-major friends and books like NLP by Manning and Schutze.

Having been through what Zed's been through (on both the giving and receiving end to wit), I can empathize with why he'd get all sanctimonious about it, but I feel that it just makes the problem worse (at least in person). Programmers are already an egotistical lot, and I've learned that directly attacking their ego tends to make things worse.

What can be done then? I don't know. It feels like this is at the core a personality problem, in particular one in which people associate their ability to know with their sense of self, and personality problems are terribly hard to correct.

I'm currently doing a CS masters and I did my undergrad in Physics. I'm really glad that I have the background in mathematics and stats that I have. In our physics classes they really drilled into us that although our theory and homework often used exact numbers, no measurement meant anything without a measurement of the error. "A number without error is meaningless", they would say.

One thing I never fully understood was how to calculate the propagation of error. There are many useful tricks for reducing the impact of error on your final calculation, and a few things you need to watch to make sure it doesn't increase.

I didn't, and I'm graduating with a BS in CS from Stanford in a few weeks. We have an introductory stats requirement, but I found that the stats class I took my sophomore year for my (dropped) economics major was far more enlightening than my CS stats requirement.

I think the fundamental issue is that there's very little focus on statistical practice in most of these courses. Social scientists have it good: they're always taught how to deal with and interpret statistics in the same way they'll have to in their line of work. It's totally useless to throw a bunch of theory at students. I think that teaching statistics in the context of real problems is the only way (most) students will actually learn and come to appreciate how useful it is.

The department revamped the major this year, and they've introduced a mandatory stats class tailored to CS students: cs109.stanford.edu. I haven't looked through it at all, but I think it's a step in the right direction.

Finally, I have to give a shout-out to The Little Handbook of Statistical Practice (http://www.tufts.edu/~gdallal/LHSP.HTM) in this thread. It's an amazing resource for anyone who works with statistics. I've referenced it while doing performance testing, building an A/B testing system, and working on problem sets. From the website:

"My aim is to describe, for better or worse, what I do rather than simply present theory and methods as they appear in standard textbooks. This is about statistical practice--what happens when a statistician (me) deals with data on a daily basis."

Read it now!

For a different view of how useful statistical practice (when applied mindlessly) is, read "The Black Swan". It's all about how people use statistical models with Normal distributions in places where it's patently unjustified, and the price we pay for it.

I played this game myself with a friend. I sent him ten samples of a (for him) unknown distribution and asked him to estimate the mean. Then 100, then 1000. His estimate of the mean kept changing to higher and higher values, because the samples were drawn from a Pareto (power-law) distribution with a mean of 1000. Such a distribution is almost indistinguishable from one with a mean of infinity, because all the signal is in the very rare, large outliers. If you try to analyze samples from such a process assuming it's Gaussian, nothing will make sense, and the standard deviation will give you an estimated uncertainty of the mean that is far, far below the actual uncertainty.

Data point: I am about Zed's age and majored in math and was one class away from a CS double major and I didn't have to take any stats courses. When I was in school stats was a totally different track than what most math people took. For some reason biology and business majors had to take stats but physics and math majors did not. I feel like the physics/math/CS people assumed stats was easy and you'd pick it up as needed.

There's a big percentage of programmers who don't understand O(n^2). On multiple occasions, I've seemed as a guru because of this one little tidbit.

What if the medical profession has as much egregious and widespread ignorance of the basics as programming? Would you be in favor of certification?

In medical professions, you go through a period of training (clinicals, internship, residency) where people with tons of experience point out that you never learned anything. This humbling experience is not really available in the CS world.

Particularly disturbing, since the only good way I know of to learn programming and system architecture is through mentoring. Basically, our mentoring system is haphazard. There's even a lot of anti-mentoring happening out there.

I find this incredibly hard to believe. Anyone who has gone through a single intro course in CS should understand at least the difference between linear and quadratic time. Perhaps I've been in the university too long. Are these people non technical majors who picked up programming in PHP?

We have de facto certification. It's called moving to silicon valley :)

And everywhere else, people have to put up with crappy programming? I think this system leaves a bit to be desired.

it seems pretty mind boggling that people are that bad at stats ... we have a mandatory stats class for CS ... it doesn't teach some of the stuff mentioned in the article but the stuff mentioned is completely intuitive -- you shouldn't need to take a stats class to understand that.

like if the variance of your result is really high and you're estimating the mean its clear that the number of samples you take is important !?!

You're proving his point, which is that people's intuition is often dangerously wrong. The sample size is actually often nowhere near as important as your sampling method. For an extreme example, I can conduct a billion coin flips with a weighted coin, but it's going to tell me jack about the general behavior of coin flips. Avoiding sample bias is a hard problem, far harder than most people appreciate.

To complicate matters further, "accuracy" is an extremely tricky concept in statistical analysis because error rates work very differently than they do in, say, physics. In physics, when you measure something you can be sure that your results are accurate, so long as you stay outside your instrument's range of error. In stats, your confidence interval just tells you how likely it is that your results are completely wrong, or even worse, wrong by a completely unknown amount. Every statistical inference you make has a chance of completely blowing up on you. That chance can be defined and reduced, but it can never be eliminated. There's also things like frequentist vs Bayesian statistics, where the interpretation of the same data can be completely different.

Zed may be a jerk sometimes, but on this topic he's dead right. Most programmers are far more confident about this stuff than they should be.

After doing some research, I think I've completely abused the notion of the confidence interval. Which I think just helps to prove Zed's point. This stuff be hard.

I don't like this article. Yes, he's right: an intuitive knowledge of the relationship between averages and standard deviations is essential. But to presume that the standard tools of statistics should be applied to every problem is to miss the point.

On a superficial level, if you are doing overnight processing of log files, then you probably care more about throughput than latency. In this case, averages are probably a fine metric. On a slightly deeper level, standard deviation is only a useful measure if the distribution is known, and in a lot of real world cases it is not. The right question isn't whether 100 or 1000 tests on the same data provides sufficient statistical power, but whether range of inputs is sufficient to trigger worst case perfomance.

Now, I presume that Zed knows these things and applies them appropriately, but the article strikes me as more snide than helpful. Perhaps as others say he's a great guy in person, but I prefer my stats with less attitude and more insight. Here, for example: http://yudkowsky.net/rational/bayes

[edit: changed my sloppy language from 'has no meaning unless to the distribution is normal' to 'is only a useful measure if the distribution is known']

On a slightly deeper level, standard deviation only has meaning if the distribution is presumed to be normal

Are you completely sure about that?

I suppose many readers of this thread are more knowledgeable about statistics than I am. I would appreciate hearing from the knowledgeable readers whether or not variance in the observed values makes a difference in the cases discussed in the submitted article.

The statement "only has meaning if the distribution is presumed to be normal" is wrong. The SD is a summary of the spread of a distribution. In fact, for most centrally concentrated distributions (including a uniform one) +/- 1 sigma corresponds to about 60% of the mass of the distribution. This is an amazingly useful thing to know.

As the above triva factoid points out, the standard deviation is an important summary statistic. More interestingly by using mean, variance (or sd), skew, and kurtosis, you can describe almost any centrally concentrated distribution. Even distribution with heavy tails.

I think what the OP meant is that most 3+ sigma results are not truly 3+ sigma, because most distributions in this world are not gaussian, but instead have large wings. SD is most useful when you know what the underlying distribution is. Currently it's more in fashion to communicate spread using confidence intervals because they presume less about the underlying distribution.

You're right. I was being sloppy.

I should have said something more like "the standard deviation calculated from a sample set is only generally applicable in so far as one is willing to make assumptions that the sample set is representative of the distribution as a whole". The default assumption in traditional statistics (such as quoting p-values) is that the distribution is normal, and in real world situations often not the case.

Your restatement is right on, although I'd go farther and say that standard deviations (and confidence intervals) are only useful metrics with regard to the particular assumptions one is willing to make about underlying distribution. Yes, you can calculate these measures, but they won't help you if your assumptions are irreparably flawed.

Are you completely sure about that?

You could quibble about my exact phrasing, but yes, I'm completely sure about that. This is the 'black swan' problem writ small. I don't mean that a high standard deviation should be ignored for real-world distributions, but I do mean that a low standard deviation carries very little weight unless a normal distribution is presumed.

I'm hard pressed to relate this to the cases discussed in the article, as those cases are shy on detail, but the DB2 example seems most applicable. Although he points to standard deviation as the tell-tale flag here, this is sort of misleading. The exact numerical value for the standard deviation across all queries is meaningless here, as not every query has an equal likelihood of being slow. As he states, the real problem was the terrible performance of an single query.

How many similar queries exist? Will a new query added to the system trigger a similar bug? We don't know, and standard statistics isn't going to help us unless we have an understanding of the underlying mechanism. The key here is not to test a statistically significant subset of all possible queries, but to check the performance of the actual queries executed (as he did).

I think I've read this before when Zed was "so fucking awesome". Is his whole "dropping the persona" gig just a way to get more mileage out of old articles?

His favorite blog posts are being reposted as essays.

He's the new PG, with most of the wisdom replaced by machismo.

Ma Kiz e Moe

I hadn't seen this, so I was glad it was resubmitted.

I think he has a fair point. Here on HN I see a lot of armchair sociologists critique the various articles in the social sciences that get posted, but it's rather unclear to me whether these are well grounded in an actual understanding of the issues involved, or simply habitual incantations of rules of thumb such as "correlation doesn't imply causation."

um...rules of thumb are a bit different from identifying logical fallacies.

That's a lot of text to say not much more than "Standard deviation can be as important as the mean; be careful about confounding variables; and if you're an engineer, spend more time learning statistics."

Oh, also, "You're all assholes and I rock."

Truly the Carl Sagan of software.

I think a better way to help would have been to spend the time writing a blog post introducing some core stats concepts and showing how to use R to do useful things...

Can anyone recommend a decent (read "not boring") book on statistics?

Something along the lines of "Naked Economics" for the stats realm ...

There's the Cartoon Guide to Statistics ( http://www.amazon.com/Cartoon-Guide-Statistics-Larry-Gonick/... ), which, despite its name, is pretty solid and comprehensive book on basics. And definitely not boring!

How to Lie with Statistics (http://www.amazon.com/How-Lie-Statistics-Darrell-Huff/dp/039...) is a short, enjoyable read. It doesn't tell you how to do statistics, but it gives some warning about common problems.

Also the manga guide to statistics http://nostarch.com/mg_statistics.htm

Why is he mixing computer science with programming? Computer science is just a branch of mathematics. It has not much to do with what he talks about afterwards.

Computer science is to programming as physics is to engineering. The way SOME programmers constantly put down computer science, especially academic computer science, is a sign of the anti-professionalism in the field. (I decided I better emphasize some programmers since they actually seem to be a very vocal minority.)

I think this article is great once you get past the rant-iness of it. He makes a bunch of valid points and it's true: a lot people actually don't take these pretty crucial details into account when they're dealing with statistics.

It's kind of a shame that his message is diluted by its delivery. Case in point: about half of the comments here aren't even about statistics!

I always hated stats lecture :( . But the book "Statistics for utterly confused" helped me in exams. But still I'm not in love with stats.

Would he kill me?

I don't like your chances.

pat on the back for you -- you understand intro college statistics... maybe you just work with shitty programmers that don't understand statistics????? i dunno but most of my colleagues know stats pretty damn well. but maybe thats just because i'm in school

I love it when Zed Shaw gets pissed. It's fun!!


This is a (probable) re-submission of an older article of his.

It's a very old post.

I read this in 2005.

i think we'll have a huge shortage of programmers shortly

I can see some people here claiming this is old (it is) and that it's a repost (it is), but I'll still vote this up, because especially now with people yelling "scaling! scaling!" all over the place, I can't imagine a more fitting time for developers to read this.

Scaling? People keep using that word. I don't think it means what they think it means.

99% of all statistics are made up

I've read this before. Zed does not provide any details on learning or understanding statistics in this rant.

Statistics and probability is a major course of study - far more than could be usefully taught in a blog. What he tries to do is to motivate people to learn it - which is the first major step to actually learning it.

ahh..the old "motivation by calling people stupid" trick ;)

So after Ruby and Ruby on Rails, Zed has learnt some R and statistics. Because after reading almost all the article, he basically talks about: average, median, standard deviation.

He should learn a little bit of complexity and calculability by the way, best case, worst case, Big O notation, etc.

Because when edu writes a blog post, he includes his entire life history, and, for that matter, the entire history of the universe in it.

I know you've already been smacked down but this is a big pet peeve of mine and worth pointing out. You can always complain that something should have had more information in it, and since that is always true no matter what, the complaint is information free. (A value that comes from a universe of one value contributes zero bits of information.)

I'm not so sure about that. A request for more information can be incredibly important (for instance: Tell me who committed the murder! also see: stubs on Wikipedia). Essentially, there is a point of diminishing returns, and after a while, possibly negative returns as the work becomes too hard to assimilate.

However, in this case, I tend to think that Zed included just enough information to get someone who's clueless started - he even included references at the end so that if you DO want more, you can easily get it.

> So after Ruby and Ruby on Rails, Zed has learnt some R and statistics.

From the article it sounds like he learned stats first.

> studied statistics in grad school,

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact