
Programmers Need To Learn Statistics Or I Will Kill Them All - jmonegro
http://zedshaw.com/essays/programmer_stats.html
======
smanek
I strongly disagree with Zed on this one (at least part of his diatribe - he
makes some good points about sample size/ramp-up/etc).

I don't care what your mean or standard deviation are. If your performance
isn't symmetrically distributed about the mean (and it probably isn't),
standard deviation probably doesn't mean what you expect.

I _only_ look at performance in 99th percentile (or sometimes, 99.9th
percentile). If 99% of responses are being served in 100ms on my webapp, I'm
happy. I don't care if the mean is 5ms with a 20ms standard deviation. As long
as the performance at the 99th (or 99.9th) percentile meets a level of
'acceptability' that I have predetermined, I'm satisfied. And you should be
too.

This is much easier to deal with than trying to do real math. Just make sure
that no more than 1% (or 0.1%, as the case may be) of requests take longer
than X.

Pre-emptive answer: I usually get someone asking me why I don't demand
acceptable performance at the 99.99999th percentile or at the slowest request.
It's simply an issue of resources. To get that kind of performance, you can't
use garbage collection, context switches become an issue, disk buffering even
comes into play. You have to basically write a hard real-time (or pseudo real-
time) app, and that level of effort isn't worth it for a consumer facing
webapp. If I was writing code for a pacemaker or something (and please, dear
god, no one ever let me do that) it damn sure better be hard real-time though.

~~~
viraptor
I can't really agree with you. If you care only about 99.9% of your requests
being on time, it means that one in a 1000 can time-out and die. Not that bad?
Ok - let's have an example:

Facebook - one random user-page view == 63 requests (reported by firebug).
That means 16 people generate >1000 requests. Even with 99.9% requests within
your "acceptable range", you get every 16'th person either missing an element
on a page due to some DB timeout (for example), or seeing an incomplete page
for a not "acceptable" time (with 1/63'th probability of it being the main
document, if we don't know which element fails).

Is that really an acceptable behaviour?

I don't believe that a requirement saying "the complete website loads in less
than 1s" is affected by GC or context switches at all, tbh. If you actually
tested the website with load of 10x the average expected one (over a couple of
hours of random scripted walk with a natural write/read balance - testing
you'd do on a service you really care about) and you still get random glitches
with the expected load, then there's most likely something wrong on a higher
level (disk seeks failing? shared hosting overloaded? network? etc. etc.)

~~~
Retric
Taking more than 100ms is not the same things as "time-out and die".
Performance is focused on how long something takes and it normally assumes
that the system is not broken. It would take a horribly broken system to
timeout 1% of requests. Assuming that 1% of your requests taking long enough
the user might notice, but still working is generally acceptable performance
for a webapp. So a reasonable goal might be 99% within 150ms and 99.95% don't
fail.

~~~
viraptor
smanek wrote: "I don't care what your mean or standard deviation are.", which
if taken seriously, means exactly that 1/1000th request can take an hour to
finish and he doesn't care. Anything over 10s in real-life means a time-out in
connection, or user refreshing, or going to the next page, or...

Going to 99.95% not-failing connections doesn't help you much IMHO. It means
that you've gone from every 16'th person having a problem to every 32'th
person. It's not acceptable at all.

~~~
Retric
Your also assuming that he / we measure per element not per page, which is
just silly. Also, stating a minimum accepted value is not the same as ignoring
the rest. It just places minimum reasonable performance that a context.

~~~
viraptor
I read his "Just make sure that no more than 1% [...] of requests" as actual
requests. If it meant whole-page-load-time then fair enough.

"stating a minimum accepted value is not the same as ignoring the rest" - then
what is it if you don't care about mean and stddev? If you do care about other
requests you care about the mean (or stddev, or median, or whatever other
parameter) but just don't put it in a scientific way. If you say reasonable,
it might mean - it doesn't time-out. That means it's below X seconds. That
means with the other maximum load time constraints, you expect a mean time for
Y requests to be less than Z. Saying "Assuming that 1% of your requests taking
long enough the user might notice" you just put a constraint on that
parameter.

(just to explain why I keep arguing this point - because that's what the post
is about (kind-of), people say "reasonable time", but don't want to set the
actual mean/stddev - which is what they do care about)

~~~
Retric
No, because taking less than the threshold is effectively meaningless. So,
when you calculate the mean and standard deviation the calculation is altered
by meaningless trivia. When you add in the latency and rendering times 1ms and
30ms might as well be the same number. What you really want is to round
everything under your thresholds up to that threshold and then look at the
breakout as x% take 100ms or less, y% take 1 second or less, and z% take less
than 3 seconds etc.

You then look at that breakout based on what the page is, or the time of day
ect. But, from a performance standpoint the the only really important number
is the first number, once enough of your site is fast enough having a few slow
areas, or some issues at peak times is reasonable, and once it's slow enough
for a user to notice it's a problem.

------
manvsmachine
In case you're interested in the (lengthy) original discussion:
<http://news.ycombinator.com/item?id=626771>

EDIT: that wasn't the original, but it was the longest. This is the original
(I think): <http://news.ycombinator.com/item?id=48006>

~~~
jmonegro
It's a really good article - I came across it today on my twitter feed, and it
surprised me that it still has the same impact it had in 2005.

------
Perceval
I'm trying to teach myself statistics (for social sciences) right now. I got
two books (Agresti & Finlay along with Knoke, Bohrnstedt & Mee) but both seem
to be written in jargon instead of English.

Does anyone have recommendations for statistics books that are actually
written in English?

~~~
bokonist
For a long time, I wanted to teach myself statistics for the social sciences
too. I decided the best way to do this was to find an interesting social
science issue, and then learn statistics as part of tackling the issue. But
there was one problem with this. There are almost no problems in the social
science where it makes sense to use advanced statistics to analyze them. Using
very simple regressions can be illuminating and illustrative (though not
conclusive). But when you try to get more advanced, data quality, data coding,
issues with choosing the right controls, too many variables, sensitivity to
variable selection, inability to separate causation from correlation, etc.
render advanced statistics utterly useless. Do you have a particular social
science problem that you think would be a good fit for statistics?

~~~
thomaspaine
I guess it depends by what you mean by "advanced statistics". Is correcting
for heteroskedasticity, a fixed-effects regression, instrumental variable
regression, etc advanced? Because if so, all of that is pretty common practice
in say, economics. Simple linear regression is easy and often illuminating,
but good luck getting a paper published if that's all you've done.

I agree that in practice, data munging is 90% of the work and can be pretty
tedious and discouraging when working with real world datasets, but if you
don't know about issues like the ones I mentioned above, how do you even know
where to start?

For examples of how statistics can be used to solve interesting problems in
the social sciences, just browse through Freakonomics.

~~~
bokonist
Yes, that's what I mean by advanced statistics. I've read many papers in
economics using such techniques. They are 99.9% utter garbage. The issues I
mentioned above (data quality, data coding, issues with choosing the right
controls, too many variables, sensitivity to variable selection) doom the
project before you can even get to using the statistics. Garbage in, garbage
out. If you data munge and get a result, you can never determine if it's a
real result or a data mining effect.

And yes, I understand you cannot get published in the social sciences without
using advanced statistics. All this means is that the economics academia is
following into the trap of Schoclastism, and insular world that uses
complicated and absurd methods of research, increasingly divorced from
reality.

Social science needs to get over its science envy. As I've written here
before, the proper tools to use as a student of policy and sociology are the
tools of product management: <http://news.ycombinator.com/item?id=836196>

I have not read Freakonomics the book, but I read their blog occasionally.
It's interesting when they find a simple statistic that raises an interesting
point.

It might be easier to explain the problems with reference to a specific
example. If there is some social science paper that uses the techniques you
mentioned above, and you think is particularly good, send me a link, and I can
explain why I think it's likely garbage.

~~~
Perceval
I agree with your assessment of the 'scientism' of social sciences, and the
rather absurd uses of statistical methods (that should be inapplicable because
of data collective and basic assumption issues).

What I am to do is mostly descriptive statistics, but with a few simple tests
(e.g. a test of variance with covariance). I've got three populations of civil
war data and would like to compare features of those populations against each
other (e.g. instance of civil war in period one vs. period two vs. period
three). Controlling for # of states, or # of new states (say, within five
years of their founding).

Most of the complicated stuff really can't be applied to stuff like data on
civil wars, since the data doesn't meet basic assumptions of the models, but I
would like to be able to say something more than just 'the data from period
one looks different than the data from the other two periods.'

~~~
thomaspaine
> Most of the complicated stuff really can't be applied to stuff like data on
> civil wars, since the data doesn't meet basic assumptions of the models

I think this is precisely why we have more advanced statistical techniques.
There are ways to correct for or at least detect serial correlation,
homoskedasticity, etc, all of which are probably in your data. All real world
data is fucked up in some way, having a large toolbox of statistical
techniques helps you cope with this fact. Maybe not perfectly, but at least
you'll know where/why your model is wrong.

~~~
Perceval
Can you recommend a book on advanced statistical techniques or non-parametric
statistics that's written mostly in English with a minimum of jargon?

------
amoeba
Statistics will save us all.

And for the HN folks that like to Lisp it up:

<http://incanter.org/>

------
thisisnotmyname
This is a great resource if you're studying / reviewing on your own:

<http://spartan.ac.brocku.ca/~jvrbik/MATH2P82/Statistics.PDF>

------
jrockway
Hacker News readers need to stop submitting linkbait titles, or I will kill
them all.

~~~
hyperbovine
Quoi? That is the verbatim title of the post.

~~~
jrockway
That means you took the linkbait. Why reward articles like this with free
views?

~~~
raganwald
If the article delivers on the hype, why not reward the author for having the
walk to match the talk?

~~~
jrockway
Does this article deliver on the hype?

What I got out of it was: "Some people are dumber than me." This is a well-
known problem both inside and outside the programming world, and unfortunately
killing the people dumber than you is not the solution to the problem.

~~~
raganwald
I took it as "some people are ignorant of basic statistics," which isn't quite
as daunting as "some people are dumber than me." The cure is motivation to
learn, and one can take the view that the hyperbole of the title broadens the
post's reach and simultaneously provides a little negative motivation. I
didn't take the title quite that literally.

I personally take the view that you catch more flies with honey than vinegar,
but I suggest the article as a whole including the title provides a general
good.

------
stcredzero
_This article is my call for all programmers to finally learn enough about
statistics to at least know they don’t know shit._

His call can be generalized: "This article is my call for all programmers to
finally learn enough about [insert here] to at least know they don’t know
shit."

The power that we programmers have as little gods inside systems of our own
making, tends to have us over-estimate how much we know. Thus our penchant for
posting stuff that makes experts in various technical and scientific fields
roll their eyes.

------
sireat
Two great books for those who need to re/freshen up their statistics:

[http://www.amazon.com/How-Lie-Statistics-Darrell-
Huff/dp/039...](http://www.amazon.com/How-Lie-Statistics-Darrell-
Huff/dp/0393310728)

[http://www.amazon.com/Cartoon-Guide-Statistics-Larry-
Gonick/...](http://www.amazon.com/Cartoon-Guide-Statistics-Larry-
Gonick/dp/0062731025)

I haven't had a chance to check out the manga guide to statistics but that
might be a decent introduction, as well.

------
dylanz
It's a great post, albeit, probably about 3 years old.

------
asb
Just thought I'd share an example where the graphs really help you to
understand the performance profile.

[http://s3.amazonaws.com/four.livejournal/20090911/benchmark....](http://s3.amazonaws.com/four.livejournal/20090911/benchmark.html)

------
MikeTLive
once again slashdot posts hackernews as "new" news. would be nice if they
linked HERE to gain from the new comments not the 2.5yr old comments. oh well.

[http://developers.slashdot.org/story/10/01/09/2154224/Why-
Pr...](http://developers.slashdot.org/story/10/01/09/2154224/Why-Programmers-
Need-To-Learn-Statistics)

------
ez77
P = "Programmers Need To Learn Statistics"

Q = "Zed Will Kill All Programmers"

If P || Q must hold, go with P!

------
AlphaMonkey
Zed Shaw should stick to his area of expertise. His angry, opinionated rants
on topics he clearly lacks deep knowledge of are perfect examples of the
pseudo-intellectual junk that litters HN from time to time.

~~~
Raphael_Amiard
thanks for that argumentated non opinionated answer !

~~~
AlphaMonkey
At least I don't write like a slightly retarded teenager, and I don't pretend
to be an expert on Statistics. Thanks for the "kudos" anyways.

~~~
Raphael_Amiard
Sorry for the sarcastic answer, never a good thing. My point was, despite
Zed's tone , that i found irritating, he does actually provide content in this
article, that a newbie like me can use and put at profit (wich i did when the
article first came out).

On the other hand, your post sounds angry, gives no insight as to why / how he
lacks knowledge (which would have been interresting) and use condescending
expressions like "pseudo intellectual garbage"

~~~
AlphaMonkey
Look, I know it's none of my business, but if you're a newbie and you want to
learn some Statistics, wouldn't you be better off checking the MIT OCW pages,
or the countless lecture notes in PDF format that can be found on the web?
This is an honest question.

Contrary to what many short-sighted "pure" mathematicians think, Statistics is
hard. You can't explain statistics in a bunch of blog posts, not even if you
are a true expert. You can't condense all that knowledge in a blog post, one
must read the books, though painful that may seem.

~~~
loumf
To learn statistics, yes, of course. To get psyched up about learning
statistics? That's what this article did for him.

Sometimes it's just fun to read someone go nuts about something like
statistics, and it may give you motivation to go learn more about it so that
you can understand the whole thing.

~~~
tobtoh
I agree. The last time I dabbled with statistics was at university and whilst
I could see it was useful, it was boring as anything. I couldn't see any
really application in my day to day job - after all, besides understanding the
difference between mean and medium, what else would I need?

But Zed's article, juvenile as it was in tone, had undeniable passion about
statistics. And it was enough to keep me reading through the whole article and
pay attention to the examples. As it is, I will be doing further reading on
statistics and so Zed's article has been incredibly useful to me.

"Maybe it's just me, but Statistics is a beautiful field, and such
intellectual beauty should be all the "psyching up" one needed."

It's just you. Nine out of ten people think statistics are nothing more than
boring numbers made up on the spot.

~~~
AlphaMonkey
Nine out of ten people learned Statistics from teachers who knew jack of
Statistics. The difference between Statistics and Math is that Statistics is
data-oriented and somewhat "experimental". Yeah, I hated Statistics in high-
school, but when I learned Information Theory and Statistical Signal
Processing, then I understood what one could do with knowledge of Statistics,
and what the field is all about.

If you can't see the beauty in it, you've probably been taught pseudo-
Statistics. Don't feel bad. All the Math that engineering students learn is
kind of pseudo-Math. All one learns in high-school is BS. If you want to learn
something, here's my advice:

i) Don't use modern, over-designed textbooks.. you know... the thick expensive
ones with many colors and boxes highlighting the formulas one must memorize.

ii) Instead read the classic books from the 1960s, many of which are published
by Dover. They look boring at 1st sight, but their content is rich. Another
option is to get the old Soviet books from the 1960s, which tend to be
forgotten gems. The Russians are the best at marrying theory with application.

