Hacker News new | comments | show | ask | jobs | submit login
Unlearning descriptive statistics (debrouwere.org)
343 points by stdbrouw on Feb 1, 2017 | hide | past | web | favorite | 92 comments



For readers who are OK with some math, I recommend John Myles White's eye-opening post about means, medians, and modes: http://www.johnmyleswhite.com/notebook/2013/03/22/modes-medi... He describes these summary descriptive stats in terms of what penalty function they minimize: mean minimizes L2, median minimizes L1, mode minimizes L0.

A single-number statistic is _going_ to leave things out, so if you must boil things down into one number, or even a few numbers, you're going to lose something that you had in the raw data. This is why I find claims along the lines of "statistics don't tell the whole story" a little bemusing - of course they don't, the very definition of a statistic is a summarization of data that is easier to work with. The question is what data is kept or lost, or more generally what importance we place on different aspects of the raw data such that it's reflected in our descriptive statistics.

The lessons for non-technical people who want to communicate with descriptive statistics are to recognize that summarization is inherent in the nature of any descriptive statistic, that they are thereby opinionated in some way in terms of what they've preserved and what they've left out, and to recognize whether those opinions are appropriate for your purpose.


Glad you enjoyed that post so much. It really is a shame that we do such a bad job of teaching students about the inherent subjectivity of descriptive statistics and let students leave their courses with dangerous ideas about the existence of a Holy Grail statistic that will solve all of their problems.


Your followup post (http://www.johnmyleswhite.com/notebook/2013/03/22/using-norm...) is excellent. Thank you!


Thanks! I really should have finished and written the post about the SVD as well. One of these days...


I'd love to read your post on SVDs once it's written


I agree with you, although I think the problem is a focus on procedures rather than principles in general.

It took me a long time to realize that a principled reason for gaussian parametric distributions is the maximum entropy principle. Prior to that point, it had been presented as essentially arbitrary, even by established professors.

There seems to be an assumption that theoretical statistics is "too hard", and as a result there's a middle ground that gets left out. I haven't taught general stats courses in awhile (although I've taught advanced ones), but if I did, might start with Bregman divergences, and work down in the manner of your blog post.

I think there's an in-between that gets lost. You can teach principles without deriving long proofs of everything along the way.

Students don't get taught the underling principles and philosophies to choose from, and I think this leads to the "holy grail" issues you're referring to.

It seems to be changing a bit with new interest in Bayesian methods, but that's just the tip of the iceberg.


The idea of misinterpreting metrics is a very general idea and is not specific to statistics.

Humans want to distill vast amounts of information to a more manageable amount, like for example a single number.

Equity analysts look at accounting metrics, psychologists look at psychometrics test cores, doctors look at some function of blood pressure, etc etc.

Any person with deductive and sceptical mental faculties in place, will recognize that these are all simplifications, and cannot be used to deliver a unified truth.

Also, being aware of this has very little to do with being technical or not (for example, plenty of programmers only look at a CPU's clock speed to gauge performance).

Anyhow, nice post.


Thanks for writing it! It's one of my favorite math blog posts floating out there.


I always pull out Anscombe's Quartet https://en.wikipedia.org/wiki/Anscombe's_quartet The four datasets have the same mean, variance, and linear regression line, but are very different from one another.


Great example, and mentioned in the article.


Statistics are reductionist!

Well yes, that is their purpose.


My statistics professor once told us that statistics are a shadow of the truth, not the actual truth itself.


What a caveman!


This is a great and oft-forgotten point. I like to think all summary numbers are lossy, you are only free to pick your poison.


Okay, so that leads to a very obvious (imo) question I haven't seen anyone ask:

What happens when you minimize Ln, with n > 2? Why don't we use any of those?


Increasing n gives increasing weight to outliers, to the point where L-infinity is just the single maximum value. I would guess this is less useful when attempting to understand how existing data can make conclusions about future aggregate/typical cases vs. analyzing specifically the outliers.


Clarification: The l-infinity norm of a vector is the maximum absolute value of its coordinates. The analogue of the median and the mean that that gets you is the midpoint of the range.


I think the message of the article is great: move beyond the "standard" descriptions and pay more attention to what you're trying to show and who your audience is.

That said, it's a slight pet peeve of mine when people recommend the median over the mean to describe center. The median, on its own, does not describe what is "typical" any more than the mean does; it just has a small advantage in that it will always map a real observation, so for discrete data you don't end up with things like "1.9 legs." (That said, a mean of 1.9 legs actually seems much more informative to me than a median of 2 legs, so even in that case I prefer the mean.) It's easy to envision many situations where the median's representation is wildly inaccurate, just as you can imagine ways in which the mean can be misleading.

The median is also insensitive to skew, which is often cited as a good thing, but really that's something you should determine on a case-by-case basis. In many (most?) applications, there's no tangible benefit in having a measure of center that ignores skew. The median's insensitivity also creates strange situations where subgroups in the population end up unrepresented (e.g. if the poorest 20% become even poorer because of changes to the tax code, the median doesn't budge). In general, the median conveys remarkably little information. It's great at showing the center, but it gives zero indication of anything else. The mean certainly has its issues, but it's far from inferior. Use the median when the situation calls for it, certainly, but realize that it's just as limited as any other measure, if not more so.


> when people recommend the median over the mean to describe center. The median, on its own, does not describe what is "typical" any more than the mean does; it just has a small advantage

I think this swings the other extreme in selling the 'median' short. As long as we agree that it is only strictly meaningful to talk about the 'center' for symmetric distributions, median does a fine job.

In fact in many realistic scenarios a far better job than the mean. The main trouble is the normal or the Gaussian distribution is no where close to being as ubiquitous as it seems, neither is CLT as universal as it is made out to be. Gauss sort of got away with it, Gauss did not discover the distribution nor the associated CLT.

Many real data of day to day consequence have heavy tails, and mean is a pathetic measure of 'central tendency' for these. Mean is particularly sensitive to outliers. Median does significantly better than the mean in this non-academic situation. Although one could do better than median for symmetric heavy tailed data (for example trimmed means), but its mean that I find guilty of entirely disproportionate fame. If its needed to exaggerate median a bit to get people to grow up beyond the pervasive Normal / Gaussian fetish, I am all for it.

No single number is going to characterize what is 'typical'. One really needs the CDF here, and yes avoid estimating densities as much as possible.

BTW mapping to a real observation is not true, you only get a 50% chance of that.


The median is worse than the mean in a skewed distribution if you want to take the skew into account. In fact one of the strengths of the mean is that it is sensitive to changes in the entire sample, rather than only part of it. To repeat my previous example, if the lowest 20% of household incomes drop because of changes to the tax code, that's something I'd generally want reflected in my "household income" statistic. We can't get too hung up on the semantics of central tendency here--there are many good ways to measure center, and it's clear that all of these statistics actually measure very different things even though they're all grouped into the same category of statistic.

When people say the median is "better" than the mean, they often can't explain exactly why (other than to cite skew, which I've addressed). It's really just a gut feeling based on how they felt when they first learned that the median was "more realistic" for some specific set of data that was skewed. They never thought through exactly why or when you'd want to disregard the skew in the first place. I would be insane to argue that the median is never the right statistic to use; my point is merely that it shouldn't be given preference over the mean, and that people need to think about what they're actually trying to show.

The mean is not only useful for the normal distribution, and since the article specifically avoids jumping into inference territory talking about specific distributions at all is kind of putting the cart before the horse. The "disproportionate fame" of the mean is inherited from its extreme importance throughout all of statistical inference. It's true that if you ignore inference, the mean isn't particularly important as a descriptive statistic, but my argument is that it's certainly no worse than the median.


>one of the strengths of the mean is that it is sensitive to changes in the entire sample, rather than only part of it.

Yes but this is exactly what is bad with it. At least for the majority of practical applications that I can think of. For instance, if you want to measure the "average" income of the population. If there are a handful of uber-rich people, they will really raise the mean. Even if the vast majority of people make much less than the mean. The median will do a much better job of telling you what a "typical" person makes.

I really believe that when most people read or talk about "average", they mentally interpret it as the median. And that the median is generally less misleading.

Sure, there are applications where sensitivity to changes at the tails is important. But even then, mean is misleading and difficult to interpret. In your earlier example, if you raised the income of the bottom 10% by 100%, then just use that as a statistic. "The poorest 10% of people now make twice as much money" is much more informative and impressive than "the average income increased by 0.1%."


> When people say the median is "better" than the mean, they often can't explain exactly why (other than to cite skew, which I've addressed).

Well, 40~50 years of literature on robust statistics happens to disagree with the claim 'people' don't know what they mean (pun unintended) when the say the median is better. Furthermore skew has less to do with that argument than heavy tails. heavy tails are extremely (ok sorry, now its an insider pun) common.


i agree with the article in that: if you want to take the screw into account, you look at the histogram. as a single number, I dont think the mean tells you anything about the skew/tail of the data any better than the median.


It's a better than 50% chance - if the sample size is odd, you always get a real observation; if the sample size is even, you may still get a real observation (if the two median observations are equal).


You are indeed correct, I was incorrectly assuming a continuous density, in which case the probability of a tie would have been zero.


But I think the author was careful to sprinkle caveats so as to avoid universal recommendations. Rather, it take on a few common abuses.

About the mean vs. median, it might be true that "mean" is appropriate in just as many contexts as "median", but at least in my opinion, people cite a "mean" when a "median" would have been appropriate more frequently than the reverse. There are times when neither are appropriate, but if you're using median, you're more likely to understand the merits of different contexts.


I agree with the article as a whole. I just personally believe that the median is generally worse to use than the mean, so I believe that if you're going to recommend the median over the mean, you should provide guidelines on when to do so. I know not everyone agrees with me on this (as evidenced by some responses in these comments) but that's the perspective I'm coming from.

> it might be true that "mean" is appropriate in just as many contexts as "median", but at least in my opinion, people cite a "mean" when a "median" would have been appropriate more frequently than the reverse.

Sure, but this is only true because the mean is used so much more. If the whole world was instead taught in grade school to use nothing but the median, there would be just as many (I'd argue more) misuses as there are now, except this time the median would be the offender. People using the median are more likely to be using it inappropriately because it's the nonstandard option, but that's only because they're not using it blindly. If people starting using the median just because, it would suffer the same issue.


>The median [...] just has a small advantage in that it will always map a real observation, so for discrete data you don't end up with things like "1.9 legs."

If you listen to someone like Taleb, the main advantage they mention for the median is that it's a more robust statistic. For fat-tailed distributions the average can jump all over as new data come in.


Whether ignoring the tails is an advantage or not depends very much on the situation.

For example, geometric mean of returns is the right number to look at for stocks because you are indeed exposed to the tails.


> In general, the median conveys remarkably little information. It's great at showing the center, but it gives zero indication of anything else.

The median also has zero ability to make me coffee in the mornings, but I don't think I can hold that against it.

I can imagine situations where getting people to track or listen to even a single number is tough and using the mean as a measure of both central tendency and the stability of the distribution over time might be the least worst option. But is that really a common problem? Have you often encountered situations where it was impossible to communicate something like "the typical customer buys five widgets but more than 1 in 4 of our customers only buy one" because it contains two statistics and management insists on being briefed with just one?


Clearly the best approach is to directly answer the question at hand with the most relevant numbers/statistics available. I'm all for that. I'd argue you'd still want to use the mean more than the median, but it's not really important at that point because you're painting the most complete picture you can.

There are times that, for whatever reason, someone is only presenting one statistic. News headlines are a big one. In this case, you clearly want to pick the "best" statistic available. There are people who think that the median is categorically better than the mean and dispense advice as such. (You can find them saying things like "the mean is worthless, the median would be much better" in the comments of discussion boards.) That's not what this article did, but the author did imply (in its title and language, if nothing else) that the median is better than the mean without offering any sort of weighing mechanism for which to choose. That's what I'm responding to, because it perpetuates the median > mean myth that's prevalent among certain groups of people.


Is it possible to sensibly generalize the concept of median to "higher orders"?

E.g., I can imagine the difference between the 25th and 75th percentile to be descriptive of spread (like standard deviation), but those numbers seem arbitrary.



"""The median, on its own, does not describe what is "typical" any more than the mean does; it just has a small advantage in that it will always map a real observation"""

Only if you have an odd number of data points or the two middling data points have the same value. Example: I have four people with these sizes in cm: 120, 160, 180, 200...the median would be (160+180)/2=170 which is the size of none of the people.


> The median [..] just has a small advantage in that it will always map a real observation

Not if the number of observations is even and the two "middle" observations differ.


Yes that's true. I was being somewhat charitable in affording it that benefit, but I did so because the article is the one that brought up the 1.9 legs example so that was kind of an implied benefit. And it's also true that you can use a "median" that picks one of the two, instead of splitting them, without any real downside if you want to preserve the feasibility of the statistic.


I think the idea is to move away from worshiping summary statistics such as mean or median and starting to look at differences of distributions.


Great points. Discrete data with few categories are a good case where the median might not cut it. As a contrived example..

data: 0 0 0 1 1

mean: 2 / 5

median: 0


With a small number of categories a full summary of the data easily fits into text. It might still be desirable to report a summary statistic, but omitting the totals for 2 or 3 categories and reporting a summary statistic is pretty sloppy.


> With a small number of categories a full summary of the data easily fits into text.

That's fair. In my experience, summary measures are often calculated at many levels of an analysis, not just the final reporting.


It's worth noting that most statistical packages include the ability to generate a five-number summary (https://en.m.wikipedia.org/wiki/Five-number_summary), which generates many of the described skew-independent descriptive statistics automatically. (R, for example, will generate a five-number summary for each feature of a data frame with a simple summary(df) )

The TL;DR is to just plot everything if possible to visualize skew explicitly (or tricks like the Quartet), and there are a number of tools which can do that without much LOC overhead. (e.g ggplot2)


Yeah, this is a narrative challenge in reporting. Often skewedness of a data set is a(n important) piece of background context, and figuring out how to get that detail into a story without being like AND HERE IS A BIG CHART WHICH YOU WILL NEED HELP INTERPRETING is non-trivial.


Pandas has a .describe() function on dataframes too.


you can also use the aptly named fivenum() func in R :)


summary(df) will do that for every column of a dataframe with less code... which is why few people use fivenum()

The majority of defaults in R exist because statisticians do these things all the time and don't like typing any more than programmers. Time spent typing could be spent thinking about what assumptions are being violated and whether the results offer any useful insight.


I don't feel particularly convinced by this article, I feel like it misses the bullet by going beyond a reasonable standard. The article is phrased in such a manner that it acts as if typical descriptive statistics are not useful, but the more important point is more reasonable and understand when to use what statistics.

In fact, many of his examples of what to use are just other descriptive statistics, median is no less a descriptive statistic than the mean is.

I think the problem is not so much that descriptive statistics are bad, so much as they are not particularly useful when they lack context. Unfortunately, statistics is a much more complex field than your average product designer doing an A/B test thinks. The majority of statistics and models the average person utilizes has assumptions and qualifications that they don't fully understand, and frankly that's not a slight to them but rather just the additional depth of the field.

Just as an example, ask the average person to derive the degrees of freedom for their t-test without just utilizing the formula they were taught.

As in most fields this is an issue of nuance, not that beginner techniques are bad, but beginner techniques alone are not nuanced enough to capture useful or rigorous insights.


"In fact, many of his examples of what to use are just other descriptive statistics, median is no less a descriptive statistic than the mean is." I think the author would fully agree with this. He is saying we should use more easily understandable descriptive statistics.

"I think the problem is not so much that descriptive statistics are bad, so much as they are not particularly useful when they lack context. Unfortunately, statistics is a much more complex field than your average product designer doing an A/B test thinks." Whatever you say publicly is likely to be stripped of its context and to be received by an audience that does not understand the complexities of your field. The answer to this is to be change what you say, not to change the world, which verges on impossible.


This is terrific.

In the field of web ops, I've had great success tracking the 95th percentile of request time rather than the mean, median, mode, or any other descriptive metric.

The systems I worked with were, like many systems, ordinarily very efficient. That meant the mean and median metrics washed away the occasional troublesome request and hid it from the metrics dashboard.

But knowing the 95th percentile was out of bounds allowed my team to investigate and try to remediate the trouble spots.

This was especially useful in troubleshooting fax delivery (yeah, I know, I know: it was healthcare stuff; fax is considered secure and email is not) through unbelievably flakey private branch exchanges (big city hospitals)


Gil Tene has an entire talk about this point called "How NOT to measure latency" (https://www.youtube.com/watch?v=lJ8ydIuPFeU). In many circumstances, a web service (either server or client) has large numbers of dependencies, so a user only has to hit 95th-percentile latency for one dependency to have their overall latency significantly hurt. In other words, users may encounter the "rare" bad case on a majority of requests. Thus, even higher percentiles should be tightly controlled in order to truly keep a web service's overall latency under control.


Sounds fine to me: fits in with 'management by exception' - look for the extreme cases.

Standard in the industrial quality control world (2*sd above or below).


To answer the author's postscript: "why did nobody tell me this?"

Because you didn't pay attention at school? Means, medians, modes and percentiles were standard fare in my high school, and again in freshman university courses. And I didn't go to particularly expensive schools or elite universities. Then again, I also taught intro stat courses to BA students for a few years, and it was easy to tell that the vast majority of them cared very little, since understanding their uses wasn't exam material.


I think the author is trying to point out that mean, median, and mode are things that people think they understand because they know how to calculate them - any one of them is trivial for the majority of the high school educated population (14-18 years old). However, just because you can calculate something doesn't mean you know how to properly apply it.

When you taught your intro stats class, did you try to give the students guidance (and convey understanding) as to when and why they should apply one of the statistics? If so, good on you, but from my, obviously anecdotal experience, that has not been the case.


I would like to advertise again my favourite measurement of skewness, the medcouple:

https://en.wikipedia.org/wiki/Medcouple

It's robust (25% breakdown point) and is kind of the "optimal" measurement of skewness, as it's computed as the median of all possible interquartile range-like statistics.

I hope someone implements the faster algorithm for Python's statsmodels.


To be fair the article should not be called "Unlearning descriptive statistics", but "Learning just a bit more (and often better) descriptive statistics", but granted, it's not that snappy.

Certainly better than "I took a statistics course, and you won't believe what happened next!!!1!!"


Agreed, but I've got to say from experience that quantile statistics like the author proposes are often way more informative. The only downside is that they can be considerably more expensive to compute for bigger data sets, whereas mean and standard deviation are cheap and can be updated without taking another pass over the data.


On a somewhat related note, I think it's interesting that there is data on how people perceive information but that isn't really taught in descriptive statistics (at least it wasn't at my university). I feel like data representation and charts/graphs should be more of a focus if the title of the class is descriptive statistics. I hadn't even heard about "How to Lie with Statistics" or "The Visual Display of Quantitative Information" before randomly finding them mentioned somewhere. I simply learned about some standard graphs without much explanation on why you use them or how they impact the understanding of the audience.


For anyone who is interesting in correlation measures, have a look at distance correlation:

https://en.wikipedia.org/wiki/Distance_correlation http://projecteuclid.org/euclid.aoas/1267453933

It's implemented in the R package "energy", and provides a more comprehensive measure of variable dependence than Spearman's rank correlation. Distance correlation is able to detect any kind of dependence, not just monotonic relationships.


I think that the article is unfair to statisticians in saying that they ignore descriptive statistics. The problem is that one cannot choose systematically between different descriptive statistics without clearly specifying our objectives and what we believe about the data-generating process. Otherwise on what grounds can you say that the median is 'better' than the mean or the other way round? But once you specify these things you are doing statistical estimation, which is a major branch of mathematical statistics.


This is a great read. I've been thinking alot about these basic concepts myself lately, especially the idea of central tendency.

Essentially, we can make up any method we want to summarize data and give us a single value that represents the central location of the data i.e. mean vs. least absolute distance vs. distance squared, etc.

I haven't thought much about the difference between "typical case" or "expected value" so that's a very useful distinction to be made, especially when deciding which method you want to pick.


One good way to think about it is to ask what norm ("distance") between all the datapoints and your "central tendency" statistic you want to minimize, as the article alluded to.

For L2 (squared distance), you get the mean: For fixed x_i,

sum of i=1..N of (x_i - M)^2

is smallest for M = mean.

For L1 (absolute distance), you get the median.

For L0 ("identical or not"), you get the mode.

You can come up with other and alternative concepts, but it's kind of neat that the 3 most common descriptions of central tendency pop out of this one unified approach.


Variance and standard deviation gets a lot of hate here, that I don't think it deserves. It really does have nice properties, generally nicer than the MAD. Squaring feels to me much less of a hack than the absolute value. The problem that most people don't quite understand it is certainly true, but that's as much a problem with the people as the method.


I wrote an article for the ACM Queue (also published in CACM) about this very issue called "Statistics for Engineers"

http://queue.acm.org/detail.cfm?id=2903468


I took a "Statistics for Engineers" course while at university. It was mostly focused around ANOVA and hypothesis testing. By the end of the course I was sick of hearing the phrase "Null hypothesis". I wish the course had introduced some more practical applications instead it was very high level and theoretical.

Working as an engineer now (Materials Engineering) here is a short list from the top of my head of what I feel are useful statistics applications that were not covered in uni.

- Alternative probability distributions (Rosin-Ramler) - Harmonic and Geometric means - Multivariate regression - Statistical Sampling techniques (how to obtain a representative sample and avoid sampling bias) - Statistical process control - Time Series analysis


Thanks for your list.

In the IT Operations / Monitoring domain time series models play a large role. At the same time the level of sophistication is rather low. You definetly have to start with the basics when teaching such a course.


> But why do you want a number at all?

Because lots of data analysis questions hinge upon the association between two data sets, and it's nice (crucial) to be able to quantify this value. Especially because

> While statisticians are generally quite good at estimating a correlation from a picture and vice versa, most people are not.

The author says

> Still not happy and absolutely want a number? You would do well to shun correlations even so.

OK, so what else do you suggest?


> OK, so what else do you suggest?

Slopes, a.k.a. the parameters in a regression analysis, ideally as a confidence interval or prediction interval to account for uncertainty. Interpreting and communicating regression analyses is a pretty big topic on its own so I chose to only hint at it, though I understand that might not be very satisfying for some readers.


1) it is pretty amazing that normit transformations (map the quantiles of a non-normal distribution onto a Gaussian and use that) don't seem to be on this guy's radar. We use distributions with linearly additive and affine invariant properties (normal plus normal is normal, bernoulli plus bernoulli is bitwise bernoulli) because we find linear algebra very useful. Nonparametric tests and procedures erode your power; normit transformations usually increase it. I did part of my dissertation on this; it's partly to do with asymptotics, but also partly due to the robustness of Gaussian error assumptions thanks to the CLT.

I realized recently that a lot of the trouble people have with training neural networks stems from their lack of training in basic model evaluation. If you stack a bunch of shitty penalized regressions (which is what linear/logistic + relU hinge loss represents) you now have one gigantic shitty regression which is harder to debug. If your early steps are thrown out of whack by outliers, your later steps will be too. Dropout is an attempt to remedy this, but you tend to lose power when you shrink your dataset or model, so (per usual) there really is no such thing as a free lunch. But most of the tradeoffs make more sense when you are able to evaluate each layer as a predictor/filter. Scaling this up to deep models is hard, therefore debugging deep models is hard. Not exactly a big leap.

There is a reason people say "an expert is a master of the fundamentals". Building a castle on a swamp gives poor results. If you can't design an experiment to test your model and its assumptions, your model will suck. This is not rocket surgery. A GPU allows you to make more mistakes, faster, if that's what you want. If you have the fundamentals nailed down, and enough data to avoid overfitting, then nonlinear approaches can be incredibly powerful.

Most of the time a simple logistic regression will offer 80-90% of the power of a DNN, a kernel regression will offer 80-90% of the power of a CNN, and an HMM or Kalman filter will offer 80-90% of the power of an RNN. It's when you need that 10-20% "extra" to compete, and have the data to do it, that deeper or trickier architectures help.

If you can transform a bunch of correlated data so that it is 1) decorrelated and 2) close enough to Gaussian for government work, you suddenly get a tremendous amount of power from linear algebra and differential geometry "for free". This is one reason why Bayesian and graphical hierarchical mixed models work well -- you borrow information when you don't have enough to feed the model, and if you have some domain expertise, this allows you to keep the model from making stupid or impossible predictions.

Anyways. I have had fun lately playing with various deep, recurrent, and adversarial architectures. I don't mean to imply they aren't tremendously powerful in the right hands. But so is a Hole Hawg. Don't use a Hole Hawg when a paper punch is all you really need.

2) What (good) statisticians excel at is catching faulty assumptions. (I'll leave it to the reader to decide whether this data-scientist-for-hire has done a good job of that in his piece) So we plot our data, marginally or via projections, all the damned time. If you don't, sooner or later it will bite you in the ass, and then you too can join the ranks of the always-plotting. However, choosing which margins or conditional distributions to plot in a high-dimensional or sparse dataset is important to avoid wasting a lot of time. So whether via transformation or penalization (e.g. graphical lasso) or both, we usually try to prune things down and home on on "the good stuff". Prioritizing what to do first is most easily done if you have a number and can rank the putative significance by that number. Use Spearman, use energy statistics (distance correlation), use marginal score tests -- IDGAF, just use these as guidelines and plot the damned data.

Corollary: if someone shows you fancy plots and never simple ones containing clouds of individual data points, they're probably full of shit. Boxplots should be beeswarms, loess plots should have scatterplots (smoothed or otherwise) behind them. And for god's sake plot your residuals, either implicitly or explicitly.

3) see above. The author is good at fussing, and brings up some classical points. But they're not really his points. Median and MAD are more robust to outliers than mean and standard deviation, but that makes them less sensitive, too. Check your assumptions, plot everything, use the numbers as advisory quantities rather than final results.


> it is pretty amazing that normit transformations (map the quantiles of a non-normal distribution onto a Gaussian and use that) don't seem to be on this guy's radar

I don't mean to offend, but this is the PhD ur-response, "you didn't mention my pet theory!" :-)

You've given me some interesting stuff to chew on but I very specifically wanted to write about descriptive statistics as a way to describe data, not as a way to summarize it for computers so it can be used in inference. Mapping non-normal distributions onto a Gaussian ain't gonna cut it for that purpose, and to the extent that I care about robustness in this context it's not robustness of inference but whether a descriptive continues to provide a reasonable description of the data for human consumption in the face of outliers etc.


re: normit/inverse transformation: Pet theory? This isn't a theory, it's a simple inverse transformation. It's used all the time. George Box showed how to do a 2D version of it in the 1950s. Normit may be a pet name for it, though. It's just a play on words (probit, logit, expit... normit).

As far as describing data, what's wrong with median + IQR for marginal distributions, or some flavor of energy statistic for joint distributions? You will always need to trade off robustness for sensitivity and bias for variance. That's simply a mathematical feature of the universe. There are plenty of ways to take advantage of this to highlight outliers, for example, which often gets you thinking in terms of "hey, this really looks more like a mixture of two completely different distributions" and seeing if that intuition holds up.

The whole point of describing data with summary statistics is that if the assumptions are met this decouples them from the underlying data. If you use the median as an estimator for the center of your distribution and the MAD as an estimator of its scale, you may choose dimensions along which it's not very good at partitioning your observations. If you want a resistant way to describe the expected center of your data, the median and MAD are very useful. Sometimes it's even more useful to plot everything and point out "our results fall into K obvious clusters" each of which will have their own center & spread.

What I'm saying is that there's no silver bullet. Most of the time we take descriptions of the data, see if there are interesting inferences to be drawn, lather, rinse, repeat. "Get a lot more samples" often thrown into the mix. Are there strong clusters in the data? (Usually a plot will show this, whether via projection or in the raw data) Are there continuums that are interesting in relationship to things we care about? (Usually we'll turn around and model their relationship to said thing-we-care-about, conditioned upon a bunch of other items... multivariate regression, which if you're doing it right, will get you plotting the residuals, themselves descriptive of the model fit)

You simply can't do responsible statistical inference without exploring your data to see what's going on. In order to explore complex datasets, there are plenty of techniques, and most all of them demand tradeoffs (see MAD vs. SD or other metrics of "interestingness" for clustering). A number of descriptive statistics ("extremality" for example) rely upon limit behavior of specific distributions and are case-by-case.

I don't think you'll find many silver bullets for either descriptive or inferential statistics. You have to choose your tradeoffs based on what you want to accomplish.


Several great points (normit is the basis for the Gaussian copula, which was used to great effect to model the CDOs (collateralised debt obligations) that blew up in the GFC (global financial crisis)); but it would have been possible to raise them while being less dismissive...


Yeah, sorry about that. By the time I realized the way the tone had come off, I had managed to "noprocrast" myself off the site. I need to write better hot takes


Please do. That was a lot of interesting material, and the tone was unfortunate.


I largely agree with what you are saying but estimating the population transformation that makes the transformed data Gaussian from a finite sample is far from trivial.

If you have any pointers to results that show distribution free guarantee of increased power I would be super happy to read.

Here's a question for you , why not just deal with the quantiles directly (for example with quantile regression for regression tasks) and not map it to the quantiles of a Gaussian ?


quantile regression is computationally intensive and inverse transformations usually less so. Although you could certainly make the case that, given enough data, quantile regression better captures what we actually want to find, most of the time (i.e. how's this effect diverge towards the extremes).

Normit typically (not sure if universally) has the lovely property of giving you something like a marginal t-test without the assumptions of mixture-of-gaussians errors. You don't jerk around with U-statistics and thus the sample size doesn't make the test statistics so damned granular.


Thanks for responding, we are definitely in agreement. I have used both, in my experience quantile regression seems better behaved at the tail than quantile transformation in regression tasks. I think if one can make the transform conditional on the covariate they would be comparable.


Re: the linear methods vs. neural networks

It depends on the domain. Logistic or most other classifiers won't get close to NN when classifying images or text. It's not 80-90% of the power.

You are right when dealing with data that is not highly-dimensional and not very non-linear either. Also plenty of other domains..


Logistic won't do anything useful for text, to be sure, although an HMM often will (or if you have continuous-valued sequences, a Kalman filter often will do the same). Logistic or multinomial can be tremendously handy for picking up interactions between measurements that can be followed up on and/or expanded in the limited-data case.

I think that the nonlinearity is what really sets apart problems better handled by NNs (not just nonlinearity, but nonlinearity that resists any sort of linearizing transformation), even for lowish-dimensional data. If you look at a linear fit plus a relU, you're just tacking a hinge loss onto a linear/logistic fit. Stack a bunch of these on top of each other and you have a universal function approximator, for which the goodness of fit is limited by the data. If you don't have a crapton of data, the fit isn't likely to be a lot better than linear or transformed linear. If you do have a crapton of data with nonlinear relationships, the implicit structure can be better captured by the flexibility of an NN. But of course you can also spend a lot of time training and debugging the fit, when it might be possible to quickly fit and diagnose a low-dimensional linear or additive model and put it into production. For a long, long time, the most popular "machine learning" method in the valley was logistic regression :-)

For image classification the way people expect it to be done, you are absolutely right (CNNs are incredibly good at this when given enough labeled data). E.g. for relating histological images to genetics or other markers, there's almost no point in not using a CNN with or without a denoising autoencoder in front of it. For low-detail or sparse-and-low-rank mixtures, often you can use compressive approaches to get a lot faster training. But I'll not argue against CNNs for the general image recognition case.

Linear or logistic is typically a great start, and as you note, it's very general. If after trying the simplest thing that can possibly work (linear or logistic), you need better performance, or the linear models can't give you useful answers, ratcheting up the complexity is a reasonable response. You do need a good deal of data to make the latter step worthwhile in most cases. I see a lot of people skipping the first step or ignoring the need for lots of data, and these are the people who get in trouble.


Yeah totally, and I usually fall back on "classic" methods like linear and forest algorithms. It's always good to remind myself of how many domains ML is applicable to, not just image and text analysis which seems to be the hot topic of the day.


i googled "normit transformations". the only non-publication link on the first page was (i can only assume) an automatic japanese translation...


Sorry, I should have used the phrase "inverse cdf". If you can map a pile of data onto 0 to 1 (i.e. by ranking it) then you can invert it onto the values that occupy those quantiles for a Gaussian. This is nice because you can use quasi-parametric assumptions (very nice when multi-dimensional relationships are to be explored) without the data itself needing to be distributed appropriately. Other alternatives are to use nonparametric smoothers and the like but I kind of hate going to all that trouble when it's usually pointless. (important exception: when you have interesting correlated structure in high-dimensional data)

It's a hack, to be sure, but especially if you want to pool data (e.g. in mixed hierarchical models) for better predictions, it often pays off. The name is a play on "logit", "expit", "probit", "tobit", etc. since the actual transformation is relatively trivial for data that is already sorted. (For large unsorted data or streams, not so much)


Can somebody recommend a good book (or any other resource) on statistics? I mean kind of stuff this post talks about: descriptives, tests, all the basic stuff. Despite being somewhat familiar with that, I sometimes feel I really lack the solid understanding of the subject and am longing for something explanatory, with real life examples and exercises.


Think Stats (http://greenteapress.com/wp/think-stats-2e/) and the follow-up Think Bayes are awesome if you're a coder. https://bblais.github.io/statistical-inference-for-everyone-... is a complete rethink of how basic statistics should be taught and is excellent as well.


Thanks, looks awesome.


Great article overall, but I wish it went a little deeper in explaining how we should interpret the statistics it recommend.

It basically says "don't use the normal numbers, use these instead, because they are closer to what you would expect them to mean"

However I don't find all of these statistics super obvious.


This is actually a great post. I think it's a wonderful idea to move away from worshipping summary statistics (ave, median) and start studying the underlying distributions.


Surely there has to be an analysis somewhere that someone has done which links Kolmogorov complexity and one-statistic-describes-all approaches of interpreting numbers?


Something like what you describe is more or less how statisticians do in fact reason about statistics: https://en.wikipedia.org/wiki/Sufficient_statistic#Minimal_s..., though only for i.i.d. data from a parametric distribution.


I think "skew" is on of those things that are taught not because it is useful, but because it is easy to teach, easy to memorize and easy to test for.


Who the hell would consider Tukey a fringe statistician?


Lauded, sure, but just about all of his books are out of print and his work is rarely a part of the curriculum.


What's the "ordinal center"?


If you've got a bunch of ordered categories, like "bad", "okay" and "good", the category in the middle. (So if humans can have 0, 1 or 2 legs, then 1 would be at the center.)




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: