A single-number statistic is _going_ to leave things out, so if you must boil things down into one number, or even a few numbers, you're going to lose something that you had in the raw data. This is why I find claims along the lines of "statistics don't tell the whole story" a little bemusing - of course they don't, the very definition of a statistic is a summarization of data that is easier to work with. The question is what data is kept or lost, or more generally what importance we place on different aspects of the raw data such that it's reflected in our descriptive statistics.
The lessons for non-technical people who want to communicate with descriptive statistics are to recognize that summarization is inherent in the nature of any descriptive statistic, that they are thereby opinionated in some way in terms of what they've preserved and what they've left out, and to recognize whether those opinions are appropriate for your purpose.
It took me a long time to realize that a principled reason for gaussian parametric distributions is the maximum entropy principle. Prior to that point, it had been presented as essentially arbitrary, even by established professors.
There seems to be an assumption that theoretical statistics is "too hard", and as a result there's a middle ground that gets left out. I haven't taught general stats courses in awhile (although I've taught advanced ones), but if I did, might start with Bregman divergences, and work down in the manner of your blog post.
I think there's an in-between that gets lost. You can teach principles without deriving long proofs of everything along the way.
Students don't get taught the underling principles and philosophies to choose from, and I think this leads to the "holy grail" issues you're referring to.
It seems to be changing a bit with new interest in Bayesian methods, but that's just the tip of the iceberg.
Humans want to distill vast amounts of information to a more manageable amount, like for example a single number.
Equity analysts look at accounting metrics, psychologists look at psychometrics test cores, doctors look at some function of blood pressure, etc etc.
Any person with deductive and sceptical mental faculties in place, will recognize that these are all simplifications, and cannot be used to deliver a unified truth.
Also, being aware of this has very little to do with being technical or not (for example, plenty of programmers only look at a CPU's clock speed to gauge performance).
Anyhow, nice post.
Well yes, that is their purpose.
What happens when you minimize Ln, with n > 2? Why don't we use any of those?
That said, it's a slight pet peeve of mine when people recommend the median over the mean to describe center. The median, on its own, does not describe what is "typical" any more than the mean does; it just has a small advantage in that it will always map a real observation, so for discrete data you don't end up with things like "1.9 legs." (That said, a mean of 1.9 legs actually seems much more informative to me than a median of 2 legs, so even in that case I prefer the mean.) It's easy to envision many situations where the median's representation is wildly inaccurate, just as you can imagine ways in which the mean can be misleading.
The median is also insensitive to skew, which is often cited as a good thing, but really that's something you should determine on a case-by-case basis. In many (most?) applications, there's no tangible benefit in having a measure of center that ignores skew. The median's insensitivity also creates strange situations where subgroups in the population end up unrepresented (e.g. if the poorest 20% become even poorer because of changes to the tax code, the median doesn't budge). In general, the median conveys remarkably little information. It's great at showing the center, but it gives zero indication of anything else. The mean certainly has its issues, but it's far from inferior. Use the median when the situation calls for it, certainly, but realize that it's just as limited as any other measure, if not more so.
I think this swings the other extreme in selling the 'median' short. As long as we agree that it is only strictly meaningful to talk about the 'center' for symmetric distributions, median does a fine job.
In fact in many realistic scenarios a far better job than the mean. The main trouble is the normal or the Gaussian distribution is no where close to being as ubiquitous as it seems, neither is CLT as universal as it is made out to be. Gauss sort of got away with it, Gauss did not discover the distribution nor the associated CLT.
Many real data of day to day consequence have heavy tails, and mean is a pathetic measure of 'central tendency' for these. Mean is particularly sensitive to outliers. Median does significantly better than the mean in this non-academic situation. Although one could do better than median for symmetric heavy tailed data (for example trimmed means), but its mean that I find guilty of entirely disproportionate fame. If its needed to exaggerate median a bit to get people to grow up beyond the pervasive Normal / Gaussian fetish, I am all for it.
No single number is going to characterize what is 'typical'. One really needs the CDF here, and yes avoid estimating densities as much as possible.
BTW mapping to a real observation is not true, you only get a 50% chance of that.
When people say the median is "better" than the mean, they often can't explain exactly why (other than to cite skew, which I've addressed). It's really just a gut feeling based on how they felt when they first learned that the median was "more realistic" for some specific set of data that was skewed. They never thought through exactly why or when you'd want to disregard the skew in the first place. I would be insane to argue that the median is never the right statistic to use; my point is merely that it shouldn't be given preference over the mean, and that people need to think about what they're actually trying to show.
The mean is not only useful for the normal distribution, and since the article specifically avoids jumping into inference territory talking about specific distributions at all is kind of putting the cart before the horse. The "disproportionate fame" of the mean is inherited from its extreme importance throughout all of statistical inference. It's true that if you ignore inference, the mean isn't particularly important as a descriptive statistic, but my argument is that it's certainly no worse than the median.
Yes but this is exactly what is bad with it. At least for the majority of practical applications that I can think of. For instance, if you want to measure the "average" income of the population. If there are a handful of uber-rich people, they will really raise the mean. Even if the vast majority of people make much less than the mean. The median will do a much better job of telling you what a "typical" person makes.
I really believe that when most people read or talk about "average", they mentally interpret it as the median. And that the median is generally less misleading.
Sure, there are applications where sensitivity to changes at the tails is important. But even then, mean is misleading and difficult to interpret. In your earlier example, if you raised the income of the bottom 10% by 100%, then just use that as a statistic. "The poorest 10% of people now make twice as much money" is much more informative and impressive than "the average income increased by 0.1%."
Well, 40~50 years of literature on robust statistics happens to disagree with the claim 'people' don't know what they mean (pun unintended) when the say the median is better. Furthermore skew has less to do with that argument than heavy tails. heavy tails are extremely (ok sorry, now its an insider pun) common.
About the mean vs. median, it might be true that "mean" is appropriate in just as many contexts as "median", but at least in my opinion, people cite a "mean" when a "median" would have been appropriate more frequently than the reverse. There are times when neither are appropriate, but if you're using median, you're more likely to understand the merits of different contexts.
> it might be true that "mean" is appropriate in just as many contexts as "median", but at least in my opinion, people cite a "mean" when a "median" would have been appropriate more frequently than the reverse.
Sure, but this is only true because the mean is used so much more. If the whole world was instead taught in grade school to use nothing but the median, there would be just as many (I'd argue more) misuses as there are now, except this time the median would be the offender. People using the median are more likely to be using it inappropriately because it's the nonstandard option, but that's only because they're not using it blindly. If people starting using the median just because, it would suffer the same issue.
If you listen to someone like Taleb, the main advantage they mention for the median is that it's a more robust statistic. For fat-tailed distributions the average can jump all over as new data come in.
For example, geometric mean of returns is the right number to look at for stocks because you are indeed exposed to the tails.
The median also has zero ability to make me coffee in the mornings, but I don't think I can hold that against it.
I can imagine situations where getting people to track or listen to even a single number is tough and using the mean as a measure of both central tendency and the stability of the distribution over time might be the least worst option. But is that really a common problem? Have you often encountered situations where it was impossible to communicate something like "the typical customer buys five widgets but more than 1 in 4 of our customers only buy one" because it contains two statistics and management insists on being briefed with just one?
There are times that, for whatever reason, someone is only presenting one statistic. News headlines are a big one. In this case, you clearly want to pick the "best" statistic available. There are people who think that the median is categorically better than the mean and dispense advice as such. (You can find them saying things like "the mean is worthless, the median would be much better" in the comments of discussion boards.) That's not what this article did, but the author did imply (in its title and language, if nothing else) that the median is better than the mean without offering any sort of weighing mechanism for which to choose. That's what I'm responding to, because it perpetuates the median > mean myth that's prevalent among certain groups of people.
E.g., I can imagine the difference between the 25th and 75th percentile to be descriptive of spread (like standard deviation), but those numbers seem arbitrary.
Only if you have an odd number of data points or the two middling data points have the same value. Example: I have four people with these sizes in cm: 120, 160, 180, 200...the median would be (160+180)/2=170 which is the size of none of the people.
Not if the number of observations is even and the two "middle" observations differ.
data: 0 0 0 1 1
mean: 2 / 5
That's fair. In my experience, summary measures are often calculated at many levels of an analysis, not just the final reporting.
The TL;DR is to just plot everything if possible to visualize skew explicitly (or tricks like the Quartet), and there are a number of tools which can do that without much LOC overhead. (e.g ggplot2)
The majority of defaults in R exist because statisticians do these things all the time and don't like typing any more than programmers. Time spent typing could be spent thinking about what assumptions are being violated and whether the results offer any useful insight.
In fact, many of his examples of what to use are just other descriptive statistics, median is no less a descriptive statistic than the mean is.
I think the problem is not so much that descriptive statistics are bad, so much as they are not particularly useful when they lack context. Unfortunately, statistics is a much more complex field than your average product designer doing an A/B test thinks. The majority of statistics and models the average person utilizes has assumptions and qualifications that they don't fully understand, and frankly that's not a slight to them but rather just the additional depth of the field.
Just as an example, ask the average person to derive the degrees of freedom for their t-test without just utilizing the formula they were taught.
As in most fields this is an issue of nuance, not that beginner techniques are bad, but beginner techniques alone are not nuanced enough to capture useful or rigorous insights.
"I think the problem is not so much that descriptive statistics are bad, so much as they are not particularly useful when they lack context. Unfortunately, statistics is a much more complex field than your average product designer doing an A/B test thinks." Whatever you say publicly is likely to be stripped of its context and to be received by an audience that does not understand the complexities of your field. The answer to this is to be change what you say, not to change the world, which verges on impossible.
In the field of web ops, I've had great success tracking the 95th percentile of request time rather than the mean, median, mode, or any other descriptive metric.
The systems I worked with were, like many systems, ordinarily very efficient. That meant the mean and median metrics washed away the occasional troublesome request and hid it from the metrics dashboard.
But knowing the 95th percentile was out of bounds allowed my team to investigate and try to remediate the trouble spots.
This was especially useful in troubleshooting fax delivery (yeah, I know, I know: it was healthcare stuff; fax is considered secure and email is not) through unbelievably flakey private branch exchanges (big city hospitals)
Standard in the industrial quality control world (2*sd above or below).
Because you didn't pay attention at school? Means, medians, modes and percentiles were standard fare in my high school, and again in freshman university courses. And I didn't go to particularly expensive schools or elite universities. Then again, I also taught intro stat courses to BA students for a few years, and it was easy to tell that the vast majority of them cared very little, since understanding their uses wasn't exam material.
When you taught your intro stats class, did you try to give the students guidance (and convey understanding) as to when and why they should apply one of the statistics? If so, good on you, but from my, obviously anecdotal experience, that has not been the case.
It's robust (25% breakdown point) and is kind of the "optimal" measurement of skewness, as it's computed as the median of all possible interquartile range-like statistics.
I hope someone implements the faster algorithm for Python's statsmodels.
Certainly better than "I took a statistics course, and you won't believe what happened next!!!1!!"
It's implemented in the R package "energy", and provides a more comprehensive measure of variable dependence than Spearman's rank correlation. Distance correlation is able to detect any kind of dependence, not just monotonic relationships.
Essentially, we can make up any method we want to summarize data and give us a single value that represents the central location of the data i.e. mean vs. least absolute distance vs. distance squared, etc.
I haven't thought much about the difference between "typical case" or "expected value" so that's a very useful distinction to be made, especially when deciding which method you want to pick.
For L2 (squared distance), you get the mean: For fixed x_i,
sum of i=1..N of (x_i - M)^2
is smallest for M = mean.
For L1 (absolute distance), you get the median.
For L0 ("identical or not"), you get the mode.
You can come up with other and alternative concepts, but it's kind of neat that the 3 most common descriptions of central tendency pop out of this one unified approach.
Working as an engineer now (Materials Engineering) here is a short list from the top of my head of what I feel are useful statistics applications that were not covered in uni.
- Alternative probability distributions (Rosin-Ramler)
- Harmonic and Geometric means
- Multivariate regression
- Statistical Sampling techniques (how to obtain a representative sample and avoid sampling bias)
- Statistical process control
- Time Series analysis
In the IT Operations / Monitoring domain time series models play a large role.
At the same time the level of sophistication is rather low. You definetly have
to start with the basics when teaching such a course.
Because lots of data analysis questions hinge upon the association between two data sets, and it's nice (crucial) to be able to quantify this value. Especially because
> While statisticians are generally quite good at estimating a correlation from a picture and vice versa, most people are not.
The author says
> Still not happy and absolutely want a number? You would do well to shun correlations even so.
OK, so what else do you suggest?
Slopes, a.k.a. the parameters in a regression analysis, ideally as a confidence interval or prediction interval to account for uncertainty. Interpreting and communicating regression analyses is a pretty big topic on its own so I chose to only hint at it, though I understand that might not be very satisfying for some readers.
I realized recently that a lot of the trouble people have with training neural networks stems from their lack of training in basic model evaluation. If you stack a bunch of shitty penalized regressions (which is what linear/logistic + relU hinge loss represents) you now have one gigantic shitty regression which is harder to debug. If your early steps are thrown out of whack by outliers, your later steps will be too. Dropout is an attempt to remedy this, but you tend to lose power when you shrink your dataset or model, so (per usual) there really is no such thing as a free lunch. But most of the tradeoffs make more sense when you are able to evaluate each layer as a predictor/filter. Scaling this up to deep models is hard, therefore debugging deep models is hard. Not exactly a big leap.
There is a reason people say "an expert is a master of the fundamentals". Building a castle on a swamp gives poor results. If you can't design an experiment to test your model and its assumptions, your model will suck. This is not rocket surgery. A GPU allows you to make more mistakes, faster, if that's what you want. If you have the fundamentals nailed down, and enough data to avoid overfitting, then nonlinear approaches can be incredibly powerful.
Most of the time a simple logistic regression will offer 80-90% of the power of a DNN, a kernel regression will offer 80-90% of the power of a CNN, and an HMM or Kalman filter will offer 80-90% of the power of an RNN. It's when you need that 10-20% "extra" to compete, and have the data to do it, that deeper or trickier architectures help.
If you can transform a bunch of correlated data so that it is 1) decorrelated and 2) close enough to Gaussian for government work, you suddenly get a tremendous amount of power from linear algebra and differential geometry "for free". This is one reason why Bayesian and graphical hierarchical mixed models work well -- you borrow information when you don't have enough to feed the model, and if you have some domain expertise, this allows you to keep the model from making stupid or impossible predictions.
Anyways. I have had fun lately playing with various deep, recurrent, and adversarial architectures. I don't mean to imply they aren't tremendously powerful in the right hands. But so is a Hole Hawg. Don't use a Hole Hawg when a paper punch is all you really need.
2) What (good) statisticians excel at is catching faulty assumptions. (I'll leave it to the reader to decide whether this data-scientist-for-hire has done a good job of that in his piece) So we plot our data, marginally or via projections, all the damned time. If you don't, sooner or later it will bite you in the ass, and then you too can join the ranks of the always-plotting. However, choosing which margins or conditional distributions to plot in a high-dimensional or sparse dataset is important to avoid wasting a lot of time. So whether via transformation or penalization (e.g. graphical lasso) or both, we usually try to prune things down and home on on "the good stuff". Prioritizing what to do first is most easily done if you have a number and can rank the putative significance by that number. Use Spearman, use energy statistics (distance correlation), use marginal score tests -- IDGAF, just use these as guidelines and plot the damned data.
Corollary: if someone shows you fancy plots and never simple ones containing clouds of individual data points, they're probably full of shit. Boxplots should be beeswarms, loess plots should have scatterplots (smoothed or otherwise) behind them. And for god's sake plot your residuals, either implicitly or explicitly.
3) see above. The author is good at fussing, and brings up some classical points. But they're not really his points. Median and MAD are more robust to outliers than mean and standard deviation, but that makes them less sensitive, too. Check your assumptions, plot everything, use the numbers as advisory quantities rather than final results.
I don't mean to offend, but this is the PhD ur-response, "you didn't mention my pet theory!" :-)
You've given me some interesting stuff to chew on but I very specifically wanted to write about descriptive statistics as a way to describe data, not as a way to summarize it for computers so it can be used in inference. Mapping non-normal distributions onto a Gaussian ain't gonna cut it for that purpose, and to the extent that I care about robustness in this context it's not robustness of inference but whether a descriptive continues to provide a reasonable description of the data for human consumption in the face of outliers etc.
As far as describing data, what's wrong with median + IQR for marginal distributions, or some flavor of energy statistic for joint distributions? You will always need to trade off robustness for sensitivity and bias for variance. That's simply a mathematical feature of the universe. There are plenty of ways to take advantage of this to highlight outliers, for example, which often gets you thinking in terms of "hey, this really looks more like a mixture of two completely different distributions" and seeing if that intuition holds up.
The whole point of describing data with summary statistics is that if the assumptions are met this decouples them from the underlying data. If you use the median as an estimator for the center of your distribution and the MAD as an estimator of its scale, you may choose dimensions along which it's not very good at partitioning your observations. If you want a resistant way to describe the expected center of your data, the median and MAD are very useful. Sometimes it's even more useful to plot everything and point out "our results fall into K obvious clusters" each of which will have their own center & spread.
What I'm saying is that there's no silver bullet. Most of the time we take descriptions of the data, see if there are interesting inferences to be drawn, lather, rinse, repeat. "Get a lot more samples" often thrown into the mix. Are there strong clusters in the data? (Usually a plot will show this, whether via projection or in the raw data) Are there continuums that are interesting in relationship to things we care about? (Usually we'll turn around and model their relationship to said thing-we-care-about, conditioned upon a bunch of other items... multivariate regression, which if you're doing it right, will get you plotting the residuals, themselves descriptive of the model fit)
You simply can't do responsible statistical inference without exploring your data to see what's going on. In order to explore complex datasets, there are plenty of techniques, and most all of them demand tradeoffs (see MAD vs. SD or other metrics of "interestingness" for clustering). A number of descriptive statistics ("extremality" for example) rely upon limit behavior of specific distributions and are case-by-case.
I don't think you'll find many silver bullets for either descriptive or inferential statistics. You have to choose your tradeoffs based on what you want to accomplish.
If you have any pointers to results that show distribution free guarantee of increased power I would be super happy to read.
Here's a question for you , why not just deal with the quantiles directly (for example with quantile regression for regression tasks) and not map it to the quantiles of a Gaussian ?
Normit typically (not sure if universally) has the lovely property of giving you something like a marginal t-test without the assumptions of mixture-of-gaussians errors. You don't jerk around with U-statistics and thus the sample size doesn't make the test statistics so damned granular.
It depends on the domain. Logistic or most other classifiers won't get close to NN when classifying images or text. It's not 80-90% of the power.
You are right when dealing with data that is not highly-dimensional and not very non-linear either. Also plenty of other domains..
I think that the nonlinearity is what really sets apart problems better handled by NNs (not just nonlinearity, but nonlinearity that resists any sort of linearizing transformation), even for lowish-dimensional data. If you look at a linear fit plus a relU, you're just tacking a hinge loss onto a linear/logistic fit. Stack a bunch of these on top of each other and you have a universal function approximator, for which the goodness of fit is limited by the data. If you don't have a crapton of data, the fit isn't likely to be a lot better than linear or transformed linear. If you do have a crapton of data with nonlinear relationships, the implicit structure can be better captured by the flexibility of an NN. But of course you can also spend a lot of time training and debugging the fit, when it might be possible to quickly fit and diagnose a low-dimensional linear or additive model and put it into production. For a long, long time, the most popular "machine learning" method in the valley was logistic regression :-)
For image classification the way people expect it to be done, you are absolutely right (CNNs are incredibly good at this when given enough labeled data). E.g. for relating histological images to genetics or other markers, there's almost no point in not using a CNN with or without a denoising autoencoder in front of it. For low-detail or sparse-and-low-rank mixtures, often you can use compressive approaches to get a lot faster training. But I'll not argue against CNNs for the general image recognition case.
Linear or logistic is typically a great start, and as you note, it's very general. If after trying the simplest thing that can possibly work (linear or logistic), you need better performance, or the linear models can't give you useful answers, ratcheting up the complexity is a reasonable response. You do need a good deal of data to make the latter step worthwhile in most cases. I see a lot of people skipping the first step or ignoring the need for lots of data, and these are the people who get in trouble.
It's a hack, to be sure, but especially if you want to pool data (e.g. in mixed hierarchical models) for better predictions, it often pays off. The name is a play on "logit", "expit", "probit", "tobit", etc. since the actual transformation is relatively trivial for data that is already sorted. (For large unsorted data or streams, not so much)
It basically says "don't use the normal numbers, use these instead, because they are closer to what you would expect them to mean"
However I don't find all of these statistics super obvious.