Hacker News new | past | comments | ask | show | jobs | submit login
Machine Learning Is the New Statistics (danielmiessler.com)
81 points by danielrm26 on Oct 16, 2016 | hide | past | favorite | 28 comments



Machine learning is subset of statistics.

The standard text in ML, "The Elements of Statistical Learning" is authored by statistics Professors.

Statistics is the new statistics. The rest is marketing bullshit.


I think names for fields are largely defined by social forces behind it not its ideas. Terms like "statistics" are sociological artifacts, not mathematical ones. The difference between stats and machine learning is analogous to say, the difference between Canada and the United States. They're neighbours, separated at birth, share many of the same values but still each have their own personality, worldview, goals and culture.

Walk into a statistics department in a university, and you can feel the differences in culture. It seems to boil down to things about what one considers sacred. Confidence Intervals, Hypothesis Testing and Rigor are things statisticions hold near and dear to their hearts. Want to be a statistics phd? You'd better understand all the differences between convergence in probability and convergence almost surely. And things like Probability Approximate Correct Learning, VC Dimension, Random Forests, and even Deep Learning should sound like heresy to a true, card carrying, statistician.

Another odd cultural difference. Statisticians love citing old papers. You can call a paper about 20 years old "the classical work of" in ML. Stats has a much higher standard for this. The classics go way, way back to a time before I was born. Its gotta be an issue of pride.

But yea, I do feel for the most part, I feel the fields are converging, even if it seems like ML is consuming statistics, not the other way round. Stats majors are publishing in places like NIPS rather than the traditional journals, not the other way round. Just a feeling though. I have no stats to back up this claim.

But there's no point saying "A is a subset of B" unless you want to spice up your happy hour conversations in the pub.


It's surprising to me that you would describe the field of statistics as a separate 'sociological artifact', but then refer to the actual definition of the term when using the abbreviated word 'stats', as in your sentence ' I have no stats to back up this claim'.

Statistics are tallied numbers and represent actual measured values. The field of statistics is concerned with tallied numbers collected, probability is concerned with the likelihood of those numbers being produced under specific assumptions, and machine learning is a process that uses statistics to verify and adjust the probability model being used for study.

Those are all definitions used by mathematicians and statisticians (who are a subset of mathematicians), not 'sociological artifacts'. Things don't sound like heresy to a statistician unless he is arguing implicit versus explicit logic. That is regardless of how it feels when he walk into his department.

There is no need to prescribe artifacts' if we can just keep the correct definitions clear and not conflate them.


Machine learning is subset of statistics. The standard text in ML, "The Elements of Statistical Learning" is authored by statistics Professors.

To be fair, I think "Machine Learning" was an academic marketing term coined by Computer Science departments. It seems that the term "Statistical Learning" was coined in response by Statistics departments. Other similar marketing buzzwords used by various factions (comp sci, stats, actuarial science, industrial engineers, etc) include "Data Science, "Predictive Analytics," "Data Mining," "Knowledge Discovery," "Knowledge Engineering," "Soft Computing," "Artificial Intelligence," "Big Data," "Deep Learning"... To be honest, it's tiresome and troubling to see academic departments invent and adopt overlapping and vacuous marketing buzzwords.


This view isn't really accurate – the relationship is one of non-empty intersection rather than inclusion in either direction. Machine learning algorithms are meta-algorithms where a large portion of the specific algorithm to be applied is filled in based on training data. Most ML algorithms are statistically unsound and many statistical methods aren't machine learning.

To make the case that ML - stats isn't empty, consider neural networks and singular value decomposition (Netflix winners used SVD): both are wildly successful ML techniques – and neither is remotely statistically sound. Their correctness is at best heuristic, yet, partly because they can be efficiently applied to huge amounts of data – far more than classical statistical methods can handle – they are very effective.

Statistics has traditionally focused on making the most of a modest amount of data – because in previous eras, data was the limiting factor. Huge data sets today render most statistical tests useless since no hypothesis is strictly true given enough precision, and the focus on eking the most out of every data point is overkill. It turns out that a naive, imprecise method with a huge amount of data is often more effective than a sophisticated method with far less data. As ML gets better at building statistically sound models and stats gets better at scaling, the two disciplines are converging slowly – but we are still very far away from inclusion in either direction.


>Huge data sets today render most statistical tests useless since no hypothesis is strictly true given enough precision, and the focus on eking the most out of every data point is overkill.

Mind, huge data sets and huge amounts of computing power also make the use of Bayesian methods feasible, which don't have the NHST problem that anything is significant with enough data-points.


I finally figured out the best way to respond to this.

The central concept of Machine Learning is self-improvement of the models that are used, based on data alone.

That is NOT the central concept of Statistics.

So yes, Machine Learning may be related to, part of, semantically linked, a subset of, or whatever you want to say there, but the fact that Machine Learning is designed to self-improve is (in my opinion anyway) the reason it should be considered a "new" way of evaluating the world.

Saying it's all just Statistics sounds a lot like calling consciousness "just another information processing mechanism" (yawn). I know it's not that big of a difference, but it's similar.

Self-improvement of data analysis models matter enough to warrant the separate name and the attention that comes with it.


Statistics as a field typically concerns itself with fitting models to data by determining the distribution of the process that the data was generated from. Machine learning follows a different approach. Read Breiman's "Two Cultures."

http://projecteuclid.org/euclid.ss/1009213726


> by determining the distribution of the process that the data was generated from.

Well, each random variable has a distribution. And there are a few distributions that are common so are taught. Then, presto, bingo, too many students conclude that an important first step is to find a distribution. However, commonly in practice, with just samples and without more in mathematical assumptions, finding a distribution is from not very promising to hopeless. Hopeless? Yes, consider a random variable that takes values in 50 dimensional Euclidean space.

But there is a lot of statistics that is distribution-free, where we make no assumptions on probability distributions. E.g., I published such a paper in Information Sciences. In addition with some meager assumptions, say, the random variables have expectations, the squares of the random variables have finite expectations, etc., can do more.

For model fitting, if can assume that the data has Gaussian distribution, is homoscedastic, have independent and identically distributed (i.i.d.), etc., then can get some more results, e.g., know that some of the results of the computations have Gaussian or F distribution, etc. Then can do a lot of classic hypothesis tests, confidence intervals, etc.

But, with just meager assumptions, commonly can still proceed and know that are still making a best L^2 approximation. Then can drag out the classic result that a sequence of (such) random variables that are Cauchy convergent in L^2 do converge in L^2, that L^2 is complete (i.e., a Hilbert space), and that some sub-sequence converges almost surely. That's a lot -- might be able to take that to the bank. And made no more than meager, general assumptions about distributions.

Really, often we get some of the well known distributions from some theorems, not analysis of empirical data. E.g., get a Gaussian assumption from the central limit theorem. Can get an exponential distribution from the Poisson process (e.g., E. Cinlar's text), and can get that from the very general, even astounding, renewal theorem (e.g., W. Feller's second volume).


Indeed.

In fact the birth of statistical learning theory was Vapnik's (rather unintuitive) insight that although the theoretically optimal strategy in a classification task is the class conditional distribution, estimating the class conditional distribution from the data is not a promising way to do it. Estimating distributions is freaking hard, you may not (ever) have data to estimate that distribution, however you can short circuit the process and solve the classification problem directly (without explicitly modeling the class conditional distribution).

That said, machine learning has models that do not rely on a stochastic sampling process. The sampling process can be adversarial too. In such situations the algorithms guarantee that you wouldn't be too far off (even in finite time) from the strategy that would be optimal in hind-sight. These are non-asymptotic guarantees. I am of course sweeping large swathes of theory under the carpet, because this is hardly the forum for that.

The beauty of CLT notwithstanding, what is surprising is how rarely the Gaussian assumption holds in practice. Many distributions are too heavy tailed to have a finite variance, and in such scenarios CLT does not yield you a Gaussian but yields a stable distribution (Gaussian is just one, in fact the only one in the stable family with a finite distribution. Gaussian distribution was not discovered by Gauss and neither is it that normal). To make a tongue in cheek claim, the prince of mathematicians sort of got away with it.

Machine learning theory was never far from statistics and probability. How can it, when its traditional theoretical bedrock (PAC) is explicitly a statement about convergence in probability with all its epsilon and delta in your face. Where the two differ are in the amount of focus on prediction (as opposed to parameter estimation), familiarity with optimization and algorithms, and lack of fascination for test of hypothesis, Gaussian distribution and asymptotic normality.


>And made no more than meager, general assumptions about distributions.

I don't see how assuming the data are IID from a Gaussian is a meager assumption.


> I don't see how assuming the data are IID from a Gaussian is a meager assumption.

It's not. Somehow we have failed to communicate accurately.

I wrote

> But, with just meager assumptions, commonly can still proceed and know that are still making a best L^2 approximation.

In that sentence, I didn't suggest that those "assumptions" were the Gaussian i.i.d. of the previous paragraph. Instead, I left the "assumptions" unspecified. Why? Because there is quite a variety available. But lots of the options are "meager".

Typically with more assumptions, can get more results. But for model fitting, building, constructing, discovering, whatever, can still get a lot with next to nothing in assumptions.


Hmm. Maybe you're right.

But let me try to push back.

If we're trying to draw human-consumable wisdom about the state of the world from data, simply capturing a snapshot and then applying some basic analysis is one thing.

Creating self-improving mechanisms for doing this is another, even if the latter use statistics in the process. Perhaps in a similar way that Statistics uses Algebra yet is distinct in its description and capabilities.

I'm not convinced you're wrong here, just trying to talk through it.


What the parent is saying is that machine learning is a buzzword for statistics or at best another word for applied statistics.

Saying that applied physics will be the new physics because it changes the real world, is nonsensical.

In earlier days Google called their translation algorithms statistical, then they changed it to ML and now AI has come in favour again, so that is the word being used.


I think that one of the problems I have with the description in the post is that it draws a line where, in reality, no real line (or at the very least, only a blurry mess) exists. For example, where do hierarchical models and Bayesian nonparametric models fit? Where does model selection fit? These notions have existed in statistics for some time. I'm not one of these particularly dogmatic people who believe that "it's all just statistics" or, conversely "machine learning is entirely new / different". In fact, I think it's the deep (and continuing) connection between these fields that make them both so interesting and powerful. However, I do tend to agree that the type of hype used in this post massively oversells ML while simultaneously underselling Stats, based, partly, on the false dichotomy drawn between them.


HN should add a gild option.


Is that the standard? I always thought it was "Pattern Recognition and Machine Learning" by Bishop, but it's not my field.


People that think ML/AI isn't statistics typically haven't studied statistics, or have a marketing agenda. I can tolerate the latter as a fact of life. But the former ... there is often a disturbing lack of statistical understanding in "ML/AI" practitioners at the ground level, even though the vast majority of their tooling is built on basic multivariate statistics. It's rather inevitable give the sudden sex-appeal of the field, but will lead to an 'AI winter' as those folks over promise and under-deliver. Computational statistics, statistical learning, machine learning, pick your term certainly continues to progress as computational horsepower improves. But as another commenter noted, physicists/chemists still self-identify with quantum mechanics even though the computational methods/approximations for molecular dynamics continue to rapidly improve.


Not sure about AI winter. I think a lot of advances are fuel by GPUs, ASICs and colossal datasets. Also opensourcing the frameworks makes it easier for new comers.

We can recognize objects, recognize speech, at almost human level accuracies. That's a big milestone when you think about it.

Also technology improves exponentially when a ton of smart people funded by crazy fuck you money work on pushing it forward.


I would second what you said ...


While I'm not a fan of the modern abuse of the term "machine learning" as a marketing buzzword, this article is tautological ("Machine Learning is the new statistics because it is not statistics") and does not provide much insight aside from invoking other buzzwords ("Reality Analysis Level 1"?).


The "insight" would be in the form of a falsifiable claim, i.e. that ML will be bigger than the advent of statistics in terms of its ability to improve our understanding of the world.

It's a claim, nothing more.


The statement "More wisdom potentially gets extracted when you apply Statistics to more (and better) data, but the analysis itself doesn’t improve with better data." simply isn't true. A hierarchical model, for example, is increasingly able to model subgroups and additional levels of hierarchy as more data are added. Penalized regression (or Bayesian regression) is another example -- the model is structurally different as you change the quantity of data.

The difference between ML and statistics is entirely semantic. Is logistic regression a ML method or a statistical method? It is both!


Machine Learning is generically defined as a method of data analysis that automates analytical model building.

That's the part that's going to have the impact---automated improvement of the way Statistics is applied to data analysis.

Is that not major enough to be considered and discussed separately?


I think the kicker is that most machine learning is incrementing on a single model. Typically one known from statistics before. The weak learner track of combining models almost guess against this, but I think even then it is usually the same shape of model.

So, I actually agree to an extent. Much as computers can be seen as the "next logic". Only, it is such a "builds on" relationship that I think calling it next is dubious.


While I am not statistically sophisticated to say whether the statement is true or not, it seems that the mathematical machinery is strongly converging, as witnessed by papers like [1], where some of the statistical machinery is being developed on the ML side. It might be historically interesting to note that 20-30 years ago, ML also more evenly spanned work that would be considered AI based -- as including logic or knowledge representation see [2 - 4] as examples.

[1] Variational Inference: A Review for Statisticians, https://arxiv.org/abs/1601.00670

[2] A Summary of Machine Learning Papers from IJCAI-85 http://web.engr.oregonstate.edu/~tgd/publications/mlj-ijcai8...

[3] Chunking in soar: the anatomy of a general learning mechanism, http://repository.cmu.edu/cgi/viewcontent.cgi?article=2552&c...

[4] Explanation-Based Learning: An Alternative View, http://www.cs.utexas.edu/~ml/papers/ebl-mlj-86.pdf


I think specifically Deep Learning is where all the magic is. The rest of machine learning (SVMs, clustering, decision trees, etc.) are all old methods that were invented in the 90's or earlier, recent lifts in data storage and compute power have made them proportionally more powerful but they haven't unlocked new technology as far as I know.

Deep Learning wasn't even possible until recently though, data and compute power have made it possible as opposed to just proportionally better [1]. There have also been a lot of breakthroughs in Reinforcement Learning riding on the wave of Deep Learning, and both of those (DL and RL) are more than applied statistics.

1. I think of this as a 0 to 1 innovation versus a 1 to n innovation if you're familiar with Peter Thiel's terminology on that.


Modeling is the next modeling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: