Hacker News new | past | comments | ask | show | jobs | submit login
What’s the difference between statistics and machine learning? (thestatsgeek.com)
202 points by jwb133 72 days ago | hide | past | web | favorite | 93 comments

The classic explanation is Lei Breiman's 'Two Cultures' paper. He was a statistics professor who left for industry, came back, and tried to get academics to adapt industry approaches. The paper is very readable.


An oversimplified version may be:

Statistics focuses on fitting data to formally understandable models, whereas data science focuses on solving problems -- even if that means using techniques that aren't formally understood.

Leo Breiman is also known for pioneering random forest and bootstrap aggregation.

I think of machine learning as a subset of data science where you trick linear algebra into thinking.

This seems perfectly fair. And I think that historically there was plenty of use for statistics where people didn't care about the formal understanding, so they were doing crude machine learning before the term became widespread.

I've sat through lengthy discussions of machine learning exercises, and could not silence the voice in my head, saying: "This is just curve fitting." Fitting data to an arbitrary curve, and then extrapolating the fitting function, is as old as the hills.

Formal understanding is critical to actually know the limitations of any given system. Treating it as a black box has short lifetime, as your method of analysis will miss key features of the system or oversimplify it.

Understanding modern ML algebra is really "general relativity hard" if not actually harder. Spiking NNs are "quantum physics hard". The math is very much translatable between these domains.

> Treating it as a black box has short lifetime, as your method of analysis will miss key features of the system or oversimplify it.

On the other hand, it can work indefinitely if it solves a problem well enough and it is used always on the same kind of problem.

Sure, you will need to understand it better to apply it to new markets, but often that kind of research is outside the scope of a single business, and they don't need it.

How do you know it solves the problem, instead of just tricking your measure?

Do you know when it doesn't work?

Business does not care about rigor, but then terrible things happen when your face recognition system happens to not work for people with dark skin tone.

Even more terrible things happen when it detects cats as people, and worse once it's used to link data into a police database.

Even pretty bad things happen when it's used for detecting potential (treated as absolute) copyright violations.

Security holes when it's used for detecting security problems.

Loss of business when it's used for ticket prioritization.

Business world does not care that it sells a broken solution as long as it's not obvious and someone has been paid. Everyone else pays for the failures. And you cannot sue an ML system really.

Why don't we do online learning? Each failure is a new training sample.

Why dont't we do active learning? Or adversarial networks, or....

Sure it is easy to fail in machine learning. But it doesn't mean that we need to undstand how and what our models learn. That's the only big advantage of machine learning: the model learns, so I don't have to.

I don't necessarily disagree (in the sense that all models are wrong but some are useful) but to play devils advocate, its damn near impossible to claim a model will work indefinitely on a subset of problems if we don't know it's underpinnings.

Its like saying software has been tested completely when we really should be saying its passed a subset of test cases out of the complete set of possible scenarios.

The only way we can make that claim is if we can interpret the model in a mathematical proof.

> Understanding modern ML algebra is really "general relativity hard" if not actually harder. Spiking NNs are "quantum physics hard". The math is very much translatable between these domains.

That's not my experience at all. What kind of ML requires something beyond basic linear algebra? Being comfortable manipulating matrices is certainly harder than plugging data into an sklearn function, but it's also significantly easier than understanding general relativity.

The NN complexity is not a question of doing the calculations. Go on a 5x5 board would be a very straightforward game, go in a 19x19 board is not. Similarly, understanding how NN fiction at a nuts and bolts level stops being enough as you build larger networks.

> Go on a 5x5 board would be a very straightforward game

I think it is not a good comparison. A small board calls for simpler algorithms: just try out all possible games and choose the most favorable one. For larger boards this simple approach doesn't work any more and you start looking for something different. But the architecture of NN networks doesn't change at all if you move from small to large scales. Only the number and size of layers increases.

Edit: actually, that's the reason they call it "deep learning". it has more layers..

The architecture doesn't change but the way it works often changes dramatically with scale.

But isn't the main theme of NNs that the low level understanding is the best we have? In other words, there is no body of theory we can use to analyse architectural decisions so we just have to go off experimentation and heuristics derived from that.

I think a huge number of people publishing on the field would disagree with that assessment. Which is not to say they are right, but what looks like oversimplification is rarely accurate.

As a curious onlooker to this discussion, could you point to some references for learning the math that applies to emergent behavior in these larger networks?

Breiman is also responsible for ACE (Alternating Conditional Expectation) which is in many ways magical. https://en.wikipedia.org/wiki/Alternating_conditional_expect...

*Nonlinear algebra nowadays. With sprinkling of discrete algebra on top.

(Gated and threshold units.)

I'm not sure about machine learning specifically, but I heard somewhere that a data scientist is someone who does statistics, on a Mac, in San Francisco.

The other one is:

> What’s the difference between statistics and machine learning?

> About $50k a year.

After spending 80% of their time cleaning up garbage input data.

...and aftet spending 15% plotting graphs and diagrams.

And there's this helpful translator:


Can confirm

Or in Berlin eating a Big Mac.

Now let’s go back to the system admin dungeon stereotype.

Inferential statistics is about explaining an observed outcome in terms of its causing factors. Once we have explained it, then we can make predictions. Machine learning skips the explaining part and goes straight to making predictions, without attempting to understand the underlying process that led to the particular outcome. This would be the main difference, in my opinion.

Yes, this is my observation as well. Put another way, statistics is primarily concerned with understanding the mechanisms behind something. Predictive power can and will often be sacrificed if it aids explanatory power or conceptual elegance.

By contrast, machine learning is primarily concerned with making the best prediction possible, even if that means sacrificing an understanding of the underlying mechanisms.

The Venn diagram of methods can have a fair degree of overlap.

Neither disposition is right or wrong, but they tend to have natural places where each makes more sense. If you’re trying to predict whether a picture is of a cat or a dog, you probably don’t care much about the constituent contribution of factors to one pictures dogness or catness. On the other hand, if you’re trying to predict traffic collisions based on characteristics of a roadway, you’re probably less concerned with the predicted number of crashes and more concerned with the relative contribution of a handful of independent variables.

If you are trying to predict the failure modes of a cat/dog discriminator on arbitrary images, an understanding of catness/dogness is more useful.

>machine learning skips the explaining part and goes straight to making predictions

what? Then we should call it machine oracle. It uses magic to make the right predictions without any understanding. It's like seeing a strong AI beating the turing test and saying: it was just lucky. I can think only of two possible ways to make right predictions without understanding: luck and cheating. Sure, sometimes ML cheats.

    statistics is about explaining an observed
    outcome in terms of its causing factors.
Isn't it the same for ML? When we train a NN to map input factors to the observed outcome, it will set the weights of the factors that do not influence the outcome to 0.

This is only the case if your model’s features are easy to understand, and one of the motivations for some NN approaches is to avoid having to create your own features.

Causal inference and the ability to explain predictions are also fields of study for machine learning.

This may sound a little like trivializing, but don't we have to know what "statistics" are and what "machine learning" is to say anything about the difference(s) between them?

Looking at this through even the lens of multinomial logistic regression, or of econometrics generally, I don't think that "statistics draws population inferences from a sample, while machine learning finds generalizable predictive patterns" even makes sense as a difference. Any prediction is an inference about the population of future events, or of contemporary events not present in the sample. You plug 1000s of events each described by 100 columns into a logistic regression, and you're hoping to get something predictive out of it. Further, as nice as the idea is that you can tease out "factors" from your 100 columns, you don't have to look at "3.3375905e-5 x (spent five years before age 18 in a smoker's home)" for very long to wonder how much 'explanation' you're getting out of the terms in the exponents of your probability functions.

I still can't resist tweaking ML enthusiasts and data scientists: Statistics is what people who know what they're doing are doing. Machine learning is the rest!

Everybody owes it to themselves to read Breiman's Two Cultures[1]:

> There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.

The difference between inference and prediction can be illustrated with something like a decision tree or a random forest. Statistical inference is "the theory, methods, and practice of forming judgments about the parameters of a population and the reliability of statistical relationships, typically on the basis of random sampling." If you look at something like linear regressions, it makes a lot of assumptions about the data[2]: the distribution of residual errors is normal, there's no multicollinearity, etc.

Random forest don't care. Random forest is a set of steps. You follow the steps, you get an answer. You made no assumptions about your data, about the distribution of it, any of it. You just followed an algorithm.

Algorithms are powerful! Not everything needs to be an inference problem. We're better off when we have lots of tools. You don't have to choose _between_ the two camps. But the "Everything is Statistics!" and the "Everything is Machine Learning!" points of view rob us of ways of thinking about our tools that helps us understand what those tools are.

1) https://projecteuclid.org/download/pdf_1/euclid.ss/100921372...

2) https://thestatsgeek.com/2013/08/07/assumptions-for-linear-r...

Nice comment, but there's more nuance than you claim.

> You made no assumptions about your data, about the distribution of it, any of it. You just followed an algorithm.

You made implicit assumptions that you are now unaware of, which might come and bite you later (eg: using zip codes or names as a proxy for race in credit scoring models, neural networks overfititng to texture and classifying a leopard print couch as a leopard). This means that you are liable to overfit to the data, and generalize poorly, or in ways you are not supposed to.

This is what leads to the perspective that "machine learning" might end up being a powerful tool for laundering bias: https://idlewords.com/talks/sase_panel.htm

I think what you're saying is orthogonal to what I'm saying.

Yes, ZIP codes can be a proxy for race in models dealing with credit scores (and recidivism, which is also a really bad place to put racial bias), as an example. But if I put it in a mixed-effects model, it shows the same bias, and a mixed-effects model is just an extended version of linear regression. Both statistical and ML models suffer from the problem you're stating.

What you have not made any assumptions about in a random forest is about the distribution of the data you're looking at. One example of a case where the assumptions that bog-standard OLS makes about your data can cause you problems is zero-dominated data -- data with a lot of zeros in it. Basically any time you're trying to make predictions about things that are rare in your measured population.

OLS does a bad job on zero-dominated data. If you throw a zero-dominated dataset into a random forest, you will get back better answers than if you use OLS on zero-dominated data.

To be clear: there are strategies for dealing with zero-dominated data using statistical inference. You don't have to resort to non-inferential learning just because you have data that doesn't look like a bell curve. But machine learning is a powerful way to get pretty good results on a lot of problem spaces without having to understand the probability function involved (or in cases where the probability function is too complicated to be tractable computationally, like the probability function that determines the color of pixels in a dataset where you're classifying dogs versus cats).

> machine learning is a powerful way to get pretty good results on a lot of problem spaces without having to understand the probability function involved

The NFL theorems essentially state that for every dataset on which an algorithm generalizes well, there is another on which it generalizes poorly. So, there are always implicit biases, even for a random forest model (eg: if you try to model Boolean functions with a random forest, those functions which can be approximated effectively with few trees will form a set of very small measure. Specifically, it is my intuition that those which have many terms in the sum of products form might need many trees to approximate). It then becomes a question of whether the bias of your models are compatible with the dataset/domain under consideration.

See this Minsky-Sussman koan: http://www.catb.org/jargon/html/koans.html#id3141241

> Both statistical and ML models suffer from the problem you're stating.

And that is precisely why I don’t see how ML algorithms are “more powerful” than statistics in any way, as per your claim.

> And that is precisely why I don’t see how ML algorithms are “more powerful” than statistics in any way, as per your claim.

Well I'm glad you don't see that, but I'm a bit confused why you think I said it because I didn't say it and don't believe it. I said that ML is "powerful," not "more powerful."

That seems to paint random forests as too magical. Rather, to me, random forests are just assuming that there is a BDD to get your answer. You aren't making statistical assumptions of the data, per se. However, you are assuming you can keep doing some sort of split on the data. Such that lower levels of the tree should be somewhat interpretable. Same for the highest levels. It is just the walk from high to low that is hard to fully explain. Right?

I don't think random forests are "magical" at all. I think you can walk a non-technical audience from decision trees to bagged and boosted forests on a whiteboard in under an hour. (I don't think you can do this nearly as well for maximum likelihood estimation, honestly.) But there's still a difference between a mathematical function that you're trying to find the optimal values for and something that's defined by an algorithm that there's no single function to optimize.

But, at the end of the day, you can get some easy to understand statistics from building a decision tree. Based on population makeup, you'll know solid numbers on how many will be expected to hit each outcome.

Such that optimizing some other function based on the application of a decision tree is effectively a statistical question. A natural one.

Indeed, I'm not sure I see the difference. One is just explainable based on understandings of distributions and trusting that they hold. One is explainable based on understanding of decisions, and trusting those remain the best decisions.

I used my bike ride as an example earlier. It is made up of many decisions to get home. Knowing what all of those decisions are can help build up a solid range of when I'm likely to get home. For the level of accuracy I typically care about, so does just knowing the rough distribution of how long I typically take.

How are both of those not statistical in nature? One is just more fully exploring a space with ridiculous computational power, whereas the other is generalizing to a much quicker answer. Right?

Breiman is right and wrong at the same time.

Powerful tools like nonlinear polynomial models, OLS, HMM and RBMs were implemented and devised by statisticians. And not from tribe 2. The difference is the data model is explicit but general.

Have fun trying to model extremely nonlinear physical effects with statistics

Oh yes, we just include all the possible nonlinear effects and pick the ones which look meaningful [0][1]

Then we go back and perturb the model to do some hand wavey guessing at real actionable insights

but hey clients are impressed by overfitting so who the fuck cares as long as the money is coming in and the press releases are going out!!!!

[0] https://www.featuretools.com/ [1] https://scikit-learn.org/stable/modules/generated/sklearn.pr...

That's easy - statisticians take pride in models that are understandable, while machine learning practitioners take pride in models that are not.

I'd change that to say machine learning practitioners take pride in performance. If the best method of predicting planetary motions is to model them as a hierarchy of triangular epicycles then that's exactly what a machine learning practitioner will do.

That's kind of my point, that they have different tradeoffs, although I do not hide my preference for explainability (or model soundness). :-) The example you give reminds me of Chomsky and Norvig debate on the topic.

It reminds me of Dirac on Poetry:

>In science one tries to tell people, in such a way as to be understood by everyone, something that no one ever knew before. But in poetry, it's the exact opposite.

I would take out the snark and say:

statisticians take pride in models that are understandable, while machine learning practitioners take pride in models that have high accuracy.

Statisticians take pride in models that are understandable, while machine learning practitioners take pride in models that are profitable.

No, accuracy is but one possible goal. "Machine learning practitioners take pride in models that optimize the value of that model."

And since we cannot define exactly what this value is (accuracy, RMSE, bias or a specific confusion matrix), we have to make a more abstract definition.

We are now left with a question. Is this abstract definition quantifiable? If so, we can still exclude the statistician w.r.t. understandably. However, if we allow qualitative value as well, the traditional statistician is back at the table.

Now, a further refinement needs to be made. Since fairness, accountability and transparency in Data Science are in the limelight, we can make a point that our ability to understand is a key metric for Machine Learning models as well. It is interesting to see the tendency to associate words or phrases to key neurons in embeddings. For example, in embeddings of faces, we can associate gender, ethnicitity, fatness and gaze with some of the embedded neurons.

Understandable? Two words: statistical significance :P I've seen university employees who still don't understand what it is and can't explain it...and just about everyone who uses it gets it wrong...

Not sure why you're downvoted. Also noticed that everyone uses statistics because you're expected to, but hardly anyone knows how to do it right.

Most scientists who use statistics are not statisticians.

95% of scientists who use statistics are not statisticians, plus or minus 5%, 19 times out of 20.

"statistical significance" - it has too many problems. A colleague complained once, that his test kept alternating between statistical significant and not significant back and forth. I know it's not how it is supposed to work, but most people using it don't know (or care) how to use it. My opinion: just don't use this concept at all.

On the +J+

Statistics is science and ML is engineering.

I don't know. My former ML profs do a lot of science too...

The main difference is not in techniques, but in goals. The primary goal of statistics is to help a human make an informed decision by quantifying uncertainty. That quantification of uncertainty can be used to explain to other humans why the decision was made.

The primary goal of machine learning is to help a machine make better decisions. As long as it gets the "right" answer, the explanations to humans are not as important.

I think that's close, but disagree with the uncertainty quantification bit. Often in both fields, you'll have a proof that "errors" converge at some asymptotic rate. Often in both fields, when you apply that tool to a finite data set, you don't know the error magnitude in this instance. I don't mean exact error, in the sense of 'if I knew the error, I could correct for it.' I mean you know big-O notation error, but not the constant factors.

I do like the rest of the point you make, and it seems to match OP. Inference vs otherwise. Helping people learn/act/decide based on data data vs helping machines learn/act/decide based on data.

Reminded me of this [1] by Prof. Rob Tibshirani.

[1]: https://statweb.stanford.edu/~tibs/stat315a/glossary.pdf

I like Tom Mitchell's definition in his machine learning book:

'A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.'

Merriam Webster's definition of statistics:

'a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data'

So establish a couple of definitions, and take it from there.

I believe such discussions can easily be taken too far: surely the point of a subject is to group together related topics. Who is to say that the set of topics comprising multiple subjects need to be disjoint?

The amount of hair-splitting that goes into discussing this subject is unbelievable. Clearly, they both are fairly closely intertwined - particularly given that one potential explanation boils down to the motivation of the user.

So do we need an explicit taxonomy to say that one application is machine learning and another is statistics? Or does the form not matter as much as the function?

Pace the OP, who I'm sure didn't do this with this end in mind, but a lot of the cases I've seen of people trying to bring this up come down to wanting to self-identify in a certain way as opposed to actually talking about the subject at hand.

I think it’s actually both important and possible to differentiate between stats and ML (and AI for that matter). Statisticians, ML engineers, and data scientists all have related but distinct skill sets, knowledge, methodological experience, and worldviews. The identification can be important because it suggests what sorts of tasks and projects an individual is well-suited for. (Notwithstanding the fact that I believe anyone can acquire skills/knowledge to be competent at any of those jobs and transition between them with a bit of effort).

For example, consider the work that the FDA does in evaluating clinical drug trial data. In this case, you need statisticians on staff that can critically analyze the statistical/methodological rigor, not to mention understand the medical domain and relevant regulations/processes. It doesn’t matter if they can only use the GUI versions of SAS and have never heard of TensorFlow, their time is better spent deeply understanding nuances of experimental design, sampling methods, causal analysis, estimation, etc. I would argue that a “good” statistician in this context might very well be “not so good” at data analysis, as long as they can clearly and accurately critique experiments and analyses and identify when things go wrong. Also consider—what would happen if you slotted a ML engineer into a role like this?

ML engineers are best at optimizing performance (not just accuracy) at some defined task, where ideally the cost of the model being wrong in some exotic way is not astronomical (e.g. approving a dangerous drug, convicting an innocent person, initiating a stock crash, corrupting the attention span of the human race). A good use case for machine learning comes from the book Pattern Recognition—sorting fish on a conveyor belt. No one gives an ounce of chum about statistical rigor in this case, and the consequences of being inaccurate are quantifiable and manageable. But building and tuning a ML system to do this at an acceptable performance level requires a ton of work. You’d need to define the sensor array / inputs, collect and label data, engineer features (more relevant before deep learning, in the case of computer vision), train and evaluate an object detection model, avoid overfitting the training data, build an anomaly detector to weed out the stray crab, ensure it works fast enough to be used in production, turbo-browse arXiv to make sure your stuff isn’t obsolete (damn, it is). This is a vastly different focus than statisticians, who wouldn’t get very far with this sort of performance-optimization problem.

Data scientists are somewhere in between a statistician and ML engineer—like a hybrid class in an RPG. They can cast a few key ML spells and have dabbled in arcane math, but can also slice up data goblins by hand (maybe even with style). With a foundation in a bit of everything, they can glue a team together and flexibly grow in multiple directions. But they can also become a waste of shiny gold if pitted against a specialized task without support from, you know, specialists.

I can't speak for all of machine learning, but classic statistics has some probabilistic assumptions and uses those to prove some theorems about the results of how statistical methods manipulate data.

E.g., in regression analysis, the assumptions are (i) there really is a linear mode with the variables to be used; (ii) typically the data will not fit the model exactly and instead there are errors; (iii) the errors are in the sense of probability independent, all have the same distribution which is Gaussian with mean zero. Then estimate the regression coefficients and get, say, a F ratio statistic on the fit and t-tests on the coefficients. For details can see, say, Mood, Graybill, and Boas, or Morrison, or Draper and Smith, or Rao, etc. If you need detailed references, request them here, but these are all classic references going back decades. In that case, maybe some of current machine learning does qualify as statistics.

Here regression was just an example, however, apparently especially close to much of current machine learning. But there is no end of (i) having data, (ii) having some probabilistic assumptions, (iii) manipulating the data, (iv) getting results, and (v) proving some theorems about the probabilistic properties of the results.

Might argue that in machine learning the results of the training data on the model yields some estimates of some probabilistic properties that can be used to make probabilistic statements about the results of the trained model on future real data.

To a rough approximation they differ in the theorems/results they care about. In statistics one would care about consistent estimate of parameters -- as sample size goes to infinity and the model is such and such here is an estimator that will converge (according to some interesting mode of convergence) to the true parameter generating the data.

In ML one wouldnt care much about recovering the parameters. The results/theorems of interest would be that with large enough samples the predictions and the new data will converge (according to some interesting mode of convergence). If this comes at a cost of doing poorly in terms of parameter recovery, ML wouldnt be bothered.

According to ML the cycloids and epicycloid based geocentric model of planetary motion would be perfectly acceptable.

There are non-parametric statistical methods too though (such as bootstrap methods) which don't make such assumptions.

Have two calculus teachers, A and B, each with 20 students. Look at the final exam numbers. Put all the numbers in a bucket, stir briskly, draw out 20 numbers (test scores) and average, average the other 20, and get the difference in the averages. Do this many times. Get the empirical distribution of the differences in the averages.

Now look at the difference in the actual average for A and B. If this is out in a tail of the empirical distribution, then we reject the null hypothesis that the two teachers are equally good.

For this non-parametric, distribution-free, resampling, two-sample test, to make theorems about it, which should, will likely need at least an independence assumption and likely an i.i.d. (independent, identically distributed) assumption. Else maybe each student of teacher B is an older sibling of a student of teacher A!!!!

In a nutshell, what's going on in statistical hypothesis testing is that we make the null hypothesis, and that gives us enough assumptions, e.g., i.i.d., to calculate the probability of our calculated test statistic, e.g., the difference in the two averages, being way out in a tail. Without some such null hypothesis assumptions, we have no basis on which to reject anything, are not testing anything.

There's chance of getting all twisted out of shape philosophically( here: E.g., I outlined one statistical hypothesis test for the two teachers A and B. Okay, now consider ALL reasonably relevant* hypothesis: Maybe we on the test I outlined, the two teachers look very different, not equal, maybe teacher B better. But in ALL those hypothesis tests, maybe in one of the tests the two teachers look equally good or even teacher A looks better. Now what do we do? That is, there is a suspicion that teacher B looked better ONLY because of the particular test we chose. Maybe there has been some research to clean up this issue.

Sorry, was way too busy and typed way too fast and had LOTS of typing errors.

Errata: Replace "mode" with "model"!

Statistics is a set of theories and methods that can be successfully executed on any list of numbers and provably provides mathematically valid predictions under a set of conditions that never occurs in the real world.

This means, more precisely, statistics only works on data series that obey the law of large numbers, and combining/using 2 or more statistical predictions nearly always requires total independence of the predictions and their inputs, which is never the case (for one thing, they always occur on the same planet). Furthermore, the reason people make statistical predictions is to change the outcome, but doing anything to change the outcome always invalidates the statistical method used to collect the data. There are a couple of things statistics never does. Used correctly, it can never predict extreme values. It can never correctly predict values in systems that are too complex, where too many independent variables determine the outome. And "too many" is something like 50 to 500. It can never correctly be used to verify if a deliberate change worked.

Machine learning is much the same, except it never provides mathematically valid predictions.

Despite this, it should probably be mentioned that both do provide useful results, occasionally getting things very, very wrong.


I think a better question is what is the difference between a non linear regression and machine learning!

Statistics is good for modelling things that are simplistic & low dimensional. Machine learning is good for modelling things that are nuanced & high dimensional. statisticians want to understand things but imho overestimate human ability to make sense of a complex world. machine learning people want the machine to understand the things for us and then teach us about it when it's making us breakfast.

Whatever it is, I imagine it's similar to the difference between math, stats, ml, ai, logic, econometrics, actuaries, epidemiology, etc.

For the record, I don't consider those intellectually separate fields, but do accept them to be culturally separate, for better or worse...

But then I could never understand why physics, biology or chemistry were considered separate fields either...

Or psychology, economics, philosophy, etc etc etc...

They're considered separate fields because they focus on different problems which are amenable to different techniques, leaving their expert practitioners with very different knowledge bases. You're right that it is a cultural distinction, but that doesn't mean it isn't an important or practical one.

One actually knows what's going on.

My $0.02: statistics is generally not very "sensitive" to underlying data -- it deals with small numbers of dimensions/features and lots of similar examples and classification is about falling within easily understood bounds.

Machine learning, on the other hand, can be extremely "sensitive" to underlying data -- it deals with high numbers of dimensions/features on potentially extremely sparse data sets, and a small change in one of them could result in a radically different classification. The potentially accuracy is far, far higher but so is the risk of overfitting.

Or: statistics looks for ranges, machine learning looks for patterns.

I agree that in general statistics is more concerned about inference, while machine learning focuses on prediction. On the other hand, there is such a huge overlap between the fields that it is hard to make a distinction. Also there are statistical fields which focus more on prediction the same way as ML does. For example, in geostatistics prediction is often the only goal, e.g. to predict heavy metal concentration across a domain. People accept that it is impossible to explain every bit of spatial variation and just model it by a gaussian process (the same gaussian process used in ML).

Statistics + Computer Science = Machine Learning?

I think to me that's the case. But also, Machine Learning describes a computer science goal, to have a computer that learns certain or ideally all parts of its algorithm on its own, either from example or by experimentation.

It just so happens that some of the ideas from statistics lend themselves to help computer science implement such machines that can learn. One can imagine techniques for Machine Learning being discovered in the future which leverage ideas from other fields apart from statistics.

One difference is that in Machine Learning you must think of data structures and algorithms. i.e. the practical ways to compute a model. How to represent and transform data while building a model. I think this is given less emphasis in statistics. Standard models are often used and theory is built around these different models. For example aspects such as power calculations for a regression model.

Can someone correct me, if I'm wrong, but statistics doesn't allow Turing-complete models, right? Machine learning certainly allows that (for example an RNN).

This is kind of a fun question because it places the notion of “computational” computability (classical CS) alongside that of classical computability (finding a model by, say, minimizing an error which in turn boils down to calculus or linear system solution).

Long story short, because of the power of real numbers — with each one encoding an infinite sequence of bits — it’s not clear that even “simple” computations important to elementary statistical models, like exponentiating, have low computational complexity. Computing, say, “e” can be done by a Turing machine, but it’s nontrivial.

Another observation along the same lines is that some simple statistical models turn out to really complex — like “find a such that the model

  y = sin (a x)
fits sample data (x1, y1),..., where abs(y) < 1”. This simple model class is well known to have infinite VC dimension, because a can be arbitrarily large.

Another way to say it may be that Turing completeness is not a very sharp tool to separate model classes.

I found this helpful: http://www.cs.cmu.edu/~lblum/PAPERS/TuringMeetsNewton.pdf

Depends what you mean. At the least, you can easily use statistics to talk about how well an rnn works. It may be that we can reduce an rnn to a distribution, someday.

It does seem true that statistics are more suited to describe analytical models. But, that seems as much a quirk of history than a foregone conclusion.

As an easy example, I can give you the statistics of my bike ride. Such that you can get a pretty solid understanding of what my next week's worth of rides will be like, per the parameters used in the description. This does basically nothing to help you build a bike. Or make the ride yourself. So too, would a statistical model of an rnn be.

And indeed, this is no different than a statistical model for how often humans will make mistakes see any process. Or a model for how many students will successfully learn a topic.

One will give you a grant (if you're lucky), the other will give you a multi-million dollar funding.

the difference is branding

Everything's a function, right?

From Michael Jordan’s reddit AMA

I personally don't make the distinction between statistics and machine learning that your question seems predicated on.

Also I rarely find it useful to distinguish between theory and practice; their interplay is already profound and will only increase as the systems and problems we consider grow more complex.

Think of the engineering problem of building a bridge. There's a whole food chain of ideas from physics through civil engineering that allow one to design bridges, build them, give guarantees that they won't fall down under certain conditions, tune them to specific settings, etc, etc. I suspect that there are few people involved in this chain who don't make use of "theoretical concepts" and "engineering know-how". It took decades (centuries really) for all of this to develop.

Similarly, Maxwell's equations provide the theory behind electrical engineering, but ideas like impedance matching came into focus as engineers started to learn how to build pipelines and circuits. Those ideas are both theoretical and practical.

We have a similar challenge---how do we take core inferential ideas and turn them into engineering systems that can work under whatever requirements that one has in mind (time, accuracy, cost, etc), that reflect assumptions that are appropriate for the domain, that are clear on what inferences and what decisions are to be made (does one want causes, predictions, variable selection, model selection, ranking, A/B tests, etc, etc), can allow interactions with humans (input of expert knowledge, visualization, personalization, privacy, ethical issues, etc, etc), that scale, that are easy to use and are robust. Indeed, with all due respect to bridge builders (and rocket builders, etc), but I think that we have a domain here that is more complex than any ever confronted in human society.

I don't know what to call the overall field that I have in mind here (it's fine to use "data science" as a placeholder), but the main point is that most people who I know who were trained in statistics or in machine learning implicitly understood themselves as working in this overall field; they don't say "I'm not interested in principles having to do with randomization in data collection, or with how to merge data, or with uncertainty in my predictions, or with evaluating models, or with visualization". Yes, they work on subsets of the overall problem, but they're certainly aware of the overall problem. Different collections of people (your "communities") often tend to have different application domains in mind and that makes some of the details of their current work look superficially different, but there's no actual underlying intellectual distinction, and many of the seeming distinctions are historical accidents.

I also must take issue with your phrase "methods more squarely in the realm of machine learning". I have no idea what this means, or could possibly mean. Throughout the eighties and nineties, it was striking how many times people working within the "ML community" realized that their ideas had had a lengthy pre-history in statistics. Decision trees, nearest neighbor, logistic regression, kernels, PCA, canonical correlation, graphical models, K means and discriminant analysis come to mind, and also many general methodological principles (e.g., method of moments, which is having a mini-renaissance, Bayesian inference methods of all kinds, M estimation, bootstrap, cross-validation, EM, ROC, and of course stochastic gradient descent, whose pre-history goes back to the 50s and beyond), and many many theoretical tools (large deviations, concentrations, empirical processes, Bernstein-von Mises, U statistics, etc). Of course, the "statistics community" was also not ever that well defined, and while ideas such as Kalman filters, HMMs and factor analysis originated outside of the "statistics community" narrowly defined, there were absorbed within statistics because they're clearly about inference. Similarly, layered neural networks can and should be viewed as nonparametric function estimators, objects to be analyzed statistically.

In general, "statistics" refers in part to an analysis style---a statistician is happy to analyze the performance of any system, e.g., a logic-based system, if it takes in data that can be considered random and outputs decisions that can be considered uncertain. A "statistical method" doesn't have to have any probabilities in it per se. (Consider computing the median).

When Leo Breiman developed random forests, was he being a statistician or a machine learner? When my colleagues and I developed latent Dirichlet allocation, were we being statisticians or machine learners? Are the SVM and boosting machine learning while logistic regression is statistics, even though they're solving essentially the same optimization problems up to slightly different shapes in a loss function? Why does anyone think that these are meaningful distinctions?

I don't think that the "ML community" has developed many new inferential principles---or many new optimization principles---but I do think that the community has been exceedingly creative at taking existing ideas across many fields, and mixing and matching them to solve problems in emerging problem domains, and I think that the community has excelled at making creative use of new computing architectures. I would view all of this as the proto emergence of an engineering counterpart to the more purely theoretical investigations that have classically taken place within statistics and optimization.

But one shouldn't definitely not equate statistics or optimization with theory and machine learning with applications. The "statistics community" has also been very applied, it's just that for historical reasons their collaborations have tended to focus on science, medicine and policy rather than engineering. The emergence of the "ML community" has (inter alia) helped to enlargen the scope of "applied statistical inference". It has begun to break down some barriers between engineering thinking (e.g., computer systems thinking) and inferential thinking. And of course it has engendered new theoretical questions.

I could go on (and on), but I'll stop there for now...

The difference is in how the output is evaluated.

An ML model is evaluated empirically. You compare to real world results and get an accuracy measure. An ML models tells you what something should be. And if it’s a good model, you will get a pretty high frequency of that model telling you what the thing is.

Statistics does something entirely different. It tells you what something could be.

If you flip a coin, an ML model will tell you if it’s heads or tails. Statistics will tell you how often it will be heads or tails.

Another way to think about it is the difference between probability and likelihood.

Probability is measured by your theoretical priors and hypotheses. Likelihood is measured by the results of actual trials.

The probability of a fair coin landing on heads is .5

But the likelihood of that happening isn’t actually .5 because pure frequentist probabilities depend on some fundamentally problematic things. Like a performative infinite number of trials.

The actual line between ML and statistics is really blurry because all useful statistical models are at least a little Bayesian. Priors get updated with each trial. This is essentially machine learning.

Outside of mostly bad/soft sciences (sociology, psychology, neuroscience, nutrition, and climatology are all pretty godawful about abusing classical statistics) pure frequentist statistics don’t get used much because they are really only useful for getting papers published and generating squawking headlines.

Most useful statistical methods are machine learning methods. Specifically, they are applied Bayesian methods with weak, randomized priors. Which is exactly what ML is.

I sound like I hate statistics. I don’t really. ML models can do a bunch of wacky things. There isn’t a coherent theory behind an ML model. You could point a very good classifier at your wife, and it might tell you [(bird,.1), (apple,.3), (woman,.9)]

That’s not a realistic interpretation of what could be. It just happened to get that correct.

A really excellent ML model in 2016 could’ve given the following result for president [(trump,.8), (obama,.6), (rock,.5)]. And after the fact when we can compare it to what happened, it would seem accurate.

But that doesn’t tell us the range of possibilities in our future. The reality in 2016 was that we weren’t going to elect a fucking rock as president. There was no chance of that. There is zero chance that if I point my camera at my girlfriend, she might actually be a potato. Yeah, the ML model might be right because it has guessed right, and that’s often all we care about.

But if I need to know what my chances are of my girlfriend becoming my wife or the mother of my children. That’s where we need statistics. An ML model can’t have those kinds of priors baked in unless you force it. And if you do that you’re just paying someone to do some really expensive Bayesian regression.

No, ML model is not evaluated empirically at all. Measures like accuracy and precision are not empirical. Neither is generalization as evaluated by using some model data set. These measures are statistical predictions that may or may not be correct.

Failure rate in the wild is empirical. Accuracy in the wild, the same.

Since the designer does not have access to real data, they are actually not working empirically at all and as such you get common overfitting and methods that work well on model datasets but not on real data.

You know that data scientists have access to real data, right? It's not all just kids doing tricks on kaggle? That we have the ability to actually measure shit?

Do you think that all of this is a total joke? I mean, you're allowed to think that. But, umm, yeah. Real people do real things with real data. Believe it or not.

statistics - descriptive, ML - predictive

Machine Learning = Heuristics + Statistics

In short:

Machine Learning - making predictions

Statistics - distilling huge amount of data into a few indicators

Machine Learning uses algorithms, optimization/operations research, differential calculus, probability and statistics as it sees fit.

Statistics makes predictions too. Indicators are usually predictions - a mean is called an expected value for a reason.

I find the spectacle of machine-learning practitioners and statisticians furiously using different language to agree with each other quite amusing.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact