Hacker News new | past | comments | ask | show | jobs | submit login
Translating Between Statistics and Machine Learning (cmu.edu)
143 points by BillPollak 6 months ago | hide | past | web | favorite | 22 comments

It's fascinating to see the differences in language around statistics across disciplines. I used to work in a group where my colleagues came from different backgrounds (one had a phd in particle physics, another a phd in stats, and the third a phd in economics). We all were hired around the same time. We spent the first month learning how to communicate with each other about basic statistical concepts so that we were on the same page.

I wonder if the economics PhD had the most difficulty due to graph axis inversion[1].

[1] https://en.wikipedia.org/wiki/Wikipedia:Reference_desk/Archi...

That didn't really come up. We were doing institutional research for a university to support senior leaders. Most of our work entailed working with transactional HR data, educational outcomes data, other information collected at the university to provide policy analysis and strategic recommendations.

The stats and econ phd came from the same school, so they had more shared vocabulary, but definitely thought about problems differently. The physics colleague came from Europe and also thought about problems at a much different scale. So a lot of the initial time was spent deriving proofs so that everyone felt comfortable with different statistical methods used for analysis. The data we worked with was often small scale, sparse, and not really IID. The proofs all essentially converged, but it was interesting because different language and assumptions were made depending on their distinct disciplinary backgrounds.

Sorry for the vague speak. A lot of the work we did was confidential, so can't really talk specifics.

A few others I've noticed:

    Statistics | Machine Learning
    Dummy variable | one-hot encoding
    Fitting a model | training a model

I still hear fitting a model, but usually it's just shorthanded to "training."

I love how Machine Learning has just taken over Decision Theory/Game Theory/Economics as well with Reinforcement Learning or rather "Inverse Reinforcement Learning" - like just renaming utility to reward.

There seems to be tons of scholars who have become big names by taking results and concepts from other fields, such as heuristic/approximate dynamic optimization and game theory... Now I would wonder - were these independent discoveries, or did they just read it & not provide citations...

I spent an impressive amount of time with a biostats PhD (who had veto power on my IRB protocol) working this sort of stuff out. In the end it became clear he really didn't care about the machine's training process at all, and he was only interested in validation as a way to get to talk about comparitive statistics to evaluate the results of many machines with each other.

I drew many, many tables on the whiteboard that day.

X causes Y if surgical (or randomized controlled) manipulations in X are correlated with changes in Y

X causes Y if it doesn't obviously not cause Y.

This seems to conveniently overlook the decades of quantitative social science built on "I controlled for a couple things, and p is less than .05, so X causes Y."

But I'm not clear on the community context here. Is this just good-natured ribbing?

I do not think it does overlook that research: those papers generally assume that once those things are controlled for, the remaining variation is as good as random and thus we indeed recover the causal effect of X on Y. Many papers are probably _wrong_ on this, but they still use "causation" in the first sense and not in the second sense.

I would like to know what statisticians mean by "nonparametric", and what machine learning people mean by "nonparametric", because they seem to be something very different.

In ML: when your model is not defined by a fixed set of parameters, but the number of "parameters" varies depending on the training data. For example k nearest neighbor classification requires storing the entire training set in order to be able to make predictions. Gaussian process regression and Dirichlet process based clustering (mixture fitting) are other examples. Linear regression on the other hand is parametric as the model is defined by a fixed set of coefficients whose count does not depend on the number of training examples/observations.

this is how I understand it. but then I’ve heard statisticians describe neural networks as “nonparametric”, even though they typically have a fixed number of parameters. (millions of parameters! arguably they are the MOST parametric.)

Neural Network in general is indeed nonparametric because the number of weights are not something that is fixed in advance but learned from data. If they are considered fixed, for example for logistic regression then its considered parametric.

The number of parameters in aneural net as used today, specifically in computer vision, is basically never learned from training data. I actually cannot recall practically used methods that would do that.

Another example but not related to statistics: "convolution" in machine learning is not exactly the same thing as in signal processing.

> Statistics: regressions


> ML: supervised learners, machines

I think ML uses the term regression for the situations where the output is numeric value (as opposed to a label), and supervised learning is more than just regression. Usually regression models have mean squared error loss function, that's one way to spot them.

Yes, in ML (and I believe for regular statistics as well) regression is when you are trying to predict a number rather than a label. Predicting house prices is a classical example.

An exception for this is "logistic regression" which is accepted by both the stats and ML communities.

Supervised learning is whenever you have a target variable for the observations you use to train/fit your model.

If you only have targets for some of your observations, it is called "semi-supervised learning". Although in deep learning you often talk about "pre-training" your model, which often is adjusting weights in an unsupervised way.

So you can have supervised regression as well as supervised label predictors. These can also be semi-supervised.

Oh jesus, this is super useful. I had a hard time with ML people speak.

I once interviewed at a biostats shop where the interviewer kept using the word "responses" to refer to feature values. I could not pin him down on the problem statement. Pretty sure that dumbass thinks I'm a dumbass.

That doesn't make sense. By "response" he probably meant response variables, as in the dependent variables. Features would be the independent variables (or "predictors" in some stats/biostats circles).

Precisely! Post hoc, I figured out that he must have meant responses to an assay, which makes sense in context, but like, I would have expected someone with any stats background whatsoever to be able to clarify.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact