
Machine Learning Done Wrong - denzil_correa
http://ml.posthaven.com/machine-learning-done-wrong
======
mdbco
_" Statistical modeling is a lot like engineering."_

I can certainly see why this is a good comparison, because it's true that both
engineering methods and statistical methods rely on sets of given assumptions,
but it's also really important not to take this analogy too far. Engineering
is ultimately something that is done in a mechanistic world with primarily
deterministic outcomes, whereas statistical modeling is conducted in a
stochastic world with probabilistic outcomes, so it wouldn't be good to think
about machine learning as predominantly mechanistic in nature (in spite of its
name). Of course, a lot of the seven points that follow in the post actually
emphasize the importance of stochastic factors (e.g. outliers, variance
issues, collinearity, etc), so the author is clearly not making this mistake,
but it might be good to clarify for anyone else who is reading.

 _" 6\. Use linear model without considering multi-collinear predictors"_

This is a great point, and just to expand on it a bit, you can also have
situations where you have simultaneity, i.e. two or more of your features or
predictors are either functions of each other and/or functions of some third
variable. This type of problem is more difficult to detect but can cause
serious problems with interpreting the regression coefficients as it's
ultimately a type of endogeneity, which means that common approaches like OLS
will not be consistent.

~~~
chengtao
great comment and +1

------
dustintran
I strongly disagree with not using linear models, at least to build some
theory and intuition before continuing with more sophisticated algorithms.
What I find to be more egregiously misused when doing machine learning in
practice is that everyone too often flocks to the state of the art with little
understanding why. There's no reason for example to spend weeks (or months)
tuning a incredibly deep neural network if the current predictive ability is
enough and there are higher priority matters to work on.

Moreover, there's just too much of an emphasis on prediction. Design and
analysis of experiments, handling missing data and the context of the data
sets, and quantifying one's uncertainty about parameters in a principled
manner for robust estimators are very underappreciated skills in the
community. Using p values arbitrarily and "95% confidence intervals" based on
an unchecked normal approximation is incredibly more harmful than not doing
anything at all. There's just so much more to machine learning than supervised
learning.

~~~
kylebgorman
In natural language processing, we can get close to state of the art
performance on nearly every major task with a linear model; usually, the
feature sets contain what are essentially conjunctions of features, but these
are chosen by hand, by domain experts, rather than produced with, say, a
polynomial kernel.

------
karthikv2k
"2\. Use plain linear models for non-linear interaction" It should be noted
that Linear models are only linear in the model parameters, while the features
can be transformed using non-linear functions. This trick makes linear models
very powerful. Also if you have big data (in millions/billions) then you are
better off with linear models, as SVM is very difficult to scale.

In my experience (all in big data), I rarely seen people use SVM, usual
choices are logistic regressions and tree based models. In some finance and
insurance industries you are restricted to use only interpretable models,
which linear models are.

~~~
chengtao
As you pointed out the transforming features is powerful, I believe that's the
exact reason which makes SVM powerful. Though the way features can be combined
with SVM is limited, the limitation makes SVM training fast in the dual space.

On the other hand, if you wanna compare logistic regression with SVM. While
the detail is pretty tricky. One simplified view is to compare linear SVM
which is essentially hinge loss with L2 regularization against logistic
regression with L2 regularization which is essentially negative binomial log
likelihood loss with L2 regularization. If you plot the loss functions, it's
easy to see how they penalize negative & positive cases differently.

------
orting
I think the points are good, but I am not very happy about this statement

"When dealing with small amounts of data, it’s reasonable to try as many
algorithms as possible and to pick the best one since the cost of
experimentation is low. But as we hit “big data”, it pays off to analyze the
data upfront and then design the modeling pipeline (pre-processing, modeling,
optimization algorithm, evaluation, productionization) accordingly."

If done correctly, then I agree. But we have to be carefull about overfitting
when we try out several models or make an initial analysis to determine which
model to use. In this sense, choosing a model is no different from fitting the
parameters of the model.

~~~
idunning
If you are disciplined, and separate data into training and testing sets, you
can try as many models as you want without fear of overfitting. Indeed,
optimizing over the parameters of a model on the training set is essential
(pruning parameters in a tree, regularization weights, etc.) and can be
thought of as training large number of models.

If you aren't doing this correctly, then you can't really interpret the
performance of even a single model. Seen people screw this up in so many ways
- my favorite recent one that was quite high on HN was someone using the full
dataset for variable selection, before doing a training-testing split
afterwards.

~~~
stiff
If you use performance on the test set for model selection, this is not true.
It follows from simple probabilistic reasoning, the more models you try the
higher the chance one will score well on both the training set and the test
set by "luck", and this is especially true with small datasets. In fact it is
a best practice to use a separate validation set for model selection and use
the test set only for final performance evaluation, see e.g. the answer to
this question:

[http://stats.stackexchange.com/questions/9357/why-only-
three...](http://stats.stackexchange.com/questions/9357/why-only-three-
partitions-training-validation-test)

------
ma2rten
_Returning to fraud detection, high order interaction features like "billing
address = shipping address and transaction amount < $50" are required for good
model performance._

I agree, that non-linear models are often able to beat linear ones, but if you
have limited amounts of data feature engineering will always beat clever
algorithms.

~~~
chengtao
Yes, and IMO, most of the time, the insight behind the data is far more
important than the modeling algorithms to achieve high performance with few
exceptions (say computer vision, NLP, etc which really requires A LOT OF
data). Even in some large data set, take page rank as an example. The
fundamental insight was the popularity of the site would be a great signal for
ranking the search result, and random walk would be a great way to approximate
the popularity. As a result, Google made a great success in search ranking.

------
ai_maker
Good post! A bunch of mistakes is a bunch of opportunities to improve. It kind
of complements Domingos' paper:

[http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf)

------
chengtao
Also, thank you all for reading the post. I'm the author and I'll be happy to
clarify any of the points in the blog~

------
mystique
Good list. I am new to Machine Learning with only ~1 year of real work and
sometimes I slip and make one of these mistakes.

I have a question on #7. I have not used the co-efficients to mean feature
importance but some times get tempted to use them. How do you explain which
factors are the most important factors behind some outcome to non-stat people?

~~~
idunning
Some techniques, e.g. random forest, give variable importance indicators for
free. If you can test it out, give it a go - don't have to use the random
forest as the final model.

~~~
ma2rten
You can use correlation or the coefficients of a linear model iff the features
on the same scale. Another method is to train a model leaving out each feature
once, then you see how much accuracy drops.

------
moultano
I'm glad #1 is #1. If I had to partition these into two sets, they would be
"make sure it works (2-7)" and "make sure your definition of 'works' works
(1)."

------
antonb2011
You suggest up/down sampling rare cases. Can you please elaborate on the
standard approaches for this kind of problem? For both linear and nonlinear
classifiers. Thank you.

~~~
chengtao
Great question and my main point is less about up sampling the rare cases but
more about the default loss function used in the model training might not
directly align with the final business metric (which is the metric
practitioners should care more about). As a result, it's important to align
the both. For some algorithms, it's easier to incorporate different loss
function, while for some others, it might not be the case. Over or under
sampling is one fairly generally applicable way to tweak the loss function.

While I'm not an expert of the theory behind sampling, if you do find the need
to tweak sampling to align the default loss function and the business metric,
I would say doing grid search first, and validate the result with the business
insight, e.g. if you find getting the rare cases right is much more important
that getting the common cases right, does that align with the business
insight?)

------
trunkation
Is it somehow inspired by Linear Algebra Done Wrong by Treil which itself was
probably inspired by Linear Algebra Done Right by Axler ? :)

~~~
chengtao
Not really, it was actually more inspired by Statistics Done Wrong.

~~~
trunkation
Oh, Ok. Never knew there was a book by that title.

~~~
kimolas
It was written by someone in my PhD cohort—it's available for preorder now.
[http://www.statisticsdonewrong.com](http://www.statisticsdonewrong.com)

