
Approaching Almost Any Machine Learning Problem - Anon84
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/
======
glial
Interesting post, but there are several errors, or at least suggestions that
don't make sense to me:

> Doing so will result in very good evaluation scores and make the user happy
> but instead he/she will be building a useless model with very high
> overfitting.

Testing on your training data doesn't in-and-of-itself lead to overfitting,
but it will hide overfitting if it does exist (and is a terrible practice for
that reason).

> At this stage only models you should go for should be ensemble tree based
> models.

Not sure why this should be the case. Many ensemble models are very memory
hungry & slow, Random Forests being a good example. They are flexible and have
minimal assumptions, sure, but that doesn't mean you shouldn't try any other
modeling technique, especially if you have domain knowledge.

> We cannot apply linear models to the above features since they are not
> normalized.

Not true at all. You won't be able to meaningfully compare coefficient
magnitudes, but you can certainly apply linear models.

> These normalization methods work only on dense features and don’t give very
> good results if applied on sparse features.

Since normalization is done feature-by-feature, there is no reason why this
should be true.

> generally PCA is used decompose the data

If your latent features add linearly, ok, but do they? Is it meaningful to
have negative values for your features? If not, consider using something like
sparse non-negative matrix factorization.

Lastly, using this approach means you have a HUGE model & parameter space to
search through. Because of this you will need a ton of data to get meaninigful
results.

He seems to be treating machine learning methods as a bag of tricks. That's ok
so far as it goes, but in my experience it's much more valuable to try and
understand your data, and the process that generates it, and then build a
model that tries to reflect that data generation process.

~~~
nl
A lot of the criticisms here are about inefficiencies in the approach. This is
true, but the approach here does generalise well (provided you have sufficient
data).

It's worth noting that Abhishek is working on the "automatic data science"
problem[1] and is viewing thing through that lens.

AutoML is an interesting area - there have been workshops at ICML, and good
progress is being made in the area.

[1] [https://arxiv.org/abs/1507.02188](https://arxiv.org/abs/1507.02188)

~~~
visarga
This talk is very interesting: they do automatic model selection, pipeline
stacking (pick the right sequence of pre-processing stages to place in the
pipeline) and hyperparameter optimization. They use meta-learning to improve
search speed.

 _Automatic Machine Learning? SciPy 2016 by Andreas Mueller_ \-
[https://www.youtube.com/watch?v=Wy6EKjJT79M](https://www.youtube.com/watch?v=Wy6EKjJT79M)

------
vonnik
This is a great post as far as it goes. While Abhishek mentions Keras for
neural networks (and Keras is a great Python library), he doesn't really go
into the cases where deep neural networks are more useful than other
algorithms, and how that changes a data scientist's workflow.

DNNs are really well suited to unstructured data, which isn't the kind he
highlights. One reason for that is because they automatically extract features
from data using optimization algorithms like stochastic gradient descent with
backpropagation. What that means is they bypass the arduous process of feature
engineering. They help you get around that chokepoint, so that you can deal
with unstructured blobs of pixels or blobs of text.

Because unstructured data is most of the data in the world, and because DNNs
excel at modeling it, they have proven to be some of the most useful and
accurate algorithms we have for many problems.

Here's an overview of DNNs that goes into a bit more depth:

[http://deeplearning4j.org/neuralnet-
overview.html](http://deeplearning4j.org/neuralnet-overview.html)

~~~
bhntr3
I think the post represents the typical data scientist's approach to a machine
learning problem pretty well. Deep learning is exciting but, for practical
problems, most people aren't using it yet. XGBoost is still dominant in Kaggle
competitions. Lots of companies with lots of data still use simple single
machine models.

So, I think it's a little misleading to say that deep learning eliminates the
need for feature engineering. If we're going to provide business value to a
company, we're often dealing with somewhat structured data and not
"unstructured blobs of pixels or blobs of text". There are several simple
linear models with highly engineered features in production where I work that
resist attempts to replace them with more clever models and less feature
engineering.

Deep learning is awesome and there may be a time when it solves every problem.
But let's not oversell it. For now, if someone wants to do machine learning
and get paid for it, they're more likely to see their colleagues using the
techniques in this article than training deep multi-layer neural nets.

~~~
nl
Deep Learning is heavily used in Kaggle competitions when there is image data.

Even in non-image data competitions, a deep neural network will often be one
of the models chosen to ensemble. They generally perform slightly worse than
XGBoost models, but have the advantage that they often aren't closey
correlated, which helps fighting overfitting when tuning ensembling
hyperparamters.

~~~
selectron
For image competitions you are right. Neural networks are often in winning
teams ensembles, but they require a lot more work than something like xgboost
(gradient-boosted decision trees). For a dataset that isn't image processing
or NLP, xgboost is in general much more widely used than neural nets. Neural
nets suffer from the amount of computing resources and knowledge needed to
apply them, though given infinite knowledge and computing power they are
probably on par with or better than xgboost. And if you need to analyze an
image they are great.

------
lqdc13
Approaching (Almost) Any Machine Learning ___Classification_ __Problem.

If you are doing sequence labeling, learning something about the data,
tackling partially unlabeled data or time-varying data, you generally have to
take a different approach.

------
pesfandiar
It's a very insightful article about nitty-gritty details of working on ML
problems. However as an outsider, I can't decide if some of very specific
statements without any reasoning (e.g. good range of value for parameters) are
highly valuable wisdom coming from years of experience, or merely overfitted
patterns that he's adopted in his own realm.

~~~
selectron
I would say that table is really quite valuable. Kaggle problems come from all
types of companies, so it doesn't make sense to say that it is "overfitted
patterns that he's adopted in his own realm". With that said, validation on
your own dataset will trump general knowledge, so you shouldn't view these
parameters as hard and fast rules. But the parameters in that table will
provide a useful starting point, and if you stray too far from them that is a
warning sign that you might be overfitting.

------
slv77
Throwing out a couple of stupid questions since we're talking about rules-of-
thumb..

1) How much does feeding irrelevant features have on a model? For example if
you added several columns of normally distributed random numbers?

2) How much impact does having several highly correlelated feature have on a
model?

3) If you had limited time and budgets would it be better spent on cleaning
data (removing bad labels, noise in data); feature engineering (relevant
features) or feature selection (removing irrelevant or redundant feature)?

~~~
selectron
1) It depends heavily on the model. Something like xgboost (gradient boosted
decision tree) will handle irrelevant features fairly well, while other models
(like linear models, especially without lasso regularization) will have much
more trouble. In virtually all cases adding noise will decrease model
performance.

2) Same as 1), depends on the model. With good hyper-parameters xgboost can
handle correlated features well, while other models may struggle.

3) With a good model (again like xgboost), feature engineering is usually the
best use of your time. Removing "bad" labels and "noise" in the data is
especially dangerous, as if you are not extremely careful you can make your
model worse. If you can identify why the label is "bad" then you can remove or
correct it, but you need a reason why you wouldn't have these bad labels on
your test dataset. Removing outliers can help your model, but it is risky. In
contrast smart feature engineering is low risk and can provide large gains if
you see a pattern the model could not see. Feature selection can be important
as well, and is generally pretty quick assuming you have good hardware, so you
might as well do it, especially if you have some knowledge about which
features you expect to be not that useful.

------
danvayn
Awesome post and site, my only complaint about this blog is that you'd think
the top left would be a link and 'No Free Hunch' would not be one, but in fact
it's the opposite..

------
cmdrfred
I've been looking for a place to begin with machine learning, thanks.

------
elgabogringo
Awesome post. Too hungover to read all that today though. Definitely Saturday
morning over a cup of coffee - assuming I'm not in the same shape then.

------
greenpizza13
Great post, but how am I supposed to find it credible when he uses quotation
marks for emphasis?

> The splitting of data into training and validation sets “must” be done
> according to labels. In case of any kind of classification problem, use
> stratified splitting. In python, you can do this using scikit-learn very
> easily.

~~~
millisecond
I don't think that's emphasis, but a general rule that you should break if you
need to.

~~~
lgas
I think the author used real quotation marks as stand-ins for "air quotes".

