
When not to use deep learning - jpn
http://hyperparameter.space/blog/when-not-to-use-deep-learning/
======
peterburkimsher
My naïve understanding of deep learning is that it works by finding patterns
in the answers, instead of actually solving problems.

If I take a multiple-choice exam and always answer "C", then I have a good
chance at getting more than 25%.

For image recognition, I think the classifier is doing the real work (trying
to actually answer the question), and the deep learning is just seeing if the
answer matches the pattern of expected answers.

Somehow, this actually works. I think that it's because true randomness is
hard to find.

The problem that I've found is that it's really difficult to teach deep
learning. I'm making a Chinese-English teaching tool (
[http://pingtype.github.io](http://pingtype.github.io) ) and sourcing my
translations from Google Translate. I find a lot of mistakes in my dictionary
that obviously came from Google's model getting the word spacing wrong. I can
fix it in my own dictionary immediately. If I submit the correction to Google,
it just changes some weightings, and hundreds of people will have to submit
the same correction before their deep learning will finally catch on that it
needs to change something.

~~~
daddyo
Your naive understanding is supported by at least one deep learning authority:

> I haven’t found a way to properly articulate this yet but somehow everything
> we do in deep learning is memorization (interpolation, pattern recognition,
> etc) instead of thinking (extrapolation, induction, etc). I haven’t seen a
> single compelling example of a neural network that I would say “thinks”, in
> a very abstract and hard-to-define feeling of what properties that would
> have and what that would look like.

> All the while I'm thinking: this thinking process this person goes through
> as he analyzes this data: THAT is what Machine Learning SHOULD do

\-- Andrej Karpathy

Deep learning for image recognition works because our visual world is made up
of structured hierarchical features: Dark/Light, Texture, Edge, Part of
Object, Object, Scene. Deep learning layers create increasingly higher-level
features in a computationally feasible way.

~~~
amelius
So a better name for "deep learning" would be "shallow understanding"?

~~~
dmreedy
I personally prefer 'generic hashing/parsing'; deep learning excels at the
automatic creation of a mapping of unstructured information to structured
information, after a sufficient period of training.

~~~
zebrafish
Hmm... but isn't that what our brains do as well? Unstructured intensities of
light bouncing off our retinas which becomes a structured recognized object.

~~~
dmreedy
It definitely seems to be part of what our brain does. The visual cortex is an
apt comparison since that's where a lot of the structural inspiration for
modern ANNs comes from. But, there does seem to be a little more than that
too; it's not clear whether all the brain does is reducible to a hash function
(reducible in any useful sense, at least; a very very very big, very very very
sparse hash function, perhaps).

------
nilkn
I was expecting more discussion of alternatives.

For instance, in cases where deep neural networks aren't desirable or don't
outperform classical approaches, I'm a big fan of boosted decision trees, due
to their accuracy on many real-world datasets, their ease-of-use, and the
existence of great open source implementations. xgboost (which routinely wins
Kaggle competitions) and Spark MLLib both have high-performance distributed
training algorithms for gradient boosted trees. And as far as hyperparameter
searches go, there just aren't as many parameters to optimize. (And frameworks
like Spark are already fantastic for embarrassingly parallel tasks like
hyperparameter searches.)

~~~
nl
_Spark MLLib both have high-performance distributed training algorithms for
gradient boosted trees._

Well it exists, but I wouldn't describe it as high performance in either
accuracy or speed.

I'm a big fan of Spark, but Spark ML needs some love from people who actually
use it.

Until that happens, just use XGB (which now has Spark integration[1])

[1] [http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-
xgb...](http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-
spark-flink-and-dataflow.html)

------
therajiv
The author discusses how linear models are generally more interpretable than
deep learning methods, but I'd argue that's actually changing pretty quickly.
Especially for large image/sequence inputs (which covers most of the
applications that are getting hyped up), linear regressions don't perform very
well, and often that performance difference prevents them from picking out
important features. Given that fast, scalable methods for feature importance
are on the rise (e.g.
[https://arxiv.org/abs/1704.02685](https://arxiv.org/abs/1704.02685), which
the author mentions), you often get equally interpretable feature scores from
deep models that are more accurate than analogous ones from linear models.

Basically, my point is that model interpretation strongly depends on how
accurate your model is, and because deep learning models are so much better
than linear models for some tasks, it makes sense to use them - even if your
primary goal _is_ interpretability.

That said, I do believe that if you ever care at all about interpretation, you
should almost never be using multilayer perceptrons (which have recently
become part of the widening umbrella term "deep learning"), because they
rarely work better than decision tree models or basic linear models (and MLPs
are generally less or equally as interpretable when compared to traditional
methods).

~~~
daddyo
Feature importance is not quite the same as interpretability.

Random Forests can give feature importance, but that does not account for
interactions between features. So, in the end, you don't know how a model made
a decision (it could be because there is a feature with high importance, but
it could also be because there is an informative interaction between lower
importance features).

If you want to compare deep learning with linear models, you should leave
image data out of it. Compare them on structured data and bag of words.

MLP's and boosted decision trees, in my experience, definitely beat decision
tree and linear models, on structured data. But they lack longterm robustness
(complex forecasting models need constant retraining, which can hamper their
adoption by business units) and don't pass regulation (it is not enough to say
"has_asthma" is a high-importance feature).

In finance and health care, interpretability is enormously valued. It is a
constant trade-off between accuracy and interpretability.

A long time ago, Caruana made hospital triage models, with neural networks
being the clear winner in generalization performance. Instead, they opted for
a simple logistic regression when productionizing. Why?

> [...] patients with pneumonia who have a history of asthma have lower risk
> of dying from pneumonia than the general population. Needless to say, this
> rule is counterintuitive. But it reflected a true pattern in the training
> data: patients with a history of asthma who presented with pneumonia usually
> were admitted not only to the hospital but directly to the ICU (Intensive
> Care Unit). The good news is that the aggressive care received by asthmatic
> pneumonia patients was so effective that it lowered their risk of dying from
> pneumonia compared to the general population. The bad news is that because
> the prognosis for these patients is better than average, models trained on
> the data incorrectly learn that asthma lowers risk, when in fact asthmatics
> have much higher risk (if not hospitalized).

[http://people.dbmi.columbia.edu/noemie/papers/15kdd.pdf](http://people.dbmi.columbia.edu/noemie/papers/15kdd.pdf)

Though there is nothing holding you back from using both simple linear, and
complex non-linear models at the same time: Only when the models severely
disagree do you pick the interpretable model. Or use the linear model to find
data issues, like those mentioned above, that are tremendously obscured (if
not impossible to identify) when only using deep learning in a train-test
framework.

~~~
tome
The anecdote about pneumonia and the ICU is pretty puzzling. Why wasn't
submission to the ICU one of the classification "labels"?

~~~
nonbel
Here is a talk about that paper:
[https://www.youtube.com/watch?v=UqPcq0n59rQ](https://www.youtube.com/watch?v=UqPcq0n59rQ)

I see that it has also gotten mainstream news coverage as some kind of lesson
about the dangers of machine learning. The real problem is they didn't have
data that could answer the question they had, P(Death|No hospitalization), so
instead they fit models to answer a different question,
P(Death|Hospitalization).

Then they didn't like that the complex models answered the second question too
well, so they used simpler ones that made it easier to manually filter out any
results that didn't make sense as answers _to the first question_ (which isn't
one they could answer to begin with).

No model they fit is safe. You could only use one limited to domains where
P(Death|No hospitalization) ~ P(Death|Hospitalization), which isn't something
they assessed.

------
andreyk
"The point is that training deep nets carries a big cost, in both
computational and debugging time. Such expense doesn’t make sense for lots of
day-to-day prediction problems and the ROI of tweaking a deep net to them,
even when tweaking small networks, might be too low. "

As a Masters student now training deep models for a little while now, I think
this point is underemphasized. Doing something novel (so, not just image
classification) requires a TON of engineering, not to mention the research
considerations. And there are so many tiny decisions and hyperparameters, that
even when I thought I had considerable domain knowledge I found it very
lacking. I guess it should not be surprising given that 'Deep Learning' refers
to a very broad set of models only related by having a learned hierarchical
representation. There are a few problems where you can use existing deep
learning almost off the shelf (most notably image classification,
segmentation), but for most applications I think we're not there yet. As long
as this remains true (which I suspect will be for a long time), SVMs and
decision trees and linear models are still definitely worth knowing and
understanding.

------
digitalzombie
If NN was able to do small data then are they better than their counter parts?

I mean if you can do it for small data and it was good then we would be seeing
it dominate kaggle in all problem domains. Maybe the small data problems
belong to other algorithm (such as tree base and forest, SVM).

disclaimer - I'm bias for tree base algorithm in medium and small data since
it is my thesis.

~~~
andreyk
I think no - SVMs are explicitly optimized to generalize the best from small
data
([https://en.wikipedia.org/wiki/Hinge_loss](https://en.wikipedia.org/wiki/Hinge_loss)
\- 'The hinge loss is used for "maximum-margin" classification'), whereas NNs
have more hacky regularization methods. I am not sure if the same is true for
tree-based methods, but of course those are lovely due to how interpretable
they are when you have few features.

------
fnl
Pretty much agree, and _particularly_ on the budget/time aspect.

