
Don't use deep learning when your data isn't that big - simplystats2
https://simplystatistics.org/2017/05/31/deeplearning-vs-leekasso/
======
dbecker
_If you are Google, Amazon, or Facebook and have near infinite data it makes
sense to deep learn. But if you have a more modest sample size you may not be
buying any accuracy_

The author explores sample sizes up to 85, and then suggests this is the
relevant range except at Google, Amazon, Facebook, etc.

But the VAST majority of people considering deep learning have sample sizes
between those extremes. Results on small samples are interesting, but it's
disingenuous to market this as typical of the world outside Google.

~~~
billf1953
Yeah what a dog argument. Basically either shows the author has little
comprehension of ml or he happened to design a terrible demonstration of small
data issues.

------
minimaxir
> For low training set sample sizes it looks like the simpler method (just
> picking the top 10 and using a linear model) slightly outperforms the more
> complicated methods.

This is a _very_ bad argument for the given clickbaity headline. A methodology
that works well for one dataset with few observations will not necessarily
work well for another dataset.

You can do almost whatever you want with small datasets, it's just harder than
with big data (and is necessary if obtaining data is expensive, e.g. medical
trials). Specifically, you'll want to do bootstrapping to simulate additional
data and reduce the uncertainty due to a low amount of data.

The "almost" is that you can't have hundreds of features if you have a small
dataset (Curse of Dimensionality:
[https://en.wikipedia.org/wiki/Curse_of_dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality))

------
krosaen
The benefits of deep learning have more to do with the number of features than
the size of the dataset; e.g when you are dealing with million pixel images,
you need a deep net to extract useful higher level features automatically.
From there, yes, more data is better, but a better post in this vein would be,
"Don't use deep learning your data doesn't have that many features".

~~~
ovi256
Hmm I would refrain from saying DL is useful only when approaching 1 megapixel
images.

State of the art performance on MNIST is held by a 6 layer convnet (4 layers
convolutional, 2 layers FC). MNIST is just 28 x 28 grayscale images, so 768
dimensions. There are many more datasets on the same order of dimensionality.
CIFAR 10/100 (32 x 32 pixel images) is also dominated by DL convnets, AFAIK.

~~~
krosaen
Sure, didn't mean to imply megapixel images were the lower threshold, just
that it is more to do with the number of features and the need to
automatically extract higher level features.

------
saosebastiao
I've been baffled by this as well. I can understand why deep learning has done
well in fields that can roughly be described as sensory perception, but it has
_never_ improved on basic random forests or SVMs for problems in my domain.
And we have lots of data (at least more than can fit in an R instance).

Even taking data size out of the picture, functionally it is not there yet for
most tasks. Maybe it will be in the future, but the big problem with it is
that with _n_ neurons, you have _n_ ^ _n_ possible topologies, and finding the
right neural topology is a major optimization problem that we're only barely
learning basic human heuristics for.

I'm willing to bet the deep learning thing is just one more Neat fad that will
eventually cause disillusionment at its lack of results, reverting us back to
the Scruffy view that intelligence is far too complex to be described
holistically by small sets of simple algorithms. The great thing about the
Scruffy philosophy is that it isn't derogatory...deep learning will _always_
have a place as a tool in its tool set. It merely doesn't hold unreasonable
expectations.

~~~
eachro
What is your domain?

~~~
saosebastiao
Transportation, Logistics, Supply Chain Management, (physical) Operations.

I suspect the reason why deep learning has done so poorly in my domain is that
the underlying data is a result of things that are very poorly abstracted as a
"function". We have lots of discrete events, stateful buffering, hard non-
linearities, discontinuities, numerical bounds, etc. It's more like learning
business rules and physical process design than learning a mathematical
function. This is part of the reason I don't see deep learning being a
holistic solution for self driving cars...once you get past sensory perception
and simple 2d path planning, driving is more of a rule based process than
anything.

That being said, ML tends to be a pretty niche technique for us anyway. If a
process and its components are well known and understood, we tend towards
solutions that come from Operations Research over Machine Learning. It is only
when things are poorly understood that we use ML (example: predicting product
demand fluctuations based on media coverage or predicting truck arrival times
given severe weather patterns and traffic backups). PGMs do really well here,
but are far more difficult to understand, formulate, and train...for most
tasks Random Forests are almost always Good Enough(TM).

~~~
mrmaximus
+1 For real world business problems that I most frequently encounter doing
consulting, it's hard to beat Random Forests and/or Gradient Boosting. Truth
be told, most business problems I encounter turn out to be largely helped by
good old linear models.

~~~
mswen
Agreed! I have often done more sophisticated analysis and then stepped back
and concluded that a simpler analysis was actually better for the business. It
moved the business into a better place for informed decisions and gave them
simple (analysis backed) rules of thumbs that every manager/director/ VP could
understand and use by just checking a couple numbers and doing a simple easily
remembered bit of math.

Understandable models with clear intervention points are what most businesses
seem to need once you get to digging around in their operations, customer and
sales data.

------
bjornsing
It's true that simple models often outperform more complex ones on small
datasets. But the comparison seems rather unfair in this case: the "deep
learning" model employed seems to be a simple feedforward discriminative
classifier, and these are known to perform badly on small datasets. There are
other "deep learning" models that would likely perform _much_ better on small
sample sizes. I've written a blog post about one idea [1]. If you prefer
published per-reviewed research (and you should of course) then e.g. Semi-
Supervised Learning with Deep Generative Models [2] is a good starting point.

1\. [http://www.openias.org/hybrid-generative-
discriminative](http://www.openias.org/hybrid-generative-discriminative)

2\.
[https://pdfs.semanticscholar.org/b6b9/39ffc9920cd8521299a6fe...](https://pdfs.semanticscholar.org/b6b9/39ffc9920cd8521299a6fe9ec55775f9bf3c.pdf)

------
AndrewKemendo
Commonly understood in the field^ is that 60,000 examples is the sweet spot
for training and validation data, 50k for training 10k for test/validation.
This is largely because the MNIST set is exactly that size and is so commonly
used successfully. Get very high accuracy and reduces instances of
overfitting.

That said you can do a lot with a relatively little set. This 2012 paper puts
the range between 80-570 samples [1] again depending on model and required
outcomes. Leslie Smith at NRL has been working on this problem as well and
showing some great progress on really small sample sets as well.

Major takeaway here is that there is such a thing as too big, and too small of
data sets for classification accuracy, but those definitions are rapidly
changing.

^Your mileage may vary depending on model, fine tuning, transfer learning
etc...

[1][https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3307431/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3307431/)

~~~
zeroxfe
> Commonly understood in the field^ is that 60,000 examples is the sweet spot
> for training and validation data, 50k for training 10k for test/validation.

I don't understand how this could be true. Shouldn't the sweet spot be a
function of the dimensionality of the data?

~~~
amelius
No, the deep learning network should be smart enough to reduce the data to its
essence (and that's what it ultimately does).

If DL would need more training data for higher dimensional inputs, then DL
would lose against a simple pattern matching (correlation) algorithm at some
point.

~~~
landon32
Imagine you are trying to use a neural network to classify single bit data as
either being 1 or 0 (I know you obviously don't need a neural net for this,
but it's an example). Aside from not needing a very deep network, you would
not need much training data.

Then imagine classifying the color of a single pixel as "light" or "dark".
There's three dimensions-- red, green, and blue. You would also need much less
training data here than if you were trying to train a network to recognize a
car, right?

I think this is what zeroxfe is referring to

------
anurag
> But I’ve always thought that the major advantage of using deep learning over
> simpler models is that if you have a massive amount of data you can fit a
> massive number of parameters.

The major advantage of deep learning is not that it works better on more data.
It's that it automatically learns features that would otherwise take expert
humans a lot of time and energy to figure out and hardcode into the system.

~~~
jclos
That's an advantage of convolutional nets. Deep fully connected nets don't do
that afaik.

~~~
IanCal
They do, at least as far as I understand the statement. Historically the big
benefit was training them layer by layer, which was like training a feature
detector then a feature of features detector etc. If that's still how they're
trained (been nearly a decade for me now) then they discover features rather
than you engineering them.

This meant that you could train on large unlabelled data and then small
amounts of labelled data.

~~~
jclos
Yeah now that I think about it my statement didn't make any sense, since each
intermediate layer computes a projection of the previous one, which is
technically feature learning. I still disagree with the original comment
though, because the intermediate representations of the data computed by a
fully connected network are nothing like the ones that would be built by a
human doing feature engineering. The ones learned by a convolutional layer
would be closer to human-understandable features.

------
dragandj
The article is spot on, but also misses a simple thing: like in all hypes, DL
hype is built on human irrationality. Most people do not understand DL well
(or at all) but they see some high profile teams boasting about their success
over and over. So, if _I_ would just use that magical DL tool, just like them,
maybe _I_ would do those awesome things. Of course, there is also a simpler,
more mundane explanation: beefing up the CV with yet another hyped technology.
Hadoop: check; blockchain: check; Deep Learning: check. Maybe even deep
learning in the blockchain distributed over a million-node Hadoop cluster.
Keep them coming!

------
andreyk
I am surprised by all the criticism. Sure, this does not make the point
perfectly (with various details one could nitpick), but the basic premise
seems completely agreeable and even boring - there are ML and statistics
techniques that are simpler, more interpretable, and faster than deep learning
ones that can often be sufficiently robust for various problems/goals (SVMs
and random trees/forests in particular are lovely). At least, that's how it
seemed to me. The wording might be improved to alter "when your dataset isn’t
that big" to "when you target function isn’t that complex/when your data isn't
that complex", sure, but the point stands.

------
jclos
The main issue is that the VC dimension of a deep network is very high (iirc
it grows proportionally to the number of edges of the network, which grows
combinatorically with its depth), and for any dataset smaller than that the
network can just learn the dataset and achieve 100% accuracy. However
regularizing the network usually solves that problem.

------
mandeepj
If your dataset is small then there are techniques to handle it. I bet author
was never enrolled in a deep learning class. He just came up a title borrowed
from 'don't use big data if your data isn't big'

------
aub3bhat
There are so many issues with this post, let me enumerate:

1\. Straw man tweet by some non-practitioner which is used to set up the
straw-man argument.

2\. The whole Digits example is ridiculous, statisticians "love" toy problems
to prove theorems & make "arguments" etc. ML is empirical and not just the
performance but the entire pipeline from data to application matters.

Let me illustrate: If your aim is to predict 1 vs 0 from images of digits. As
an ML researcher I would write a program to synthesize images in all different
combinations of fonts/font-color/background color/ location available. The
data would easily be more than ~100,000 images. At this point one cannot use
LASSO on top 10 pixels (due to jittering), and a Deep Models would be
necessary. But in reality my model will outperforms because the thinking
process as an ML researcher was not to make an "argument" but to "solve" the
problem of detecting 1 vs 0.

3\. But the biggest flaw is the following argument """The sample size matters.
If you are Google, Amazon, or Facebook and have near infinite data it makes
sense to deep learn."""

This is an another issue with Biostatisticians (The author of this post is
Bio-Stats professor), is that they are fundamentally unable to recognize
importance of programming and ability to collect data. Even if you are not
Google, Amazon, Facebook you can easily collect data, even labeled data in
scale of terabytes can be collected in within days or a week. Every single PhD
student I know is not limited by size of the data but rather computational
power and storage available to them. I personally have several terabytes of
video and data from YFCC 100M that I would love to process and build models on
but I am only limited by the computational power & AWS costs. If you want a
concrete example, see the Google PlaNet paper [1] I today have enough data (~5
Tb) to replicate it and build open source geolocation model, the only hurdles
are storage and computation costs.

[1] [https://arxiv.org/abs/1602.05314](https://arxiv.org/abs/1602.05314)

~~~
sbov
> 3.

How much of this is students doing research where they already have access to
big data, which makes sense if your goal is to do deep learning research, vs
being given a problem a business wants to solve? Can you make the same
statement for the average problem at your average small-medium sized business?
Can you really get big data that is relevant to the local, non-chain coffee
shop down the street?

If you can it seems like an amazing business opportunity - to bring Google
level insights to businesses that don't directly have Google-level data.

~~~
PeterisP
The issue of whether some business like "the local, non-chain coffee shop down
the street" has any reason to use machine learning whatsoever seems to be
orthogonal to the problem discussed in article which is the choice of
approaches _if_ you're going to do some machine learning.

There's a classical quote from Tukey "The combination of some data and an
aching desire for an answer does not ensure that a reasonable answer can be
extracted from a given body of data." \- yes, it's quite likely that an
average small-medium sized business has no problems where the possible benefit
of ML-driven insights won't match the costs required to analyze whatever data
they have.

However, _if_ a small-medium business has some problem with a large enough
likely payback to justify making some ML system, it _is_ quite likely that
deep learning may be applicable on their data.

A big issue is transfer learning - in many domains while _you_ may have a
small amount of data, you'd want a system that has learned to generalize on a
huge quantity of similar _external_ data, and just tuned on your data. For
example, if a cookie bakery needs analysis of cookie pictures or reviews of
cookies, and has limited data samples, it would be reasonable to include e.g.
ImageNet data or Amazon review corpus. You'd "teach" the system how
pictures/internet reviews/English language/whatever else works on the biggest
data available, and just retrain/adapt it to your particular problem
afterwards.

