
You can probably use deep learning even if you don't have a lot of data - deepnotderp
http://beamandrew.github.io/deeplearning/2017/06/04/deep_learning_works.html
======
rkaplan
This post doesn't even mention the easiest way to use deep learning without a
lot of data: download a pretrained model and fine-tune the last few layers on
your small dataset. In many domains (like image classification, the task in
this blog post) fine-tuning works extremely well, because the pretrained model
has learned generic features in the early layers that are useful for many
datasets, not just the one trained on.

Even the best skin cancer classifier [1] was pretrained on ImageNet.

[1]:
[http://www.nature.com/articles/nature21056](http://www.nature.com/articles/nature21056)

~~~
a_bonobo
This is how the great fast.ai course begins - download VGG16, finetune the top
layer with a single dense layer, get amazing results. The second or third
class shows how to make the top layers a bit more complex to get even better
accuracy.

~~~
alexcnwy
Can't recommend the course highly enough!

~~~
yamaneko
I'm skimming through the content and it seems really great! I'm interested in
the last lesson (7-Exotic CNN Arch), but I'm afraid of missing other cool
stuff in past lessons.

What do you suggest for someone who has experience with Deep Learning?

EDIT: found this wiki with the course notes:
[http://wiki.fast.ai/index.php/Main_Page](http://wiki.fast.ai/index.php/Main_Page)

One can use it as a guide to avoid missing anything.

~~~
dirtyaura
I think one of the strengths of the course is that Jeremy shows parts of the
process of working on a ML problem. If you have time, I recommend watching
earlier lessons, even if you know the theoretical aspects of the content
covered.

------
shadowmint
You probably can... but is that really the issue?

I think the problem isnt that you cant solve problems with small amounts of
data; its that you can't solve 'the problem' at a small scale and then just
apply that solution at large scale... and that's not what people want or
expect.

People expect that if you have an industrial welder than can assemble
areoplanes (apparently), then you should easily be able to check it out by
welding a few sheets of metal together, and if it welds well on a small scale,
it should be representative of how well it welds entire vehicles.

...but thats not how DNN models work. Each solution is a specific selection of
hyperparameters for the specific data and specific shape of that data. As we
see here, specific even to the volume of data available.

It doesnt scale up _and_ it doesn't scale down.

To solve a problem you just have to sort of.... just mess around with
different solutions until you get a good one. ...and even then, you've got no
really strong proof your solution is good; just that its better than the other
solutions you've tried.

Thats the problem; its really hard to know when DNN are the wrong choice, vs.
you're just 'doing it wrong'

------
erickscott
Andrew Beam's post offered very persuasive evidence that Jeff Leek's intuition
(deep learning yields poor performance with small sample sizes) is incorrect.
The error bars and the consistent trend of higher accuracy with a properly
implemented deep learning model, particularly with smaller sample sizes, is
devastating to Leek's original post.

I think this is a fantastic example of the speed and self-correcting nature of
science in the internet-age.

As an aside, @simplystats blocked me on Twitter, which I assume is in response
to this tweet:
[https://twitter.com/ErickRScott/status/871586233599893505](https://twitter.com/ErickRScott/status/871586233599893505)
and it seems that I'm likely not the only one blocked:
[https://twitter.com/jtleek/status/871693250947624961](https://twitter.com/jtleek/status/871693250947624961)

What's most concerning about @simplystats blocking activity is the chilling
effect it has on discourse between differing perspectives. I've tried to come
up with a rationale for why highlighting the most recent evidence in reply to
someone who sympathized with Leek's original post (btw, @thomasp85 liked the
tweet) is grounds for blocking , but I can't come up with a reasonable idea.

Further aside, is irq11 Rafael Irizarry?

Update: after emailing the members of @simplystats they have removed the block
on my account and offered a reasonable explanation. SimplyStats is a force for
good in the world
([https://simplystatistics.org/courses/](https://simplystatistics.org/courses/))
and I look forward to their future contributions.

------
ska
It's an interesting conversation but really weakened by failing to take on the
generalization problem head on. This is something I see in a lot of
discussions about deep nets on smaller data sets, whether transfer or not. The
answer "it's built in" is particularly unsatisfying.

The plots shown certainly should raise the spectre of overtraining - and
rather than handwaving about techniques to avoid it, it would be great to see
a detailed discussion of how you convince yourself (i.e. with additional data)
that you are reasonably generalizable. Deep learning techniques are no panacea
here.

------
m3kw9
Ppl keep saying a lot without even thinking is a relative term. For images a
lot means enough to get to x percentage accuracy, for OCR for a single font, a
lot means 26 letters + special chars and numbers. Stop saying a lot blindly
like every one underatands.

------
irq11
...but why would you?

The fact that there are people "getting their jimmies up" on questions of
training massively paramterized statistical models on tiny amounts of data
should tell you exactly where we are on the deep-learning hype cycle. For a
while there, SVMs were the thing, but now the True Faithful have moved on to
neural networks.

The argument this writer is making is essentially: "yes, there are lots of
free parameters to train, and that means that using it with small data is a
bad idea in general, but neural networks have overfitting tools now and
they're flexible so you should use them with small data anyway." This is
literally the story told by the bulleted points.

Neural networks are a tool. Don't use the tool if it isn't appropriate to your
work. Maybe you can find a way to hammer a nail with a blowtorch, but it's
still a bad idea.

~~~
tsiki
I think you're missing the point. The jimmies are getting rustled up because
someone provides false information about the performance to make his own
argument seem better. This is something anyone should be against.

~~~
irq11
but they didn't.

the writer makes an unconvincing claim that the original post was wrong. the
data presented shows only that if you try really hard and get lucky enough,
you can probably do as well as a simple regression in this case.

the author himself admits that deep learning is probably misapplied here, and
that training with such small data is difficult, at best. which again brings
us back to the important question (i.e. the point being made by the original
post): why would you ever do this?

~~~
dbecker
_if you try really hard and get lucky enough, you can probably do as well as a
simple regression in this case._

Maybe you aren't familiar with deep learning: but this isn't "trying really
hard." This is doing basic stuff that anyone using deep learning probably
knows.

And deep learning doesn't just "do as well" as the simpler model. It does
meaningfully better at all sample sizes.

~~~
blueblob
Define "meaningfully better." Perhaps you mean statistically significantly
better? It may have better accuracy, but it has significantly less
interpretability. What does it capture that regression couldn't capture? At
least with regression you can interpret the relationship between all of the
variables and their relative importance by looking at the coefficients of the
regression. With deep learning, the best approaches for explanation are to
train another model at the same time that you use for explanation.
Additionally, it was proven that a perceptron can learn any function, so in
some senses the "deep" part of deep learning is because people are being lazy
because at least you could get a better interpretation of the perceptron. I
don't mean to imply that there's not a place for deep learning, but I think
this isn't a great refutation of the argument that fitting a deep model is
somewhat inappropriate for a small dataset.

~~~
dbecker
The model we are comparing against makes 10X as many errors.

I hadn't imagined someone would argue that's not a meaningful difference.

Though the difference is statistically significant too.

~~~
blueblob
Not sure what kind of argument that is. If something overfits it will have
less error, does that make it better? It may mean it would generalize a lot
less when run on more data. Whether or not something is meaningful depends on
what you take the meaning to be.

~~~
toth
Not the OP, but it wanted to point out it has 10X less error on _the holdout
sample_ so it is not simply overfitting.

~~~
blueblob
It doesn't matter that it's on the holdout, he's partitioning an already small
dataset into 5 partitions and talking about the accuracy in using 80 points to
predict 20 points. The whole argument is usually that in the law of large
numbers you can now have a statistically significant difference in accuracy.
When you're predicting 20 points each with 5 (potentially different) models
you likely don't have enough to talk about statistical significance.

~~~
dbecker
_We tried to mirror the original analysis as closely as possible - we did
5-fold cross validation but used the standard MNIST test set for evaluation
(about 2,000 validation samples for 0s and 1s). We split the test set into 2
pieces. The first half was used to assess convergence of the training
procedure while the second half was used to measure out of sample predictive
accuracy._

Predictive accuracy is measured on 1000 samples, not 20.

------
j7ake
Of course you can use it but does it perform better than "shallow" methods
such as Gaussian processes, SVMs, and multivariate linear regression ? either
through theoretical or empirical evidence ?

~~~
minimaxir
The original post used a linear regression (and apparently misimplemented an
intended logistic regression); this post sees better results at all sample
sizes with a proper deep learning approach.

------
zensavona
Maybe what you should do is deep learn some data and then do some deep
learning with your deep learned [deep] data.

Deep.

------
known
aka Wisdom of Crowds

------
deepnotderp
Also, not to mention transfer learning is a big one.

Case in point, the silicon valley "not hotdog" classifier which they stopped
at hotdog or not due to lack of training data when in reality they could've
just used a pre trained net on imagenet. Lol, I was literally cringing through
that episode so hard xD

~~~
minimaxir
The developer behind the real-world app mentioned why they did not use
pretraining, and why you can't always use pretraining:
[https://news.ycombinator.com/item?id=14347513](https://news.ycombinator.com/item?id=14347513)

> We ended up with a custom architecture trained from scratch due to runtime
> constraints more so than accuracy reasons (the inference runs on phones, so
> we have to be efficient with CPU + memory), but that model also ended up
> being the most accurate model we could build in the time we had. (With more
> time/resources I have no doubt I could have achieved better accuracy with a
> heavier model!)

~~~
deepnotderp
You could still have pre trained the custom architecture.

------
aub3bhat
This debate is meaningless for several reasons:

1\. The original argument is a strawman. What do they mean by "data"? Is it
survey results, microarrays, "Natural" images, "Natural" language text or
readings from an audio sensor? No ML researcher would argue that applying
complex models such as CNNs is useful for say survey data. But if the data is
domain specific, such as Natural Language text, images taken in particular
context, etc. using a model and parameters that exhibit good performance is a
good starting point.

2\. Unlike how statisticians view data (as say a matrix of measurements or
"Data Frame"), machine learning researchers view data at a higher level of
representation. E.g. An image is not merely a matrix but rather an object that
can be augmented by horizontally flipping, changing contrast etc. In case of
text you can render characters using different fonts, colors etc.

3\. Finally the example used in the initial blog post, of predicting 1 vs 0
from images is itself incorrect. Sure a statistician would "train" a linear
model to predict 1 vs 0, however I as an ML researcher would NOT train any
model at all and would just use [1] which has state of the art performance in
character recognition in widely varying conditions. When you have only 80
images, why risk assuming that they are sampled in an IID manner from
population, instead why not simply use a model thats trained on far larger
population.

Now the final argument might look suspicious but its crucial in understanding
the difference between AI/ML/CV vs Statistics. In AI/ML/CV the assumption is
that there are higher level problems (Character recognition, Object
recognition, Scene understanding, Audio recognition) which when solved enable
us to apply them in wide variety of situations where they appear. Thus when
you encounter a problem like digit recognition the answer an ML researcher
would give is to use a state of the art model.

[1] [https://github.com/bgshih/crnn](https://github.com/bgshih/crnn)

~~~
Sean1708
Are you using some accepted definition of Statistics that I'm unaware of?
Because I always thought Machine Learning was a branch of Statistics.

~~~
aub3bhat
>> Machine Learning was a branch of Statistics

Thats incorrect.

""" Machine learning is the subfield of computer science that, according to
Arthur Samuel in 1959, gives "computers the ability to learn without being
explicitly programmed." """

\-- Wikipedia

~~~
Sean1708
That's fair enough, although I'm still not really sure why statisticians have
to think about data in such a different way to machine learning researchers. I
feel that if a statistician _didn 't_ look at the bigger picture of what their
data is actually about and whether there are existing techniques to tackle
that problem, they'd make a pretty terrible statistician.

~~~
autokad
I dont know if the commentator's assertions are correct, but my anecdotal
experience follows: A statistician asks me if adding a feature could make a
model worse (R^2), and I am like of course! He gets testy, and snidely chimes
back 'i don't know why you think that could happen'. And I think 'I don't know
why you don't think you can over fit data...'

then it hit me, statisticians thinking revolves around running the model on
the entire data set, where my thinking revolves around how it performs on test
data

~~~
aub3bhat
There is a good paper by Leo Breiman [1] which discusses the exact issue that
you mentioned. E.g. Statisticians believe that there is a "True" model that
generates the data, and the errors observed are merely noise. ML on the other
hand does not assumes existence of a "True" model, the assumption is that only
the Data is source of the truth and any model that predicts the data with
lowest error is preferable subject to performance on test/cross-validation
etc. This is a powerful approach and distinguishes ML as separate field from
statistics.

My favorite example is predicting prices of real estate. A statistician will
build a multi-level model that takes into account various effects (Zip code
level, city level, school district level, year/month of acquisition etc.) and
then build a regression model. The errors then would then be simply noise, an
ML approach would be simply using weighted K-Nearest Neighbors, with
geographic location as part of distance metric. Sure there are no effects to
adjust but the K-NN regression model can account for difficult to capture
quirks of geography by being able to represent local non-linear decision
surface.

[http://projecteuclid.org/euclid.ss/1009213726](http://projecteuclid.org/euclid.ss/1009213726)

