
Using neural nets to recognize handwritten digits - ivoflipse
http://neuralnetworksanddeeplearning.com/chap1.html
======
svantana
Very well written, and I applaud the effort. But personally I don't care for
the "magical" aura that writers tend to give ANNs - to me, they are simply
(non-linear) function approximators that have a nice fitting algorithm. They
work well for some problems and poorly for others. Also, beware of over-
fitting - ANNs tend to be parameter-heavy, although there are approaches to
prune the connections.

~~~
Houshalter
>they are simply (non-linear) function approximators that have a nice fitting
algorithm.

Couldn't "function approximator" describe most machine learning approaches?
And a nice fitting algorithm is of course the goal.

~~~
shoo
Yes.

I'd go further - the "nice fitting algorithm" is to minimise the error on the
training set as a function of the weight parameters, and one obvious way to
(locally) minimise that is gradient descent + the chain rule.

The math / applied math machinery is all incredibly general, and a useful way
to think about many machine learning algorithms.

[http://en.wikipedia.org/wiki/Gradient_descent](http://en.wikipedia.org/wiki/Gradient_descent)
[http://en.wikipedia.org/wiki/Chain_rule#Higher_dimensions](http://en.wikipedia.org/wiki/Chain_rule#Higher_dimensions)

Not to say that applied math is the only valuable perspective to think about,
there's clearly also statistical and computational views as well. E.g. we're
trying to approximate a function that we can never directly evaluate (error on
the samples we havent seen yet)

------
ivoflipse
If you liked his first chapter, consider supporting his IndieGogo campaign for
the whole book ([http://www.indiegogo.com/projects/neural-networks-and-
deep-l...](http://www.indiegogo.com/projects/neural-networks-and-deep-
learning-book-project/)).

~~~
corin_
In case the author is reading this thread - might be worth adding a couple
more reward tiers, for example an equivalent of the "major sponsor" on an
individual level, maybe $30-60 to be named somewhere as a supporter. $15-$200
seems like a very big gap (and of course you can chose to donate in between,
but I presume that reward tiers are effective in pushing people to donate
more).

Looking forward to reading chapter one when I have time, though I suspect it
will confuse me quite a lot...

edit: I see he is, already commented before I wrote this

------
joe_the_user
_" The adder example demonstrates how a network of perceptrons can be used to
simulate a circuit containing many NAND gates. And because NAND gates are
universal for computation, it follows that perceptrons are also universal for
computation."_

I think this comment from the article needs caveats. Of course, a neural
network would not qualify as Turing Complete just because it's finite. Keep in
mind also that neural network, lacking anything like counters, tape, or
recursion, couldn't approximate a Turing in the way that a finite Von Neuman
architecture machine does. (A NN can represent any given function over a
domain if it get large enough, kind of the universality of a finite
automaton).

I know this a reference to this generation of NN having overcome an earlier
problem of _not_ being able to represent a NAND gate but still, it's worthing
keeping mind that an ordinary computer can simulate an NN with just a program
but this doesn't work vice-versa, so that NN's in that sense are far from
universal.

~~~
michael_nielsen
Networks of perceptrons are universal in the standard sense used when talking
about circuits --- they can compute any finite Boolean function.

I agree that the relationship between circuit complexity and Turing machines
is somewhat subtle, for the reasons you mention. The relationship is greatly
clarified by the notion of uniform circuit complexity, which makes it possible
to prove an equivalence between a (carefully defined notion of) circuit
complexity and Turing machine complexity. Unfortunately, I don't know of a
good online treatment of uniform circuit complexity. I learnt it through a
1993 paper by Andy Yao, but that's definitely not a good introductory
reference!

In any case, in my book I'm using the term universal in the same way as people
usually use it for circuits, i.e., it means the same thing as when people say
that the NAND gate is universal for computation. Hope that clarifies things.

------
tba
This is a cool exercise! After completing it, I wanted to find out exactly
what each NN hidden node represented. I trained a tiny (10 hidden node) NN on
an OCR dataset and created a visualization here:
[https://rawgithub.com/tashmore/nn-
visualizer/master/nn_visua...](https://rawgithub.com/tashmore/nn-
visualizer/master/nn_visualizer.html) .

Can anyone figure out what each hidden node represents?

You can also select a node and press "A" (Gradient Ascent). This will change
the input in a way that increases the selected node's value. By selecting an
output node and mashing "A", you can run the NN in reverse, causing it to
"hallucinate" a digit.

------
MechSkep
What about convolutional neural nets? They weren't mentioned, but that's
really what most of the deep learning approaches use...

~~~
michael_nielsen
They're discussed later in the book. The first chapter is an introduction, and
I didn't want to introduce convolutional nets before (for example) fundamental
techniques such as stochastic gradient descent and backpropagation.

~~~
MechSkep
Great! How far does the book go in terms of advanced approaches? Up to the
current state of research?

~~~
michael_nielsen
My current plan is to describe some pretty recent results -- most likely, the
big breakthrough on ImageNet by Krizhevsky, Sutskever and Hinton
([http://www.cs.utoronto.ca/~ilya/pubs/2012/imgnet.pdf](http://www.cs.utoronto.ca/~ilya/pubs/2012/imgnet.pdf)),
which uses convolutional nets. I may also describe the famous Google-Stanford
"cat neuron" paper
([http://ai.stanford.edu/~ang/papers/icml12-HighLevelFeaturesU...](http://ai.stanford.edu/~ang/papers/icml12-HighLevelFeaturesUsingUnsupervisedLearning.pdf)
). But at this point things are moving so quickly that I'll keep my options
open, and if more exciting things come up, I may change my plans.

Of course, there's a tremendous amount going on, so my broader philosophy is
to focus on fundamentals. Readers who thoroughly master the core ideas
shouldn't have much trouble later getting up to speed with the result-of-the-
month.

~~~
VladRussian2
>some pretty recent results -- most likely, the big breakthrough on ImageNet
by Krizhevsky, Sutskever and Hinton
([http://www.cs.utoronto.ca/~ilya/pubs/2012/imgnet.pdf](http://www.cs.utoronto.ca/~ilya/pubs/2012/imgnet.pdf)),
which uses convolutional nets.

kernels learned by the first convolutional layer (the figure 3. on page 6)
have uncanny resemblance to Gabor function-modeled orientation-selective cells
("bars and grating cell") in the primary visual cortex. Looks like computers
are on the right track :)

[http://www.cs.rug.nl/~petkov/publications/bc1997.pdf](http://www.cs.rug.nl/~petkov/publications/bc1997.pdf)

"The discovery of orientation-selective cells in the primary visual cortex of
monkeys almost 40 years ago and the fact that most of the neurons in this part
of the brain are of this type ..."

The difference here is a "number game" \- visual cortex contains cells whose
receptive fields' positions, eccentricities, sizes, orientation, number of
excitatory and inhibitory zones (e.g. Fig.1 in the link) make a reasonable
coverage for the space of possible values. Ie. the number of these cells is in
the millions vs. 96. Of course it is only a matter of computing power to run
all reasonable combinations of kernels emulating the real visual cortex, yet
it would put immense computational challenge onto the second and next layers
until we understand what [should] happens there.

~~~
apu
FWIW, many vision researchers believe that the resemblance of the first
convolutional layer to Gabor filters is perhaps more a case of selection bias
than anything else. The argument goes that were they _not_ the output of the
first layer, that paper wouldn't get accepted =)

I'm not sure if I fully believe this, but certainly there doesn't seem to be a
very principled way to choose your network architecture. Different people
propose different ones, and the fundamental justification for each one seems
to be: "look, we recreate gabor filters in layer 1 and we get good numbers at
the end!"

Of course, NN people argue that that's almost exactly what vision people do as
well, except in "feature-land" rather than "architecture-land".

~~~
VladRussian2
>FWIW, many vision researchers believe that the resemblance of the first
convolutional layer to Gabor filters is perhaps more a case of selection bias
than anything else. The argument goes that were they not the output of the
first layer, that paper wouldn't get accepted =)

well, i can see the temptation - the orientation and spatial frequency
selectivity are the major characteristics of cells in V1 and the receptive
field for the first layer there does look like Gabor

[http://www.scholarpedia.org/article/Area_V1#Receptive_fields](http://www.scholarpedia.org/article/Area_V1#Receptive_fields)

I agree that such a good resemblance of the learned kernels to Gabor is too
good, this is why i used "uncanny" :) If it is real then i think it manifests
very interesting and, no pun intended, deep emerging properties of the neural
net learning process (something along the lines "maximum entropy kernels while
still doing the job" as the asymptotic state)

Btw, is it really selection or confirmation bias?

And to expand on previous point of convoluting the input with many-many
kernels - happens to be at the order of 40 per "pixel":

"V1 contains a vast number of neurons. In humans, it contains about 140
million neurons per hemisphere (Wandell, 1995), i.e. about 40 V1 neurons per
LGN neuron. Such divergence gives scope for extensive processing of the images
received from LGN."

------
cdurr
Ahh, the MNIST database of handwritten digits. I never took an ML course and I
was only able to achieve 87% recognition rate 6 years ago for a university
software engineering project. I read others achieving 99.9% recognition rates
with their ANNs so I wasn't happy with my result. I tried to self-study about
ANNs but I found most material to be either too simple or too complicated. I
finally found some articles about ANNs
([http://visualstudiomagazine.com/Articles/List/Neural-
Network...](http://visualstudiomagazine.com/Articles/List/Neural-Network-
Lab.aspx)) with code samples in C#, so I'll finally be looking into rewriting
my old code to get a better result.

------
snarfy
That reminds me of a video I saw about something called restricted boltzmann
machines:

[http://www.youtube.com/watch?v=AyzOUbkUf3M&t=24m0s](http://www.youtube.com/watch?v=AyzOUbkUf3M&t=24m0s)

~~~
kyzyl
Caveat emptor. Geoff's videos are a great way to launch into the field of deep
learning, but do bear in mind that they are beginning to age. A lot of the
stuff about why things work, what is state-of-the-art, and where the work is
headed is now dated (even according to Hinton himself).

------
Draco6slayer
I'm not sure if I am making a mistake, but I couldn't use the command listed
in order to clone the repository. I'm using windows, with git installed, and I
received the error: "Permission Denied: publickey"

I was able to get everything by looking you up on github and using the url of
the repository.

Edit: Also, you might mention the repository earlier, because it's rather
large and I've had to break from the book while it downloads.

~~~
michael_nielsen
Thanks for the tip, I'll look into it!

------
bcuccioli
I did exactly this for a school project about a year ago:

[https://github.com/bcuccioli/neural-ocr](https://github.com/bcuccioli/neural-
ocr)

There's a paper in there that explains the design of the system and my
results, which weren't great, probably due to the small size of training data.

------
hcarvalhoalves
Isn't that the same material covered on Andrew Ng's Coursera course "Machine
Learning", down to the training data?

~~~
diab0lic
It may well be similar material to Ng's Coursera course, but the table of
contents on the right shows that this is obviously going to transition into
the topic of deep networks -- part of Ng's research but not his Coursera
course.

The training data is the MNIST dataset released by NIST some time ago, anyone
is allowed to use it. It truly is no surprise to see it here, as it is very a
very commonly used dataset in ML tutorials/books. It receives some discussion
in Artificial Intelligence: A Modern Approach by Russel and Norvig, and even
in the Theano getting started tutorials.

~~~
kyzyl
That's right. In fact, for a subclass of image recognition tasks, MNIST has
become the standard benchmark. Pretty much all of Geoff Hinton's early work on
deep learning used MNIST to track the progress of his (their) methods.

In fact, if the goal of the book is to educate people on the field then I
would say it's definitely best to use the standard benchmarks. It lets readers
relate what's in the book to the literature, should they desire, and just like
in academia, it lends credence to the author's statements. I've seen people
take a _lot_ of flak for publishing writings that don't use the standard
datasets, but make claims of progress.

