
A Beginner's Guide to Understanding Convolutional Neural Networks - kilimchoi
https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-Neural-Networks/
======
Dzugaru
Have yet to see an illustration that grasps multichannel convolution filters
(MCCF) concept clearly. Why those channel stack size keep growing? How are
they actually connected?

The thing that each conv filter consists of kernels in multiple channels
(that's why first layer filter visualisations are colored btw - color image is
a "3-dimensional" image) - and we convolve each kernel with corresponding
input channel, then _sum_ (that's the key) the responses. Then having multiple
MCCF (usually more at each layer) yields a new multi-channel image (say, 16
channels) and we apply new set of (say, 32) 16-channeled MCCFs to it (which we
cannot visualise by themselves anymore, we need a 16-dimensional image for
each filter) yielding 32-channel image. That sort of thing is almost never
explained properly.

~~~
jhj
Valid-only convolution (in the MATLAB sense) by itself reduces the
dimensionality of the input; for images, it will go from (h x w) to (h - kh +
1) x (w - kw + 1) per each plane.

You can think of a convnet as a series of feature transformations, consisting
of a normalization/whitening stage, a filter bank that is a projection into a
higher dimension (on an overcomplete basis), non-linear operation in the
higher dimensional space, and then possibly pooling to a lower dimensional
space.

The “filter bank” (aka convolution) and non-linearity produce a non-linear
embedding of the input in a higher dimension; in convnets, the “filter bank”
itself is learned. Classes or features are easier to separate in the higher
dimensional space. There are some still-developing ideas on how all this stuff
is connected to wavelet theory on firmer mathematical ground and the like, but
for the most part, it just works "really well".

For an image network, at each layer there are (input planes x output planes)
convolution kernels of size (kh x kw).

Each output plane `j` is a sum over all input planes `i` individually
convolved using the filter (i, j); the reduction dimension is the input plane.

~~~
jhj
see
[https://github.com/facebook/fbcunn/blob/master/test/Referenc...](https://github.com/facebook/fbcunn/blob/master/test/ReferenceConvolutions.cpp#L192)

for a loop nest that shows what the forward pass of a 2-d image convnet
convolution module does. That's gussied up with convolution stride and padding
and a bunch of C++11 mumbo jumbo, but you should be able to see what it is
doing.

------
chrisruk
[http://arxiv.org/abs/1602.04105#](http://arxiv.org/abs/1602.04105#) \-- This
paper is awesome for a use of CNNs, for automatic modulation recognition of RF
signals.

I'm attempting to use their approach with GNU Radio currently -

[https://radioml.com/blog/2016/07/18/towards-a-gnu-radio-
cnn-...](https://radioml.com/blog/2016/07/18/towards-a-gnu-radio-cnn-
tensorflow-block/)

------
danielmorozoff
Great writeup from Stanford CS231 course:
[http://cs231n.github.io/convolutional-
networks/](http://cs231n.github.io/convolutional-networks/)

------
sjnair96
Damn the author is a freshman!

------
thallukrish
A human child learns much more easily by seeing only a handful of images of a
cat and then almost being able to say any type of cat image as it grows
(without ever seeing 1 million or billion images). So, there seem to be
something that shows that more than the amount of data, the "reality" of
seeing a real cat probably includes all possible aspects of a Cat ? There seem
to be something missing with this whole deep learning stuff and the way it is
trying to simulate the human cognition.

------
vonnik
Here's an intro to ConvNets in Java:
[http://deeplearning4j.org/convolutionalnets.html](http://deeplearning4j.org/convolutionalnets.html)

Karpathy's stuff is also great:
[https://cs231n.github.io/](https://cs231n.github.io/)

------
crncosta
Very well illustrated.

------
cynicaldevil
I am new to CNNs/machine learning, but here's my $0.02: Regardless of which
technique you use, it seems that the amount of data required to learn is too
high. This article talks about neural networks accessing billions of
photographs, a number which is nowhere near the number of
photos/objects/whatever a human sees in a lifetime. Which leads me to the
conclusion that we aren't extracting much information from the data. These
techniques aren't able to calculate how the same object might look under
different lighting conditions, different viewing angles, positions, sizes, and
so on. Instead, companies just use millions of images to 'encode' the
variations into their networks.

Imo there should be a push towards adapting CNNs to calculate/predict how the
object might look under different conditions, which might lead to other
improvements. This could also be extended to areas other than image
recognition.

~~~
karpathy
People rarely train on billions of images, we're usually around the scale of
~million. This already works quite well in many respects. A back of the
envelope calculation assuming about 10fps vision gives ~1B images by age of 5.
And humans aren't necessarily starting from scratch as our machine learning
systems do.

It's not clear if people can calculate what an object might look like in
different viewing angles, but even if they could if you would want to in an
application, and even if you did there's quite a bit of work on this (e.g.
many related papers here [http://www.arxiv-
sanity.com/1511.06702v1](http://www.arxiv-sanity.com/1511.06702v1)). At least
so far I'm not aware of convincing results that suggest that doing so improves
recognition performance (which in most applications is what people care
about).

~~~
cynicaldevil
Even if we assume that a 5 year old has seen 1000-1500 pictures of say, cats,
in his lifespan, it is still far less than the number of images required to
train a CNN to label them as accurately as a human can.

And of course, I am not talking about just viewing angles. There are several
other factors, but I only mentioned the ones which I could think of.

~~~
jm547ster
Every second a human opens their eyes, they are seeing a constant stream of
changing "pictures" on which to train on.

~~~
kimolas
This is the right perspective. It seems the OP believes actual photographs are
privileged in some way. In reality, any visual input from our eyes counts as
training data, as you said.

~~~
cynicaldevil
You seem to forget that the photos are labelled, which counts as supervised
learning. What us humans excel at is unsupervised learning, which is difficult
for machines. But yes, I agree that humans have the advantage of continuous
video access.

