
An Intuitive Explanation of Convolutional Neural Networks (2016) - jlukecarlson
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
======
jatsign
How do CNNs work when the output is multiple categories? For instance, in the
same image is a cat and a dog and a car. What's the architecture look like -
multiple CNNs, each that can predict one category? Or does one CNN have
multiple outputs and if the score > threshold, add that category to the list
shown to the user?

Also, how do CNNs draw a box around the target in the image?

~~~
T_D_K
First question: The network is trained to recognize a fixed set of outputs.
That's what makes it a classifier -- it classifies an input into a single
output. It does this by giving each output possibility a score, and the
highest score is its guess for what the original image is. So if I have a
network that I train to recognize cats, dogs, and cars, and I get an output
like {cat: .13, dog: .85, car: .02} Then the input was most likely a dog. The
network calculates all of those values simultaneously.

You can, of course, tell the network to output whatever you want: all of the
guesses, best guess, top five guesses, all guesses over a threshold, etc.

Note, this is a gross oversimplification, but it gets the general concept
across.

------
air7
> Parameters like number of filters, filter sizes, architecture of the network
> etc. have all been fixed before Step 1 and do not change during training
> process – only the values of the filter matrix and connection weights get
> updated.

Is this just the article's over-simplification or are these values really just
randomly selected?

~~~
T_D_K
These are called hyper parameters (filters, filter sizes, stride length,
pooling function, activation function, and a whole host of others not
mentioned in this article). They are chosen "randomly" in the sense that it
isn't an exact science, ie there is no "right" answer. However, intuition and
experience are used as a guide to select reasonable values.

The values in the filter matrices and the weights and biases of the fully
connected layers are truly random though. They are often initialized with
Gaussian random values. Sometimes they are just initialized as all 1's, or
0's. Again, there's no "right" answer (there is probably research out there
that recommends one initialization approach over another). These are the
values that are trained using gradient descent.

------
biocomputation
I actually don't think this is a good explanation at all. I'm not saying it's
badly written, just that it's not a good explanation for the stated purpose
(serving as an intuitive explanation).

To this point, the article is certainly NOT intuitive if you don't already
understand image convolution. The explanation is also very long and rambling.
While I understand the author has made an effort, I don't think the article
really presents the subject matter in a new way: I can learn all of this
elsewhere. This is a common problem when people write about complex subject
matter without fully understanding the knowledge gap between teacher and
audience.

If I were the author, I might try to read up on technical communication and
spend some time figuring out how to correctly simply something. As it stands,
this article using the typical strategy of information hiding to simplify the
subject matter. The problem is that information hiding doesn't doesn't work
very well unless it is expertly done. I do like the animation, but again, it
only serves to show how image convolution works, and doesn't actually teach us
anything about a CNN.

I would suggest the author break the document into three separate sections,
the first being very simple (maybe start with the part that says 'images are
just matrices') and then add more details in each section. The final section
would have a lot of detail. That way you counteract the information blindness
that occurs from simplification by providing the information later.

Otherwise, this article is really more of a data dump than an intuitive
explanation, and since it doesn't really teach us anything we can't learn
elsewhere, I don't see what it contributes.

A cleaner explanation, expertly prepared, could really elevate the effort that
went into this.

~~~
ardit33
Jesus, chill. I am reading it, (and I know nothing about CNN's), and learning
about what do I need to read first about them. The author makes it clear on
few things that you need to read before hand.

I think it is a good article/blog post, (thanks dude, whoever you are that
wrote it).

You on the other hand didn't give any better alternatives on your "rant".

~~~
biocomputation
The comments are for discussing HN submissions. As written, the article is yet
another data dump on CNNs. There are a lot of these on the web already, and I
don't think this explanation is better than what already exists.

I stand by my comments.

> You on the other hand didn't give any better alternatives on your "rant".

I don't have to provide better alternatives. Note that my response did provide
suggestions on how to improve the article.

------
junkcollector
The article is all right but for newbies reading it; be a little careful. The
author is sloppy with terminology in a way that can trip up someone who is
just learning. An example being that a Kernel and a Filter are not the same
thing.

~~~
mechaxl
Can you explain the difference between the two? I'm new to CNNs and have been
wondering this myself - this SO answer says that they're the same thing:

[https://stats.stackexchange.com/questions/154798/difference-...](https://stats.stackexchange.com/questions/154798/difference-
between-kernel-and-filter-in-cnn)

~~~
junkcollector
Sorry, edited a bit for clarity after thinking some more about it.

The kernel of a filter would be it's impulse response, which is what you
convolve by to get the filter response. That's where the sloppy terminology
comes from. A kernel though does not need to be a filter.

A kernel is a function whose product maps a point in one domain onto another
domain. For example, the Fourier transform has a kernel of e^jwt. The integral
(or sum if discrete) of these products over the function is the transform
because it maps the entire function into it's new space. A filter is a
function typically defined as having product behavior in the
frequency(transformed) domain, which is equivalent to convolution in the
time(original) domain. A window is a function that has product behavior in the
time(original) domain, and thus convolution behavior in the
frequency(transformed) domain.

Particularly in linear algebra (matrix math), if something is a kernel
function, there are certain mathematical implications.

Another confusing bit here is that the convolution they are performing to
project the original function (the larger image matrix) onto the smaller one
isn't a proper convolution, it has a hidden window function in the way the
operation is being performed to restrict the output to only the fully
overlapped area of an otherwise linear 2D convolution. This is typically
called a cropped convolution in image processing.

~~~
banachtarski
Personally I think you're being a bit pedantic and fuzzy yourself on the
terminology. For the purposes of CNN, it's perfectly fine to think of them as
the same thing and the kernel in this case is simply not the same as the
"kernel" in linear algebra you alluded to. In fact, it's so different, I don't
even know why you'd bother to mention it.

------
sigstoat
anyone happen to be familiar with any uses of CNNs on 1D "images"? (like you'd
get from linear image sensors [https://toshiba.semicon-storage.com/ap-
en/product/sensor/lin...](https://toshiba.semicon-storage.com/ap-
en/product/sensor/linear-sensor.html) )

i hit up google scholar occasionally looking for references, but literally
everything seems to be applying them to 2D images.

~~~
Florin_Andrei
Well, what happens if you build a 1D-input CNN in TensorFlow and train it the
usual way? Does it work? Seems like it should.

What's even the difference between 1D inputs and 2D inputs? It's all a bunch
of numbers anyway. I don't think it really matters if the pixels are arranged
(as you see them) in a neat rectangle vs in a straight line. You could take a
2D matrix and enumerate it as a linear string of numbers and it would still be
the same matrix, just represented differently. I don't think the CNN cares
either way.

I would go as far as saying that the 1D-ness of the input is just "in your
head".

~~~
kmmlng
I would argue that in a signal (1D) you can expect some sort of relationship
between consecutive elements. In an image (in essence a 2D signal), you can
expect a relationship between consecutive elements not just on the horizontal,
but also on the vertical axis.

If you arbitrarily represent a signal as a 2D matrix, then abrupt changes in
the gradient on the vertical axis are meaningless. But the same is not true in
an image, which is naturally represented as a 2D matrix. Here, a sudden change
on the vertical axis usually corresponds to an edge in the image.

If you represent an image as a 1D array, you throw away spatial information.
So I'm not sure about the 1D-ness just being in ones head.

------
AlphaWeaver
This article was very helpful. The animations did wonders to show how the
networks iterate.

~~~
dnautics
The computerphile cnn video is quite good.

Of course, andrej karpathys Stanford lecture on the subject is as well.

------
tehsauce
Breezes right over back-propagation, arguably the most crucial part :/

