The thing that each conv filter consists of kernels in multiple channels (that's why first layer filter visualisations are colored btw - color image is a "3-dimensional" image) - and we convolve each kernel with corresponding input channel, then sum (that's the key) the responses. Then having multiple MCCF (usually more at each layer) yields a new multi-channel image (say, 16 channels) and we apply new set of (say, 32) 16-channeled MCCFs to it (which we cannot visualise by themselves anymore, we need a 16-dimensional image for each filter) yielding 32-channel image. That sort of thing is almost never explained properly.
A CNN does this for localized sections of the images. Each layer looks over a wider area of the original image (because of the convolution) and embeds the "concepts" in that area of the image in a higher-dimensional space than the layer before it.
You can think of a convnet as a series of feature transformations, consisting of a normalization/whitening stage, a filter bank that is a projection into a higher dimension (on an overcomplete basis), non-linear operation in the higher dimensional space, and then possibly pooling to a lower dimensional space.
The “filter bank” (aka convolution) and non-linearity produce a non-linear embedding of the input in a higher dimension; in convnets, the “filter bank” itself is learned. Classes or features are easier to separate in the higher dimensional space. There are some still-developing ideas on how all this stuff is connected to wavelet theory on firmer mathematical ground and the like, but for the most part, it just works "really well".
For an image network, at each layer there are (input planes x output planes) convolution kernels of size (kh x kw).
Each output plane `j` is a sum over all input planes `i` individually convolved using the filter (i, j); the reduction dimension is the input plane.
for a loop nest that shows what the forward pass of a 2-d image convnet convolution module does. That's gussied up with convolution stride and padding and a bunch of C++11 mumbo jumbo, but you should be able to see what it is doing.
I'm attempting to use their approach with GNU Radio currently -
Karpathy's stuff is also great: https://cs231n.github.io/
I doubt he tries to be a though leader, rather this post looks like a notes that he made while learning about CNN and published them since they might be useful as a quick-start to someone else.
Please stop creating accounts to break the site rules with.
Imo there should be a push towards adapting CNNs to calculate/predict how the object might look under different conditions, which might lead to other improvements. This could also be extended to areas other than image recognition.
It's not clear if people can calculate what an object might look like in different viewing angles, but even if they could if you would want to in an application, and even if you did there's quite a bit of work on this (e.g. many related papers here http://www.arxiv-sanity.com/1511.06702v1). At least so far I'm not aware of convincing results that suggest that doing so improves recognition performance (which in most applications is what people care about).
We can train a CNN to do that in a few days.
They'd also have gained the ability to do very (seriously) difficult things like walking, climbing objects, the rudiments of folk physics, picking things up and throwing things. They'd have some rudimentary ability modeling other agents.
It's good to be happy with current progress and I do not suffer from the AI-effect but being too lenient can hamper creativity and impede progress by occluding limitations.
And of course, I am not talking about just viewing angles. There are several other factors, but I only mentioned the ones which I could think of.
I've not yet looked carefully into it, but I expect that sort of feedback should drastically reduce the amount of required raw data. Machines might not (at first) get to build predictive models from interactions, but even our best approaches to transfer and multi-task learning are very constrained compared to the free form multi modal integrative learning a parrot is capable of. With very little energy spend.
This is good, it means there are still a lot of exciting things left to work out.
Since we know that things usually don't change instantly (for example, a cat won't suddenly change into a dog), if we assume 10fps vision, 1500 pictures of cats would mean looking at a cat for 2 and a half minutes total in 5 years. And since we know cats won't change into something else, if we see a cat walking somewhere, we'll still know it's a cat, giving us the labels we need for the training.
I think that if we assume 30fps (which still seems kind of low), and we assume that the human looks at a cat for 15 minutes (which still isn't much) that's already 27000 pictures.
The most common example is digit recognition with the MNIST dataset. This is a common problem given to beginners, and even many beginners to CNN's achieve human-level accuracy. That dataset is in the tens of thousands.
there should be a push towards adapting CNNs to calculate/predict how the object might look under different conditions
Data augmentation like rotations, horizontal flipping and random cropping are a widespread practice.
or check out what's available in a deep learning library like keras
Sorry I don't have better academic references.