Understanding Convolution in Deep Learning

therobot24 · on March 31, 2015

It's nice to see basic signal processing [finally] entering machine learning. As an EE in the machine learning field, it hurts to read "sliding window" when you can do the same thing with a few FFTs.

The author mentions correlation, but doesn't properly give the intuition of what's happening:

> When we perform convolution of an image of a person with an upside image of a face, then the result will be an image with one or multiple bright pixels at the location where the face was matched with the person.

Well yea, but the best way to explain it is that when you flip the signal and convolve, you're getting the largest summation when the two align. Simple as that.

I'm sure we'll see the Circular Harmonic Transform for rotation invariant (up to a point) Deep Nets, and maybe even the Mellon Transform for scale invariant (again, up to a point) Deep Nets. <- Note i haven't done the math per say showing that these will work, but i can't think of a reason they wouldn't.

kunstmord · on March 31, 2015

Here's a write-up about a Neural Net which was used to win a Kaggle image classification challenge, they did a lot of transformations on the input data to a) prevent overfitting b) provide invariance. Some other cool tricks mentioned there, too. https://benanne.github.io/2015/03/17/plankton.html

benanne · on March 31, 2015

Thanks for the plug :) This is not quite the same thing though, we used a bunch of affine transformations for data augmentation, but we're not using any transforms with fancy invariance properties to compute the feature maps inside the networks, which I think is what therobot is talking about.

I have experimented with FFT convolutions (the Theano implementation for this is based on my code), but they are only really beneficial with large filters, and the current trend is towards convnets with very small filters (1x1, 2x2 or 3x3).

_ntka · on March 31, 2015

> I'm sure we'll see the Circular Harmonic Transform for rotation invariant (up to a point) Deep Nets

A fairly easy way to introduce rotation invariance in DCNNS is to perform random rotations on the inputs during training. Likewise for scale invariance. Translation invariance is already introduced by the convolution operation itself.

The thing about deep learning, is that transform kernels are learned, not pre-computed as in classical signal processing. A DCNN will learn whatever convolution kernels it needs to perform the task at hand. I wouldn't be surprised if a DCNN trained on a classical signal processing task ended up rediscovering some well-known transform kernels originally derived from physical first principles...

Nvn · on March 31, 2015

> A fairly easy way to introduce rotation invariance in DCNNS is to perform random rotations on the inputs during training. Likewise for scale invariance.

It is a bit silly to call these invariances, as different filter/kernel combinations will be activated when a rotated or scaled input is encountered, the individual filters are not rotation or scale invariant. The entire network can only deal with rotations and scales it encountered during training, whilst having to learn 'redundant features' to a certain extent.

It will get the job done for many tasks, but it's a brute force sort of approach that will complicate the learning process (i.e. more scales and rotations require more filters, thus needing a more complex network that is harder to train).

I think there's definitely a lot that can be learnt from (classical) signal processing in order to come up with a much more elegant and efficient solution.

chestervonwinch · on March 31, 2015

> A fairly easy way to introduce rotation invariance in DCNNS is to perform random rotations on the inputs during training. Likewise for scale invariance. Translation invariance is already introduced by the convolution operation itself.

Just to be clear (and I'm sorry if I'm being pedantic), you're talking about invariance of two separate things. In the first case, you're talking about the invariance of the overall network, F(x), i.e. if R is a rotation operator, F(Rx) = F(x). The network's prediction does not change for a suitable set of R's.

On the other hand, convolution is a shift invariant operator, meaning it acts the same no matter its location. If Cx is the output of a convolutional layer and Sx is a shifted signal, then C(Sx) = S(Cx). This is not shift invariance of the output.

The shift invariance of the operator means the convolutional will detect features that resonate well with its kernel irrespective of the location of their location in the signal. However, this does not automatically guarantee that the network's prediction will be shift invariant, i.e. F(Sx) = F(x).

flipp3r · on March 31, 2015

I've been working on a project for matching template images myself for a couple months. I'm using a self made Java library to match static images with pixels on a user's screen. Mainly my project is about finding stuff that looks like static images, on a user's screen, fast ( i.e. <10ms ).

The post here is a really good resource, are there any Java libraries ( excluding OpenCV bindings ) that there exist for this kind of template matching?

therobot24 · on March 31, 2015

Check out correlation filters - there's a ton out there - OTSDF, MMCF, MOSSE, ZACFs, etc. They're basically designed do template matching but in such a way that the input statistics are considered to refine the output result for better matching (less errors, improved separation between classes, etc). I don't know of any Java libraries, but here is a MATLAB library of different types (https://github.com/vboddeti/CorrelationFilters) and here is a very basic implementation of OTSDF using C++ via the Eigen library (https://github.com/jsmereka/PatchBasedCorrelation/tree/maste...).

anupshinde · on March 31, 2015

I am an "average" ML researcher and an enthusiast, have worked on quite a few small/large works myself. Every time I see the kind of engineering complexity involved in stuff like deep learning - I wonder, how evolution could ever lead to general intelligence like human-intelligence OR if learning is a logical process at all