
Deep Neural Decision Forests [pdf] - fitzwatermellow
http://research.microsoft.com/pubs/255952/ICCV15_DeepNDF_main.pdf
======
pjf
Could someone explain in layman words what is the qualitative contribution of
the paper and why it is important?

~~~
gamegoblin
Super high level of deep nets, skip if you already know:

Deep nets are good at taking a vector of size N and transforming it to a
vector of size M, where M is perhaps a more general, abstract, or "useful"
representation.

e.g.

Your N-vector might be a length 786 vector of floating point values
representing black-intensity in a 28x28 grayscale image. This is the case in
the MNIST dataset (lots of 28x28 images of the digits 0-9), a classic dataset
in machine learning.

After a layer or two of a deep net, this N-vector might be transformed into an
M-vector where each component represents some particular edge, curve, or blip
within the source image.

So you've gone from the representation of "pixel 0 is gray, pixel 1 is dark
gray, pixel 2 is white..." to a representation of "There is a vertical edge on
the central lefthand side of the image, there is an upwards facing curve in
the central top part of the image....".

It's clear that the latter representation is more compact and useful for the
purpose of digit recognition.

It's also worth noting this representation is specific to the problem at hand.
The edges and curves you have learned would probably be unable to accurately
reproduce say, letters of the alphabet, as they are specialized to reproducing
digits. The net has learned a more compact representation by using statistics
to figure out that most of the information is redundant. There are only 10
possible outputs, but the input space is 256 grayscale values ^ 768 pixels.

In a traditional deep net, your output layer for this particular problem
(digit recognition) might be a vector of length 10, where element of the
vector is the probability of that digit being the one shown. So a result of
<0.1, 0.1, 0.998, 0.0 ... 0.0> would indicate that the net thought the digit
was a 2.

================

Super high level of decision trees and forests, skip if you already know:

A decision tree is somewhat self-explanatory -- it's kind of like a flow chart
for making judgments. Here is an example, classifying cool vs. uncool based on
3 attributes.

My dataset:

    
    
              | bow_tie | socks | sandals | cool
        ------|---------|-------|---------|------
        Alice | true    | true  | true    | true
        Bob   | false   | true  | true    | false
        Carol | false   | true  | false   | true
        Doug  | true    | false | true    | true
        Ella  | false   | false | false   | false
    

A possible decision tree:

    
    
        if bow_tie == true
            return true
        else
            if socks == true
                if sandals == true
                    return false
                else
                    return true
            else
                if sandals == true
                    return true
                else
                    return false
    

It's also worth noting that decisions trees can be equal with different
representation. The following will always return the same value as the above
for the elements of the dataset:

    
    
        if socks == true
            if sandals == true
                if bow_tie == true
                    return true
                else
                    return false
            else
                return true
        else
            if sandals == true
                return true
            else
                if bow_tie == true
                    return true
                else
                    return false
           

The difference is how you pick the divisions. The first one uses a more
entropy-reducing strategy -- we notice that the bow_tie division is a simple,
hard rule. Bowties are cool. The second is more of a random decision decision
tree, so it's less "efficient" in that it must make potentially more
judgments.

Why would we ever want to be less efficient? It turns out if you train several
(hundreds, thousands, etc) decision trees on subsets of the data, and then
average their results together, they are alarmingly good classifiers.
Extremely simple to code, train, and use. This is called a decision forest. A
decision tree on its own is often weak, but decision forests are a powerful
tool.

================

High level of why this work is interesting:

The traditional means of training a deep neural net is with gradient descent.
The most common form of this is some method of "backpropagation". You run a
training example through your network, calculate the error between the result
and expected result, and then propagate this error gradient back through the
network to tune the transformations to produce closer to what you want. This
method often requires the functions you use within the deep network to be
differentiable.

As mentioned above, there are several training strategies for decision trees,
but the most common is some form of "mostly random".

To extend my example from the first section, one could use the deep neural net
to transform the 768 grayscale pixel values into perhaps 30 higher level
edge/curve features. Then one could use this length 30 vector as the input to
train a decision forest.

This might end up getting better results than either strategy by itself. You
use the neural net to do the abstracting and the decision forest to make the
final decision. This uses both of their advantages in tandem -- deep neural
nets are great at generating more abstract and general features, and decision
forests are quite good at producing accurate classifications given high-
quality, lower-dimensional input data.

This idea of multi-tiered systems isn't particularly new. What this paper
does, though, is introduce a _differentiable_ decision tree. This means that
they can train their decision trees with gradient descent, the same way they
train the neural network. This means that, rather than training the two tiers
of their system individually, they can train them together, producing even
better results.

~~~
FlyingLawnmower
Fantastic explanation.

------
strebler
Why do they only compare their method against GoogLeNet? VGG was the winner in
2014, most papers seem to use VGG and my team has observed it to always gets
higher accuracy than GoogLeNet.

~~~
nl
GoogLeNet "won" ImageNet 2014[1], but VGG is often found to be more flexible.

I'm not entirely sure, but I think that the multiple softmax layers in
GoogLeNet might make it easier to modify for this purpose than the VGG
architecture.

[1] [http://image-net.org/challenges/LSVRC/2014/results](http://image-
net.org/challenges/LSVRC/2014/results) (look for "Classification+localization
with provided training data: Ordered by classification error")

~~~
strebler
Oh that's right, GoogLeNet did edge out VGG in the pure classification task.
Localization is essential to us, so VGG was the clear winner for us :)

------
themichaellai
fwiw, Criminisi also was an author on this:
[http://research.microsoft.com/pubs/158806/CriminisiForests_F...](http://research.microsoft.com/pubs/158806/CriminisiForests_FoundTrends_2011.pdf)

