
Deep Forest: Towards an Alternative to Deep Neural Networks - kercker
https://arxiv.org/abs/1702.08835
======
rkaplan
"In contrast to deep neural networks which require great effort in hyper-
parameter tuning, gcForest is much easier to train."

Hyperparameter tuning is not as much of an issue with deep neural networks
anymore. Thanks to BatchNorm and more robust optimization algorithms, most of
the time you can simply use Adam with a default learning rate of 0.001 and do
pretty well. Dropout is not even necessary with many models that use BatchNorm
nowadays, so generally tuning there is not an issue either. Many layers of 3x3
conv with stride 1 is still magical.

Basically: deep NNs can work pretty well with little to no tuning these days.
The defaults just work.

~~~
computerex
I couldn't disagree more. The defaults don't just work, and the architecture
of the network could also be considered a hyper parameter in which case what
would be a reasonable default for all the types of problems ANN are used for?

~~~
ipunchghosts
Are you using batch normalization? If you are, an issue I see all the time is
folks not setting the EMA filter coef correctly. In keras, it defaults to
something like 0.99 which in my mind makes no sense. I use something around
0.6 and life is good. You want to get an overall good measurement of the
statistics and in my mind the frequency cutoff when coef=0.99 is just way too
high for most application. You usually want something that filters out just
about everything except very close to DC.

~~~
gcr
The response to "the defaults should work just fine without any hyperparameter
tuning" is "try fiddling with the EMA filter coefficient hyperparameter" ?

(Just poking fun. :P)

~~~
joe_the_user
It's like the joke of the mathematician giving an exposition of a complex
proof. At one point he says "It is obvious that X", pauses, scratches his
head, does a few calculations. Leaves room for twenty minutes and returns.
Then continues "it is obvious that X" and goes to the next step.

Deep in the field, it's fine for machine learning experts to say "everything
just works" [if you've mastered X, Y, Q esoteric fields and tuning methods]
since they're welcome to "humble brag" as much as they want. But when this
gets in the way of figuring out what really "just works" it's more of a
problem.

------
throw_away_777
I've always found it curious that Neural Networks get so much hype when
xgboost (gradient boosted decision trees) is by far the most popular and
accurate algorithm for most Kaggle competitions. While neural networks are
better for image processing types of problems, there are a wide variety of
machine learning problems where decision tree methods perform better and are
much easier to implement.

~~~
nojvek
Any good links you recommend learning xgboost? I've never quite figured out
how they work.

~~~
BickNowstrom
[http://xgboost.readthedocs.io/en/latest/model.html](http://xgboost.readthedocs.io/en/latest/model.html)

[http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf](http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)

[https://www.youtube.com/watch?v=wPqtzj5VZus](https://www.youtube.com/watch?v=wPqtzj5VZus)
Trevor Hastie - Gradient Boosting Machine Learning

[https://www.youtube.com/watch?v=sRktKszFmSk](https://www.youtube.com/watch?v=sRktKszFmSk)
Ensembles (3): Gradient Boosting, Ihler

------
jre
I don't know about the others, but the two visions dataset they compare to
(MNIST and the face recognition one) are small datasets and the CNN they
compare to doesn't seem very state of the art.

It also seems each layer of random forest just concatenates a class
distribution to the original feature vector. So this doesn't seem to get the
same "hierarchy of features" benefit that you get in large-scale CNN and DNN.

~~~
jkbschwarz
To your point that they are comparing small datasets. I dont see that as a
problem. If they achieve better results on small datasets that is a great
achievement, as often the bottleneck is the size of the dataset rather than
computation time.

~~~
krona
> often the bottleneck is the size of the dataset rather than computation time

That's generally true for DNNs, which is a good place to be if you have lots
of data. This typically isn't true for tree based approaches, which is why
they fell out of fashion in some problem domains; they don't generalize as
well. This paper doesn't seem to change what we already know in this respect.

------
FrozenVoid
Related: Deep neural decision forests (ConvNets+Random Forests)
[http://www.cv-
foundation.org/openaccess/content_iccv_2015/pa...](http://www.cv-
foundation.org/openaccess/content_iccv_2015/papers/Kontschieder_Deep_Neural_Decision_ICCV_2015_paper.pdf)
[http://matthewalunbrown.com/papers/1603.01250v1.pdf](http://matthewalunbrown.com/papers/1603.01250v1.pdf)
[https://topos-theory.github.io/deep-neural-decision-forests/](https://topos-
theory.github.io/deep-neural-decision-forests/)

------
ungzd
Was about to joke about Deep Support Vector Machines, but found out they exist
too:
[https://www.esat.kuleuven.be/sista/ROKS2013/files/presentati...](https://www.esat.kuleuven.be/sista/ROKS2013/files/presentations/DSVM_ROKS_2013_WIERING.pdf)
[http://deeplearning.net/wp-
content/uploads/2013/03/dlsvm.pdf](http://deeplearning.net/wp-
content/uploads/2013/03/dlsvm.pdf)

~~~
soVeryTired
Deep linear regression?

~~~
eanzenberg
That's quite literally deep learning so made me lol.

~~~
soVeryTired
Can you elaborate on that? I feel like I've missed something.

~~~
eanzenberg
Linear regression on top of linear regression with non-linearities in between
is a neural network.

------
paulsutter
No Free Lunch theorem refesher:

"if an algorithm performs well on a certain class of problems then it
necessarily pays for that with degraded performance on the set of all
remaining problems"

[https://en.m.wikipedia.org/wiki/No_free_lunch_theorem](https://en.m.wikipedia.org/wiki/No_free_lunch_theorem)

~~~
CuriouslyC
That doesn't apply to an ensemble of algorithms where the weights of a given
member of the ensemble are adapted based on observations from the given
domain. If it did, humans wouldn't be able to choose a good algorithm for
specific cases, and obviously we can.

Deep neural networks can be thought of as ensembles of smaller neural
networks, though of course each member of the ensemble is going to share some
degree of algorithmic bias. This suggests that perhaps deep neural networks
with heterogeneous activation functions and branching structures will perform
better than homogeneous networks.

~~~
deong
> If it did, humans wouldn't be able to choose a good algorithm for specific
> cases, and obviously we can.

This is a surprisingly commonly held fallacy in some AI circles. It's the idea
that humans are mathematically perfect. When you phrase it that way, it's
fairly obviously false, but you still see a lot of people argue things like
"NFL doesn't apply to ensembles because humans..." or "machines can never be
as intelligent as humans because...".

The reality is that humans are subject to the same mathematical laws as
machines. It's far more likely that my brain can't solve an NP-hard problem in
polynomial time either. My brain can't beat random search on the set of all
possible problems.

~~~
CuriouslyC
It doesn't imply that humans are mathematically perfect, but our brains are
basically algorithm generating algorithms - we're not just weighting a set of
preexisting solutions. To generalize quite a bit, saying NFL applies to
general intelligence ends up implying there are problems for which efficient
algorithms exist, but intelligence is literally incapable of discovering. That
seems pretty absurd to me.

~~~
deong
The problem is that if our brains are not "magic" (essentially making this a
religious argument), then they're operating via some algorithmic principles.
If our brains are "algorithm generating algorithms", then I can in principle
write an "algorithm generating algorithm" in silicon that does the same thing.
And I know with mathematical certainty that my digital version is subject to
NFL. So either my brain is too or we go back to religion to explain what non-
physical process is responsible for our super-turing capabilities.

The second point is that NFL is often interpreted in a weird way, where we
only think about "interesting problems". It is defined on the set of all
search or optimization problems. It says if your algorithm is better than
exhaustive search over some subset of problems, it must be worse on the
complement of that set.

What does it mean for an algorithm to be efficient? Well, it's roughly
speaking the number of steps it needed to take (assuming each step is the same
amount of work, blah blah). OK, so an "efficient" algorithm must, by
definition, prioritize some steps over others -- it's picking the "best" steps
to take each time it has the choice. OK, so I'll just make up some instances
of the problem that are custom-tailored so that what the algorithm thinks are
the "best" steps always lead me in the wrong direction. You algorithm will
then be worse than exhaustive search on my set of problems, precisely because
it's choosing to avoid the steps I know to be good -- I defined the problem to
make that happen.

That is true of any algorithm you can conceive of. There will be problems for
which the bias that makes the algorithm good on the problems you intended it
to work on will be exactly the wrong bias.

It doesn't matter if you say, "Aha, I'll let my algorithm generate new
algorithms! Gotcha!" I'll just design a set of problems for which your
algorithm generating algorithm will generate the wrong algorithms.

Search is always about bias. Without bias, you have random search -- that's
literally the textbook machine learning definition of bias. All NFL says is
that if you have to worry about every possible problem, any bias you choose
will be worse than random sometimes. There's no escape clause here. Your
algorithm generating algorithm is still a search algorithm with its own
biases, and it will still be worse than random on some subset of all possible
problems.

------
DanielleMolloy
So if this works well why is there no comparison on ImageNet?

~~~
kercker
In the conclusion section the authors said that :

"If we had stronger computational facilities, we would like to try big data
and deeper forest, which is left for future work." and that:

"As a seminar study, we have only explored a little in this direction."

~~~
DanielleMolloy
If someone proposes a method as an alternative for a field, they need to test
this method on the accepted benchmark dataset for that field. For object
recognition in static images this dataset is the ImageNet competition.
Computing power can be bought from AWS if no cluster is available. The lack of
it can't be an argument.

Not saying that the paper has no reason to exist, I think it is generally well
written and decision trees certainly deserve attention. If they can do
representation learning on high level this is certainly something to look
into. But it shouldn't claim to be an alternative to state-of-the-art deep
learning if there is no data for this comparison. Everyone can solve MNIST (or
even CIFAR).

~~~
cgearhart
I agree their claim is a bit hyperbolic, but that's an unreasonably high bar
to expect for the scope of this paper.

------
KirinDave
While optimizations to cost by ditching GPUs as a requirement are important
(and presumably these systems benefit from GPU optimization as well, seems
unclear from my skim of the paper), cheaper training is NOT just about saving
your wallet.

A real emerging area of opportunity is having systems train new systems. This
has numerous applications, including assisting DSEs in the construction of new
systems or allowing expert systems to learn more over time and even integrate
new techniques into a currently deployed system.

I'n not an expert here, but I'd like to be, so I'm definitely going to ask my
expert friends more about this.

------
Dim25
Please note the CPUs they have used are pretty advanced: 2x Intel E5 2670 v3
CPU (24 cores) - approx. price $1.5k per unit
([http://ark.intel.com/products/81709/Intel-Xeon-
Processor-E5-...](http://ark.intel.com/products/81709/Intel-Xeon-
Processor-E5-2670-v3-30M-Cache-2_30-GHz)).

Looking forward to try the code (especially on CIFAR or ImageNet), Zhi-Hua
Zhou, one of the authors, said they are going to publish it soon.

------
throwaway312780
XGBoost also appears to have a GPU implementation.

~~~
pilooch
Yes, as a plugin, and on its way to make it to the trunk AFAIK. I've
integrated it into deepdetect recently because even as a beta it works well
and complements GPU based DNNs fairly naturally. Deep learning practitioners
are already well equipped with GPUs so having XGBoost run on them as well is a
good bonus!

------
edixon
None of these experiments actually do anything to show feature learning - if
this is the claim, I would like to see a transfer learning experiment. I would
be surprised if this works well, since they can't jointly optimize their
layers (so you can't just use ImageNet to induce a good representation). Not
quite clear why we should think that trees will turn out to be inherently
cheaper that a DNN with similar accuracy, unless perhaps the model structure
encodes a prior which matches the distribution of the problem?

------
argonaut
The method's performance on MNIST is relatively mediocre. You might think
98.96% is amazing, but it's about relative performance. It is a relatively
easy exercise nowadays to get above 99% with neural nets. Even I can get that
kind of performance with hand-written Python neural nets, on the CPU, with no
convolutions.

For the rest of the (non-image) datasets, it's already common knowledge that
boosting methods are competitive with neural nets.

------
uptownfunk
Would like to see three things coming out of this

1\. R code implementation (could probably write this myself but would make
things easier)

2\. How to get feature importance? Otherwise difficult to implement in
business context.

3\. Better benchmarks

------
DrNuke
Progress in this field is astonishing and it really propagates to the masses
in the form of easy-to-use black boxes with a pinch of undergraduate-level
maths. Just wow!

------
bamboozled
HN just won't be the same

