
The often-overlooked random forest kernel - stablemap
https://rmarcus.info/blog/2017/10/04/rfk.html
======
jdonaldson
This is a nice article, but there's some caveats with this approach that
deserve a note:

1) Random forests are by nature supervised learners. They will completely
ignore information that they can't use to predict their target output. This
will have a big impact on any similarity scores derived from these types of
models. At best, you'll see which observations the model considers to be
similar in the training set (for the purposes of predicting the given output).
This is still useful, but not as generalizable.

2) Even when considering the above, the distribution of splits is commonly
skewed or biased in some way. The depth of the trees in the forest will vary.
The distribution of training data at leaf nodes will vary. Categorical
features may cause more splits than numeric features, etc. Some manner of re-
weighting (a la information gain ratio) needs to be used to balance feature
importance on the overall model behavior.

I think this is a useful technique, and I would like to see more tree-based
metrics used. The metrics just need some adjustments to account for all the
distributional kinks in the trees, and the users need to know what kind of
caveats are applicable for their data.

~~~
tvladeck
> 1) Random forests are by nature supervised learners

You can run unsupervised random forests. The way this procedure works is to
first make a "copy" of the original dataset (the reason for the quotes will be
evident in a second), and then train the forest to distinguish the original
data from the copy with a binary target vector.

The "copy" is made by sampling column-wise from the original dataset, so that,
in the copy, each variable has the same univariate distribution as the
original, but any interdependencies are destroyed.

The forest then learns to distinguish the original from the copy, and the only
information that it has at its disposal are interdependencies in the original
dataset.

------
RMarcus
I'm the author -- this is a technique to use a trained random forest as a
kernel function for similar data. It can be used as a sort of primitive
transfer learning, or just as a way to get a very high quality kernel for a
(labeled) dataset.

Happy to answer any questions!

~~~
stochastic_monk
Very interesting; I've ignored random forests more than I should. Thank you!

I won't comment on your methods except to say that comparing classification
over linear PCA against a kernel PCA in linearly unseparable data isn't
exactly fair, and I think that providing an SVM performed on a different
kernel PCA decomposition or a kernel SVM itself would be more illustrative.
(Is your code available?)

I generally think of neural networks as enormous meta-kernels. (Composites of
<composites of...> kernels) This generally leads me to think of ways that
kernels can be turned into neural network layers.

Great work has been done turning powerful tools like Random Fourier
Features/Kitchen Sinks into layers in neural networks (e.g., Alex Smola's
Deep-Fried ConvNets
[[https://arxiv.org/abs/1412.7149](https://arxiv.org/abs/1412.7149)] and
Choromanski's Structured Adaptive/Random
Spinners[[https://arxiv.org/abs/1610.06209]](https://arxiv.org/abs/1610.06209\])).
Deep Forest
[[https://arxiv.org/abs/1702.08835](https://arxiv.org/abs/1702.08835)] is a
method which claims to work well, but it's somewhat odd; It's not quite what I
would have imagined.

My biggest criticism of random forests is that the more expressive models are
more memory-hungry and expensive at both training-time and run-time than many
comparable methods. But bounded-complexity trees with smart implementations
seem to be a lot more useful and have broader applications than I give them
credit for.

~~~
RMarcus
I agree the comparison could be a lot more fair. My goal was to show how
adding in labels can give you better principle components, but I should've
used an RBF kernel, or some kind of unsupervised kernel, instead of linear
PCA.

I really should go and make the code available. I spent a lot of time writing
the RF kernel in Cython to get it to be fast. But it is so fragile and
incomplete, I'm not sure I am ready to release and maintain it. Maybe I'll
just post the scripts.

Thanks for the links and insights about ANNs as combined kernel learners +
classifiers. I'll check out the papers (I'm also surprised at what Deep Forest
is. Not what I was expecting.)

~~~
theSage
What were you expecting deep forests to be like?

------
srean
Let me add another often-overlooked tweak that you can use with a random
forest: applying random rotations before one learns the trees.

This is extremely powerful. To give you some intuition on why this is so,
recall that simple trees do axis parallel splits of the space. If the trees
have to approximate an inclined line, they have to do a high resolution stair
stepping to approximate it, -- like so --

    
    
        __
          |
          ------
                |
                 ---------
                          |_______

this ends up requiring a lot of trees or a very deep one. Now if you equip the
algorithm with a few inclined splits, which is exactly what a random rotations
would do, you gain a lot of power.

~~~
haeffin
How is that different from allowing non-axis-aligned splits?

~~~
srean
That is a very interesting question and one can go deep into the woods with
that, not everything is known about this question. There are two distinct
effects at play here, computational cost and learning efficiency.

A brutally shortened response would be that searching for non-axis aligned
splits in the nodes can be very expensive CPU wise. This addresses the first
effect at play. Applying a random rotation on the other hand is quite cheap.

The second effect at play is that ensembling loses its power when you optimize
over oblique splits because of lost diversity.

It comes down to a trade off.

------
zintinio5
This is really awesome, thanks for posting! I was not yet aware that random
forests could be used as kernels, and the link to Davies and Ghahramani is
especially useful, definitely will read through later.

~~~
amelius
That reference indeed seems very accessible.

