
Can Gradient Boosting Learn Simple Arithmetic? - mariofilho
http://mariofilho.com/can-gradient-boosting-learn-simple-arithmetic/
======
YeGoblynQueenne
I hate to be the source of all the bad vibes in the room, again, but I'd be
careful with claims about "learning arithmetic". It's obvious that the trained
model in the article is only memorising its training data, rather than
actually _learning_ arithmetic functions, in the general sense. If you gave it
any numbers outside the range it was trained on, those graphs would start to
look real silly. This is probably the (actual) reason why no testing split was
used- because the model is too weak to generalise to held-out data [1].

The point to keep in mind here is that addition is a recursive function [2]
and as such cannot be learned by learners that cannot model recursion, which
is basically all statistical machine learners. The best thing that can be done
is to approximate it within some range, at which point you're only memorising
which pairs of numbers X and Y map to which third number Z - like the model in
the article does. And it doesn't even do it very well, hence the need for
noise (which allows it to luck out and cover more XYZ triples).

So let's say that a better title would be "Can GB approximate arithmetic
functions over a tiny range of numbers?". Which is not that exiting, for sure
[3].

____________________

[1] The only good reason I can think to not validate on a test partition is
that you only have a single example. I can think of one use case, trying to
learn plans from examples of starting and goal states. But arithmetic? How can
you claim to have learned "arithmetic" when you can't show that your model
works on even one pair of numbers it hasn't seen in training?

[2]
[https://en.wikipedia.org/wiki/Peano_axioms#Addition](https://en.wikipedia.org/wiki/Peano_axioms#Addition)

[3] This is also a good example of the limitations of statistical machine
learning, in general. Having "learned" addition, a strong learner should be
able to use it to learn the other three functions. Except, statistical machine
learners can only learn one concept at a time, and they can't reuse their
models as features to learn new concepts.

~~~
agent008t
It seems like the big question is how can the gap between statistical and
symbolic ML be bridged. How can the same architecture both learn to e.g.
recognize cats in images, and perform logical inference on symbols?

~~~
YeGoblynQueenne
The δILP paper from DeepMind did something like this using one architecture,
trained end-to-end:

[https://deepmind.com/blog/learning-explanatory-rules-
noisy-d...](https://deepmind.com/blog/learning-explanatory-rules-noisy-data/)

They have one experiment were they learn the less-than relation between
_images_ of digits. Pretty cool stuff.

However- my preference would be a system that combined statistical and
symbolic learning. For machine vision specifically, the statistical learner (a
CNN most likely) would extract features and these would then become the
universe of discourse for the background knowledge of the symbolic learner.
This is highly speculative and I haven't done any sort of practical work
towards that but I've discussed some of the complications with coleagues and I
think that they can be overcome.

~~~
agent008t
I have only briefly looked through their long paper, and it looks like what
you suggest is pretty much exactly what they do.

Their main contribution seems to be that they have incrementally improved on a
symbolic learner. They explicitly tell it to search a predefined set of
programmes. They use pretrained MNIST CNN classifiers as inputs into it, which
already know that e.g. MNIST has 10 classes.

What I was talking about is symbolic reasoning somehow 'emerging' from a
connectionist approach, without being explicitly designed. The cool thing
about deep nets is that they are both kind of biologically and evolutionary
plausible (except back-propagation, I suppose), and are able to achieve great
performance on a bunch of traditionally difficult perception tasks (vision,
hearing - again, with some caveats).

But how such a system could develop symbolic reasoning is not at all clear.
Are any other biological systems apart from humans capable of symbolic
reasoning?

Of course, planes don't flap their wings and so a practical system will
probably have a symbolic reasoning engine designed top-down rather than
emerging from some neural net. But it is still an interesting question that
would give us more insight into how our brains work.

~~~
YeGoblynQueenne
I see what you mean- thank you for the clarification.

Well, I'm not really the right person to discuss this issue since I can't
claim to understand how the brain works. However, I do understand a few things
about connectionist methods and it's my understanding that they are not very
good models of the way the brain works at all (am I misrepresenting your turn
of phrase, of "biologically plausible"?). In that sense, I doubt it's possible
that a neural net would develop symbolic reasoning by dint of being in some
way similar to a biological brain.

In general, my experience with neural nets and gradient optimistation
techniques is that unless a great deal of effort is spent directing their
learning, they are prone to learn whatever is convenient, which is very often
not what the human users want it to.

For instance, see the following collection of anecdotes of evolutionary
algorithms lerning whatever they please, rather than what the researchers were
trying to teach them:

[https://arxiv.org/abs/1803.03453](https://arxiv.org/abs/1803.03453)

Or the descriptions of the difficulties of training Deep Reinforcement
Learning models in this article:

[https://www.alexirpan.com/2018/02/14/rl-
hard.html](https://www.alexirpan.com/2018/02/14/rl-hard.html)

Also, while I'm a staunch symbolicist myself, I'm not 100% convinced that
symbolic reasoning is "natural". I think it rather took a lot of effort to
develop such systems and most people still have a great deal of trouble using
them with precision. If you meant to say that symbolic reasoning should arise
spontaneously in a neural network, I see that as very unlikely.

------
yorwba
It's actually quite obvious why gradient boosted trees don't work on the
synthetic dataset, if you know how they're trained. I guess it's a testimonial
for their usefulness as black-box predictors that the author didn't realize
that.

For gradient boosted trees, you first need to grow a single tree. That tree
starts with a single leaf and then needs to be split to try and improve
performance. But because the data is perfectly antisymmetric, no suitable
split can be found. So the growing process terminates. Gradient boosting can't
help you, because the residuals to train the next tree on are identical to the
original data.

If you add even the slightest amount of imbalance to the data, e.g. by
sampling random positions instead of using a grid, the problem disappears.

~~~
mariofilho
Hi, thanks for your comment:

I opened an issue on Github. This didn't seem obvious to many people that read
the article, so it's nice that now we can keep this in mind while using the
model:

[https://github.com/dmlc/xgboost/issues/4069](https://github.com/dmlc/xgboost/issues/4069)

------
luckyt
It's really interesting that XGBoost failed when ran on a dataset that had no
noise. I've also seen a similar thing occur with the Adam optimizer when
training neural networks on perfect synthetic data [1]. Always interesting to
take a dive down and understand why this is happening -- it gives you a
glimpse into the internals of your algorithms.

[1]:
[https://datascience.stackexchange.com/questions/25024/strang...](https://datascience.stackexchange.com/questions/25024/strange-
behavior-with-adam-optimizer-when-training-for-too-long)

~~~
svantana
I wouldn't call that comparable -- Adam gets 99.99% of the way, and this
person is wondering why it doesn't go all the way. The answer of course is
that it wasn't designed to. In this case XGBoost fails to do anything at all,
which seems like a major bug.

------
shoo
Friedman's gradient boosting machine paper from ~1999 is worth a read:
[https://projecteuclid.org/download/pdf_1/euclid.aos/10132034...](https://projecteuclid.org/download/pdf_1/euclid.aos/1013203451)

------
pfortuny
Possibly silly question: what would happen if one were to extrapolate? I mean,
using the trained model in a larger grid.

I should do it myslef but have no background in applications.

~~~
mariofilho
It will very likely fail to predict new data. The model can approximate the
interactions over the range it saw on training, but we need to show it larger
ranges if we want to predict for a bigger grid.

~~~
hopler
So, Betteridge's Law of Headlines is supported.

------
ggerules
Interesting read. Thanks for posting! It reminds me of a simple find the
equation (symbolic regression) problem encountered in genetic programming.
(see Koza's first genetic programming book).

