
Yandex open sources CatBoost, a gradient boosting ML library - bobuk
https://techcrunch.com/2017/07/18/yandex-open-sources-catboost-a-gradient-boosting-machine-learning-librar/
======
kgwgk
> gradient boosting — the branch of ML that is specifically designed to help
> “teach” systems when you have a very sparse amount of data, and especially
> when the data may not all be sensorial (such as audio, text or imagery), but
> includes transactional or historical data, too.

That's some strange definition.

~~~
mjn
Weird, yeah. Seems like a roundabout way of trying to preemptively answer the
"why not deep learning?" question omnipresent among ML newcomers. The bits
identified aren't really wrong: you could argue that gradient boosting's
comparative strength is that it works well (often out-of-the-box, with little
tuning) on structured data sets, including relatively small data sets. Hence
the good performance on Kaggle-type problems, whereas deep learning is ahead
in audio/text/image/video data; and hence the lack of gradient boosting being
used on ImageNet-type problems.

But these points all belong in some section entitled "why use gradient
boosting instead of another ML method?", not in a definition of gradient
boosting.

~~~
daddyo
Seems that deep learning can benefit from gradient boosting too (at least,
from a computational perspective).

[https://arxiv.org/abs/1706.04964](https://arxiv.org/abs/1706.04964) "Learning
Deep ResNet Blocks Sequentially using Boosting Theory"

(As for the lay-man description: I thought boosting performed better out-of-
the-box on dense data than on sparse data, because most feature sub-selections
for bagging are on zero'd features)

------
amrrs
The benchmark scores seem to have been measured against Kaggle dataset which
makes the scores more reliable and also with Categorical Features support and
less tuning requirement, Catboost might be the ML library XGBoost enthus might
have been looking for, but on the contrary, how come a Gradient Boosting
Library making news while everyone's talking about Deep learning stuff?

~~~
zitterbewegung
I think its because of three reasons:

1\. Yandex is annoucing a new ML library and that makes it news because Yandex
is well established

2\. Gradient boosting is quite effective and popular

3\. Not everything has to be about deep learning

~~~
daddyo
Adding to 1.: This came out of MatrixNet research, which was state-of-the-art
(and well-guarded) for years.

------
sandGorgon
Catboost is implemented in C. Does anyone know how stuff like this is run at
scale over multiple machines ? For example, if I want to run a distributed
computation in spark - I use some primitives that are distributed in nature.

But how does someone use Catboost across a cluster of 10 machines ? All the
help documents are heavily single machine. Is there any kind of infra-
framework that will distributed the jobs across all the machines running
Catboost, etc ?

~~~
prdonahue
Last time I did something in C requiring spreading work across a cluster of
machines it was with MPI(CH). Docs available at mpich.org. This was Monte
Carlo simulation for hyper-dimensional (~8 IIRC) asset allocation and thus the
simulations were easily divisible—recombining was just simple arithmetic—and
the network I/O was minimal.

Interesting tidbit (to me anyway): This was 2006, before you could use
something like AWS for the purpose and we were trying to keep costs to a
minimum. (IBM had their "public one" grid but it was unusable.) The Core 2 Duo
processor had just come out so I hired an intern from Cal—now a PhD and
brilliant engineer at Netflix—to figure out the optimal overclocking rig and
we built 32 chassis consisting of just motherboard, RAM, NIC, custom
cooling/heatsink, and power supply. The problem was then how to deploy these
in a colo. At the time there were some low end providers at 200 Paul willing
to get creative with a cabinet so I found a machinist (metal worker?) able to
cut some custom aluminum shelving on top of which we could stack the ATX
cases. Rigged the boxes up to network boot off one of the nodes, compiled our
application with Intel's C++ compiler to take advantage of the SIMD/SSE3
instruction set, and away we went running billions of simulations on a startup
budget.

~~~
srean
If you were using a distributed workload (not clear form your comment if you
were) curious if you tried out multiple NICs per machine. By 2006 there
already were mutliple queues per machine, but prior to that, multiple NICs
were sometimes helpful.

------
nl
_“Reduced overfitting” which Yandex says helps you get better results in a
training program._

So that's awesome...

The benchmarks at the bottom of
[https://catboost.yandex/](https://catboost.yandex/) are somewhat useful
though. I do remember when LightGBM came out and the benchmarks vs XGB were...
very selective though.

~~~
autokad
I love both lightgbm and xgb. together they make a good ensamble. should be
interesting how this one turns out.

~~~
nerdponx
Interesting idea to ensemble them. The main difference is the way the trees
are constructed, right? Do ensembling them is kind of like saying "I don't
know which one is better, so screw it I will do both an average the results",
right?

~~~
s0ulmate
Hey! CatBoost team here.

Yes, stacking different gradient boosting algorithms works well on practice.
One example is a just finished kaggle-like competition:
[http://mlbootcamp.ru/round/12/sandbox/](http://mlbootcamp.ru/round/12/sandbox/)
where mpershin stacked CatBoost with LGBM and got the 7th place.

The kernel for this solution can be found in one of our tutorials:
[https://github.com/catboost/catboost/blob/master/catboost/tu...](https://github.com/catboost/catboost/blob/master/catboost/tutorials/mlbootcamp_v_tutorial.ipynb)

------
kensoh
Thanks OP, I really like this part of the whole article - 'It also uses an API
interface that lets you use CatBoost from the command line or via API for
Python or R'

