
Hummingbird: Compile trained ML models into tensor computation - polm23
https://github.com/microsoft/hummingbird
======
vladf
This is an interesting idea, with the main non-trivial win really being
vectorized GBDT inference. Instead of serially going down a DT, you can
convert it to vectorizable GEMM code.

By a cursory look at hummingbird/ml/operator_converters/_tree_commons.py this
seems to just be doing a GraphBLAS-style graph traversal with dense GEMM,
which strikes me as resulting in incredibly redundant computation. I think
this would only give you acceleration for very small trees (edit: granted, I
think that's what the defaults are for a lot of these packages).

I'd be interested in seeing how this stacks up against:

* native GPU execution for XGB/LGBM

* a _sparse_ GEMM implementation of their algorithm

* the classical reference for DT vectorization, Quickscorer, see [https://github.com/hpclab/quickscorer](https://github.com/hpclab/quickscorer)

~~~
interesaaat
In the technical report
([https://scnakandala.github.io/papers/TR_2020_Hummingbird.pdf](https://scnakandala.github.io/papers/TR_2020_Hummingbird.pdf))
table 8 we compared against NVIDIA FIL library. From their blogpost
([https://medium.com/rapids-ai/rapids-forest-inference-
library...](https://medium.com/rapids-ai/rapids-forest-inference-library-
prediction-at-100-million-rows-per-second-19558890bc35)) they claim to be
faster than the original XGB\LGBM GPU implementation, so we decided to pick
FIL as GPU baseline.

Regarding the GEMM strategy: you are right, it only works for shallow trees.
In Figure 7 we compared GEMM against the other strategies implementing typical
tree traversal.

At some point we had an implementation using sparse GEMM, but only for sparse
input feature vectors.

------
aasasd
You'd think that a machine-learning thing would be called ‘ _mocking_ bird’.

------
1337shadow
Nothing to do with the Hummingbird notation, if anyone is wondering.

[https://www.hummingbirdnotation.com/](https://www.hummingbirdnotation.com/)

------
fxtentacle
How is this different from replacing numpy with cupy?

~~~
interesaaat
With cupy you limit yourself to nvidia hardware. By using pytorch instead you
can run on whatever hardware supported by it, either directly or indirectly
(e.g., through ONNX conversion)

~~~
fxtentacle
But I believe ONNX also doesn't support AMD GPUs, so for people using a
regular workstation, the ONNX options are CPU & NVIDIA GPU, just like with
numpy and cupy.

~~~
interesaaat
I am referring to IPU and TPU for example. Or even FPGAs.

------
KorfmannArno
Is anyone else interested in reading groups for the open-source book D2L.AI?

