
Vectorized processing for Apache Arrow - gizmodo59
https://github.com/dremio/gandiva
======
gizmodo59
There is another component called Apache Arrow Flight, a new way for
applications to interact with Arrow. Now that there is an established way for
representing data between systems, Flight is an RPC protocol for exchanging
that data between systems much more efficiently. You can think of this as a
much faster alternative to ODBC/JDBC for in-memory analytics.

~~~
wesm
Arrow Flight is still being developed and may be while before it drops to real
user code. See ARROW-249. Indeed having a open standard runtime memory format
for tabular / columnar data sets is key to improved performance

------
cafxx
No mention anywhere of how it was ensured that the benchmark measurements were
taken after the code had been compiled by the JIT (specifically, C2 on
HotSpot)... seriously?

------
DanielShir
I'd like to use it, could you guys add a license to this thing?

------
jpfr
The technology behind this (and Apache Arrow in general!) are really cool.

The HN submission title is much more clickbait than the actual content
deserves. Any algorithmic improvement (in terms of big-O runtime complexity)
can be scaled up to 100x by making the data set large enough.

~~~
lmeyerov
This is cool! Maybe a different perspective:

\-- Numbers are a red herring to me: HPC libs plugged onto the columnar data
should destroy LLVM-generated code, including SIMD (not sure if being done
yet). HPC libs should cover the typical alg shapes anyways (e.g., everything
in the article), and GPU code should generally beat whatever is here anyways.
For a fun time, look into SIMD polyhedral vs skeleton & manual GPU codes..

The numbers here we really care about are expression compile time - will a
system respond quickly to an analyst or adaptive system generating code. Is it
< 10ms overhead, < 100ms, < 1s, < 5s, ...?

\-- All about the UDFs: So why do we care? A big problem w/ the algorithm
template libs is they stink at accepting friendly user-defined expressions
(UDFs). Fast systems are increasingly adopting LLVM to solve this. Speaking of
SIMD.. MapD does exactly this: columnar SQL expressions -- LLVM --> CUDA. Our
team has been thinking about how to do stuff like js/python/pandas UDFs ->
CUDA, and pragmatically, most roads came down to this (vs say Numba's
internals). So Gandiva providing a modular bit here is quite cool, esp. for
the growing Arrow community!

\-- For framework devs: Likewise, there are some cool choices here like code
caching. Users shouldn't directly interact with this, but framework devs will.
We've eaten cycles here for similar projects, and most Gandiva consumer would
have to as well if it wasn't here too. So just as Apache Calcite is easing dev
of a lot of similar systems, it's cool to see the team is not just doing the
modular system, but seemingly, doing it well!

