
The Origin and History of Apache Arrow - jonbaer
https://www.dremio.com/origin-history-of-apache-arrow/
======
chrisaycock
This blog post is the first mention I'd seen of Gandiva:

[https://github.com/dremio/gandiva](https://github.com/dremio/gandiva)

As I began reading about that, I thought it sounded a lot like Weld, so I was
happy to see it listed in the prior art:

[https://www.weld.rs](https://www.weld.rs)

These projects present a compiler IR that can be emitted by an analytics
framework.

For example, Spark and pandas currently run functions directly to perform
filters and aggregations. These can be fast for simple user cases, but complex
user code might result in a lot of branching and temporary values. So instead,
the libraries could emit a mid-level IR that is then JITed via LLVM. That's
exactly what Weld is for and it looks like Gandiva is similar (on top of Arrow
data).

I've seen a lot of interest in LLVM-based array programming lately, including
TensorFlow XLA and Intel's nGraph. So I'm glad to see these techniques applied
to tabular data as well.

~~~
kstirman
More on the Gandiva Initiative here: [https://www.dremio.com/announcing-
gandiva-initiative-for-apa...](https://www.dremio.com/announcing-gandiva-
initiative-for-apache-arrow/)

------
nevi-me
This has been a great and informative write-up. I learnt of Arrow when Wes and
Hadley announced Feather. I think Arrow was still 0.4.0

At the end of 2016, we had a project at work where we were trying to migrate a
credit modeling architecture away from SAS to something different.

There were enterprise "SAP HANA" hammer+nail in the room (I'm biased because I
think HANA is rubbish vaporware), and directors who thought "open source is
immature and unstable".

They gave me the lead on the project, and I started out with a naive Pandas
port, where I hit the data size issue. I got more resources and we made an R
attempt, similar perf. We made a PySpark impl, serde and high latency were the
problem. Wrote another version in Scala; better but similar.

Given the nature of the business, we canned the project. I learnt a lot
though.

Data serialisation is a pain. When we started using Feather, we were able to
speed up some of our implementations by bigly magnitudes. At the time, Parquet
interop was sketchier, but much better than CSV. I tried taking advantage of
Spark and R by building a bridging service where I could run the fast parts of
each service, dump data to Feather files, and have the next service pick up
processing. R hasn't had a good prod/serve API story, so I didn't get good
results. I used gRPC for Python and Scala, and tried some REST thing as a
proxy for R. What I could have benefited from was an IPC.

Throughout Arrow releases, and the various integrations over the past year and
a bit, a lot of what we struggled with was solved.

The recent improvements (as I'm keenly interested in Arrow for computing) make
me excited.

I'm starting a project where I'll be building web-based viz and reporting. I'm
planning on using Arrow, undecided on whether to use JVM or Rust on the back
end given that JDBC to Arrow will be a thing in the next release, or Diesel.rs
to Arrow from DataFusion.

Nonetheless, exciting future for data engineers on the horizon!

------
Jabbermonkey
I hit a roadblock with trying to read Stata files into Pandas a few months
ago. I discovered that not all versions of Stata file formats are supported by
Pandas in Python. R has much better support for Stata files.

With the help of Feather, which was written over Arrow, I was able to read
Stata files into R, write the dataframe out to Feather and read the Feather
file into a Pandas dataframe with no manipulation.

Without Feather I would have had to resort to using CSVs as intermediate files
which would have meant additional pre-processing in R and post-processing in
Pandas. Feather and Arrow saved me a bunch of time on this.

I'm looking forward to using Arrow more broadly but, even with just Feather,
Wes and Hadley have vastly simplified the effort of interfacing between R and
Python/Pandas. I'm also very excited to see what else comes out of their
partnership at Ursa Labs: [https://ursalabs.org](https://ursalabs.org)

------
amelius
I was wondering what is a columnar data store and found this on Wikipedia:

> by storing data in columns rather than rows, the database can more precisely
> access the data it needs to answer a query rather than scanning and
> discarding unwanted data in rows. Query performance is increased for certain
> workloads.

This seems true only in very specific circumstances. And why can't a database
figure out the best way to store data, rather than let the user make the
decision upfront (which they may later regret when requirements change)?

~~~
zukzuk
Columnar data stores tend to be used in applications where performance is the
primary concern. Abstractions tend to be leaky, but especially so when it
comes to performance.

This is maybe comparable to the way high-performance applications are still
usually written in C instead of higher-level languages. You'd think that at
this point we could write compilers smart enough to generate more performant
code than something written by a human hand, but in practice that often
doesn't turn out to be true. Not yet anyway. And even when this is possible,
due to their inevitably complexity, those abstractions tend to misbehave in
unexpected ways, with unpredictable performance hiccups.

------
roryisok
I misread the title and thought this was going to be about Native-American
archers.

------
rb808
Does anyone know when we can expect a 1.0 release that I can use in prod? I
have a perfect project for it and have tried it out but am reluctant to
implement fully until its nailed down to a LTS version.

~~~
crb002
Should be stable now for 99%.

~~~
rb808
Thanks that's helpful. I'm excited about this one.

------
mistrial9
very interesting as an outsider to see the high-stakes engineering like this
unfold, and notice the pull of lowly scripting language python for serious
architecture CS teams.

