
Apache Arrow: A new open source in-memory columnar data format - jkestelyn
https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces87
======
bcoates
If you don't speak press-release, this is a cool project to create an in-
memory interop format for columnar data, so various tools can share data
without a serialize/deserialize step which is very expensive compared to
copying or sharing memory on the same machine or within a local network.

[https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=blob;f...](https://git-
wip-
us.apache.org/repos/asf?p=arrow.git;a=blob;f=format/Layout.md;h=c393163bf894bab283641882d9aa4a8c2ef0ef8e;hb=HEAD)

(edited post because I fail reading git and didn't notice the java
implementation)

~~~
jaltekruse
Disclosure I am a committer on Apache Drill and Apache Arrow.

This isn't actually true. The java implementation has been complete and used
in Apache Drill, a distributed SQL engine, for the past few years. While we
anticipate a few small changes to make sure the standard works well across new
systems, this is by no means an announcement without tested code.

[https://git-wip-us.apache.org/repos/asf?p=arrow.git;a=commit...](https://git-
wip-
us.apache.org/repos/asf?p=arrow.git;a=commit;h=fa5f0299f046c46e1b2f671e5e3b4f1956522711)

~~~
perishabledave
I stumbled upon Google's whitepaper on Dremel. IIRC it explained how to store
the data in columnar format, but I didn't quite get how that translated into
quick queries. Happen to know where I can look to better understand how it
works?

~~~
TallGuyShort
I'll take a stab at explaining it myself: "transactional queries" are faster
in a traditional format because they access many columns in few rows. For
instance, if you want to log in to a website, you access the username,
password, and possibly other authentication factors for a single user: this
ends up being faster if you can go to the user's row, and then read all those
fields in a contiguous scan.

"Analytical queries" are faster in a column-based format, because you're doing
things like computing the correlation between 2 variables. Instead of looking
at many columns in a specific row, you're looking at most of the data in a few
columns. So instead of reading a whole row at once, it would be nice to skip
the columns you don't care about and just grab big chunks of two or three
columns.

Does that make sense?

edit: For a long time I was confused about why HBase was described as a
columnar data store when access was still pretty row-based. I think the reason
is because you group columns into column families which can be stored and
retrieved separately, so you still get some of the benefit of a true column-
oriented store.

~~~
vdm
> For a long time I was confused about why HBase was described as a columnar
> data store when access was still pretty row-based

The term was overloaded by column-family stores, which were often referred to
as just 'column stores', probably by people who were not aware of systems like
Vertica and MonetDB.

[http://dbmsmusings.blogspot.co.uk/2010/03/distinguishing-
two...](http://dbmsmusings.blogspot.co.uk/2010/03/distinguishing-two-major-
types-of_29.html)

------
xtacy
Nice initiative. Cheap serde and cross-language compatibility with an eye
towards data scan intensive workloads is an important component!

Have you folks considered Supersonic engine from Google, which was designed
with similar (but not as extensive as Arrow) goals in mind?

[https://github.com/google/supersonic](https://github.com/google/supersonic)

~~~
infinite8s
Do you have any experience with Supersonic? It appears to be abandoned (at
least there's no activity on the mailing list as of 2014 -
[https://groups.google.com/forum/#!forum/supersonic-query-
eng...](https://groups.google.com/forum/#!forum/supersonic-query-engine))

------
rch
In-memory only... How is this better than the SFrame implementation from Dato
(2015) that was posted here a couple of days ago?

[https://news.ycombinator.com/item?id=11106501](https://news.ycombinator.com/item?id=11106501)

~~~
sandGorgon
even i had that question - especially when people are talking about using
SFrame as the underlying structure for Julia. but then I saw this:

 _" Arrow's cross platform and cross system strengths will enable Python and R
to become first-class languages across the entire Big Data stack," said Wes
McKinney, creator of Pandas._

 _Code committers to Apache Arrow include developers from Apache Big Data
projects Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu (incubating),
Parquet, Phoenix, Spark, and Storm as well as established and emerging Open
Source projects such as Pandas and Ibis._

this is pretty much nuke-from-orbit.

~~~
rch
> this is pretty much nuke-from-orbit

That analogy might imply overkill, thus highlighting the tactical advantages
of the SFrame approach in processing a month's worth of 1-10GB daily-generated
SQLite files, for instance.

~~~
tanlermin
Python's Dask out of core dataframe can also do that.

~~~
infinite8s
Dasks' out of core dataframes are just a thin wrapper around pandas dataframes
(aided by the recent improvement in pandas to release the GIL on a bunch of
operations)

~~~
tanlermin
Uh, no they are not. They lazy- scale pandas to on disk and distributed files.

[http://dask.pydata.org/en/latest/dataframe.html](http://dask.pydata.org/en/latest/dataframe.html)

"Dask dataframes look and feel like pandas dataframes, but operate on datasets
larger than memory using multiple threads."

[http://blaze.pydata.org/blog/2015/09/08/reddit-
comments/](http://blaze.pydata.org/blog/2015/09/08/reddit-comments/)

~~~
sandGorgon
Why doesn't Pandas have anything to save the entire workspace to disk (like
.RData). There are all these cool file formats like Castra, HDF5, even the
vanilla pickle - but I don't see anything with a one shot save of the
workspace (something like Dill)

Is this an antipattern for Pandas?

------
rodionos
I had a hard time parsing out the technical bits behind the facade of titles
and endorsements. I didn't know you could be a Vice President of an open
source product. Two roles I'm familiar with: committer and PMC chair. Now VPs.
Just saying...

~~~
jkestelyn
PMC Chair = Apache VP

~~~
rodionos
Good to know, thanks.

------
agentgt
Not to completely denigrate Apache but I have found contributing to Apache
projects (not the license but real apache projects) to be somewhat of a
hassle. They generally do not do PRs but rather patches via email or attached
to bugs (perhaps some do but I have yet to see one that does), require
signatures for contribution, and JIRA is really getting slow these days.

I'm somewhat ignorant as I don't run any Apache projects but I'm curious as to
why people choose Apache to back their project these days. I guess why choose
a committee instead of just leaving it on Github. I suppose its the whole
voting and board stuff.

~~~
rectang
Projects at Apache have varying levels of integration with Github. For
projects with good integration (and there are lots: Spark, Cordova, etc.), a
pull request is fine.

As for why projects choose Apache over Github, there are lots of reasons. Some
of them apply to choosing any foundation (Apache, Eclipse, etc.) over Github:
legal rigor, known quantity for enterprisey consumers, and so on. Probably the
biggest reason many company sponsored projects end up at the ASF is because
the ASF has a reputation as a good place for competing companies to
collaborate on a common code base.

Personally, I have chosen to donate a great deal of my own time and energy to
the ASF because I greatly treasure its emphasis on governance by individual
contributors rather than corporations. (That the ASF is a 501(c)(3) non-profit
rather than a 501(c)(6) like some of the more slick, consortium-like
foundations is related.)

------
ccleve
Could someone explain the difference between this and Avro or Parquet? Do they
serve the same purpose?

~~~
jaltekruse
Parquet is designed specifically to store large amounts of data efficiently on
disk. As such it defaults to compressing and encoding data to save space.
Arrow is designed for immediate consumption without any materialization into a
different in-memory data structure. It is already in a format well suited to
be used for sending over the wire or reading directly from an API.

I don't know as much about the internals of Avro, but I know it is a bit
different from Parquet, in that it can be used to serialize and deserialize
smaller amounts of data. It is used to store large datasets in files, although
it will in most cases be less space efficient than Parquet. It has also been
used as a way of embedding complex structures into other systems (similarly to
how JSON can be embedded in a database), or for serializing individual
structures between systems. The binary representation of Avro needs to be read
into a system-specific format like a C/C++ struct/object, Java object, etc.
for consumption.

In contrast, Arrow is designed to represent a list of objects/records
efficiently. It is designed to allow for a chunk of memory to handed to a
lightweight language-specific container that can immediately reference into
the memory to grab a specific value, without reading each of the records into
it's own individual object or structure.

~~~
infinite8s
Parquet is also designed to efficiently store nested data on disk (by
efficient I mean it can retrieve a field at arbitrary depth without needing to
walk from the root of the record)

------
buremba
Nice to see a new columnar data format alternative. Just a quick question
though.

The existing columnar data formats such as Parquet and ORC aim to be space-
efficient since the data is stored in disk and IO operations are usually the
bottleneck. The columnar data formats shine in big-data area so the amount of
data will be huge. Given that columnar data formats can be compressed
efficiently and that's of the main points of columnar data formats such as
Parquet and ORC, I'm not sure that I understand the main point of in-memory
columnar data formats.

Once the data is in-memory and we can access any column of a row in constant-
time what's the difference between a row-oriented data format and columnar
data format?

~~~
khc
Random memory access isn't really constant time when you factor in hardware
prefetch and cache lines. See "What Every Programmer Should Know About Memory"

~~~
TallGuyShort
That, and regardless of row- or column-orientation, any common in-memory
format that was actually as well-established as this looks like it will be in
Big Data projects would be nice.

------
david-given
Is it streamable? Could I use this as an intermediate format to send columnar
data between two processes via a pipe?

~~~
barneso
Better, if the processes are on the same machine you could use it to share the
data via shared memory or a common memory mapping, to avoid having copies of
the data on each end of the pipe.

------
tycho01
I'm interpreting this as saying they want to use the same representation to
use in memory (for querying) and for 'serialization' (sending the same thing
over the wire). This begs the question why separate serialized representations
ever became a thing in the first place.

My understanding is that serialization became a thing because in-memory
representations tend to use pointers to shared data structures that may thus
be referenced multiple times while being stored only once. This would not
translate 1:1 to serialized representations (where memory offsets would no
longer hold meaning) -- much less in any language-agnostic way.

So I have this suspicion that Apache Arrow would not support reusing duplicate
data while storing it only once. Would anyone mind clarifying on this point?

------
NovaX
"Modern CPUs are designed to exploit data-level parallelism via vectorized
operations and SIMD instructions. Arrow facilitates such processing."

How will Arrow use vectorized instructions on the JVM? That seems to be only
available to the JIT and JNI, which is a frustrating limitation.

------
crudbug
"All systems utilize the same memory format"

Will Cassandra Java drivers support this ?

------
jkestelyn
More technical details available here:

[http://blog.cloudera.com/blog/2016/02/introducing-apache-
arr...](http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-
interoperable-in-memory-columnar-data-structure-standard/)

