
Do we need a third Apache project for columnar data representation? - riboflavin
http://dbmsmusings.blogspot.com/2017/10/apache-arrow-vs-parquet-and-orc-do-we.html
======
wesm
One of the lead Arrow developers here
([https://github.com/wesm](https://github.com/wesm)). It's a little bit
disappointing for me to see the Arrow project scrutinized through the one-
dimensional lens of columnar storage for database systems -- i.e. considering
Arrow to be an alternative technology (i.e. part of the same category of
technologies) to Parquet and ORC.

The reality (at least from my perspective, which is more informed by the data
science and non-analytic database world) is that Arrow is a new, category
defining technology. So you could choose to use it as a main memory storage
format as an alternative for Parquet/ORC if you wanted, but that would be one
possible use case for the technology, not the primary one.

What's missing from the article is the role and relationship between runtime
memory formats and storage formats, and the costs of serialization and data
interchange, particularly between processes and analytics runtimes. There are
also some small factual errors about Arrow in the article (for example, Arrow
record batches are not limited to 64K in length).

I will try to write a lengthier blog post on wesmckinney.com when I can going
deeper into these topics to help provide some color for onlookers.

~~~
Thrymr
In fact, the conclusion of the article is more positive than the headline
makes it out to be: "Therefore, it makes sense to keep Arrow and Parquet/ORC
as separate projects, while also continuing to maintain tight integration."

~~~
makmanalp
I think he buried the lede a bit with the title - his answer being "yes, we
should have a separate format for this." The way he phrased his points was a
bit odd and seemed inimical at times even though in the end it wasn't - e.g.
he mentioned the X100 paper and his C-store compression paper which both talk
about lightweight in-memory compression schemes which would suit Arrow's use
case well, but then he goes back and says "Arrow probably won't support gzip"
(which is much more heavyweight but offers better compression ratio and is
more suitable for disk based storage formats) - OK, so that's fine and to be
expected then? It turns out, yep, it's expected.

His main idea I took away was dispelling the notion that we should have
instead been putting effort into slightly repurposing an existing format for
our in-memory data layout.

It's definitely exciting to see the data science world and the databases world
finally interacting a lot more - I think each has a lot to learn from the
other. Battle tested techniques on one side, and entirely new use cases to
deal with on the other.

\---

For the curious: Abadi is a well-known name in the databases community,
especially with regards to column stores. A few papers that he's co-authored
that I like:

"Column-Stores vs. Row-Stores: How Different Are They Really?" \- speaks about
how a different style of query execution is a big part of what drives column
store performance, and not just the memory layout itself:
[http://db.csail.mit.edu/projects/cstore/abadi-
sigmod08.pdf](http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf)

And then there's the magnum opus "The Design and Implementation of Modern
Column-Oriented Database Systems" which is a huge survey into the subject:
[https://stratos.seas.harvard.edu/publications/design-and-
imp...](https://stratos.seas.harvard.edu/publications/design-and-
implementation-modern-column-oriented-database-systems)

~~~
jacquesnadeau
You nailed it. The most exciting part about all of this is being able to move
between a "data science" context and a "database" context (and back again)
without pain or penalty.

------
jacquesnadeau
I'm also a developer on Arrow
([https://github.com/jacques-n/](https://github.com/jacques-n/)), similar to
WesM. It is always rewarding (and also sometimes challenging) to hear how
people understand or value something you're working on.

I think Dan's analysis is evaluating Arrow from one particular and fairly
constrained perspective of "if using Arrow and Parquet for RDBMS purposes,
should they exist separately". I'm glad that Dan comes to a supportive
conclusion even with a pretty narrow set of criteria.

If you broaden the criteria to all the different reasons people are
consuming/leveraging/contributing to Arrow, the case only becomes more clear
for its existence and use. As someone who uses Arrow extensively in my own
work and professionally ([https://github.com/dremio/dremio-
oss](https://github.com/dremio/dremio-oss)), I find many benefits including
two biggies: processing speed AND interoperability (now two different apps can
share in-memory data without serialization/deserialization or duplicate memory
footprint). And best of all, the community is composed of collaborators trying
to solve similar problems, etc. When you combine all of these, Arrow is a no
brainer as an independent community and is developing quickly because of that
(80+ contributors, Many language bindings (6+) and more than 1300 github stars
in just a short amount of time).

------
rectang
The Apache Software Foundation has no problem with hosting competing projects.
_There is no top-level technical direction by the Board or any other entity._
There's no "steering committee" or the like where companies pay to play and
push the organization to pump up their preferred project.

This is one of the fundamental reasons that the ASF is successful. Any time
you see the ASF criticized for hosting competing projects without addressing
this point, feel free to dismiss the critique as facile and uninformed.

~~~
wesm
Apache Arrow is not competing with Apache Parquet or Apache ORC.

~~~
osi
and _even if it were_, the ASF doesn't mind. contributors are free to dedicate
their time to whatever they choose.

------
chubot
This point is interesting:

 _So indeed, it does not matter whether the data is stored on disk or in
memory --- column-stores are a win for these types of workloads. However, the
reason is totally different._

He says that for tables on disk, column layout is a win due to less data
transferred from disk. But for tables in memory, the win is due to the fact
that you can vectorize operations on adjacent vaules.

I also think the convergence of these two approaches is interesting: one is
from a database POV; the other is from a programming language POV.

Column Databases came out of RDBMS research. They want to speed up SQL-like
queries.

Apache Arrow / Pandas came at it from the point of view of R, which descended
from S [1]. It's basically an algebra for dealing with scientific data in an
interactive programming language. In this view, rows are observations and
columns are variables.

It does seem to make sense for there to be two separate projects now, but I
wonder if eventually they will converge. Probably one big difference is what
write operations are supported, if any. R and I think Pandas do let you mutate
a value in the middle of a column.

\-----

On another note, I have been looking at how to implement data frames. I
believe the idea came from S, which dates back to 1976, but I only know of a
few open source implementations:

\- R itself -- the code is known to be pretty old and slow, although rich with
functionality.

\- Hadley Wickham's dplyr (the tbl type is an enhanced than data.frame). C++.

\- The data.table package in R. In C.

\- Pandas. In a mix of Python, Cython, and C.

\- Apache Arrow. In C++.

\- Julia DataFrames.jl.

If anyone knows of others, I'm interested.

[1]
[https://en.wikipedia.org/wiki/S_(programming_language)](https://en.wikipedia.org/wiki/S_\(programming_language\))

~~~
Notre1
For data frame implementations, you can also look at Spark's Dataset/DataFrame
and Graphlab Create's SFrame.

The company/team behind Graphlab Create was bought by Apple and the open
sourced components haven't been updated since then. Because of that, I
wouldn't use it in production, but if you are just looking for functioning
implementations to compare, that gives you one more.

~~~
chubot
Thanks, I never heard of Graphlab Create. It is a substantial piece of code!
It says it's "out of core", which means it's probably more similar to
Parquet/ORC than Arrow. But still interesting.

[https://github.com/turi-
code/SFrame/tree/master/oss_src/sfra...](https://github.com/turi-
code/SFrame/tree/master/oss_src/sframe)

For comparison, dplyr and arrow:

[https://github.com/tidyverse/dplyr/tree/master/src](https://github.com/tidyverse/dplyr/tree/master/src)

[https://github.com/apache/arrow/tree/master/cpp/src/arrow](https://github.com/apache/arrow/tree/master/cpp/src/arrow)

C++ does seem to be useful for stuff like this.

------
tedmiston
A pretty glaring flaw or omission in the analysis is using a table that's just
6 columns wide. Tables used in data analytics workloads are much wider, 50-100
columns or more is common. That number of columns means scanning significantly
more data for the row-oriented storage.

~~~
hyperpape
Depending on the details, I'd wonder if you'd just end up pulling in a single
cache line from the row, regardless of how wide the row is. Even in his
benchmark, you're wasting a good 5/6ths of the memory bandwidth you're using
(pulling a 24 byte row and only examining 4 bytes of it).

------
tshiran
I would argue that the fact that Arrow has been integrated into so many
projects over the last year is proof that a separate project made sense.
Dremio, PySpark, Pandas, MapD, NVIDIA GPU Data Frame, Graphistry, Turbodbc,
...

~~~
rch
Here's the conclusion of the article:

 _It makes sense to keep Arrow and Parquet /ORC as separate projects, while
also continuing to maintain tight integration._

You might enjoy reading it to see why.

------
d--b
Before people get all negative about the article, be sure to read it, because
the author does answer that yes, it does make sense.

~~~
ellisv
> Therefore, it makes sense to keep Arrow and Parquet/ORC as separate
> projects, while also continuing to maintain tight integration.

I wonder how many people will be lost before getting to the last line.

------
bazizbaziz
FWIW it should be possible to vectorize the search across the row store. With
24 byte tuples (assuming no inter-row padding) you can fit 2.6 rows into a 64
byte cache line (a 512 bit simd register). Then it's just a matter of proper
bit masking and comparison. Easier said than done, I figure because that
remainder is going to be a pain. Another approach is to use gather
instructions to load the first column of each row and effectively pack the
first column into a register as if it were loaded from a row store and then do
a vectorized compare as in the column-store case.

All of that to underscore it's not that one format vectorizes and the other
doesn't. The key takeaway here is that with the column store, the compiler can
_automatically_ vectorize. This is especially a bonus for JVM based languages
because afaik there is no decent way to hand-roll SIMD code without crossing a
JNI boundary.

~~~
glangdale
This isn't that hard. A sane person would do 3 cache lines at once as 192
bytes = 8 24 byte tuples. You would do 3 AVX512 loads, a masked compare at the
proper places (actually, I think the masked compare instructions can take a
memory operand, which might get you uop fusion so the load+compare is one
instruction) yielding 3 masks each with 16 bits (of course, most of the bits
would be junk). The 16 bit masks can be shift+or'ed together (whether as
"k-regs" or in GPRs) and the correct bits can be extracted with PEXT.

The downside of this is that you are still reading 6 times as much data. A
straightforward implementation of this should not be CPU bound IMO. If a
Skylake Server can't keep up with memory doing 32-bit compares I'll eat my
hat.

Gather is _not_ a good idea for this purpose. Gather is very expensive. It's
really mainly good for eliminating pointer chasing and keeping a SIMD
algorithm in the SIMD domain.

------
dgudkov
I've built a columnar in-memory data transformation engine [1] and wrote [2]
about the need in common columnar data format to avoid re-compression. The
problem I have with the Apache projects is that they all require strongly
typed columns (if I'm not missing something). In our case, the app is actively
used for processing spreadsheets and sometimes XML files which requires
supporting mixed data types (e.g. timestamps and strings) in a column.

Another issue is storing 128-bit decimals. They are more preferable for
financial calculations, but are not supported in the Apache projects.

So may be we need a forth standard for columnar data representation. Or expand
the existing ones to make them more versatile.

[1] [http://easymorph.com/in-memory-engine.html](http://easymorph.com/in-
memory-engine.html)

[2] [http://bi-review.blogspot.ca/2015/06/the-world-needs-open-
so...](http://bi-review.blogspot.ca/2015/06/the-world-needs-open-source-
columnar.html)

~~~
jacquesnadeau
Arrow supports a union type for heterogeneous columns (we use it for random
json in Dremio) and a 128-bit decimal.

------
glangdale
It's an interesting article but surely a modern instance should be able to
manage more than 4 u32 compares per cycle? Any machine with AVX2 (or beyond)
should either be able to do 2x256 AVX2 VPCMPEQD instructions (or AVX 512 can
do 1x512). The code to marshal up the results of this compare and do something
useful with would push out to another cycle or two but IMO we can surely
manage to compare an entire cache line (64 bytes) in 1 cycle worth of work.

This doesn't invalidate his point, and there are even more interesting things
that can be done with packed data formats - and of course if you're searching
a 6-bit column for some data item (or set of data items) you might be even
happier to be using a columnar form (especially if the row contains lots of
stuff: the 6:1 ratio in the article might be understating the advantage of
working at the column level).

------
nartz
I don't have any strong feelings about this either way, but the main question
in my head after reading this is:

\- Are the differences SIGNIFICANT enough to support two (three?) different
codebases? A lot of your points seem to be more about building in
configurations/plugins versus a completely different (core) storage format.
For instance, adding in separate plugins for different dictionary
compressions, or being able to specify the block read size or schemes. I would
just think you may be spending a lot of time reinventing 80% of the wheel and
20% on 'memory specific' functionality.

(I'm naively oversimplifying, but its something to think about).

------
dogruck
I think the author comes across as tone deaf for two reasons:

1\. Burying the lede, by tacking the affirmative conclusion on as the final
sentence,

2\. This quote reads as pompous, not humble:

> I assume that the Arrow developers will eventually read my 2006 paper on
> compression in column-stores and expand their compression options to include
> other schemes which can be operated on directly (such as run-length-encoding
> and bit-vector compression). I also expect that they will read the X100
> compression paper which includes schemes which can be decompressed using
> vectorized processing.

------
lowbloodsugar
If the CPU test is simply reading 1/6 of the data, then it should be memory
that is the bottleneck and not the CPU (even unoptimized). Something smells
very wrong about his adhoc test. The next piece of data is in the cache-line
seven times out of eight. And its reading 1/6 of the data into memory. Should
be way faster even without -O3. And if there's something fundamentally broken
about the code without -O3, then _why post that benchmark at all_? Seems
dishonest.

Be good to post the code when making such claims.

~~~
zlynx
I thought it made the point that if you aren't vectorizing your code then it
hardly matters if you use an optimized data format.

SSE makes a huge difference. With AVX and multiple threads you can actually
exceed the memory bandwidth of a CPU socket, which looks funny in performance
graphs as adding threads to cores suddenly stops scaling.

------
dkural
Columnar data representation can be optimized and designed for a wild range of
query and data types that have non-aligned constraints. So yes, you can have
3, or 12 for that matter. I think we'll see more & more purpose-specific
database technologies as gains from Moore's laws and its analogues start
slowing down, and some previously niche application areas become high-$ areas
with larger user bases.

------
jetblackio
Not sure if this is a good place to ask, but how do Apache Arrow and Parquet
compare to Apache Kudu ([https://kudu.apache.org/](https://kudu.apache.org/))?
Seems like all three are columnar data solutions, but it's not clear when
you'd use one over the other.

Kind of surprised the article didn't mention Kudu for that matter.

~~~
massaman_yams
This covers the distinction a bit better.
[https://www.slideshare.net/HadoopSummit/the-columnar-era-
lev...](https://www.slideshare.net/HadoopSummit/the-columnar-era-leveraging-
parquet-arrow-and-kudu-for-highperformance-analytics)

~~~
jetblackio
Right on, this is perfect. Thanks!

~~~
jacquesnadeau
One quick note to make on this. Kudu is a storage implementation, (similar to
Parquet in some ways). Arrow isn't about persistence and is actually built to
be complementary to both Kudu and Parquet.

Also note: Kudu is a distributed process. Arrow and Parquet are libraries that
can be embedded into your existing applications.

------
bradhe
> However, a modern CPU processor runs at approximately 3 GHz --- in other
> words they can process around 3 billion instructions a second. So even if
> the processor is doing a 4-byte integer comparison every single cycle, it is
> processing no more than 12GB a second

This is so laughably simplified.

------
polskibus
Does arrow itself supports and optimizes for multicore processing? Or is that
a responsibility of a higher layer like dremio? If so, do such layers optimize
Arrow queries execution to utilize multiple cores as much as possible?

------
js8
I wonder if one could have an FPGA attached to the CPU, load a piece of code
into it for pipelined decompressing and processing a piece of compressed
column store, and then vroom! It would process data really fast.

~~~
aheilbut
Like Netezza?

~~~
js8
Yeah, but I want to have it in my home computer. Any application could use
this thing.

------
polskibus
How can I query Apache Arrow without entire Hadoop stack? It seems it could be
a great in-memory OLAP engine, if only there was an efficient way to slice and
dice it?

~~~
jacquesnadeau
We built Dremio (github.com/dremio/dremio-oss, apache licensed) entirely on
top of Apache Arrow specifically for the purposes of creating a high speed
analytical capabilities including MOLAP like work as well as other forms of
caching/acceleration for analytical workloads. Other products/projects are
also starting to adopt a similar technical architecture.

~~~
polskibus
Dremio looks very interesting indeed. What would you recommend for interacting
with Arrow with more control, as a library? I'm interested in creating new
Arrow-based data sources, not using it as an intermediary to other data
sources.

On a side note - what other products/projects did you mean?

~~~
jacquesnadeau
The Arrow project itself is a set of libraries. One of the things we'll do is
try to add more algorithms over time to it so if you want say, a fast arrow
sort or arrow predicate application. Full SQL is always far more complex and I
can't see the project itself .

The engine inside of Dremio is something we call Sabot (a shoe for modern
arrows, see sabot round on wikipedia). We hope to make it modular enough one
day to use a library but it isn't there yet.

In regards to your other question re projects/products: Arrow contributors are
actively trying to get more adoption of Arrow as an interchange format for
several systems. We've had discussions around Kudu (no serious work done yet
afaik). Parquet-to-Arrow for multiple languages is now available. Arrow
committers include committers from several other projects such as HBase,
Cassandra, Phoenix, etc. The goal is ultimately to figure integrations with
all.

In most cases, these data storage systems are saddled with slow interfaces for
data access. (Think row-by-row, cell-by-cell interfaces.) Arrow, among other
things, allows them to communicate through a much faster mechanism (shared
memory--or at least shared representation if not node local).

~~~
polskibus
How does dremio differ from PrestoDB? As far as I know, PrestoDB can also
virtualize access to many data sources and join data between them. We didn't
go deep with PrestoDB because our basic tests for multi-source joins ran very
slowly, and it seemed to pull all data from both joined tables into one place.
I'm not a Prestodb expert, so maybe there's a better way to do it (all
suggestions welcome).

What's the differentiator? Is dremio smarter somehow and avoids copying all
data to perform a simple join? Or does it copy the data the same way but Arrow
lets it be faster than Presto? What's on your roadmap?

~~~
jacquesnadeau
PrestoDB is similar to Impala, Hive and other SQL Engines. Each is designed to
do distributed SQL processing. Dremio does embed an OSS distributed SQL
processing engine (Sabot, built natively on Arrow) as well but we see that as
only a means to an end. Our focus is much more on being a bi & data
fabric/service.

At the core of this vision are: very advanced pushdowns (far beyond other OSS
systems), a powerful self-service UI for managing, curating and sharing data
(designed for analysts, not just engineers) and--most importantly--the first
open source implementation of distributed relational caching for all types of
data. You can see more details about this last part in a deck I presented at
DataEngConf early today: [https://www.slideshare.net/dremio/using-apache-
arrow-calcite...](https://www.slideshare.net/dremio/using-apache-arrow-
calcite-and-parquet-to-build-a-relational-cache-81440786)

------
fs111
Well, there is another one here:
[https://carbondata.apache.org/](https://carbondata.apache.org/)

------
alexandercrohde
Let me give benefit of the doubt instead of simply doubting -

Can somebody provide a justification why performance-centric implementation-
details justify an entire new project? Couldn't this be done as simply a
storage engine? For that matter, couldn't all columnar datastores merely be
storage engines?

~~~
jacquesnadeau
It's more than an implementation detail because we're also targeting
interoperability between multiple separate technologies. One of the key things
that the article didn't fully cover is that Arrow serves two purposes: high
performance processing and interoperability.

A key part of the vision is: two systems can share a representation of data to
avoid serialization/deserialization overhead (and potentially copying in a
shared memory environment). This is only possible if the in-memory format is
also highly efficient for processing. This allows the processing systems (say
Pandas and Dremio) to share a representation, both process against it, and
then move the data between each other with zero overhead.

If you shared the data representation on the wire and then each application
had to transform it to a better structure for processing, you're still paying
for a form of ser/deser. By using Arrow for both processing and interoperation
you benefit from near-no-cost movement between systems and also a highly
efficient representation to process data (including some tools to get you
started in the form of the Arrow libraries).

