One of the lead Arrow developers here (https://github.com/wesm). It's a little bit disappointing for me to see the Arrow project scrutinized through the one-dimensional lens of columnar storage for database systems -- i.e. considering Arrow to be an alternative technology (i.e. part of the same category of technologies) to Parquet and ORC.
The reality (at least from my perspective, which is more informed by the data science and non-analytic database world) is that Arrow is a new, category defining technology. So you could choose to use it as a main memory storage format as an alternative for Parquet/ORC if you wanted, but that would be one possible use case for the technology, not the primary one.
What's missing from the article is the role and relationship between runtime memory formats and storage formats, and the costs of serialization and data interchange, particularly between processes and analytics runtimes. There are also some small factual errors about Arrow in the article (for example, Arrow record batches are not limited to 64K in length).
I will try to write a lengthier blog post on wesmckinney.com when I can going deeper into these topics to help provide some color for onlookers.
In fact, the conclusion of the article is more positive than the headline makes it out to be: "Therefore, it makes sense to keep Arrow and Parquet/ORC as separate projects, while also continuing to maintain tight integration."
I think he buried the lede a bit with the title - his answer being "yes, we should have a separate format for this." The way he phrased his points was a bit odd and seemed inimical at times even though in the end it wasn't - e.g. he mentioned the X100 paper and his C-store compression paper which both talk about lightweight in-memory compression schemes which would suit Arrow's use case well, but then he goes back and says "Arrow probably won't support gzip" (which is much more heavyweight but offers better compression ratio and is more suitable for disk based storage formats) - OK, so that's fine and to be expected then? It turns out, yep, it's expected.
His main idea I took away was dispelling the notion that we should have instead been putting effort into slightly repurposing an existing format for our in-memory data layout.
It's definitely exciting to see the data science world and the databases world finally interacting a lot more - I think each has a lot to learn from the other. Battle tested techniques on one side, and entirely new use cases to deal with on the other.
---
For the curious: Abadi is a well-known name in the databases community, especially with regards to column stores. A few papers that he's co-authored that I like:
"Column-Stores vs. Row-Stores: How Different Are They
Really?" - speaks about how a different style of query execution is a big part of what drives column store performance, and not just the memory layout itself: http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf
You nailed it. The most exciting part about all of this is being able to move between a "data science" context and a "database" context (and back again) without pain or penalty.
Thanks wesm for clearing this. My understanding (albeit limited) of Arrow was that it would complement Parquet. An example would be speeding up the in-memory representation of Parquet files. I believe mitigating the serialization costs will also help projects like PySpark.
I was under the impression that Apache Arrow was supposed to be the unified storage representation of data that can be used across Apache projects.
i.e HBase stores data as Apache Arrow, which can be directly queried by Spark or Hive without the need for serialization. As in eliminating the current overhead of going from HBase HFiles to Spark's internal RDD/DataFrame representation with lots of ser/deser going on on both fronts.
I think the comparisons to Parquet and ORC are surface level, I expected Arrow to become the "One storage to rule-them-all".
Arrow is all about in-memory, not long-term persistence. Systems can write it to disk but it is the one in-memory representation to rule them all, not storage/disk. Disk has its own requirements and challenges outside the scope of Arrow.
I'm also a developer on Arrow (https://github.com/jacques-n/), similar to WesM. It is always rewarding (and also sometimes challenging) to hear how people understand or value something you're working on.
I think Dan's analysis is evaluating Arrow from one particular and fairly constrained perspective of "if using Arrow and Parquet for RDBMS purposes, should they exist separately". I'm glad that Dan comes to a supportive conclusion even with a pretty narrow set of criteria.
If you broaden the criteria to all the different reasons people are consuming/leveraging/contributing to Arrow, the case only becomes more clear for its existence and use. As someone who uses Arrow extensively in my own work and professionally (https://github.com/dremio/dremio-oss), I find many benefits including two biggies: processing speed AND interoperability (now two different apps can share in-memory data without serialization/deserialization or duplicate memory footprint). And best of all, the community is composed of collaborators trying to solve similar problems, etc. When you combine all of these, Arrow is a no brainer as an independent community and is developing quickly because of that (80+ contributors, Many language bindings (6+) and more than 1300 github stars in just a short amount of time).
The Apache Software Foundation has no problem with hosting competing projects. There is no top-level technical direction by the Board or any other entity. There's no "steering committee" or the like where companies pay to play and push the organization to pump up their preferred project.
This is one of the fundamental reasons that the ASF is successful. Any time you see the ASF criticized for hosting competing projects without addressing this point, feel free to dismiss the critique as facile and uninformed.
So indeed, it does not matter whether the data is stored on disk or in memory --- column-stores are a win for these types of workloads. However, the reason is totally different.
He says that for tables on disk, column layout is a win due to less data transferred from disk. But for tables in memory, the win is due to the fact that you can vectorize operations on adjacent vaules.
I also think the convergence of these two approaches is interesting: one is from a database POV; the other is from a programming language POV.
Column Databases came out of RDBMS research. They want to speed up SQL-like queries.
Apache Arrow / Pandas came at it from the point of view of R, which descended from S [1]. It's basically an algebra for dealing with scientific data in an interactive programming language. In this view, rows are observations and columns are variables.
It does seem to make sense for there to be two separate projects now, but I wonder if eventually they will converge. Probably one big difference is what write operations are supported, if any. R and I think Pandas do let you mutate a value in the middle of a column.
-----
On another note, I have been looking at how to implement data frames. I believe the idea came from S, which dates back to 1976, but I only know of a few open source implementations:
- R itself -- the code is known to be pretty old and slow, although rich with functionality.
- Hadley Wickham's dplyr (the tbl type is an enhanced than data.frame). C++.
My objective, in fact, is to see the data science world unify data frame representations around Apache Arrow, so code written for Python, Julia, R, C++, etc. will all be portable across programming languages. See https://www.youtube.com/watch?v=wdmf1msbtVs
For data frame implementations, you can also look at Spark's Dataset/DataFrame and Graphlab Create's SFrame.
The company/team behind Graphlab Create was bought by Apple and the open sourced components haven't been updated since then. Because of that, I wouldn't use it in production, but if you are just looking for functioning implementations to compare, that gives you one more.
Thanks, I never heard of Graphlab Create. It is a substantial piece of code! It says it's "out of core", which means it's probably more similar to Parquet/ORC than Arrow. But still interesting.
I thought GraphLab Create had some nice functionality, it is too bad that the Apple acquisition appears to have effectively killed the publicly available product. It doesn't look like you can even pay for the commercial part anymore.
Its not publicly available, but I'm also working on an implementation of data-frame-esque structures.
You've got most of the ones I'm familiar with there, barring arguably the SAS data set. Albeit that one is "row based", in terms of storage, like the database tables of yore, albeit the application of formats/types is done intellectually on a columnar level.
Oh cool, what resources are you using to design the data structures? Is it just what you know from using data frames, existing code, etc.? Or any books or journal articles?
The main thing I know of is that R's data.table package has the concept of indices on columns like SQL. R doesn't have that.
I don't know if any of the others like Pandas, Arrow, or dplyr have it.
I just Googled and found this article and a very old gist:
I would guess that most implementations don't have indices, and that might be an indication that they're not essential. But for someone with a computer science background it seems annoying not to use hash tables when they are so obviously applicable.
The short answer is I'm using my own experience in the field and the limitations/frustrations I've had working with the existing ones :)
As you've said, I think the inclusion of indices comes from people used to working with more of a database background speeding up frequent common queries. SAS data sets can be indexed, but I don't think there's anything intellectually to be gained there that isn't gotten just from general database indexes, since SAS data sets are designed to be on-disk data structures and operate very similarly to database tables.
Talking completely off the cuff here, I imagine it has to do with the fact that these are mostly designed to be in-memory/columnar based structures, by people doing particular kinds of analysis on those structures. I'm not saying indices can't play a part, but that given the context, they're mainly in the realms of micro and query optimisations.
I think one has to ask what kind of use case would an index support, and what does it gain in efficiency vs take away in complexity?
For in memory columnar analysis on highly mutable data structures, its not quite so clear that it gains that much.
First of all, if you're scanning an entire column, you're not gaining anything from indexes. Indeed, you're losing out because of the cost/space to store one.
Second of all, if you're updating or mutating the data structure frequently in memory, then the cost of re-indexing the data structure needs to be paid each time. A cost even more questionable when in-memory access is already pretty fast. And in a lot of this field, that's a common use case.
Thirdly, if you're just randomly accessing into points within the column, well usually columns are set out as collections of continuous arrays of memory. To offset into the array at a particular index i is likely faster than the double indirection of looking up an index (also in memory) and then offsetting into another point in the original array.
R and the like already have the concept of indexing into a data frame using an array of integers, which you could argue not only covers the concept of an index, but possibly might even be more efficient than storing them in a hash.
Lets say you want to index on a string or a complex object/combination of objects/values? Well there's a potential use case (since a string doesn't usually index straight into a low-level continuous array), but the likes of R does have the concept of row names (albeit, i can't remember how efficiently its implemented), which already lets you access a dataframe notionally based on the name of a row.
Indeed, that's one of the places hashes have in my own implementation: figuring out how to index into the objects via strings labels. But I deliberately left the inclusion of row/column names optional (though column names are included by default since that seems to be the defacto norm).
In practice, many programmers/programs in this space either legitimately don't need or don't use the concept.
Indeed, I struggled quite intently whether to include rowname indexing as a concept at all, since I personally have barely used them in my day to day life for the last 10+ years. But I can't say there isn't a use for them. I suppose transition matrices on real world entities come to mind...
Yeah row names are another good example. I've never used them, so I would be inclined to leave them out. But I'm not sure what other people use them for.
Maybe they are used for fast lookup? In other words, its like an index? If you want to retrieve a particular row, it's usually an O(n) scan. But maybe row names are like a limited kind of index and you can retrieve by name in O(1) time.
I tend to follow Hadley Wickham's model -- column labels are essential, but I'm pretty sure he would say row names are not in the model? I haven't seen it addressed actually.
If I'm understanding correctly the transition matrix example, I think data frames and matrices are vastly different. And it makes sense to have labels for every dimension on matrices, but not data frames.
What language are you implementing your data frame library in? I'm curious about both the API and the algorithms. I looked at Hadley's newer dplyr code, and it's mostly in C++., as opposed to plyr which was mostly in R.
Are you a R user? If I were doing this (which I may but not for like a year), I would probably use Hadley's API and model the code after his. I'm curious if you have any criticisms of the way he does things.
FWIW the summary of using indices is that it changes the computational complexity of an algorithm -- and O(n + m) algorithm can be turned into O(m log n), and this makes an orders of magnitude difference in running time. It's not small difference. I've used this to good effect in real programs.
It's more like data preparation than analysis, but that is a big part of what people use data frames for.
Do you have a working definition for what constitutes a 'data frame'? API/operations you can do to it, or maybe performance requirements of some sort? It seems like, for instance, a 2d array might qualify as a 'data frame' under some definitions.
Data frame is fairly well defined, but it's a common source of confusion. In a data frame, the columns are of heterogeneous types (string, integer, bool, double, factor or "enum", etc.). In a vector/matrix, every cell has the same type.
A short way to understand it is that R and Matlab aren't the same thing. They are not substitutes for one another.
In R, the core type is the data frame. In Matlab, it is the matrix, of which a 2D array is a special case. (Both Matlab and R have grown some of the others' functionality, but it's not good at all. You can tell which one is primary.)
Another way to say it: Julia at its core isn't an R replacement; it's a Matlab replacement. Apparently, the representation of missing values was a problem for DataFrames.jl. Whether or not individual entries in a vector are "nullable" has a large impact on how you write vectorized mathematical operations and how well it performs.
And if you notice that data frames sound like SQL tables, well that's exactly why these Apache projects are converging/overlapping :) They are coming from different points of view, but they have the same underlying data model.
One of my slogans is that "R is the only language without the ORM problem", because its native data model matches the data model of a relational database. I'm designing a shell which will feature data frames (eventually, probably not for a year or more):
SQL is a very limited language, with essentially no capabilities for abstraction. Writing programs using the relational model in a Turing-complete language like R or Python is very powerful.
A 2D array has a different algebra from a Dataframe. (though some operators are analogous, e.g. a transpose = a pivot)
A dataframe is basically behaves like a database table, with all the attendant operations (aggregation, pivots, filters, etc.) The basic algebra for dataframes is relational algebra plus a few other useful database operations.
The algebra for a 2D array (matrix) is linear algebra.
I think there are similarities, but I don't think one is a special case of the other. At a very high level, they are both collection data structures that share certain properties but that are optimized for very different use cases.
There are instances where the difference can be palpable. I work with legacy MATLAB code where database type data are stored in cell arrays. They are tremendously difficult to reason with for data science work because the primitive operations are all based on positional cell coordinates. (on the other hand, arrays are much easier to work with for numerical type algorithms)
If you understand the theory of coordinatized data, choosing the right coordinate system can greatly simplify math/data manipulation. A data frame is a type of dataset where the coordinates are simply dimensions in other columns, which makes it amenable to aggregations, pivots, slicing/dicing and such on mixed categorical and numerical data.
Do you have any experience with data frames in Spark? Do they work well?
My impression that it's more of an API to bolt the data frame abstraction onto whatever Apache Spark already did, rather than a native data frame representation. But I could be wrong -- I haven't used it.
Is the query language Scala or something else? As far as I remember, native Spark is like MapReduce where you write the distributed functions in the same language that the framework is written in -- Scala.
Also, I'm pretty sure Apache Spark is more immutable/functional, whereas in R and Python, data frames are mutable. But that model also has benefits, for redundancy and resilience to nodes being down, which I believe is the whole point of Spark.
Although, I have a pet-theory that you don't really need distributed data frames. For 95% or 99% of data science, use stratified sampling to cut your data down to the size of a single machine. Then load it up in R or Python. You can make almost any decision with say 4 GiB of data. Often you can make it with 4 MB or 4 KB of data. It's not the amount of data that is the problem, but the quality.
> Do you have any experience with data frames in Spark? Do they work well?
Yes, for the most part.
> Is the query language Scala or something else?
You can use Spark SQL, which is essentially SQL; Spark parses it and pumps through a query planner and optimizer. You can also call methods in the Spark API (from Python, Scala, etc.)
> I'm pretty sure Apache Spark is more immutable/functional.
Yes, but that can be a good thing. It makes data transformations easier to reason about.
> Although, I have a pet-theory that you don't really need distributed data frames.
Yes, for most data science work on small datasets, you really don't. That said, there is some work being done in the Spark community to enable Spark to work as a computation engine for local datasets. That way one can standardize on one interface for both small data and big data. As of right now though, there is a fair bit of overhead if you want to use Spark on a smaller datasets -- it's way slower than R/Python dataframes. Hopefully some of those overhead issues will be addressed.
This is why I think Arrow is such an exciting development. The promise is that you can use any compute engine on the same dataframe representation with no serialization overheads. So you can use nimbler tools like R and Python when the data is small, and Spark (or PySpark) on the same data when it needs to scale up. It abstracts your data away from your tooling while maintaining high performance.
OK thanks for the information. What I'm suggesting is that a two-phase workflow for analytics may actually make more sense than trying to unify things. (Here, I'm considering analytics as distinct from big data in products, where you really need to compute on every user's data or every web page).
Basically you have a disk-backed distributed column store with petabytes of data. At Google this is Dremel:
"Dremel: interactive analysis of web-scale datasets"
And then you write SQL queries against it to load a sample of gigabytes onto a single machine. Stratified sampling is useful here, so you can representative samples of both big groups and small groups, which tends to happen a lot. And then you use something like R, Python, or Julia to analyze that subset.
Despite the title of the Dremel Paper, it's not really good for interactive analysis. It's good for a quick "top K" query, which is sometimes all you need. But for anything beyond that, it's too slow.
And SQL is a horrible language for writing analytics in. The data frame model is much better. SQL is awesome for loading data though.
So I'm basically suggesting that the SQL/disk/distributed vs. the Programming language/memory/local distinction actually works well for most problems. You can cleanly divide your workflow into "what data do I need to answer this question?" vs. all the detailed and iterative analysis on a smaller data set.
We do need a much better interchange format than CSV though.
I'm not sure I understand exactly what you are proposing that is new. Could you help me see what I'm missing? (not being snarky, genuinely curious)
1) 2-phase workflow for analytics - Yes, this is being done today with different distributed column stores (Vertica, SQL Server columnstores etc.), queried into local dataframes. Spark can be used for that.
2) SQL good for data retrieval, dataframes better for analytics - OK. Ok, Spark can retrieve content via SQL into Spark Dataframes, and then apply RDD operations which are similar to dplyr type operations.
3) Better interchange format than CSV? OK, you can always write your SQL resultset as a Parquet file which can be natively accessed by Spark or Pandas (via PyArrow - there's Arrow again)
It's not so much proposing something new, but calling into question the assumption that this is desirable:
That way one can standardize on one interface for both small data and big data.
For big data, I believe you want a set of operations that are efficient for data extraction:
- Selecting and filtering.
- Explicit and efficient joins. Disallow pathological joins not related to data extraction!
- Sampling. Especially an efficient and accurate stratified sampling over groups. I believe this is missing from most databases, although I could be wrong since there are many of them I'm not familiar with. They typically have a fast inaccurate algorithm one, or a slow accurate one. Or they just allow you to sample an entire set of records from a table, not do "group by"-like semantics. I did what you would now call "data engineering" for many years and this was probably the #1 thing that came up over and over.
That's pretty much it as far as I can tell. The goal is just to cut your data down to manageable size, not do anything overly smart.
On the other hand, for small data, you want an expressive interface, not necessarily an efficient one. You want all the bells and whistles of R and dplyr (and I suppose Pandas, although I'm less familiar with it).
A lot of those things are hard to implement efficiently, but you need them for scripting. Moreover sometimes it's just a lot easier to use an inefficient algorithm on small data, since most of the code is throwaway.
I spent awhile looking at Apache Arrow a few weeks ago. I like the idea a lot. I feel like it may be trying to serve too many masters (but that's just a feeling). It definitely seems possible to underestimate the difference in requirements between small data and big data.
I'm saying those differences are fundamental and not accidental, and you want different APIs so that programs are aware of the differences.
Also, in most other domains there is a distinction between "interchange format" and "application format" (e.g. png/tiff files vs. photoshop files). It would be nice if that didn't exist, but it's there for a reason. This is a long argument, but some problems are more social than technical. Dumb, inefficient interchange formats solve a social problem.
Ah I wasn't sure which part of my reply was being referred to. I think there are two different phases of the data science process: (1) EDA and exploratory modeling. (2) productionizing models on the large.
I believe Arrow can be part of (1) and (2). It is an in-memory format that aims to avoid serialization to/from different use points, so it is tooling and data-size agnostic. You can still sample from it and perform your exploration and analysis.
Spark on the other hand and is mostly (2) -- it can be used for (1) but it is fairly weak at it due to overheads. It also doesn't support the full range of capabilities that dplyr (and other R packages) bring to the table. I would not use Spark for EDA or throwaway modeling.
I think we can agree to disagree but I see a lot of advantages to having a common in-memory abstraction (such as Arrow) to small/big data whose difference is only that of cardinality. We don't have to mandate the use of such a format of course, but I think having such a format solves a large class of problems for a lot of people.
At the same time, no one will stop using CSV (tabular)/JSON (hierarchical) even though these are inefficient, "dumb" formats. It is good enough for most applications and are what I consider local optimum solutions. Even Excel XLSX is a pretty good format for interchange.
I am open to changing my mind but my experiences so far have not convinced me otherwise yet.
Right, EDA is all about making quick iterations, and my point is that I can't think of ANY use cases for EDA on more than a machine's worth of data. (Other than "top" queries, which are useful, but SQL will give you that. You don't need data frames for simple analysis.)
In my experience, when people think they want to "explore" hundreds of terabytes of data (if that's even possible), they either don't have a good method of sampling (due to database limitations), or they haven't considered what kind of sampling they need to answer the question. I'm open to counterexamples where you need terabytes or petabytes of data and you can't sample, but I can't think of any myself.
And yes R is too inefficient to handle some problems on a single machine (or multiple machines), but data.table and dplyr are more efficient. I guess one anti-pattern I would see is -- R is too inefficient, let's run R on multiple machines! I think the better solution is to make use of your machine better, with packages like data.table and dplyr (or a future platform based on Arrow). I think it's easy to underestimate the practical problems of distributed computing.
You should really avoid distributed computing when you can!
And I agree that once you reach production, you do need different tools. In my experience, when you are dealing with hundreds of terabytes or petabytes of data, you're writing a fairly optimized program with a big data framework. NOT using data frames. Big data frameworks are already pretty inefficient -- as you mention, if you run them on a local machine, you see evidence of that. This is true of all the frameworks at Google written in C++ too. When you add the data frame abstraction, it's only going to get worse. Every layer adds some overhead.
So I am very big on data frames for data analysis obviously, but I question the value of distributed data frames.
Bringing it back to the article, having 2 abstractions definitely makes sense. You want a distributed disk-backed column store for storing huge amounts of data permanently. And then you want an in-memory format like Arrow for iteratively analyzing data using the data frame abstraction.
What's not clear to me is that you want a distributed in-memory abstraction as well. It's possible but I think there are so many other problems in the tools for data science -- I can easily name 10 problems with both R and Python ecosystems that are more important than distributed data frames.
To play devil's advocate, I do see the value in not rewriting your analysis when you go from prototype to production. I'm pretty big on that -- I like to ship the prototypes. That's pretty much the philosophy of Julia -- the prototype should be efficient.
But I see multiple challenges there for data science. Even if you had distributed data frames, it wouldn't solve that problem. For one thing, most analysts don't write their code with things like indices in mind, see my other comments here:
If the code could benefit from indices, and is not using them (like most programs that use data frames), then IMO they need major surgery anyway. It's not just a matter of flipping a switch and magically going from small to big. But yes I think the general idea of shipping your prototypes is appealing.
> Although, I have a pet-theory that you don't really need distributed data frames. For 95% or 99% of data science, use stratified sampling to cut your data down to the size of a single machine. Then load it up in R or Python. You can make almost any decision with say 4 GiB of data. Often you can make it with 4 MB or 4 KB of data. It's not the amount of data that is the problem, but the quality.
To add fuel to that fire, you can get AWS EC2 instances with 4TB of main system RAM now. At least where inference is the goal, the times that you'd need exotic systems are truly rare indeed.
Data frames do work well. You are correct that it is mainly an abstraction on top of the existing spark RDD representation. The language is Scala, though a slightly strange string/macro-based sql-ish API. Perhaps you can tell from that description I'm a bit luke-warm on it.
There are a few big benefits to immutability. Redundancy and resilience is a big one, but you also get some improved performance because the compiler and planner can more aggressively optimize, and do things like predicate push-down since it doesn't have to materialize everything. It also gets you a lot of the more general programming benefits of immutability/FP in preventing bugs and such.
I somewhat agree with your point about data volume. The situations that require Spark are relatively infrequent. But they are becoming more common. A lot of data is required when you start getting into building more complex machine learning models. But for a lot of your run-of-the-mill regression or basic understanding analysis, Spark is way overkill.
A pretty glaring flaw or omission in the analysis is using a table that's just 6 columns wide. Tables used in data analytics workloads are much wider, 50-100 columns or more is common. That number of columns means scanning significantly more data for the row-oriented storage.
Depending on the details, I'd wonder if you'd just end up pulling in a single cache line from the row, regardless of how wide the row is. Even in his benchmark, you're wasting a good 5/6ths of the memory bandwidth you're using (pulling a 24 byte row and only examining 4 bytes of it).
I would argue that the fact that Arrow has been integrated into so many projects over the last year is proof that a separate project made sense. Dremio, PySpark, Pandas, MapD, NVIDIA GPU Data Frame, Graphistry, Turbodbc, ...
Well... use in projects is an indicator, not a proof, of utility. Not all choices are good ones, inferior & redundant products are crop up on a regular basis, and to the dismay of many in this community, sometimes win out. Happily, from the article, Arrow, doesn't appear one of them, and has its own unique and justified niche.
FWIW it should be possible to vectorize the search across the row store. With 24 byte tuples (assuming no inter-row padding) you can fit 2.6 rows into a 64 byte cache line (a 512 bit simd register). Then it's just a matter of proper bit masking and comparison. Easier said than done, I figure because that remainder is going to be a pain. Another approach is to use gather instructions to load the first column of each row and effectively pack the first column into a register as if it were loaded from a row store and then do a vectorized compare as in the column-store case.
All of that to underscore it's not that one format vectorizes and the other doesn't. The key takeaway here is that with the column store, the compiler can automatically vectorize. This is especially a bonus for JVM based languages because afaik there is no decent way to hand-roll SIMD code without crossing a JNI boundary.
This isn't that hard. A sane person would do 3 cache lines at once as 192 bytes = 8 24 byte tuples. You would do 3 AVX512 loads, a masked compare at the proper places (actually, I think the masked compare instructions can take a memory operand, which might get you uop fusion so the load+compare is one instruction) yielding 3 masks each with 16 bits (of course, most of the bits would be junk). The 16 bit masks can be shift+or'ed together (whether as "k-regs" or in GPRs) and the correct bits can be extracted with PEXT.
The downside of this is that you are still reading 6 times as much data. A straightforward implementation of this should not be CPU bound IMO. If a Skylake Server can't keep up with memory doing 32-bit compares I'll eat my hat.
Gather is not a good idea for this purpose. Gather is very expensive. It's really mainly good for eliminating pointer chasing and keeping a SIMD algorithm in the SIMD domain.
I've built a columnar in-memory data transformation engine [1] and wrote [2] about the need in common columnar data format to avoid re-compression. The problem I have with the Apache projects is that they all require strongly typed columns (if I'm not missing something). In our case, the app is actively used for processing spreadsheets and sometimes XML files which requires supporting mixed data types (e.g. timestamps and strings) in a column.
Another issue is storing 128-bit decimals. They are more preferable for financial calculations, but are not supported in the Apache projects.
So may be we need a forth standard for columnar data representation. Or expand the existing ones to make them more versatile.
It's an interesting article but surely a modern instance should be able to manage more than 4 u32 compares per cycle? Any machine with AVX2 (or beyond) should either be able to do 2x256 AVX2 VPCMPEQD instructions (or AVX 512 can do 1x512). The code to marshal up the results of this compare and do something useful with would push out to another cycle or two but IMO we can surely manage to compare an entire cache line (64 bytes) in 1 cycle worth of work.
This doesn't invalidate his point, and there are even more interesting things that can be done with packed data formats - and of course if you're searching a 6-bit column for some data item (or set of data items) you might be even happier to be using a columnar form (especially if the row contains lots of stuff: the 6:1 ratio in the article might be understating the advantage of working at the column level).
I don't have any strong feelings about this either way, but the main question in my head after reading this is:
- Are the differences SIGNIFICANT enough to support two (three?) different codebases? A lot of your points seem to be more about building in configurations/plugins versus a completely different (core) storage format. For instance, adding in separate plugins for different dictionary compressions, or being able to specify the block read size or schemes. I would just think you may be spending a lot of time reinventing 80% of the wheel and 20% on 'memory specific' functionality.
(I'm naively oversimplifying, but its something to think about).
If the CPU test is simply reading 1/6 of the data, then it should be memory that is the bottleneck and not the CPU (even unoptimized). Something smells very wrong about his adhoc test. The next piece of data is in the cache-line seven times out of eight. And its reading 1/6 of the data into memory. Should be way faster even without -O3. And if there's something fundamentally broken about the code without -O3, then why post that benchmark at all? Seems dishonest.
I thought it made the point that if you aren't vectorizing your code then it hardly matters if you use an optimized data format.
SSE makes a huge difference. With AVX and multiple threads you can actually exceed the memory bandwidth of a CPU socket, which looks funny in performance graphs as adding threads to cores suddenly stops scaling.
Columnar data representation can be optimized and designed for a wild range of query and data types that have non-aligned constraints. So yes, you can have 3, or 12 for that matter. I think we'll see more & more purpose-specific database technologies as gains from Moore's laws and its analogues start slowing down, and some previously niche application areas become high-$ areas with larger user bases.
Not sure if this is a good place to ask, but how do Apache Arrow and Parquet compare to Apache Kudu (https://kudu.apache.org/)? Seems like all three are columnar data solutions, but it's not clear when you'd use one over the other.
Kind of surprised the article didn't mention Kudu for that matter.
I have been working full-time on Kudu since its early development. As others have mentioned, Arrow and Kudu are quite different. Despite the controversial-sounding title of Daniel Abadi's article, his content was actually reasonable and his conclusion in the final paragraph of the article is worth reading. In summary, he acknowledges that in-memory and on-disk columnar formats have different goals and both have their place (Arrow being an in-memory format).
Apache Kudu is much more than a file format - it is a columnar distributed storage engine. One way to think of Kudu is as mutable Parquet, but really it's a database backend that integrates with Impala and Spark for SQL, among other systems. It's fault tolerant, manages partitioning for you, secure, and much more. For a quick introduction to Kudu you can check out this short slide deck I put together over a year ago... it's a bit dated but a good overview: https://www.slideshare.net/MichaelPercy3/intro-to-apache-kud...
For more up-to-date information, follow the Apache Kudu Blog at http://kudu.apache.org/blog/ or follow the official Apache Kudu twitter account @ApacheKudu.
One quick note to make on this. Kudu is a storage implementation, (similar to Parquet in some ways). Arrow isn't about persistence and is actually built to be complementary to both Kudu and Parquet.
Also note: Kudu is a distributed process. Arrow and Parquet are libraries that can be embedded into your existing applications.
> However, a modern CPU processor runs at approximately 3 GHz --- in other words they can process around 3 billion instructions a second. So even if the processor is doing a 4-byte integer comparison every single cycle, it is processing no more than 12GB a second
Does arrow itself supports and optimizes for multicore processing? Or is that a responsibility of a higher layer like dremio? If so, do such layers optimize Arrow queries execution to utilize multiple cores as much as possible?
I wonder if one could have an FPGA attached to the CPU, load a piece of code into it for pipelined decompressing and processing a piece of compressed column store, and then vroom! It would process data really fast.
How can I query Apache Arrow without entire Hadoop stack? It seems it could be a great in-memory OLAP engine, if only there was an efficient way to slice and dice it?
We built Dremio (github.com/dremio/dremio-oss, apache licensed) entirely on top of Apache Arrow specifically for the purposes of creating a high speed analytical capabilities including MOLAP like work as well as other forms of caching/acceleration for analytical workloads. Other products/projects are also starting to adopt a similar technical architecture.
Dremio looks very interesting indeed. What would you recommend for interacting with Arrow with more control, as a library? I'm interested in creating new Arrow-based data sources, not using it as an intermediary to other data sources.
On a side note - what other products/projects did you mean?
The Arrow project itself is a set of libraries. One of the things we'll do is try to add more algorithms over time to it so if you want say, a fast arrow sort or arrow predicate application. Full SQL is always far more complex and I can't see the project itself .
The engine inside of Dremio is something we call Sabot (a shoe for modern arrows, see sabot round on wikipedia). We hope to make it modular enough one day to use a library but it isn't there yet.
In regards to your other question re projects/products: Arrow contributors are actively trying to get more adoption of Arrow as an interchange format for several systems. We've had discussions around Kudu (no serious work done yet afaik). Parquet-to-Arrow for multiple languages is now available. Arrow committers include committers from several other projects such as HBase, Cassandra, Phoenix, etc. The goal is ultimately to figure integrations with all.
In most cases, these data storage systems are saddled with slow interfaces for data access. (Think row-by-row, cell-by-cell interfaces.) Arrow, among other things, allows them to communicate through a much faster mechanism (shared memory--or at least shared representation if not node local).
How does dremio differ from PrestoDB? As far as I know, PrestoDB can also virtualize access to many data sources and join data between them. We didn't go deep with PrestoDB because our basic tests for multi-source joins ran very slowly, and it seemed to pull all data from both joined tables into one place. I'm not a Prestodb expert, so maybe there's a better way to do it (all suggestions welcome).
What's the differentiator? Is dremio smarter somehow and avoids copying all data to perform a simple join? Or does it copy the data the same way but Arrow lets it be faster than Presto? What's on your roadmap?
PrestoDB is similar to Impala, Hive and other SQL Engines. Each is designed to do distributed SQL processing. Dremio does embed an OSS distributed SQL processing engine (Sabot, built natively on Arrow) as well but we see that as only a means to an end. Our focus is much more on being a bi & data fabric/service.
At the core of this vision are: very advanced pushdowns (far beyond other OSS systems), a powerful self-service UI for managing, curating and sharing data (designed for analysts, not just engineers) and--most importantly--the first open source implementation of distributed relational caching for all types of data. You can see more details about this last part in a deck I presented at DataEngConf early today: https://www.slideshare.net/dremio/using-apache-arrow-calcite...
Thank you very much for a thorough response. I think I would be happy with a library without SQL support, as long as filtering, grouping would be supported. Seems like that would be Sabot :) Maybe one day I'll be able to use it.
Dremio is focused on a combination of data access, acceleration and a self-service analyst experience (Tableau/Qlik one-click integration, data curation and data management).
We also invest heavily in our pushdowns. For example, we invested heavily in our Elasticsearch capabilities (support for CONTAINS and Lucene syntax, Painless and Groovy pushdowns, index leveraging, etc) and I'm pretty comfortable saying we're the best in world at exposing Elastic in a SQL context. Similarly for our other sources (join & window pushdowns in Oracle, SqlServer, etc, high speed predicate application in Hadoop, Mongo aggregation pipeline pushdowns including support unwind, etc).
I don't know Querona well (just a cursory site review). They seem much more focused on classic federation. It would also be important to understand when they repackage the Spark source connectors versus where they have built something that pushes down more powerfully (which you typically need for the best performance with these newer data source systems).
I was looking for something local, a library or sth like that. I don't want to run other processes just to query that memory. Spark is enormous and slow in comparison with local in memory solutions.
Gotcha. Just some confusion around what you meant by in-memory OLAP engine. It seems that many storage format libraries get integrated into other platforms that bring the other supporting features (and the other undesirable heavyweight things you didn't want) with them.
Calcite is more a query optimization framework. Actually, I think parent has a good question, which is, how to use these tools, formats, etc. from something simpler than an entire full blown distributed technology - and I think the answer to that is, there aren't too many technologies out of the box that do that right now.
I would think maybe extending SQLite to use columnar storage would be one possible path that might make sense.
There are columnar DBMS engines you can embed like MonetDB, but I personally haven't seen a lightweight query engine sitting right on top of Parquet/ORC, or implementing Arrow for in-memory columnar. You can get Apache Spark up and running pretty quickly on a single node, but I haven't seen anything even easier than that.
Let me give benefit of the doubt instead of simply doubting -
Can somebody provide a justification why performance-centric implementation-details justify an entire new project? Couldn't this be done as simply a storage engine? For that matter, couldn't all columnar datastores merely be storage engines?
It's more than an implementation detail because we're also targeting interoperability between multiple separate technologies. One of the key things that the article didn't fully cover is that Arrow serves two purposes: high performance processing and interoperability.
A key part of the vision is: two systems can share a representation of data to avoid serialization/deserialization overhead (and potentially copying in a shared memory environment). This is only possible if the in-memory format is also highly efficient for processing. This allows the processing systems (say Pandas and Dremio) to share a representation, both process against it, and then move the data between each other with zero overhead.
If you shared the data representation on the wire and then each application had to transform it to a better structure for processing, you're still paying for a form of ser/deser. By using Arrow for both processing and interoperation you benefit from near-no-cost movement between systems and also a highly efficient representation to process data (including some tools to get you started in the form of the Arrow libraries).
I think the author comes across as tone deaf for two reasons:
1. Burying the lede, by tacking the affirmative conclusion on as the final sentence,
2. This quote reads as pompous, not humble:
> I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression). I also expect that they will read the X100 compression paper which includes schemes which can be decompressed using vectorized processing.
The reality (at least from my perspective, which is more informed by the data science and non-analytic database world) is that Arrow is a new, category defining technology. So you could choose to use it as a main memory storage format as an alternative for Parquet/ORC if you wanted, but that would be one possible use case for the technology, not the primary one.
What's missing from the article is the role and relationship between runtime memory formats and storage formats, and the costs of serialization and data interchange, particularly between processes and analytics runtimes. There are also some small factual errors about Arrow in the article (for example, Arrow record batches are not limited to 64K in length).
I will try to write a lengthier blog post on wesmckinney.com when I can going deeper into these topics to help provide some color for onlookers.