The reality (at least from my perspective, which is more informed by the data science and non-analytic database world) is that Arrow is a new, category defining technology. So you could choose to use it as a main memory storage format as an alternative for Parquet/ORC if you wanted, but that would be one possible use case for the technology, not the primary one.
What's missing from the article is the role and relationship between runtime memory formats and storage formats, and the costs of serialization and data interchange, particularly between processes and analytics runtimes. There are also some small factual errors about Arrow in the article (for example, Arrow record batches are not limited to 64K in length).
I will try to write a lengthier blog post on wesmckinney.com when I can going deeper into these topics to help provide some color for onlookers.
His main idea I took away was dispelling the notion that we should have instead been putting effort into slightly repurposing an existing format for our in-memory data layout.
It's definitely exciting to see the data science world and the databases world finally interacting a lot more - I think each has a lot to learn from the other. Battle tested techniques on one side, and entirely new use cases to deal with on the other.
For the curious: Abadi is a well-known name in the databases community, especially with regards to column stores. A few papers that he's co-authored that I like:
"Column-Stores vs. Row-Stores: How Different Are They
Really?" - speaks about how a different style of query execution is a big part of what drives column store performance, and not just the memory layout itself: http://db.csail.mit.edu/projects/cstore/abadi-sigmod08.pdf
And then there's the magnum opus "The Design and Implementation of Modern Column-Oriented Database Systems" which is a huge survey into the subject: https://stratos.seas.harvard.edu/publications/design-and-imp...
Looking forward to your post.
i.e HBase stores data as Apache Arrow, which can be directly queried by Spark or Hive without the need for serialization. As in eliminating the current overhead of going from HBase HFiles to Spark's internal RDD/DataFrame representation with lots of ser/deser going on on both fronts.
I think the comparisons to Parquet and ORC are surface level, I expected Arrow to become the "One storage to rule-them-all".
Did I get this wrong? Kindly correct if so.
I think Dan's analysis is evaluating Arrow from one particular and fairly constrained perspective of "if using Arrow and Parquet for RDBMS purposes, should they exist separately". I'm glad that Dan comes to a supportive conclusion even with a pretty narrow set of criteria.
If you broaden the criteria to all the different reasons people are consuming/leveraging/contributing to Arrow, the case only becomes more clear for its existence and use. As someone who uses Arrow extensively in my own work and professionally (https://github.com/dremio/dremio-oss), I find many benefits including two biggies: processing speed AND interoperability (now two different apps can share in-memory data without serialization/deserialization or duplicate memory footprint). And best of all, the community is composed of collaborators trying to solve similar problems, etc. When you combine all of these, Arrow is a no brainer as an independent community and is developing quickly because of that (80+ contributors, Many language bindings (6+) and more than 1300 github stars in just a short amount of time).
This is one of the fundamental reasons that the ASF is successful. Any time you see the ASF criticized for hosting competing projects without addressing this point, feel free to dismiss the critique as facile and uninformed.
So indeed, it does not matter whether the data is stored on disk or in memory --- column-stores are a win for these types of workloads. However, the reason is totally different.
He says that for tables on disk, column layout is a win due to less data transferred from disk. But for tables in memory, the win is due to the fact that you can vectorize operations on adjacent vaules.
I also think the convergence of these two approaches is interesting: one is from a database POV; the other is from a programming language POV.
Column Databases came out of RDBMS research. They want to speed up SQL-like queries.
Apache Arrow / Pandas came at it from the point of view of R, which descended from S . It's basically an algebra for dealing with scientific data in an interactive programming language. In this view, rows are observations and columns are variables.
It does seem to make sense for there to be two separate projects now, but I wonder if eventually they will converge. Probably one big difference is what write operations are supported, if any. R and I think Pandas do let you mutate a value in the middle of a column.
On another note, I have been looking at how to implement data frames. I believe the idea came from S, which dates back to 1976, but I only know of a few open source implementations:
- R itself -- the code is known to be pretty old and slow, although rich with functionality.
- Hadley Wickham's dplyr (the tbl type is an enhanced than data.frame). C++.
- The data.table package in R. In C.
- Pandas. In a mix of Python, Cython, and C.
- Apache Arrow. In C++.
- Julia DataFrames.jl.
If anyone knows of others, I'm interested.
The company/team behind Graphlab Create was bought by Apple and the open sourced components haven't been updated since then. Because of that, I wouldn't use it in production, but if you are just looking for functioning implementations to compare, that gives you one more.
For comparison, dplyr and arrow:
C++ does seem to be useful for stuff like this.
You've got most of the ones I'm familiar with there, barring arguably the SAS data set. Albeit that one is "row based", in terms of storage, like the database tables of yore, albeit the application of formats/types is done intellectually on a columnar level.
The main thing I know of is that R's data.table package has the concept of indices on columns like SQL. R doesn't have that.
I don't know if any of the others like Pandas, Arrow, or dplyr have it.
I just Googled and found this article and a very old gist:
I would guess that most implementations don't have indices, and that might be an indication that they're not essential. But for someone with a computer science background it seems annoying not to use hash tables when they are so obviously applicable.
As you've said, I think the inclusion of indices comes from people used to working with more of a database background speeding up frequent common queries. SAS data sets can be indexed, but I don't think there's anything intellectually to be gained there that isn't gotten just from general database indexes, since SAS data sets are designed to be on-disk data structures and operate very similarly to database tables.
Talking completely off the cuff here, I imagine it has to do with the fact that these are mostly designed to be in-memory/columnar based structures, by people doing particular kinds of analysis on those structures. I'm not saying indices can't play a part, but that given the context, they're mainly in the realms of micro and query optimisations.
I think one has to ask what kind of use case would an index support, and what does it gain in efficiency vs take away in complexity?
For in memory columnar analysis on highly mutable data structures, its not quite so clear that it gains that much.
First of all, if you're scanning an entire column, you're not gaining anything from indexes. Indeed, you're losing out because of the cost/space to store one.
Second of all, if you're updating or mutating the data structure frequently in memory, then the cost of re-indexing the data structure needs to be paid each time. A cost even more questionable when in-memory access is already pretty fast. And in a lot of this field, that's a common use case.
Thirdly, if you're just randomly accessing into points within the column, well usually columns are set out as collections of continuous arrays of memory. To offset into the array at a particular index i is likely faster than the double indirection of looking up an index (also in memory) and then offsetting into another point in the original array.
R and the like already have the concept of indexing into a data frame using an array of integers, which you could argue not only covers the concept of an index, but possibly might even be more efficient than storing them in a hash.
Lets say you want to index on a string or a complex object/combination of objects/values? Well there's a potential use case (since a string doesn't usually index straight into a low-level continuous array), but the likes of R does have the concept of row names (albeit, i can't remember how efficiently its implemented), which already lets you access a dataframe notionally based on the name of a row.
Indeed, that's one of the places hashes have in my own implementation: figuring out how to index into the objects via strings labels. But I deliberately left the inclusion of row/column names optional (though column names are included by default since that seems to be the defacto norm).
In practice, many programmers/programs in this space either legitimately don't need or don't use the concept.
Indeed, I struggled quite intently whether to include rowname indexing as a concept at all, since I personally have barely used them in my day to day life for the last 10+ years. But I can't say there isn't a use for them. I suppose transition matrices on real world entities come to mind...
Maybe they are used for fast lookup? In other words, its like an index? If you want to retrieve a particular row, it's usually an O(n) scan. But maybe row names are like a limited kind of index and you can retrieve by name in O(1) time.
I tend to follow Hadley Wickham's model -- column labels are essential, but I'm pretty sure he would say row names are not in the model? I haven't seen it addressed actually.
If I'm understanding correctly the transition matrix example, I think data frames and matrices are vastly different. And it makes sense to have labels for every dimension on matrices, but not data frames.
What language are you implementing your data frame library in? I'm curious about both the API and the algorithms. I looked at Hadley's newer dplyr code, and it's mostly in C++., as opposed to plyr which was mostly in R.
Are you a R user? If I were doing this (which I may but not for like a year), I would probably use Hadley's API and model the code after his. I'm curious if you have any criticisms of the way he does things.
FWIW the summary of using indices is that it changes the computational complexity of an algorithm -- and O(n + m) algorithm can be turned into O(m log n), and this makes an orders of magnitude difference in running time. It's not small difference. I've used this to good effect in real programs.
It's more like data preparation than analysis, but that is a big part of what people use data frames for.
A short way to understand it is that R and Matlab aren't the same thing. They are not substitutes for one another.
In R, the core type is the data frame. In Matlab, it is the matrix, of which a 2D array is a special case. (Both Matlab and R have grown some of the others' functionality, but it's not good at all. You can tell which one is primary.)
Another way to say it: Julia at its core isn't an R replacement; it's a Matlab replacement. Apparently, the representation of missing values was a problem for DataFrames.jl. Whether or not individual entries in a vector are "nullable" has a large impact on how you write vectorized mathematical operations and how well it performs.
And if you notice that data frames sound like SQL tables, well that's exactly why these Apache projects are converging/overlapping :) They are coming from different points of view, but they have the same underlying data model.
One of my slogans is that "R is the only language without the ORM problem", because its native data model matches the data model of a relational database. I'm designing a shell which will feature data frames (eventually, probably not for a year or more):
SQL is a very limited language, with essentially no capabilities for abstraction. Writing programs using the relational model in a Turing-complete language like R or Python is very powerful.
A dataframe is basically behaves like a database table, with all the attendant operations (aggregation, pivots, filters, etc.) The basic algebra for dataframes is relational algebra plus a few other useful database operations.
The algebra for a 2D array (matrix) is linear algebra.
There are instances where the difference can be palpable. I work with legacy MATLAB code where database type data are stored in cell arrays. They are tremendously difficult to reason with for data science work because the primitive operations are all based on positional cell coordinates. (on the other hand, arrays are much easier to work with for numerical type algorithms)
If you understand the theory of coordinatized data, choosing the right coordinate system can greatly simplify math/data manipulation. A data frame is a type of dataset where the coordinates are simply dimensions in other columns, which makes it amenable to aggregations, pivots, slicing/dicing and such on mixed categorical and numerical data.
My impression that it's more of an API to bolt the data frame abstraction onto whatever Apache Spark already did, rather than a native data frame representation. But I could be wrong -- I haven't used it.
Is the query language Scala or something else? As far as I remember, native Spark is like MapReduce where you write the distributed functions in the same language that the framework is written in -- Scala.
Also, I'm pretty sure Apache Spark is more immutable/functional, whereas in R and Python, data frames are mutable. But that model also has benefits, for redundancy and resilience to nodes being down, which I believe is the whole point of Spark.
Although, I have a pet-theory that you don't really need distributed data frames. For 95% or 99% of data science, use stratified sampling to cut your data down to the size of a single machine. Then load it up in R or Python. You can make almost any decision with say 4 GiB of data. Often you can make it with 4 MB or 4 KB of data. It's not the amount of data that is the problem, but the quality.
To add fuel to that fire, you can get AWS EC2 instances with 4TB of main system RAM now. At least where inference is the goal, the times that you'd need exotic systems are truly rare indeed.
Yes, for the most part.
> Is the query language Scala or something else?
You can use Spark SQL, which is essentially SQL; Spark parses it and pumps through a query planner and optimizer. You can also call methods in the Spark API (from Python, Scala, etc.)
> I'm pretty sure Apache Spark is more immutable/functional.
Yes, but that can be a good thing. It makes data transformations easier to reason about.
> Although, I have a pet-theory that you don't really need distributed data frames.
Yes, for most data science work on small datasets, you really don't. That said, there is some work being done in the Spark community to enable Spark to work as a computation engine for local datasets. That way one can standardize on one interface for both small data and big data. As of right now though, there is a fair bit of overhead if you want to use Spark on a smaller datasets -- it's way slower than R/Python dataframes. Hopefully some of those overhead issues will be addressed.
This is why I think Arrow is such an exciting development. The promise is that you can use any compute engine on the same dataframe representation with no serialization overheads. So you can use nimbler tools like R and Python when the data is small, and Spark (or PySpark) on the same data when it needs to scale up. It abstracts your data away from your tooling while maintaining high performance.
Basically you have a disk-backed distributed column store with petabytes of data. At Google this is Dremel:
"Dremel: interactive analysis of web-scale datasets"
And then you write SQL queries against it to load a sample of gigabytes onto a single machine. Stratified sampling is useful here, so you can representative samples of both big groups and small groups, which tends to happen a lot. And then you use something like R, Python, or Julia to analyze that subset.
Despite the title of the Dremel Paper, it's not really good for interactive analysis. It's good for a quick "top K" query, which is sometimes all you need. But for anything beyond that, it's too slow.
And SQL is a horrible language for writing analytics in. The data frame model is much better. SQL is awesome for loading data though.
So I'm basically suggesting that the SQL/disk/distributed vs. the Programming language/memory/local distinction actually works well for most problems. You can cleanly divide your workflow into "what data do I need to answer this question?" vs. all the detailed and iterative analysis on a smaller data set.
We do need a much better interchange format than CSV though.
1) 2-phase workflow for analytics - Yes, this is being done today with different distributed column stores (Vertica, SQL Server columnstores etc.), queried into local dataframes. Spark can be used for that.
2) SQL good for data retrieval, dataframes better for analytics - OK. Ok, Spark can retrieve content via SQL into Spark Dataframes, and then apply RDD operations which are similar to dplyr type operations.
3) Better interchange format than CSV? OK, you can always write your SQL resultset as a Parquet file which can be natively accessed by Spark or Pandas (via PyArrow - there's Arrow again)
That way one can standardize on one interface for both small data and big data.
For big data, I believe you want a set of operations that are efficient for data extraction:
- Selecting and filtering.
- Explicit and efficient joins. Disallow pathological joins not related to data extraction!
- Sampling. Especially an efficient and accurate stratified sampling over groups. I believe this is missing from most databases, although I could be wrong since there are many of them I'm not familiar with. They typically have a fast inaccurate algorithm one, or a slow accurate one. Or they just allow you to sample an entire set of records from a table, not do "group by"-like semantics. I did what you would now call "data engineering" for many years and this was probably the #1 thing that came up over and over.
That's pretty much it as far as I can tell. The goal is just to cut your data down to manageable size, not do anything overly smart.
On the other hand, for small data, you want an expressive interface, not necessarily an efficient one. You want all the bells and whistles of R and dplyr (and I suppose Pandas, although I'm less familiar with it).
A lot of those things are hard to implement efficiently, but you need them for scripting. Moreover sometimes it's just a lot easier to use an inefficient algorithm on small data, since most of the code is throwaway.
I spent awhile looking at Apache Arrow a few weeks ago. I like the idea a lot. I feel like it may be trying to serve too many masters (but that's just a feeling). It definitely seems possible to underestimate the difference in requirements between small data and big data.
I'm saying those differences are fundamental and not accidental, and you want different APIs so that programs are aware of the differences.
Also, in most other domains there is a distinction between "interchange format" and "application format" (e.g. png/tiff files vs. photoshop files). It would be nice if that didn't exist, but it's there for a reason. This is a long argument, but some problems are more social than technical. Dumb, inefficient interchange formats solve a social problem.
I believe Arrow can be part of (1) and (2). It is an in-memory format that aims to avoid serialization to/from different use points, so it is tooling and data-size agnostic. You can still sample from it and perform your exploration and analysis.
Spark on the other hand and is mostly (2) -- it can be used for (1) but it is fairly weak at it due to overheads. It also doesn't support the full range of capabilities that dplyr (and other R packages) bring to the table. I would not use Spark for EDA or throwaway modeling.
I think we can agree to disagree but I see a lot of advantages to having a common in-memory abstraction (such as Arrow) to small/big data whose difference is only that of cardinality. We don't have to mandate the use of such a format of course, but I think having such a format solves a large class of problems for a lot of people.
At the same time, no one will stop using CSV (tabular)/JSON (hierarchical) even though these are inefficient, "dumb" formats. It is good enough for most applications and are what I consider local optimum solutions. Even Excel XLSX is a pretty good format for interchange.
I am open to changing my mind but my experiences so far have not convinced me otherwise yet.
In my experience, when people think they want to "explore" hundreds of terabytes of data (if that's even possible), they either don't have a good method of sampling (due to database limitations), or they haven't considered what kind of sampling they need to answer the question. I'm open to counterexamples where you need terabytes or petabytes of data and you can't sample, but I can't think of any myself.
And yes R is too inefficient to handle some problems on a single machine (or multiple machines), but data.table and dplyr are more efficient. I guess one anti-pattern I would see is -- R is too inefficient, let's run R on multiple machines! I think the better solution is to make use of your machine better, with packages like data.table and dplyr (or a future platform based on Arrow). I think it's easy to underestimate the practical problems of distributed computing.
You should really avoid distributed computing when you can!
And I agree that once you reach production, you do need different tools. In my experience, when you are dealing with hundreds of terabytes or petabytes of data, you're writing a fairly optimized program with a big data framework. NOT using data frames. Big data frameworks are already pretty inefficient -- as you mention, if you run them on a local machine, you see evidence of that. This is true of all the frameworks at Google written in C++ too. When you add the data frame abstraction, it's only going to get worse. Every layer adds some overhead.
So I am very big on data frames for data analysis obviously, but I question the value of distributed data frames.
Bringing it back to the article, having 2 abstractions definitely makes sense. You want a distributed disk-backed column store for storing huge amounts of data permanently. And then you want an in-memory format like Arrow for iteratively analyzing data using the data frame abstraction.
What's not clear to me is that you want a distributed in-memory abstraction as well. It's possible but I think there are so many other problems in the tools for data science -- I can easily name 10 problems with both R and Python ecosystems that are more important than distributed data frames.
To play devil's advocate, I do see the value in not rewriting your analysis when you go from prototype to production. I'm pretty big on that -- I like to ship the prototypes. That's pretty much the philosophy of Julia -- the prototype should be efficient.
But I see multiple challenges there for data science. Even if you had distributed data frames, it wouldn't solve that problem. For one thing, most analysts don't write their code with things like indices in mind, see my other comments here:
If the code could benefit from indices, and is not using them (like most programs that use data frames), then IMO they need major surgery anyway. It's not just a matter of flipping a switch and magically going from small to big. But yes I think the general idea of shipping your prototypes is appealing.
There are a few big benefits to immutability. Redundancy and resilience is a big one, but you also get some improved performance because the compiler and planner can more aggressively optimize, and do things like predicate push-down since it doesn't have to materialize everything. It also gets you a lot of the more general programming benefits of immutability/FP in preventing bugs and such.
I somewhat agree with your point about data volume. The situations that require Spark are relatively infrequent. But they are becoming more common. A lot of data is required when you start getting into building more complex machine learning models. But for a lot of your run-of-the-mill regression or basic understanding analysis, Spark is way overkill.
It makes sense to keep Arrow and Parquet/ORC as separate projects, while also continuing to maintain tight integration.
You might enjoy reading it to see why.
I wonder how many people will be lost before getting to the last line.
All of that to underscore it's not that one format vectorizes and the other doesn't. The key takeaway here is that with the column store, the compiler can automatically vectorize. This is especially a bonus for JVM based languages because afaik there is no decent way to hand-roll SIMD code without crossing a JNI boundary.
The downside of this is that you are still reading 6 times as much data. A straightforward implementation of this should not be CPU bound IMO. If a Skylake Server can't keep up with memory doing 32-bit compares I'll eat my hat.
Gather is not a good idea for this purpose. Gather is very expensive. It's really mainly good for eliminating pointer chasing and keeping a SIMD algorithm in the SIMD domain.
Another issue is storing 128-bit decimals. They are more preferable for financial calculations, but are not supported in the Apache projects.
So may be we need a forth standard for columnar data representation. Or expand the existing ones to make them more versatile.
This doesn't invalidate his point, and there are even more interesting things that can be done with packed data formats - and of course if you're searching a 6-bit column for some data item (or set of data items) you might be even happier to be using a columnar form (especially if the row contains lots of stuff: the 6:1 ratio in the article might be understating the advantage of working at the column level).
- Are the differences SIGNIFICANT enough to support two (three?) different codebases? A lot of your points seem to be more about building in configurations/plugins versus a completely different (core) storage format. For instance, adding in separate plugins for different dictionary compressions, or being able to specify the block read size or schemes. I would just think you may be spending a lot of time reinventing 80% of the wheel and 20% on 'memory specific' functionality.
(I'm naively oversimplifying, but its something to think about).
1. Burying the lede, by tacking the affirmative conclusion on as the final sentence,
2. This quote reads as pompous, not humble:
> I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression). I also expect that they will read the X100 compression paper which includes schemes which can be decompressed using vectorized processing.
Be good to post the code when making such claims.
SSE makes a huge difference. With AVX and multiple threads you can actually exceed the memory bandwidth of a CPU socket, which looks funny in performance graphs as adding threads to cores suddenly stops scaling.
Kind of surprised the article didn't mention Kudu for that matter.
Apache Kudu is much more than a file format - it is a columnar distributed storage engine. One way to think of Kudu is as mutable Parquet, but really it's a database backend that integrates with Impala and Spark for SQL, among other systems. It's fault tolerant, manages partitioning for you, secure, and much more. For a quick introduction to Kudu you can check out this short slide deck I put together over a year ago... it's a bit dated but a good overview: https://www.slideshare.net/MichaelPercy3/intro-to-apache-kud...
For more up-to-date information, follow the Apache Kudu Blog at http://kudu.apache.org/blog/ or follow the official Apache Kudu twitter account @ApacheKudu.
Also note: Kudu is a distributed process. Arrow and Parquet are libraries that can be embedded into your existing applications.
This is so laughably simplified.
On a side note - what other products/projects did you mean?
The engine inside of Dremio is something we call Sabot (a shoe for modern arrows, see sabot round on wikipedia). We hope to make it modular enough one day to use a library but it isn't there yet.
In regards to your other question re projects/products: Arrow contributors are actively trying to get more adoption of Arrow as an interchange format for several systems. We've had discussions around Kudu (no serious work done yet afaik). Parquet-to-Arrow for multiple languages is now available. Arrow committers include committers from several other projects such as HBase, Cassandra, Phoenix, etc. The goal is ultimately to figure integrations with all.
In most cases, these data storage systems are saddled with slow interfaces for data access. (Think row-by-row, cell-by-cell interfaces.) Arrow, among other things, allows them to communicate through a much faster mechanism (shared memory--or at least shared representation if not node local).
What's the differentiator? Is dremio smarter somehow and avoids copying all data to perform a simple join? Or does it copy the data the same way but Arrow lets it be faster than Presto? What's on your roadmap?
At the core of this vision are: very advanced pushdowns (far beyond other OSS systems), a powerful self-service UI for managing, curating and sharing data (designed for analysts, not just engineers) and--most importantly--the first open source implementation of distributed relational caching for all types of data. You can see more details about this last part in a deck I presented at DataEngConf early today: https://www.slideshare.net/dremio/using-apache-arrow-calcite...
We also invest heavily in our pushdowns. For example, we invested heavily in our Elasticsearch capabilities (support for CONTAINS and Lucene syntax, Painless and Groovy pushdowns, index leveraging, etc) and I'm pretty comfortable saying we're the best in world at exposing Elastic in a SQL context. Similarly for our other sources (join & window pushdowns in Oracle, SqlServer, etc, high speed predicate application in Hadoop, Mongo aggregation pipeline pushdowns including support unwind, etc).
I don't know Querona well (just a cursory site review). They seem much more focused on classic federation. It would also be important to understand when they repackage the Spark source connectors versus where they have built something that pushes down more powerfully (which you typically need for the best performance with these newer data source systems).
With a little bit more poking: Apache Calcite?
I would think maybe extending SQLite to use columnar storage would be one possible path that might make sense.
There are columnar DBMS engines you can embed like MonetDB, but I personally haven't seen a lightweight query engine sitting right on top of Parquet/ORC, or implementing Arrow for in-memory columnar. You can get Apache Spark up and running pretty quickly on a single node, but I haven't seen anything even easier than that.
Can somebody provide a justification why performance-centric implementation-details justify an entire new project? Couldn't this be done as simply a storage engine? For that matter, couldn't all columnar datastores merely be storage engines?
A key part of the vision is: two systems can share a representation of data to avoid serialization/deserialization overhead (and potentially copying in a shared memory environment). This is only possible if the in-memory format is also highly efficient for processing. This allows the processing systems (say Pandas and Dremio) to share a representation, both process against it, and then move the data between each other with zero overhead.
If you shared the data representation on the wire and then each application had to transform it to a better structure for processing, you're still paying for a form of ser/deser. By using Arrow for both processing and interoperation you benefit from near-no-cost movement between systems and also a highly efficient representation to process data (including some tools to get you started in the form of the Arrow libraries).