Arrow is the most important thing happening in the data ecosystem right now. It's going to allow you to run your choice of execution engine, on top of your choice of data store, as though they are designed to work together. It will mostly be invisible to users, the key thing that needs to happen is that all the producers and consumers of batch data need to adopt Arrow as the common interchange format.
Snowflake has adopted Arrow as the in-memory format for their JDBC driver, though to my knowledge there is still no way to access data in parallel from Snowflake, other than to export to S3.
As Arrow spreads across the ecosystem, users are going to start discovering that they can store data in one system and query it in another, at full speed, and it's going to be amazing.
"A common, efficient serialized and wire format across data engines is a transformational development. Many previous systems and approaches (e.g., [26, 36, 38, 51]) have observed the prohibitive cost of data conversion and transfer, precluding optimizers from exploiting inter-DBMS performance advantages. By contrast, inmemory data transfer cost between a pair of Arrow-supporting systems is effectively zero. Many major, modern DBMSs (e.g., Spark,
Kudu, AWS Data Wrangler, SciDB, TileDB) and data-processing
frameworks (e.g., Pandas, NumPy, Dask) have or are in the process of incorporating support for Arrow and ArrowFlight. Exploiting this is key for Magpie, which is thereby free to combine data from different sources and cache intermediate data and results, without needing to consider data conversion overhead."
I wish MS put in some resources behind Arrow in .NET. I tried raising some remarks about it on dotnet repos (esp. within ML.NET), but to no avail. Hopefully it would change now that Arrow is more popular, and also written about by MS itself.
Arrow is definitely one of the top 10 new things I'm most
excited about in the data science space, but not sure I'd
call it the most important thing. ;)
It is pretty awesome, however, particularly for folks like
me that are often hopping between Python/R/Javascript. I've
definitely got in on the roadmap for all my data science
libraries.
Btw, Arquero from that UW lab looks really neat as well, and
is supporting Arrow out of the gate
(https://github.com/uwdata/arquero).
Much of the value of Arrow is in the things that will get built after Arrow is widely supported by data warehouses. Much of the data ecosystem we have today was designed to avoid the cost of moving data between systems. The whole Hadoop ecosystem is written in Java and shoehorned into map-reduce for this reason.
Imagine if, for example, you could use Mathematica or R to analyze data in your Snowflake cluster, with no bottleneck reading data from the warehouse even for giant datasets. This is the future that’s going to be enabled by Arrow.
FWIW, you can support tensors etc. natively today by using Arrow's structs, such as `column i_am_packed: {x: int32, y: int32, z: int32, time: datetime64[ns], temp: int32}`. This will be a dense, packed representation for whatever fixed type the column has, including dynamically generated. It ends up being quite nice to have that passed around, esp. vs say protobufs.
You wouldn't be sending mixed-length values, and ~all metrics systems I've worked with would be fine there. The nullables end up a win for those. However, you'd need to do a manual (likely zero-copy) casts between the struct-type to whatever tensor-type you're using. Massively popular ml systems like huggingface do this fine afaict for their Arrow-based tensor work. Likewise, as we do a lot of GPU stuff, what's additionally common is compacting the in-memory stuff as big memory blocks ('long recordbatches') instead of CPU-land's typically more fragmented ones, and that ends up making casts even easier. Annoying to have to add an explicit cast for some interop cases, but preserves end-to-end type safety & hasn't been a deal breaker for us. Having to write the cast being annoying/difficult, esp. for whoever does it first.
More frustrating for us has been sparse data and compression controls, but most formats are even worse here..
Almost no database systems support multidimensional arrays. So they are not appropriate for many use cases?
* BigQuery: no
* Redshift: no
* Spark SQL: no
* Snowflake: no
* Clickhouse: no
* Dremio: no
* Impala: no
* Presto: no
... list continues
We've invited developers to add the extension types for tensor data, but no one has contributed them yet. I'm not seeing a lot of tabular data with embedded tensors out in the wild.
I think that implementing good ndim=2 support would already be a huge leap forward, it doesn't have to be something super generic. E.g., given that most of the classic machine learning is essentially using 2-dimensional data (samples x features) as inputs, this is a very common use case.
E.g., as of right now, having to concatenate hundreds of columns manually just in order to pass them to some ml library in a contiguous format is always a pain and often doubles the max ram requirement.
This may help you do zero copy for a column of multi-dim without losing value types, just that it's encoding a multi-dim. This example is for values that are 3x3 of int8's:
```
import pyarrow as pa
my_col_of_3x3s = pa.struct([ (f'f_{x}_{y}', pa.int8()) for x in range(3) for y in range(3) ])
```
If using ndarrays, I think our helpers are another ~4 lines each. Interop with C is even easier, just cast. You can now pass this data through any Arrow-compatible compute stack / DB and not lose the value types. We do this for streaming into webgl's packed formats, for example.
What you don't get is a hint to the downstream systems that it is multidimensional. Tableau would just let you do individual bar charts, not say a heatmap, assuming they support rank 2's. To convert, you'd need to do that zero-copy cast to whatever they do support. I agree a targetable standard would avoid the need for that manual conversion, and increases the likelihood they use the same data rep.
Native support would also avoid some header bloat from using structs. However, we find that's fine in practice, it's metadata. E.g., our streaming code reads the schema at the beginning and then passes it along, so actual payloads are pure data, and skip resending metadata.
I suspect that AllegroCache accepts arrays with rank>=2, although I never got around to trying it out. (At the very least its documentation has nothing to say about any limitations on what kinds of arrays can be stored, so I'm assuming it stores all of them.)
TileDB[1] supports both sparse and dense multidimensional arrays. We support returning data in arrow arrays or an arrow table as one of many ways we interoperate with computational tools. You can also access the data directly via numpy array, pandas, R data.frames and a number of integrations with MariaDB, Presto, Spark, GDAL, PDAL and more.
Agreed! Thank you Arrow community for such a great project. It's a long road but it opens up tremendous potential for efficient data systems that talk to one another. The future looks bright with so many vendors backing Arrow independently and Wes McKinney founding Ursa Labs and now Ursa Computing. https://ursalabs.org/blog/ursa-computing/
It's nice that the cyclical technology pendulum is finally swinging back from XML/JSON to files-with-C-structs again, but any serious analytics data store (e.g. Clickhouse) uses its own aggressively optimized storage format, so the process of loading from Arrow files to database won't go away.
Are these execution engines internally using Arrow columnar format or are they just exposing Arrow as a client wire format? AFAIK Spark and Presto does not use Arrow as execution columnar format, but just data sources/sinks.
You can configure Spark to use arrow for passing data between Java and Python via spark.sql.execution.arrow.pyspark.enabled but yes, Spark uses Java datatypes internally.
Any Snowflake developers reading this - the current snowflake-connector-python is pinned to 0.17, almost 1 year out of date now. Would be great to get that bumped to a more recent version :-)
True. Arrow is awesome and Dremio is using it as well as their built in memory. I tried it and it is increadibily fast. The future of data ecosystem is gonna be amazing
If you found/wrote a adapter to translate your structured PDF into Arrow's format, yes - the idea is that you can wire up anything that can produce Arrow data to anything that can consume Arrow data.
I'm kinda confused. Is that not the case for literally everything? "You can send me data of format X, all I ask is that you be able to produce format X" ?
I'm assuming that I'm missing something fwiw, not trying to diminish the value.
The difference is that arrow’s mapping behind the scenes enables automatic translation to any implemented “plugin” that is on the user’s implementation of arrow. You can extend arrows format to make it automatically map to whatever you want, basically.
And it’s all stored in memory - so much faster access to complex data relationships than anything that exists to my knowledge.
Could I write a plug-in that mapped to Cypher? I’ve got a graph use case in mind where I want to use RedisGraph but don’t feel comfy with Redis as a primary DB and would totally consider a columnar store as a primary if I didn’t have to serialize.
Uhh.. maybe. It's a serde that's trying to be cross-language / platform.
I guess it also offers some APIs to process the data so you can minimize serde operations. But, I dunno. It's been hard to understand the benefit of the libabry and the posts here don't help.
There's no serde by design (aside from inspecting a tiny piece of metadata indicating the location of each constituent block of memory). So data processing algorithms execute directly against the Arrow wire format without any deserialization.
I challenge you to have a closer look at the project.
Deserialization by definition requires bytes or bits to be relocated from their position in the wire protocol to other data structures which are used for processing. Arrow does not require any bytes or bits to be relocated. So if a "C array of doubles" is not native to the CPU, then I don't know what is.
CPUs come in many flavors. One area where they differ is in the way that bytes of a word are represented in memory. Two common formats are Big Endian and Little Endian. This is an example where a "C array of doubles" would be incompatible and some form of deserilaziation would be needed.
My understanding is that an apache arrow library provides an API to manipulate the format in a platform agnostic way. But to claim that it eliminates deserialization is false.
Hiya, a bit of OT (again, last one promise!): I saw your comment about type systems in data science the other day (https://news.ycombinator.com/item?id=25923839). From what I understood, it seems you want a contract system, wouldn't you think? The reason I'm asking is that I'm fishing for opinions on building data science infra in Racket (and saw your deleted comment in https://news.ycombinator.com/item?id=26008869 so thought you'd perhaps be interested), and Racket (and R) dataframes happen to support contracts on their columns.
You are right that if you want to do this in a heterogeneous computing environment, one of the layouts is going to be "wrong" and require an extra step regardless of how you do this.
But ... (a) this is way less common than it was decades ago (rare use cases we are talking about here ) and (b) it seems to be addressed in a sensible way (i.e. Arrow defaults to little-endian, but you could swap it on a big-endian network). I think it includes utility functions for conversion also.
So the usual case incurs no overhead, and the corner cases are covered. I'm not sure exactly what you are complaining about, unless it's the lack of liberally sprinkling ("no deserialization in most use cases") or whatever around the comments?
Big endian is pretty rare among anything you’d be doing in-memory analytics on. Looks like you can choose the endianess of the format if you need to, but it’s little endian by default: https://arrow.apache.org/docs/format/Columnar.html. I’d suggest reading up on the format, it covers the properties it provides to be friendly to direct random access.
If it works as a universal intermediate exchange language, it could help standardize connections among disparate systems.
When you have N systems, it takes N^2 translators to build direct connections to transfer data between them; but it only takes N translators if all them can talk the same exchange language.
can you define what at translator is? I don't understand the complexity you're constructing. I have N systems and they talk protobuf. What's the problem?
By a translator, I mean a library that allows accessing data from different subsystems (either languages or OS processes).
In this case, the advantages are that 1) Arrow is language agnostic, so it's likely that it can be used as a native library in your program and 2) it doesn't copy data to make it accessible to another process, so it saves a lot of marshalling / unmarshalling steps (assuming both sides use data in tabular format, which is typical of data analysis contexts).
Arrow had it selling points as non-serde. But I am wondering how does it achieve no-serde with Python? By allocating PyObject cleverly with a network packet buffer?
If I convert an Arrow int8 array to normal python list of int's, will this involve copying?
This is similar to Numpy. You operate on Python objects that describe the Numpy structure, but the actual memory is stored in C objects, not Python objects.
Not GP post, but it might have been better stated as 'eliminating serde overhead'. Arrow's RPC serialization [1] is basically Protobuf, with a whole lot of hacks to eliminate copies on both ends of the wire. So it's still 'serde', but markedly more efficient for large blocks of tabular-ish data.
So, there are several components to Arrow. One of them transfers data using IPC, and naturally needs to serialize. The other uses shared memory, which eliminates the need for serde.
Sadly, the latter isn't (yet) well supported anywhere but Python and C++. If you can/do use it, though, data are just kept as as arrays in memory. Which is exactly what the CPU wants to see.
Oh, that's fantastic to hear. Right now I'm living in Python because that's the galactic center, but I've also been anxious to find a low-cost escape hatch that doesn't just lead to C++.
You should definitely check out Julia then. There are a few parts of the language that use C/C++ libraries (blas and mpfr are the main ones), but 95% of the time, your stack will be Julia all the way down.
In the best case, your CPU needn't really be involved beyond a call to set up mapped memory. It is (among other things) a common memory format, so you can be working with it in one framework, and then start working with it in another, and it's location and format in memory hasn't changed at all.
The term likely predates the Rust implementation. SerDe is Serializer & Deserializer, which could be any framework or tool that allows the serialization and deserialization of data.
BigQuery recently implemented the storage API, which allows you to read BQ tables, in parallel, in Arrow format: https://cloud.google.com/bigquery/docs/reference/storage
Snowflake has adopted Arrow as the in-memory format for their JDBC driver, though to my knowledge there is still no way to access data in parallel from Snowflake, other than to export to S3.
As Arrow spreads across the ecosystem, users are going to start discovering that they can store data in one system and query it in another, at full speed, and it's going to be amazing.