Hacker News new | past | comments | ask | show | jobs | submit login
Apache Arrow 3.0 (apache.org)
561 points by kylebarron on Feb 3, 2021 | hide | past | favorite | 192 comments



Arrow is the most important thing happening in the data ecosystem right now. It's going to allow you to run your choice of execution engine, on top of your choice of data store, as though they are designed to work together. It will mostly be invisible to users, the key thing that needs to happen is that all the producers and consumers of batch data need to adopt Arrow as the common interchange format.

BigQuery recently implemented the storage API, which allows you to read BQ tables, in parallel, in Arrow format: https://cloud.google.com/bigquery/docs/reference/storage

Snowflake has adopted Arrow as the in-memory format for their JDBC driver, though to my knowledge there is still no way to access data in parallel from Snowflake, other than to export to S3.

As Arrow spreads across the ecosystem, users are going to start discovering that they can store data in one system and query it in another, at full speed, and it's going to be amazing.


Microsoft is also on top of this with their Magpie project

http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf

"A common, efficient serialized and wire format across data engines is a transformational development. Many previous systems and approaches (e.g., [26, 36, 38, 51]) have observed the prohibitive cost of data conversion and transfer, precluding optimizers from exploiting inter-DBMS performance advantages. By contrast, inmemory data transfer cost between a pair of Arrow-supporting systems is effectively zero. Many major, modern DBMSs (e.g., Spark, Kudu, AWS Data Wrangler, SciDB, TileDB) and data-processing frameworks (e.g., Pandas, NumPy, Dask) have or are in the process of incorporating support for Arrow and ArrowFlight. Exploiting this is key for Magpie, which is thereby free to combine data from different sources and cache intermediate data and results, without needing to consider data conversion overhead."


I wish MS put in some resources behind Arrow in .NET. I tried raising some remarks about it on dotnet repos (esp. within ML.NET), but to no avail. Hopefully it would change now that Arrow is more popular, and also written about by MS itself.


way cool! Is magpie end-user facing anywhere yet? We were using the azureml-dataprep library for a while which seems similar but not all of magpie


Arrow is definitely one of the top 10 new things I'm most excited about in the data science space, but not sure I'd call it the most important thing. ;)

It is pretty awesome, however, particularly for folks like me that are often hopping between Python/R/Javascript. I've definitely got in on the roadmap for all my data science libraries.

Btw, Arquero from that UW lab looks really neat as well, and is supporting Arrow out of the gate (https://github.com/uwdata/arquero).


Much of the value of Arrow is in the things that will get built after Arrow is widely supported by data warehouses. Much of the data ecosystem we have today was designed to avoid the cost of moving data between systems. The whole Hadoop ecosystem is written in Java and shoehorned into map-reduce for this reason.

Imagine if, for example, you could use Mathematica or R to analyze data in your Snowflake cluster, with no bottleneck reading data from the warehouse even for giant datasets. This is the future that’s going to be enabled by Arrow.


Is it not what Presto (now Trino) is solving as well (among other things) ? Even though it is focused only on analytics and not on ML use cases.


Until arrow has proper support for multidimensional arrays [1], its not really appropriate for many use cases.

Not having first class support for multidimensional arrays, in a modern framework, really surprised me and my sensor data.

[1] https://lists.apache.org/x/thread.html/9b142c1709aa37dc35f1c...


FWIW, you can support tensors etc. natively today by using Arrow's structs, such as `column i_am_packed: {x: int32, y: int32, z: int32, time: datetime64[ns], temp: int32}`. This will be a dense, packed representation for whatever fixed type the column has, including dynamically generated. It ends up being quite nice to have that passed around, esp. vs say protobufs.

You wouldn't be sending mixed-length values, and ~all metrics systems I've worked with would be fine there. The nullables end up a win for those. However, you'd need to do a manual (likely zero-copy) casts between the struct-type to whatever tensor-type you're using. Massively popular ml systems like huggingface do this fine afaict for their Arrow-based tensor work. Likewise, as we do a lot of GPU stuff, what's additionally common is compacting the in-memory stuff as big memory blocks ('long recordbatches') instead of CPU-land's typically more fragmented ones, and that ends up making casts even easier. Annoying to have to add an explicit cast for some interop cases, but preserves end-to-end type safety & hasn't been a deal breaker for us. Having to write the cast being annoying/difficult, esp. for whoever does it first.

More frustrating for us has been sparse data and compression controls, but most formats are even worse here..


Almost no database systems support multidimensional arrays. So they are not appropriate for many use cases?

* BigQuery: no * Redshift: no * Spark SQL: no * Snowflake: no * Clickhouse: no * Dremio: no * Impala: no * Presto: no ... list continues

We've invited developers to add the extension types for tensor data, but no one has contributed them yet. I'm not seeing a lot of tabular data with embedded tensors out in the wild.


I think that implementing good ndim=2 support would already be a huge leap forward, it doesn't have to be something super generic. E.g., given that most of the classic machine learning is essentially using 2-dimensional data (samples x features) as inputs, this is a very common use case.

E.g., as of right now, having to concatenate hundreds of columns manually just in order to pass them to some ml library in a contiguous format is always a pain and often doubles the max ram requirement.


This may help you do zero copy for a column of multi-dim without losing value types, just that it's encoding a multi-dim. This example is for values that are 3x3 of int8's:

```

import pyarrow as pa

my_col_of_3x3s = pa.struct([ (f'f_{x}_{y}', pa.int8()) for x in range(3) for y in range(3) ])

```

If using ndarrays, I think our helpers are another ~4 lines each. Interop with C is even easier, just cast. You can now pass this data through any Arrow-compatible compute stack / DB and not lose the value types. We do this for streaming into webgl's packed formats, for example.

What you don't get is a hint to the downstream systems that it is multidimensional. Tableau would just let you do individual bar charts, not say a heatmap, assuming they support rank 2's. To convert, you'd need to do that zero-copy cast to whatever they do support. I agree a targetable standard would avoid the need for that manual conversion, and increases the likelihood they use the same data rep.

Native support would also avoid some header bloat from using structs. However, we find that's fine in practice, it's metadata. E.g., our streaming code reads the schema at the beginning and then passes it along, so actual payloads are pure data, and skip resending metadata.



If you out a blank line between your bullet points, they'll display properly:

* BigQuery: no

* Redshift: no

* Spark SQL: no

* Snowflake: no

* Clickhouse: no

* Dremio: no

* Impala: no

* Presto: no


I suspect that AllegroCache accepts arrays with rank>=2, although I never got around to trying it out. (At the very least its documentation has nothing to say about any limitations on what kinds of arrays can be stored, so I'm assuming it stores all of them.)


On a side note, Clickhouse had some Arrow support

https://github.com/ClickHouse/ClickHouse/issues/12284


ClickHouse has support for multidimensional arrays.


The only database I know of that natively supports multidimensional arrays is SciDB (one of Stonebraker's products)

https://en.wikipedia.org/wiki/SciDB


TileDB[1] supports both sparse and dense multidimensional arrays. We support returning data in arrow arrays or an arrow table as one of many ways we interoperate with computational tools. You can also access the data directly via numpy array, pandas, R data.frames and a number of integrations with MariaDB, Presto, Spark, GDAL, PDAL and more.

Disclaimer: I am a member of the TileDB team.

[1] https://tiledb.com/


ClickHouse has support for multidimensional arrays with arbitrary types and number of dimensions.

They are stored in tables in efficient column-oriented format.


What are the use cases where multi-dimensional arrays are important and why?


Astronomy, seismology, microscopy.

There's an interesting query language, SciQL[1], built to support such use cases. It can be used with MonetDB.[2]

[1] https://projects.cwi.nl/scilens/Resources/SciQL.html

[2] https://www.youtube.com/watch?v=WaMo3oGTstY


R&D/manufacturing of consumer facing sensors, in my case.


Agreed! Thank you Arrow community for such a great project. It's a long road but it opens up tremendous potential for efficient data systems that talk to one another. The future looks bright with so many vendors backing Arrow independently and Wes McKinney founding Ursa Labs and now Ursa Computing. https://ursalabs.org/blog/ursa-computing/


It's nice that the cyclical technology pendulum is finally swinging back from XML/JSON to files-with-C-structs again, but any serious analytics data store (e.g. Clickhouse) uses its own aggressively optimized storage format, so the process of loading from Arrow files to database won't go away.


Are these execution engines internally using Arrow columnar format or are they just exposing Arrow as a client wire format? AFAIK Spark and Presto does not use Arrow as execution columnar format, but just data sources/sinks.


You can configure Spark to use arrow for passing data between Java and Python via spark.sql.execution.arrow.pyspark.enabled but yes, Spark uses Java datatypes internally.


Any Snowflake developers reading this - the current snowflake-connector-python is pinned to 0.17, almost 1 year out of date now. Would be great to get that bumped to a more recent version :-)


True. Arrow is awesome and Dremio is using it as well as their built in memory. I tried it and it is increadibily fast. The future of data ecosystem is gonna be amazing


Wait so you're telling me I could store data as a PDF file, and access it easily / quickly as SQL?


If you found/wrote a adapter to translate your structured PDF into Arrow's format, yes - the idea is that you can wire up anything that can produce Arrow data to anything that can consume Arrow data.


I'm kinda confused. Is that not the case for literally everything? "You can send me data of format X, all I ask is that you be able to produce format X" ?

I'm assuming that I'm missing something fwiw, not trying to diminish the value.


The difference is that arrow’s mapping behind the scenes enables automatic translation to any implemented “plugin” that is on the user’s implementation of arrow. You can extend arrows format to make it automatically map to whatever you want, basically.

And it’s all stored in memory - so much faster access to complex data relationships than anything that exists to my knowledge.


Could I write a plug-in that mapped to Cypher? I’ve got a graph use case in mind where I want to use RedisGraph but don’t feel comfy with Redis as a primary DB and would totally consider a columnar store as a primary if I didn’t have to serialize.


Uhh.. maybe. It's a serde that's trying to be cross-language / platform.

I guess it also offers some APIs to process the data so you can minimize serde operations. But, I dunno. It's been hard to understand the benefit of the libabry and the posts here don't help.


There's no serde by design (aside from inspecting a tiny piece of metadata indicating the location of each constituent block of memory). So data processing algorithms execute directly against the Arrow wire format without any deserialization.


Of course there is. There is always deserialization. The data format is most definitely not native to the CPU.


I challenge you to have a closer look at the project.

Deserialization by definition requires bytes or bits to be relocated from their position in the wire protocol to other data structures which are used for processing. Arrow does not require any bytes or bits to be relocated. So if a "C array of doubles" is not native to the CPU, then I don't know what is.


Perhaps "zero-copy" is a more precise or well-defined term?


CPUs come in many flavors. One area where they differ is in the way that bytes of a word are represented in memory. Two common formats are Big Endian and Little Endian. This is an example where a "C array of doubles" would be incompatible and some form of deserilaziation would be needed.

My understanding is that an apache arrow library provides an API to manipulate the format in a platform agnostic way. But to claim that it eliminates deserialization is false.


I wish I had your confidence, to argue with Wes McKinney about the details of how Arrow works


Iunno. That looks like "where angels fear to tread" territory to me.


Hiya, a bit of OT (again, last one promise!): I saw your comment about type systems in data science the other day (https://news.ycombinator.com/item?id=25923839). From what I understood, it seems you want a contract system, wouldn't you think? The reason I'm asking is that I'm fishing for opinions on building data science infra in Racket (and saw your deleted comment in https://news.ycombinator.com/item?id=26008869 so thought you'd perhaps be interested), and Racket (and R) dataframes happen to support contracts on their columns.


You are right that if you want to do this in a heterogeneous computing environment, one of the layouts is going to be "wrong" and require an extra step regardless of how you do this.

But ... (a) this is way less common than it was decades ago (rare use cases we are talking about here ) and (b) it seems to be addressed in a sensible way (i.e. Arrow defaults to little-endian, but you could swap it on a big-endian network). I think it includes utility functions for conversion also.

So the usual case incurs no overhead, and the corner cases are covered. I'm not sure exactly what you are complaining about, unless it's the lack of liberally sprinkling ("no deserialization in most use cases") or whatever around the comments?


Big endian is pretty rare among anything you’d be doing in-memory analytics on. Looks like you can choose the endianess of the format if you need to, but it’s little endian by default: https://arrow.apache.org/docs/format/Columnar.html. I’d suggest reading up on the format, it covers the properties it provides to be friendly to direct random access.


If it works as a universal intermediate exchange language, it could help standardize connections among disparate systems.

When you have N systems, it takes N^2 translators to build direct connections to transfer data between them; but it only takes N translators if all them can talk the same exchange language.


can you define what at translator is? I don't understand the complexity you're constructing. I have N systems and they talk protobuf. What's the problem?


By a translator, I mean a library that allows accessing data from different subsystems (either languages or OS processes).

In this case, the advantages are that 1) Arrow is language agnostic, so it's likely that it can be used as a native library in your program and 2) it doesn't copy data to make it accessible to another process, so it saves a lot of marshalling / unmarshalling steps (assuming both sides use data in tabular format, which is typical of data analysis contexts).


It's not just a serde. One of its key use cases is eliminating serde.


Arrow had it selling points as non-serde. But I am wondering how does it achieve no-serde with Python? By allocating PyObject cleverly with a network packet buffer?

If I convert an Arrow int8 array to normal python list of int's, will this involve copying?


I think the idea is that you'd have a Python object that behaves exactly like a list of ints would, but with Arrow as its backing store.


This is similar to Numpy. You operate on Python objects that describe the Numpy structure, but the actual memory is stored in C objects, not Python objects.


It would have to, but to a numpy vector, maybe not.


I just don't believe you. My CPU doesn't understand Apache Arrow 3.0.


Not GP post, but it might have been better stated as 'eliminating serde overhead'. Arrow's RPC serialization [1] is basically Protobuf, with a whole lot of hacks to eliminate copies on both ends of the wire. So it's still 'serde', but markedly more efficient for large blocks of tabular-ish data.

[1]: https://arrow.apache.org/docs/format/Flight.html


> Arrow's serialization is Protobuf

Incorrect. Only Arrow Flight embeds the Arrow wire format in a Protocol Buffer, but the Arrow protocol itself does not use Protobuf.


Apologies, off base there. Edited with a pointer to Flight :)


So, there are several components to Arrow. One of them transfers data using IPC, and naturally needs to serialize. The other uses shared memory, which eliminates the need for serde.

Sadly, the latter isn't (yet) well supported anywhere but Python and C++. If you can/do use it, though, data are just kept as as arrays in memory. Which is exactly what the CPU wants to see.


Shared memory format is supported in Julia too!


Oh, that's fantastic to hear. Right now I'm living in Python because that's the galactic center, but I've also been anxious to find a low-cost escape hatch that doesn't just lead to C++.


You should definitely check out Julia then. There are a few parts of the language that use C/C++ libraries (blas and mpfr are the main ones), but 95% of the time, your stack will be Julia all the way down.


In the best case, your CPU needn't really be involved beyond a call to set up mapped memory. It is (among other things) a common memory format, so you can be working with it in one framework, and then start working with it in another, and it's location and format in memory hasn't changed at all.


For those wondering what a SerDe is: https://docs.serde.rs/serde/


The term likely predates the Rust implementation. SerDe is Serializer & Deserializer, which could be any framework or tool that allows the serialization and deserialization of data.

I first came across the concept in Apache Hive.


I first came across the concept when AWS Glue was messing up my CSVs on import


In case you weren't aware, Glue's metastore is an implementation of Apache Hive, so not too different from the person you're replying to.


I meant serialize / deserialize in the literal sense.


Excited to see this release's official inclusion of the pure Julia Arrow implementation [1]!

It's so cool to be able mmap Arrow memory and natively manipulate it from within Julia with virtually no performance overhead. Since the Julia compiler can specialize on the layout of Arrow-backed types at runtime (just as it can with any other type), the notion of needing to build/work with a separate "compiler for fast UDFs" is rendered obsolete.

It feels pretty magical when two tools like this compose so well without either being designed with the other in mind - a testament to the thoughtful design of both :) mad props to Jacob Quinn for spearheading the effort to revive/restart Arrow.jl and get the package into this release.

[1] https://github.com/JuliaData/Arrow.jl


Especially impressive that the first official version ships with such broad feature coverage! Really great work by Jacob.


> a separate "compiler for fast UDFs" is rendered obsolete

Agreed. I am excited too. Thanks Jacob!



Can someone ELI5 what problems are best solved by apache arrow?


The premise around arrow is that when you want share data with another system, or even on the same machine between processes, most of the compute time spent is in serializing and deserializing data. Arrow removes that step by defining a common columnar format that can be used in many different programming languages. Theres more to arrow than just the file format that makes working with data even easier like better over the wire transfers (arrow flight). How this would manifest for your customers using your applications? They'd like see speeds increase. Arrow makes a lot of sense when working with lots of data in analytical or data science use cases.


Not only in between processes, but also in between languages in a single process. In this POC I spun up a Python interpreter in a Go process and pass the Arrow data buffer between processes in constant time. https://github.com/nickpoorman/go-py-arrow-bridge


How is this any better than something like flatbuffers though?


For one, Arrow is columnar.

The idea of zero-copy serialization is shared between Arrow and FlatBuffers.


This is explained in the FAQS for Arrow, saying they use flatbuffers internally. Arrow supports more complex data types


> saying they use flatbuffers internally

Only for IPC support - Arrow data format does not use flatbuffers.


Exactly. Some specific examples.

Read a Parquet file into a Pandas DataFrame. Then read the Pandas DataFrame into a Spark DataFrame. Spark & Pandas are using the same Arrow memory format, so no serde is needed.

See the "Standardization Saves" diagram here: https://arrow.apache.org/overview/


Is this merely planned, or does pandas now use Arrow’s format? I was under the impression that pandas was mostly numpy under the hood with some tweaks to handle some of the newer functionality like nullable arrays. But you’re saying that arrow data can be used by pandas without conversion or copying into new memory?


Pandas is still numpy under the hood, but you can create a numpy array that points to memory that was allocated elsewhere, so conversion to pandas can be done without copies in nice cases where the data model is the same (simple data type, no nulls, etc.): https://arrow.apache.org/docs/python/pandas.html#zero-copy-s...


I hope you don’t mind me asking dumb questions, but how does this differ from the role that say Protocol Buffers fills? To my ears they both facilitate data exchange. Are they comparable in that sense?


Better to compare it to Cap'n Proto instead. Arrow data is already laid out in a usable way. For example, an Arrow column of int64s is an 8-byte aligned memory region of size 8*N bytes (plus a bit vector for nullity), ready for random access or vectorized operations.

Protobuf, on the other hand, would encode those values as variable-width integers. This saves a lot of space, which might be better for transfer over a network, but means that writers have to take a usable in-memory array and serialize it, and readers have to do the reverse on their end.

Think of Arrow as standardized shared memory using struct-of-arrays layout, Cap'n Proto as standardized shared memory using array-of-structs layout, and Protobuf as a lightweight purpose-built compression algorithm for structs.


> Think of Arrow as standardized shared memory using struct-of-arrays layout, Cap'n Proto as standardized shared memory using array-of-structs layout

I just want to say thank you for this part of the sentence. I understand struct-of-arrays vs array-of-structs, and now I finally understand what the heck Arrow is.



Protobuf provides the fixed64 type and when combined with `packed` (the default in proto3, optional in proto2) gives you a linear layout of fixed-size values. You would not get natural alignment from protobuf's wire format if you read it from an arbitrary disk or net buffer; to get alignment you'd need to move or copy the vector. Protobuf's C++ generated code provides RepeatedField that behaves in most respects like std::vector, but in as much as protobuf is partly a wire format and partly a library, users are free to ignore the library and use whatever code is most convenient to their application.

TL;DR variable-width numbers in protobuf are optional.


protobufs still get encoded and decoded by each client when loaded into memory. arrow is a little bit more like "flatbuffers, but designed for common data-intensive columnar access patterns"


Arrow does actually use flatbuffers for metadata storage.



is that at the cost of the ability to do schema evolution?


Yes.

A second important point is the recognition that data tooling often re-implements the same algorithms again and again, often in ways which are not particularly optimised, because the in-memory representation of data is different between tools. Arrow offers the potential to do this once, and do it well. That way, future data analysis libraries (e.g. a hypothetical pandas 2) can concentrate on good API design without having to re-invent the wheel.

And a third is that Arrow allows data to be chunked and batched (within a particular tool), meaning that computations can be streamed through memory rather than the whole dataframe needing to be stored in memory. A little bit like how Spark partitions data and sends it to different nodes for computation, except all on the same machine. This also enables parallelisation by default. With the core count of CPUS this means Arrow is likely to be extremely fast.


Re this second point: Arrow opens up a great deal of language and framework flexibility for data engineering-type tasks. Pre-Arrow, common kinds of data warehouse ETL tasks like writing Parquet files with explicit control over column types, compression, etc. often meant you needed to use Python, probably with PySpark, or maybe one of the other Spark API languages. With Arrow now there are a bunch more languages where you can code up tasks like this, with consistent results. Less code switching, lower complexity, less cognitive overhead.


> most of the compute time spent is in serializing and deserializing data.

This is to be viewed in light how hardware evolves now. CPU compute power is no longer growing as much (at least for individual cores).

But one thing that's still doubling on a regular basis is memory capacity of all kinds (RAM, SSD, etc) and bandwidth of all kinds (PCIe lanes, networking, etc). This divide is getting large and will only continue to increase.

Which brings me to my main point:

You can't be serializing/deserializing data on the CPU. What you want is to have the CPU coordinate the SSD to copy chunks directly -and as is- to the NIC/app/etc.

Short of having your RAM doing compute work*, you would be leaving performance on the table.

----

* Which is starting to appear (https://www.upmem.com/technology/), but that's not quite there yet.


> What you want is to have the CPU coordinate the SSD to copy chunks directly -and as is- to the NIC/app/etc.

Isn't that what DMA is supposed to be?

Also, there's work in getting GPUs to load data straight from NVME drives, bypassing both the CPU and system memory. So you could certainly do similar things with the PCIE bus.

https://developer.nvidia.com/blog/gpudirect-storage/

A big problem is that a lot of data isn't laid out in a way that's ready to be stuffed in memory. When you see a game spending a long time loading data, that's usually why. The CPU will do a bunch of processing to map on disk data structures to a more efficient memory representation.

If you can improve the on-disk representation to more closely match what's in memory, then CPUs are generally more than fast enough to copy bytes around. They are definitely faster than system RAM.


This is backward -- this sort of serialization is overwhelmingly bottlenecked on bandwidth (not CPU). (Multi-core) compute improvements have been outpacing bandwidth improvements for decades and have not stopped. Serialization is a bottleneck because compute is fast/cheap and bandwidth is precious. This is also reflected in the relative energy to move bytes being increasingly larger than the energy to do some arithmetic on those bytes.


An interesting perspective on the future of computer architecture but it doesn't align well with my experience. CPUs are easier to build and although a lot of ink has been spilled about the end of Moore's Law, it remains the case that we are still on Moore's curve for number of transistors, and since about 15 years ago we are now also on the same slope for # of cores per CPU. We also still enjoy increasing single-thread performance, even if not at the rates of past innovation.

DRAM, by contrast, is currently stuck. We need materials science breakthroughs to get beyond the capacitor aspect ratio challenge. RAM is still cheap but as a systems architect you should get used to the idea that the amount of DRAM per core will fall in the future, by amounts that might surprise you.


I'm curious too. Does this mean data is first normalized into this "columnar format" as the primary source and all applications are purely working off this format?

I do see yet clearly how the data is being transferred if no serializing/deserializing is taking place if someone here can help fill in further. It almost sounds like there is some specialized bridge for the data transfer and I don't have the right words for it.


I think you've got it. Data is shared by passing a pointer to it, so the data doesn't need to be copied to different spots in memory (or if it is it's an efficient block copy not millions of tiny copies).


is there a strong use case around passing data from a backend to a frontend, e.g. from pandas data frame on the server into a js implementation on the client side, to use in a UI? As opposed to data pipelining among processing servers.


Yes definitely. For example, the perspective pivoting engine (https://github.com/finos/perspective) supports arrow natively, so you can stream arrow buffers from a big data system directly to be manipulated in your browser (treating the browser as a thick client application). It's a single copy from network buffer into the webassembly heap to get it into the library.


Not as much because you are still paying the overhead of doubt shipping across the internet and through the browser.


Question, doesn't Parquet already do that?


From https://arrow.apache.org/faq/: "Parquet files cannot be directly operated on but must be decoded in large chunks... Arrow is an in-memory format meant for direct and efficient use for computational purposes. Arrow data is... laid out in natural format for the CPU, so that data can be accessed at arbitrary places at full speed."


Yes. But parquet is now based on Apache Arrow.


Parquet is not based on Arrow. The Parquet libraries are built into Arrow, but the two projects are separate and Arrow is not a dependency of Parquet.


Arrow has definitely influenced the design of Parquet, they’re meant to compliment each other.


In Perspective (https://github.com/finos/perspective), we use Apache Arrow as a fast, cross-language/cross-network data encoding that is extremely useful for in-browser data visualization and analytics. Some benefits:

- super fast read/write compared to CSV & JSON (Perspective and Arrow share an extremely similar column encoding scheme, so we can memcpy Arrow columns into Perspective wholesale instead of reading a dataset iteratively).

- the ability to send Arrow binaries as an ArrayBuffer between a Python server and a WASM client, which guarantees compatibility and removes the overhead of JSON serialization/deserialization.

- because Arrow columns are strictly typed, there's no need to infer data types - this helps with speed and correctness.

- Compared to JSON/CSV, Arrow binaries have a super compact encoding that reduces network transport time.

For us, building on top of Apache Arrow (and using it wherever we can) reduces the friction of passing around data between clients, servers, and runtimes in different languages, and allows larger datasets to be efficiently visualized and analyzed in the browser context.


To really grok why this is useful, put yourself in the shoes of a data warehouse user or administrator. This is a farm of machines with disks that hold data too big to have in a typical RDBMS. Because the data is so big and stored spread out across machines, a whole parallel universe of data processing tools and practices exists that run various computations over shards of the data locally on each server, and merge the results etc. (Map-reduce originally from Google was the first famous example). Then there is a cottage industry of wrappers around this kind of system to let you use SQL to query the data, make it faster, let you build cron jobs and pipelines with dependencies, etc.

Now, so far all these tools did not really have a common interchange format for data, so there was a lot of wheel reinvention and incompatibility. Got file system layer X on your 10PB cluster? Can't use it with SQL engine Y. And I guess this is where Arrow comes in, where if everyone uses it then interop will get a lot better and each individual tool that much more useful.

Just my naive take.


I recently found it useful for the dumbest reason. A dataset was about 3GB as a CSV and 20MB as a parquet file created and consumed by arrow. The file also worked flawlessly across different environments and languages.

So it’s a good transport tool. It also happens to be fast to load and query, but I only used it because of the compact way it stores data without any hoops to jump through.

Of course one might say that it’s stupid to try to pass around a 2GB or 20MB file, but in my case I needed to do that.


Parquet is not Arrow. Parquet has optimizations for storage size at the expense of compute readiness. Arrow maximizes computation efficiency, at the expense of storage size.

Read more here: https://stackoverflow.com/questions/56472727/difference-betw...


I used arrow to read and write parquets across environments. Thats my dumb reason for finding arrow so useful.


Did you have to reformat the data to get that size saving?


Not at all. It was just swapping out readcsv with readparquet in R and Python. It was painless. Granted my dataset was mostly categorical so that’s why it compressed down so much, but it was a real lifesaver.

It was also nice to be able to read while bundles of parquet a into a single dataframe easily. So is nice for “sharding” really big parquets over multiple files. Or fitting under file size limits on git repos.


gzip probably would have been fine for that case even without parquet?


Gzip did help, but not as much as parquet. I don’t remember the exact sizes but I think parquet was was 3GB->20MB and gzip was like 3GB->180MB or something.

Also gzip was an extra step of unzip then read.


The big thing is that it is one of the first standardized, cross language binary data formats. CSV is an OK text format, but parsing it is really slow because of string escaping. The files it produces are also pretty big since it's text.

Arrow is really fast to parse (up to 1000x faster than CSV), supports data compression, enough data-types to be useful, and deals with metadata well. The closest competitor is probably protobuf, but protobuf is a total pain to parse.


CSV is a file format and Arrow is an in memory data format.

The CSV vs Parquet comparison makes more sense. Conflating Arrow / Parquet is a pet peeve of Wes: https://news.ycombinator.com/item?id=23970586


To be fair, arrow absorbed parquet-cpp, so a little confusion is to be expected.


The worst part about CSV is that it appears to be so simple that many people don't even try to understand how complicated it is and just roll their own broken dialect instead of just using a library. There is also no standardized encoding. Delimiters have to be negotiated out of band and people love getting them wrong.


Mostly agree, but RFC-4180 does specify delimiters. I've used pipe-delimited files for years to ensure explicitly-defined conformity, but recently due to the plethora of libraries have been thinking that CSV may be simpler.


The closest competitor would be HDF5, not Protobuf.


Seems nice. How does it compare to hdf5?


HDF5 is pretty terrible as a wire format, so it's not a 1-1 comparison to Arrow. Generally people are not going to be saving Arrow data to disk either (though you can with the IPC format), but serializing to a more compact representation like Parquet.


As I understand, arrow is particularly interesting since it’s wire format can be immediately queried/operated on without deserialization. Would saving an Arrow-structure as parquet not defeat that purpose, since your would need the costly deserialization step again on read? Honest question


The FAQ [1] and this SO answer [2] explain it better than I can, but basically yes. However, the (de)serialization overhead is probably better than most alternative formats you could save to.

[1] https://arrow.apache.org/faq/ [2] https://stackoverflow.com/questions/56472727/difference-betw...


I wrote a bit about this from the perspective of a data scientist here: https://www.robinlinacre.com/demystifying_arrow/

I cover some of the use cases, but more importantly try and explain how it all fits together, justifying why - as another commenters has said - it's the most important thing happening in the data ecosystem right now.

I wrote it because i'd heard a lot about Arrow, and even used it quite a lot, but realised I hadn't really understood what it was!


The press release on the first version of Apache Arrow discussed here 5 years ago is actually a good introduction IMO:

https://news.ycombinator.com/item?id=11118274


Rather curious myself.

https://en.wikipedia.org/wiki/Apache_Arrow was interesting, but I think many of us would benefit from a broader, problem focused description of Arrow from someone in the know.


When you want to process large amounts of in-memory tabular data from different languages.

You can save it to disk too using Apache Parquet but I evaluated Parquet and it is very immature. Extremely incomplete documentation and lots of Arrow features are just not supported in Parquet unfortunately.


Do you mean the Parquet format? I don't think Parquet is immature, it is used in so many enterprise environments, it's is one of the few columnar file format for batch analysis and processing. It preforms so well... But I'm curious to know your opinion on this, so feel free to add some context to your position!


Yeah I do. For example Apache Arrow supports in memory compression. But Parquet does not support that. I had to look through the code to find that out, and I found many instances of basically `throw "Not supported"`. And yeah as I said the documentation is just non-existent.

If you are already using Arrow, or you absolutely must use a columnar file format then it's probably a good option.


Is that a problem in the Parquet format or in PyArrow? My understanding is that Parquet is primarily meant for on-disk storage (hence the default on-disk compression), so you'd read into Arrow for in-memory compression or IPC.


I don't know whether it is a limitation of the format or of the implementation. I used the C++ library (or maybe it was C, I can't remember), which I assume PyArrow uses too.

> hence the default on-disk compression

No, Parquet doesn't support some compression formats that Arrow does.


Python Pandas can output to lots of things. R can read lots of things. Arrow lets you push (mostly*) seamlessly. (I hit one edge case with an older version with JSON as a datatype).

Not having to maintain data type definitions when sending around the data, nor caring whether my colleagues were using R or Python, worked great.


This is not the greatest answer, but one anecdote: I am looking at it as an alternative to MS SSIS for moving batches of data around between databases.


This link is a 404. Perhaps they weren't intending this post to be public yet?

At any rate, archive.org managed to grab it https://web.archive.org/web/20210203194945/https://arrow.apa...


Thanks for the heads up. The post is intended to be up but there's an intermittent error happening. It's been reported to the Apache infrastructure team.


For use as a file format, where one priority is to compress columnar data as well as possible, the practical difference between Arrow (via Feather?), Parquet, and ORC is still somewhat vague to me. During my last investigation, I got the impression that Arrow worked great as a standard, interoperable, in-memory columnar format, but didn't compress nearly as well as ORC or Parquet due to lack of RLE and other compression schemes (other than dictionary). Is this still the case? Is there a world where Arrow completely supplants Parquet and/or ORC?

EDIT: Just found https://wesmckinney.com/blog/arrow-columnar-abadi, which helps answer this question.


Can someone dig into the pros and cons of the columnar aspect of Arrow? To some degree there are many other data transfer formats but this one seems to promote its columnar orientation.

Things like eg. protobuffers support hierarchical data which seems like a superset of columns. Is there a benefit to a column based format? Is it an enforced simplification to ensure greater compatibility or is there some other reason?


The main benefit of a columnar representation in memory is it's more cache friendly for a typical analytical workload. For example, if I have a dataframe:

  (A int, B int, C int, D int)
And I write:

  A + B
In a columnar representation, all the As are next to each other, and all the Bs are next to each other, so the process of (A and B in memory) => (A and B in CPU registers) => (addition) => (A + B result back to memory) will be a lot more efficient.

In a row-oriented representation like protobuf, all your C and D values are going to get dragged into the CPU registers alongside the A and B values that you actually want.

Column-oriented representation is also more friendly to SIMD CPU instructions. You can still use SIMD with a row-oriented representation, but you have to use gather-scatter operations which makes the whole thing less efficient.


Could you expand on this more columnar data and row data? I missed something here how the data is organized and what you mean by the C,D values getting dragged along.


To rephrase the sibling comment, if you had an array of four {ABCD} structs, there's basically two ways of storing them on disk:

1. AAAA BBBB CCCC DDDD

2. ABCD ABCD ABCD ABCD

One major heuristic in how CPUs make your code fast is to assume that if you access some memory, you're probably interested in the memory nearby. So when you access the first "A" bit of memory (common to both sequences above), depending on the memory layout you use, the CPU might also be smart and load the next bits into memory too -- maybe the next "AA", maybe "BC".

Depending on your workload, one or the other of those might be faster. If you're only interested in the first ABCD element because you're doing

  SELECT * FROM users WHERE id=$1
then you'll likely want "row-oriented" data -- the #2 scheme above. But if you're interested in all of the A values and none of the values from B/C/D because you're doing

  SELECT AVG(age) FROM users
then you'll likely want something "column-oriented" -- the #1 scheme above.


Row data is an array of structs, with each struct representing a record.

Columnar data is a struct of arrays, with each array representing a column.


It might be a stretch , but isn't it kind of the same idea as why ECS is faster than OO in video games development ?


A columnar format is almost always what you want for analytical workloads, because their access patterns tend to iterate ranges of rows but select only a few columns at random.

About the only thing protocol buffers has in common is that it's a standardized binary format. The use case is largely non-overlapping, though. Protobuf is meant for transmitting monolithic datagrams, where the entire thing will be transmitted and then decoded as a monolithic blob. It's also, out of the box, not the best for efficiently transmitting highly repetitive data. Column-oriented formats cut down on some repetition of metadata, and also tend to be more compressible because similar data tends to get clumped together.

Coincidentally, Arrow's format for transmitting data over a network, Arrow Flight, uses protocol buffers as its messaging format. Though the payload is still blocks of column-oriented data, for efficiency.


This is intended for analytical workloads where you're often doing things that can benefit from vectorization (like SIMD). It's much faster to SUM(X) when all values of X are neatly laid out in-memory.

It also has the added benefit of eliminating serialization and deserialization of data between processes - a Python process can now write to memory which is read by a C++ process that's doing windowed aggregations, which are then written over the network to another Arrow compatible service that just copies the data as-is from the network into local memory and resumes working.


> It also has the added benefit of eliminating serialization and deserialization of data between processes

Is that accurate? It still has to deserialize from apache arrow format to whatever the cpu understands.


The important part to focus on is _between processes_.

Consider Spark and PySpark. The Python bits of Spark are in a sidecar process to the JVM running Spark. If you ask PySpark to create a DataFrame from Parquet data, it'll instruct the Java process to load the data. Its in-memory form will be Arrow. Now, if you want to manipulate that data in PySpark using Python-only libraries, prior to the adoption of Arrow it used to serialize and deserialize the data between processes on the same host. With Arrow, this process is simplified -- however, I'm not sure if it's simplified by exchanging bytes that don't require serialization/deserialization between the processes or by literally sharing memory between the processes. The docs do mention zero-copied shared memory.


The Arrow Feather format is an on-disk representation of Arrow memory. To read a Feather file, Arrow just copies it byte for byte from disk into memory. Or Arrow can memory-map a Feather file so you can operate on it without reading the whole file into memory.


That's exactly how I read every data format.

The advantage you describe is in the operations that can performed against the data. It would be nice to see what this API looks like and how it compares to flatbuffers / pq.

To help me understand this benefit, can you talk through what it's like to add 1 to each record and write it back to disk?


Redshift's explanation is pretty good. Among other things, if you only need a few columns there are entire blocks of data you don't need to touch at all. https://docs.aws.amazon.com/redshift/latest/dg/c_columnar_st...

It's truly magical when you scope down a SELECT to the columns you need and see a query go blazing fast. Or maybe I'm easily impressed.


- Selective/lazy access (e.g. I have 100 columns but I want to quickly pull out/query just 2 of them).

- Improved compression (e.g. a column of timestamps).

- Flexible schemas being easy to manage (e.g. adding more columns, or optional columns).

- Vectorization/SIMD-friendly.


So if I understand this correctly from an application developers perspective:

- for OLTP tasks, something row based like sqlite is great. Small to medium amounts of data mixed reading/writing with transactions

- for OLAP tasks, arrow looks great. Big amounts of data, faster querying (datafusion) and more compact data files with parquet.

Basically prevent the operational database from growing too large, offload older data to arrow/parquet. Did I get this correct?

Additionally there seem to be further benefits like sharing arrow/parquet with other consumers.

Sounds convincing, I just have two very specific questions:

- if I load a ~2GB collection of items into arrow and query it with datafusion, how much slower will this perform in comparison to my current rust code that holds a large Vec in memory and „queries“ via iter/filter?

- if I want to move data from sqlite to a more permanent parquet „Archive“ file, is there a better way than recreating the whole file or write additional files, like, appending?

Really curious, could find no hints online so far to get an idea.


I think the best way to think about it is, you have teams where hadoop queries take 5+ hours to run. You port that same data into arrow, now that same query takes 30 seconds.


This is something I understood, but I ask specifically as a person at the intersection point between those worlds, owning the operational database and want to see if I can enhance systems for both, speedy operations _and_ seamless analytics using arrow.


Related - recent Apache Arrow support in Julia announcement - https://julialang.org/blog/2021/01/arrow/


I got interested in Arrow recently after reading this blog post showing that Arrow (and Ray) are much faster than Pickle: https://rise.cs.berkeley.edu/blog/fast-python-serialization-...

I have a question about whether it would fit this use-case:

* I need a SUPER fast KV-store.

* I'm on a single machine.

* Keys are 10-bytes if you compress (or strings with 32 characters if you don't), unfortunately I can't store it as an 8-byte int. sqlite said it supports arbitrary precision numerics, but then I got burned finding out that casts integers to arbitrary precision floats and only keeps the first 14 digits of precision :\

* Values are 4-byte ints. Maybe 3 4-byte ints.

* I have maybe 10B - 100B rows.

* I need super fast lookup and depending upon my machine can't always cache this in memory, might need to work from disk.

Would arrow be useful for this? Currently just using sqlite.


"much faster than Pickle"

That isn't saying much, though.

https://www.benfrederickson.com/images/python-serialization/...


Unrelated: What is this *pickle* term origin?

Python errors sometimes generate some "Could not pickle" errors and not sure what it tries to convey ...


Pickling a cucumber stores and preserves it for later use, so the term is used for serialization. Still an odd analogy though, since you can't unpickle a pickle. I've heard freeze/thaw which is at least a little better.


It's Python. So it's Monty Python.


Pickle is the Python module that (de)serializes Python objects to some sort of binary format. Failure to pickle means the object couldn’t be serialized — the most common instance is probably trying to pass a lambda to a multiprocessing worker, because they don’t have names (you should use a def or functions.partial instead).


Since you need fast lookup, traditional transactional row store like RocksDB seems a better fit.


I'm surprised they are still making breaking changes, and they plan to make more (they are already working on a 4.0).


I didn't see any breaking changes in the release notes but I may have missed them. Maybe they don't use SemVer?


Arrow uses SemVer, but the library and the data format are versioned separately: https://arrow.apache.org/docs/format/Versioning.html


Uh... 404 error for the link....


Cudf and Cylon are two execution engines natively supporting Arrow format

https://github.com/rapidsai/cudf https://github.com/cylondata/cylon


Has anyone had success in getting the page to load? I'm on my desktop and can't see what's behind the link.


I'm on mobile and getting a white screen of death also.


How is Arrow when it comes to streaming workflows? Like, could Arrow replace Storm as analytics in a pipeline from Flume?


Would really love to see first class support for Javascript/Typescript for data visualization purposes. The columnar format would naturally lend itself to an Entity-Component style architecture with TypedArrays.


Have you seen https://github.com/finos/perspective? Though they removed the JS library a few months ago in favour of a WASM build of the C++ library.


The JS library hasn't been removed - it's the main UI and API over the WASM library that allows for Perspective to be used in the browser: https://perspective.finos.org/


I think the GP meant the Typescript arrow library.


There is a Typescript arrow library that can operate off the IPC format. As an example use case, take a look at the arquero data processing library from the University of Washington (from the same group that created D3). https://github.com/uwdata/arquero


https://arrow.apache.org/docs/js/ has existed for a while and uses typed arrays under the hood. It's a bit of a chunky dependency, but if you're at the point where that level of throughput is required bundle size is probably not a big deal.


I was mostly looking at this: https://arrow.apache.org/docs/status.html


Not sure I follow, that page indicates that JS support is pretty good for all but the more obscure features (e.g. decimals) and doesn't mention data visualization at all? Anyhow, I've successfully used https://github.com/vega/vega-loader-arrow for in-browser plots before, and Observable has a fair few notebooks showing how to use the JS API (e.g. https://observablehq.com/@theneuralbit/introduction-to-apach...)


Why no love for PHP I wonder? Don't see a supported library there.


If you're successfully doing data science or data engineering in PHP, you're already a god among informaticians, and don't need any extra help.


Hiya, a bit of OT: I saw your comment about type systems in data science the other day (https://news.ycombinator.com/item?id=25923839). From what I understood, it seems you want a contract system, wouldn't you think? The reason I'm asking is that I'm fishing for opinions on building data science infra in Racket (and saw your deleted comment in https://news.ycombinator.com/item?id=26008869 so thought you'd perhaps be interested), and Racket (and R) dataframes happen to support contracts on their columns.


I notice PHP compatibility conspicuously absent from so many libraries. Which is kind of amazing to me, since I "think" in PHP and MATLAB in a declarative and data-driven way. I do everything in sync blocking functional style and mostly just pipe data around. So I've found that the runners up (like Javascript, Python and Ruby) mostly just get in the way and force me to adopt their style. They're all roughly equivalent in power and expressivity, but nothing lets me go between a spreadsheet and the shell quite as easily as PHP.


has anyone had success using arrow in js to feed tabular data to the frontend from the backend, such as a pandas data frame?


Last time I worked in ETL was with Hadoop, looks like a lot happened.


There's actually a lot of overlap between Hadoop and Arrow's origins - a lot of the projects that integrated early and the founding contributors had been in the larger Hadoop ecosystem. It's a very good sign IMO that you can hardly tell anymore - very diverse community and wide adoption!


I worked at a large company a few years ago on a team implementing this. It’s super cool and works great. Definitely where the future is headed


Does no one do load testing anymore, anyone got a working mirror


Yay, another ad-tech support engine from Apache, great


I'm done with Hacker News, you guys just upvote marketing and politics.


I genuinely would like to know what's the issue with a post about a major release of an innovative project. What's political about it? It's not like it's a new release of an enterprise/paid product based on an open source project.


Yet your only contribution here is fully political...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: