
Apache Arrow 1.0 - dragonsh
https://arrow.apache.org/blog/2020/07/24/1.0.0-release/
======
derriz
Parquet/arrow is a great format for Pandas - fast and nicely compact in terms
of file size for those of us who don't have the luxury of directly attached
NVMe SSDs and where I/O bottlenecks are a consideration.

Even after issues with its inability to map datetime64 values properly, I was
reasonably happy with my design choice.

I became less happy on discovering that it's very weak as an interchange
format for cross-language work.

In my case I wanted to use some existing JVM-based tooling. This caused huge
pain. The Jvm/Java library/API is a complete mess, sorry, and if people are
complaining about the Python/C++ documentation, there's basically nothing for
the Java library. It's barely useable and the dependencies are horrific - the
whole thing is mingled with hadoop dependencies - even the API itself.

And the API is barely above exposing the file-format. Nothing like "load this
parquet file" into some object which you can then query for it's contents -
you're dealing with blocks and sections and other file-format level entities.

The other issue is caused its flexibility - for example Panda's dataframes are
written with what's effectively a bunch of "extension metadata" which means it
works great for reading and writing pandas from Python but don't expect
anything to be able to work with the files out-of-the-box in other languages.

In the end, the only way I could get reliable reading and writing from the JVM
was to only store numeric and string data from the Python side. Even then it
feels flakey - with a bunch of hadoop warnings and deprecation warnings. I
know the JVM has little appreciation in the data science world which is maybe
a reason for the sorry state of the Java library.

Edit: to be specific, I am talking about my experiences with Arrow/Parquet.

~~~
wesm
What you've written sounds like a criticism of the JVM data analytics
ecosystem (the Java Parquet library in particular) and not Apache Arrow
itself. Parquet for Java is an independent open source project and developer
community. For example, you said

> It's barely useable and the dependencies are horrific - the whole thing is
> mingled with hadoop dependencies - even the API itself.

These are comments about [http://github.com/apache/parquet-
mr](http://github.com/apache/parquet-mr) which is a different open source
project.

For C++ / Python / R many of the developers for both Apache Arrow and Apache
Parquet are the same and we currently develop the Parquet codebase out of the
Arrow source tree.

So, I'm not sure what to tell you, we Arrow developers cannot take it upon
ourselves to fix up the whole JVM data ecosystem.

~~~
derriz
I'm not expecting anything really and I do appreciate your work and effort.
And it's a specific use case for arrow, I guess.

But at your landing page, it's claimed "Apache Arrow defines a language-
independent columnar memory format for flat and hierarchical data, organized
for efficient analytic operations on modern hardware like CPUs and GPUs. " and
that "Libraries are available for C, C++, C#, Go, Java, JavaScript, MATLAB,
Python, R, Ruby, and Rust.". This certainly gave me the impression that more
than just Python, C++ and R would be well supported.

The JVM isn't complete irrelevant in data-science given the position of
Spark/Scala. This also raised my expectations of arrow/parquet because it
seems to be the de-facto standard for table storage for this JVM platform. And
I experienced no issues on that platform.

To be clear, I'm not blaming you for my design decision (I'm a software
engineer not a data-scientist btw), and I still think parquet/arrow rocks for
Python but in my experience it doesn't really deliver a useable "cross-
language" file format at the moment.

~~~
mumblemumble
FWIW, while the JVM isn't completely irrelevant in data, I will say, even as a
big user of Spark via Scala, that JVM languages are quickly becoming
irrelevant in data. Spark's Scala API is simultaneously the core of the
platform, and also very much a second-class citizen that lacks a lot of
important features that the Python API has. Easy interop with a good math
library, for example.

Similarly, the reference implementation of Parquet may be in Java, but
consuming it from a Java language, outside of a Spark cluster, is still a
royal pain. Whereas doing it from Python isn't too bad.

Long story short, I think that expecting a project that's just trying to
implement a columnar memory format to also muck out the world's filthiest
elephant pen is perhaps asking too much. Though perhaps a project like Arrow
could serve as the cornerstone of an effort to douse it all with kerosene and
make a fresh start.

~~~
pjmlp
I spent a couple of years doing consultancy for life sciences research labs,
most people were just using Excel and Tableau, plugged into OLAP, SQL servers,
alongside Java and .NET based stores.

Stuff like Arrow doesn't come even into the radar of IT.

------
wesm
hi, Wes (Apache Arrow co-creator and Python pandas creator) here! If you're
wondering what this project is all about, my JupyterCon keynote (18 min long)
from 3 years ago is a good summary and the vision / scope for what we've been
doing since 2016 has been pretty consistent

[https://www.youtube.com/watch?v=wdmf1msbtVs](https://www.youtube.com/watch?v=wdmf1msbtVs)

~~~
tgb
Thanks Wes, pandas and Arrow are great projects. Is Feather now ready for
long-term storage with the V2 release? And now that it's just a renaming of
the Arrow IPC format, what's its future?

~~~
tbenst
I’m also very confused about the relationship of Arrow’s stability guarantees
and that of the on-disk feather. Can we safely switch from parquet to feather
for long-term data storage?

~~~
sandGorgon
same question here - why Feather and why is it named differently ?

Also is Parquet and Arrow the same ? df.to_parquet('df.parquet.gzip',
compression='gzip') will not use arrow i presume ? i have to use a separate
library to save to parquet using arrow. a bit confused.

~~~
lmeyerov
Graphistry uses parquet as a more stable and thus persistent storage format
when folks save data, and arrow for ephemeral internal data where we're ok
(and somewhat enjoy) version changes, as that just means code version
upgrades. There are performance differences in practice such as parquet having
more per-column compression modes built in, making it attractive for colder
storage, and arrow for in-memory/rpc/streaming/etc for similar reasons.

RE:Feather -- Arrow itself isn't necessarily a full file format -- you can
imagine memory buffers being all over the heap with giant gaps inbetween b/c
diff cols generated at diff times -- but in practice folks will indeed
serialize to disk consolidated buffers (pa.Table -> write stream -> file). If
we couldn't do that, RPC wouldn't work ;-) My understanding of Feather is it
standardizes ideas around this consolidation, but we are able save to disk
(within versions) without it. We found it more predictable to stick to
~Parquet for storage and Arrow buffer passing for streaming, but now that
Feather networking APIs for accelerated bulk transfers may be stablizing,
there may be speed advantages to using it over manual buffer streaming (and
still stick w/ Parquet for persistent files).

Arrow<>Parquet conversion is super fast b/c of the co-design around similar
concepts: both using record batches of dense binary column buffers means
implementations can pointer-copy, memory map, use bulk copy primitives, etc.
for zero-copy or at least highly accelerated interop. Python RAPIDS GPU
kernels can therefore selectively stream in a few parquet columns across many
parquet files through a single 900GB/s GPU, compute over them, and write back
out to arrow or a new parquet.

------
lmeyerov
Congrats! And a big thank you to Wes for being such a devoted community
leader!

We've been on quite the journey here:
[https://www.graphistry.com/blog/graphistry-2-29-5-upload-100...](https://www.graphistry.com/blog/graphistry-2-29-5-upload-100x-more-
rapids-0-13-learnrapids-com-and-more) . Think json -> protobuf -> arrow, and
paralleled in our work on parallel js -> opencl+js -> python rapids/cuda for
accelerated native compute over it. The blogpost demos the ~100X bigger &
faster dataset result of supporting Arrow for our uploader & RAPIDS for our
parser when you can't and want us to convert for you.

Something folks miss with Arrow, IMO, is it is like google protobufs for
everyone else. Arrow is not just having a nice binary format, but is also
ready out-of-the-box for streaming, rich datatypes, and for larger/longer-term
projects, standardized & auto self-describing schemas. If you've had to
manually decipher, maintain, and update generated protobuf schemas (because
you aren't google with all the internal ~protobuf tooling / integrations /
infra / etc.), that should sound pretty good ;-) A lot more to do, but already
way ahead of most other things, esp. in aggregate.

------
neurobashing
given the sheer number and scope of Apache projects, it would be nice to have
the title mention what it is (eg, "Apache Arrow - in memory analytics -
1.0.0") esp when the linked page doesn't directly say.

~~~
wenc
Normally I would agree but Apache Arrow already has significant name-
recognition to people who are likely to use it -- data engineers/scientists
etc. It's a library that provides an in-memory columnar format, and supplies a
data engine currently used in Apache Spark, Pandas, and libraries like
turbodbc, which helps these tools achieve high performance on operations on
tabular data.

Having a single high-performance in-memory format means different programs can
read/write from the same source without serializing/copying/deserializing. For
instance, if you wanted to pass a huge table of data from R to Java to Python
(because your tools span different languages), normally you'd have to copy and
serde (Protobuf? JSON?) to pass the data, which equals huge overheads. With
Arrow, each of those languages can directly interact with the same copy of in-
memory data, in-process -- with the highest possible performance.

You also get the performance of columnar databases without implementing your
own columnar data structure.

But of course, no harm adding a short description to the title to broaden its
audience. Arrow is truly something amazing and the more people know about it
the better.[1] Folks who program against traditional databases might not know
about it, and I think they should, especially if they need to generate
analytics (i.e. fast filtering/aggregation for dashboarding or for data
pipeline tasks).

[1] Overview:
[https://arrow.apache.org/overview/](https://arrow.apache.org/overview/)

~~~
buryat
where is it used in production?

~~~
wenc
Here's a list (and as mentioned in my comment, Apache Spark, Pandas, turbodbc)

[https://arrow.apache.org/powered_by/](https://arrow.apache.org/powered_by/)

------
tomhoule
I checked recently, and the Rust implementation of arrow — parquet as well —
is sadly still not usable on projects using a stable version of the compiler
because it relies on specialization.

There are some Jira issues on this, but there doesn't seem to be a consensus
on the way forward. Does someone have more information, is the general idea to
wait for specialization to stabilise, or is there a plan, or even an
intention, to stop relying on it?

~~~
nemothekid
Last I checked when I tried to the library last the blocker on stable was
packed_simd, which provides vector instructions in Rust. I can imagine the
arrow/datafusion-guy isn't too keen on dropping vectorized instructions as
that would be letting up a huge advantage and I'm imagine it's used liberally
throughout the code.

As for stablizing packed_simd, It's completely unclear to me when that will
land in stable rust. I recently had a project where I just ended up calling
out to C code to handle vectorization.

EDIT: According to, [https://github.com/rust-lang/lang-
team/issues/29](https://github.com/rust-lang/lang-team/issues/29), the effort
looks abandonded/deprioritized, so it may be a long time before it sees stable
rust

~~~
lordsunland
The main blocker is specialization as it's being used in several places
including parquet and arrow crate. As this feature is unlikely to be
stabilized, we'll need to replace it with something else but so far it is
challenging.

~~~
tomhoule
Thanks for clarifying that the plan is to stop relying on it :) Is there a
specific place/issue/ticket to discuss this? If time allows, I would be
interested in helping out.

~~~
lordsunland
There is
[https://issues.apache.org/jira/browse/ARROW-6717](https://issues.apache.org/jira/browse/ARROW-6717)
tracking moving to stable Rust, and
[https://issues.apache.org/jira/browse/ARROW-4678](https://issues.apache.org/jira/browse/ARROW-4678)
is also relevant. Feel free to create a sub-task under the former to remove
specialization :)

------
mands
Congrats on the 1.0 release!

We've been using Arrow for a few years in our startup (link in profile) - it's
been great as a common format for passing tabular data between Python and
Javascript.

Haven't used the in-memory/zero-copy features as much, but as a binary, high-
performance, typed format for the data analytics world it can't be beat. And
now that the Feather format is basically the Arrow format, I expect to see it
really take off as a common interchange and even storage format for medium- to
even long-term projects.

(Also nice to see the reduced Python wheel size - a little bonus for the 1.0
release :) )

------
oxfordmale
We extensively use Apache Arrow to store data files as parquet files on S3.
This is a cheap way to store data that doesn't require the query speed of a
relational (or non relational database). The main advantage of Arrow is that
is a columnar database, and loses no information in transit unlike the
nightmare of CSV files.

~~~
lsorber
Have you benchmarked this against pickling those data files? In our
experience, parquet's overhead isn't worth it for smaller data files.

~~~
alfalfasprout
I just did some benchmarks and it's pretty similar for small files. The
difference would only be noticeable if you're serializing a ton of small
files.

~~~
lsorber
Huh, makes a pretty big difference for us. We were using pandas' built-in
to_parquet though, which seems to suffer from some overhead.

------
unclepadre
Wes McKinney (Pandas creator and Arrow co-creator) will be giving an Arrow
talk live this Thursday for those who want to learn more ->
[https://subsurfaceconf.com/summer2020](https://subsurfaceconf.com/summer2020)

------
Abishek_Muthian
Apache Arrow is a nice in-memory data structure finding its usage in wide
variety of projects; especially in data science due to its feather format
which can be used to save data frame from memory to the disk.

This can particularly be useful for low memory systems like ARM SBC when
conducting long duration research. If you want to build Apache Arrow from
source for ARM, I've written a How-To here[1].

[1][https://gist.github.com/heavyinfo/04e1326bb9bed9cecb19c2d603...](https://gist.github.com/heavyinfo/04e1326bb9bed9cecb19c2d603c8d521)

------
gwittel
Glad to see the 1.0.0 release. I'm hoping that it will help reduce resistance
when I nudge people toward it. It also good to see progress in the
documentation. For a long time its been very hard to dig into anything but the
C++ or Python libraries.

As I try to move to Arrow/Flight over using JSON or binary formats like
Protobuf, one thing I see missing is tooling or schema -> code generation. It
would be nice to see some progress or roadmap in that direction as well.

------
snicker7
Polyglot, zero-copy serialization protocols (arrow, flatbuffer, capn proto) is
a trend I really appreciate.

I like to imagine a future where data is freed from the languages/tools used
to operate on them. In-memory objects would then be views into data stored in
shared memory (or disk), with the data easily read/manipulated from multiple
languages.

------
lloydatkinson
I said this before but:

Oh look, yet another Apache real time/batch/big data/stream
processing/ingestion/workflow/whatever product.

    
    
      Apache Druid
      Apache Spark
      Apache Storm
      Apache Flink
      Apache Beam
      Apache Apex
      Apache Airavata
      Apache Samza
      Apache TEZ
      Apache Hama
    

It's basically a terrible joke at this point. There's no single Apache page
helping you to decide which one you want, and they all seem to have such large
overlap. Most of them seem to have bad documentation, and give the appearence
of not really being maintained. This puts me off even trying to use them. If
there's this much scope creep/NIH/reinventing the wheel happening across the
board, I can't imagine how bad each product is individually.

Apache Kafka seems to be the only exception...

~~~
wowi42
Apache Pulsar?

------
x87678r
When I first saw Arrow I thought the biggest benefit was a good file format.
(Feather I think is the format. ) However it seems to be designed for passing
tables between languages in the same process.

If I never use multiple languages in the same process can I safely ignore it
or are there other benefits?

~~~
lmeyerov
In Python, we use Arrow a lot for going between diff frameworks: file -> (
pandas <> cudf <> cugraph ). When we work with DB vendors, we now push them to
provide Arrow instead of whatever proprietary in-house thing they've been
pushing for the same, e.g., slow JSON / ODBC, untyped CSV, or some ad-hoc wire
protocol. Also, we use for going between our services within a server +
across.

On the nodejs services + frontend JS side, there was no real equiv tool for
interop -- traditional soln is slowly round-tripping through SQL/ORM or
manually doing protobuf through some sort of pubsub -- so Arrow is part of how
we have been taming that mess too. It'll be a longer journey for Arrow or
something like it to get adopted in JS land, but I can see folks doing
TypeScript and serverless wanting it for a cleaner solution to typed data,
faster serialization & streaming, etc. (There is no true JS equiv of pandas
nor a typed variant.) We were a bit early here b/c we wanted streaming through
WebGL + OpenCL/CUDA, and while we are fans of typed data, found protobuf
tooling to be too unintegrated and manual in practice.

~~~
infinite8s
How successful have you been in pushing database vendors to use arrow on the
wire? We've started using turbodbc (to connect to SQL Server), but as I
understand it the data on the wire is still in odbc's row-oriented wire
format, and turbodbc is responsible for packing into arrow column oriented
buffers.

As you mention in your second paragraph, arrow is perfect for getting typed
query results from a database server directly to a web frontend with minimal
overhead. With the Transferable interface, it's even possible to zero-copy
transfer the arrow data from the network buffer to a webworker.

------
yingw787
I wish Excel / Google Sheets supported Parquet export. I've worked with CSV
files and Parquet files and found the latter a vast improvement. Chunking data
is trivial, you have fixed-size data types, and existing types are well-
supported across a multitude of frameworks.

In comparison, CSVs are pretty much unstructured text files with added
suggestions.

More pipelines should natively support Parquet files, or something like it,
and have a thin CSV to Parquet conversion layer on top.

P.S. I'd love to create a Parquet file visualization tool (something like
[https://github.com/in3rsha/sha256-animation](https://github.com/in3rsha/sha256-animation))
in the near future.

------
e12e
Nice. Is it just me, or is Julia notably missing an up to date arrow
implementation?

------
pyuser583
Yay! Apache Arrow is great project. Whenever I use is, people think I’m some
kind of genius.

It should get a lot more press.

~~~
michaelcampbell
The docs are dire. Lots and lots about internals and how it works, and after a
few minutes of searching, I've yet to see what it actually does or what I'd
want to use it for. The "use cases" are even overly technical things that are
building blocks for ... something. What is that something?

~~~
asdffdsa
Judging from the faq
([https://arrow.apache.org/faq/](https://arrow.apache.org/faq/)), it seems
like it's a storage format for large structured datasets, designed
specifically for use in-memory and for serialization across the network. It
provides a spec for the data format, as well as implementations for several
programming languages.

That seems pretty straightforward, what part was confusing for you?

~~~
michaelcampbell
> what part was confusing for you?

The passive aggressive isn't necessary.

~~~
asdffdsa
His snark isn't necessary; the docs are fine.

------
maximilianburke
Big fan of Arrow (and Parquet) and we use it at UrbanLogiq extensively. It's a
wonderful toolset and great for data interop across multiple languages and
environments!

------
thamer
I used Apache Arrow to store and process a relatively large amount of data,
2-3B records with ~10 fields each (still fits on one machine, but it's large
enough that I have to think about it a little bit).

I was originally using a simple binary format for fast decoding, and switched
to Arrow to be able to select only a few columns at a time. I was impressed by
the speed gains, and the size benefits of storing data in column order.

The one thing I wish Arrow had was a way to attach some metadata to its files
(if there is, I haven't found it). I originally tried to write a small header
before starting to write the Arrow data, but that made it impossible to read
back the data: as far as I can tell Arrow looks at the whole file and stores
its column definition at the end of the file and computes data offsets based
on the start of the file, meaning that there's nowhere left for me to store
anything.

It's still a very promising library and I'll definitely be checking out the
1.0 release.

~~~
lmeyerov
pyarrow.metadata in the header is for stuffing in uninterpreted bytes, so we
put in semantic stringified json data there for version numbers etc. that are
beyond the data rep typing . Super useful when maturing from localized
notebook code to software services standardized on arrow interop.

~~~
thamer
Oh cool thanks! I'll check it out.

------
mindv0rtex
How does this format compare to HDF5, which is common amount scientific
developers? Are they meant for a similar use case or not?

~~~
tbenst
HDF5 has no support for dataframes/tables, although projects have been built
ontop of it like PyTables, but these are non-standard and don’t work across
languages.

------
jakearmitage
So, maybe I'm stupid and never faced enough big data, but what's the advantage
of this versus a custom script that queries A and inserts into B?

I do that kind of stuff all the time with Go and it's pretty fast with 20~40
million records, averaging 100 KB each. Are those tools oriented to billions,
instead of millions? What are the benefits?

~~~
wesm
Arrow:

* Standardizes binary interop and "serialization" of large structured data, removing all conversions / serialization at ingest and export boundaries. This alone can mean > 2-100x performance improvement in an application that processes a lot of data

* The Arrow in-memory format is an ideal data structure to code analytical algorithms against.

Check out my 18min talk from a few years ago about the vision for the project
[https://www.youtube.com/watch?v=wdmf1msbtVs](https://www.youtube.com/watch?v=wdmf1msbtVs)

~~~
boomskats
This is a great talk and should be a top-level comment explaining this
release.

------
teleforce
Now Arrow already stable hopefully TileDB can properly support it since they
really complement each other [1].

[1][https://news.ycombinator.com/item?id=15561090](https://news.ycombinator.com/item?id=15561090)

------
glogla
What data types are available in Arrow?

Avro and Parquet can't really do timestamp with timezone, which is a major
pain when processing data from geographically distributed locations.

But if Arrow can do it, then maybe there's hope for Parquet as well.

~~~
andygrove
You should be able to find this info in the specification:
[https://arrow.apache.org/docs/format/Columnar.html](https://arrow.apache.org/docs/format/Columnar.html)

------
billman
Congrats to the team for the 1.0 release.

------
dreamcompiler
"Columnar Memory Formats"

Maybe I'm just getting old, but back when we had to melt our own sand to build
a computer, we called these things arrays.

~~~
erik_seaberg
Arrays of structs fit pretty naturally into many languages, but structs of
arrays (and keeping corresponding values in order) are unusual enough to need
plumbing that probably isn't there.

------
stilisstuk
Say I have a 50GB (bigger than RAM) CSV file I want to analyse locally with R.
Can I somehow use Apache arrow for that?

------
Niccizero
Now that is has a 1.0 release, it might finally be allowed to have a Wikipedia
page.

------
mamcx
Is this format good for also CRUD-like/mixed workloads?

------
stokedmartin
noob question -- what are some of the fundamental differences (may or may not
help performance) between thrift and apache arrow?

~~~
andygrove
Thrift is a serialization format. Arrow is a memory format.

------
endlessvoid94
Is this like Memsql back in the day?

