
Apache Arrow and the “Things I Hate About Pandas” - jbredeche
http://wesmckinney.com/blog/apache-arrow-pandas-internals/
======
drej
I started with Python because of (or rather, thanks to) pandas, it was my
gateway drug. Over the past ~5 years I've done all sorts of things with it,
including converting the whole company I worked at. At one of my employer's, I
sampled our big data platform, because it was tedious and slow to work with
and used pandas instead.

All that being said, I'd stress pretty clearly that I never let a single line
of pandas into production. There are a few reasons that I've long wanted to
summarise, but just real quick: 1) It's a heavy dependency and things can go
wrong, 2) It can act in unexpected ways - throw in an empty value in a list of
integers and you suddenly get floats (I know why, but still), or increase the
number of rows beyond a certain threshold and type inference works
differently. 3) It can be very slow, especially if your workflow is write
heavy (at the same time it's blazing fast for reads and joins in most cases,
thanks to its columnar data structure). 4) The API evolves and breaking
changes are not infrequent - that's a great thing for exploratory work, but
not when you want to update libs on your production.

pandas is an amazing library, the best at exploratory work, bar none. But I
would not let it power some unsupervised service.

~~~
krit_dms
so you prototyped in pandas, and build production code around numpy arrays?

~~~
jawilson2
That's what we have done (algo trading). Our research backend uses pandas, but
we ended up taking about a month removing it from prod code. It does
surprising things with memory usage, and the functionality we needed was more
or less wrappers around numpy anyway. Most of our performance critical code is
in cython as well. For this trading application, speed obviously isn't the
biggest concern, so python+numpy is fine. It is C++/Java everywhere else
though.

~~~
sandGorgon
any opinions between cython vs numba ? especially now that numba has gpu
acceleration

~~~
jawilson2
Never tried numba. I write all of our cuda stuff by hand anyway, and wrap that
into cython from c++ where needed.

------
stdbrouw
Wes seems to be very focused on performance and big data applications these
days, and of course it'd be great if Pandas could be used for bigger datasets,
but when I hear people complain about Pandas they complain about:

1\. the weird handling of types and null values (#4) 2\. the verbosity of
filtering like `dataframe[dataframe.column == x]` and transformations like
`dataframe.col_a - dataframe.col_b`, compared to `dplyr` in R 3\. warts on the
indexing system (including MultiIndex, which is very powerful but confusing)

For those of us who use Pandas as an alternative to R, these usability
shortcomings matter way more than memory efficiency.

~~~
vonnik
People interested in perf and big data ETL for the JVM may want to check out
DataVec:

[https://github.com/deeplearning4j/datavec](https://github.com/deeplearning4j/datavec)

[https://deeplearning4j.org/datavec](https://deeplearning4j.org/datavec)

It vectorizes/tensorizes most major data types to put them in shape for
machine learning. It also lets you save the data pipeline as a reusable
object.

------
geocar
I often want one of these middlewares.

Strings are a killer -- indeed any variable-length object makes array
programming tricky when it's nested, so a sound strategy is to intern your
strings first (in some way; KDB has enumerations, but language support isn't
necessary: hash the strings and save an inverted index works good enough for a
lot of applications). Interning strings means you see integers in your data
operations, which is about as un-fun to program in as it sounds. People _want_
to be able to write something like:

    
    
        ….str.extract('([ab])(\d)', expand=False)
    

and then get disappointed that it's slow. Everything is slow when you do it a
few trillion times, but slow things are _really_ slow when you do them a few
trillion times.

If we think about how we _build_ our tables, we can store these as a single-
byte column (or even a bitmask) and an int (or long) column, then we get fast
again.

However it is clear "fast" and "let's use JSON" are incompatible, and a good
middleware or storage system isn't going to make me trade.

~~~
lifeisstillgood
Could you expand on this (I'm asking because you clearly intend something you
have thought about a lot, but I am missing the point - it's me not you)

As far as I understand you want to handle nested arrays of strings in your
data. Ok

The "right" way is to build an index of the strings we are storing and then
store the index values (hashes of some kind) in the arrays as longs

This way our arrays are doing numbers and we handwavy search for or use
strings through some wrapper

Is this right?

And I am guessing the middleware you want does this transparently? Maybe
storing the index alongside the data in some fashion

~~~
geocar
> The "right" way is to build an index of the strings we are storing and then
> store the index values (hashes of some kind) in the arrays as longs

I don't think the rule is that firm:

You don't have to use longs because if there are only 256 unique values, why
waste so many bytes?

Meanwhile if there are so many unique values, is longs enough?

> And I am guessing the middleware you want does this transparently?

Maybe.

I personally think explicit is fine, as long as it's the most easiest and
obvious way to do it.

But I get that there's a lot of "data scientists" (probably the gross
majority) that really struggle with (what I think are) basic data structures
-- they'll rarely produce an efficient solution, and we'll see "yet another"
article about how awk+sort is faster than a 160 node hadoop cluster...

> Maybe storing the index alongside the data in some fashion

Maybe not. What works for 30k strings falls over with 300m unique strings.

The real trick with nested objects is to invert the query -- to convert the
select statements you want to run, into an insert/upsert statement you run
when you're loading the data.

------
jampekka
I would love to see some sort of "smarter" indexing in the engine. I use
pandas quite a bit, but I've never really understood the rationale behind the
indexing, especially why indexes are treated so separately from data columns.
I seem to be resetting and recreating indexes all the time, and use the
.values a lot.

More SQL-style indexing would be a lot more intuitive at least for me.

~~~
nerdponx
I used to hate it, but I've come around to its usefulness in some cases.

However I do prefer the R data.table model, which is what you descibe. You can
set an index on one or more columns in the table, and that's that.

------
rkwasny
Apache Arrow is the next big thing in data analytics, imagine doing a SQL
query in MAPD ( executes in ms on GPU ) then passing the data zero-copy into
python, doing some calculations in pandas and outputting the result into web
interface, because everything is zero-copy it can be done faster than ever
before.

~~~
F_J_H
What is your "go to" web interface for displaying/visualizing data?

I use Oracle APEX because it has a killer "interactive report" feature (ie a
data grid on steroids), which enables non-programers to easily filter,
aggregate,export,report on, etc, the data. However, although APEX is a free
option that comes with the DB, it ties you to Oracle.

It would be great if there was a similar, database independent, low-code tool
like APEX out there, so am curious what you have seen to work well.

~~~
tmostak
Have you tried the native visual analytics client that comes with MapD, MapD
Immerse?
([https://www.mapd.com/platform/immerse/](https://www.mapd.com/platform/immerse/)).
It’s not a full BI tool but is very good for interactively slicing and dicing
datasets from 1M to 100B rows. Here’s a demo of it on 11.6B rows of US Ship
data ([https://www.mapd.com/demos/ships](https://www.mapd.com/demos/ships)).

~~~
F_J_H
No, I have not - thank you for pointing that out. I'll take a look.

------
kornish
(Disclaimers: I don't have much experience using Python to build data science
products; potentially silly questions)

In industry, does Pandas tend to power the application layer, or does it find
more use as an exploratory data tool?

If the latter, do people prefer to push computation down into OLAP databases
for performance reasons?

And if so, what impact will the convergence of libraries and database
functionality have on product development? These features strike me as things
that you'd find in a database, e.g. query optimizer. I know in the past couple
years there have been a couple commercial acquisitions of in-memory execution
engines, e.g. Hyper by Tableau.

~~~
wesm
Wes here. From what I understand, pandas is the middleware layer powering
about 90% (if not more) of analytical applications in Python. It is what
people use for data ingest, data prep, and feature engineering for machine
learning models.

The existence of other database systems that perform equivalent tasks isn't
useful if they are not accessible to Python programmers with a convenient API.

~~~
FridgeSeal
This is exactly how I use it.

I pull the relevant data out of a production database, clean it, add relevant
columns, filter out trash data, use seaborn to produce some simple plots to
see what my data approximately looks/structured like then off to sklearn.

~~~
atupis
Definitely this and I like how simple plotting is at pandas, usually
df['some_column'].plot() gives decent plot out of box.

------
Dowwie
Anyone here who is smart enough to write alternative solutions to the problems
that pandas solves is capable of making a meaningful contribution to the
pandas project, yet it seems that once pandas reaches the limits of its
usefulness people go off and write a proprietary solution, never giving back.

What stopped you from contributing improvements to pandas? Have you taken
alternate routes to open source your work?

~~~
antod
Not very familiar with pandas, but it looks like the author of this post is
the creator of pandas.

~~~
Dowwie
Yes, Wes is. My comment isn't about him.

------
robochat42
This is exciting stuff but will it have any downsides for the majority (??) of
users who don't use pandas for big data? Also, this all sounds very similar to
the Blaze ecosystem, whatever happened to that? Finally, will arrow/feather
replace hdf5 and bcolz in the future?

~~~
misnome
Blaze was also my thought - I'd love to know how this/these proposals match up
with what Blaze is doing/planning to do.

~~~
wesm
The first talk about Blaze was in November 2012. It was marketed as many
things over the years, including "pandas for Big Data". My understanding is
that Anaconda (fka Continuum Analytics) is no longer working on it.

------
sevensor
Pandas is the library I wish I'd had in the late '00s when my employer decided
our site license for JMP was too costly. Well, really pandas plus matplotlib
plus Jupyter notebooks. My job frequently involved creating plots and putting
them in Powerpoint. Often the same plot, day in and day out, with new data
from the production line. An interactive tool that can automate this, with a
low barrier to entry, can save an incredible amount of time. Since I
discovered pandas, I've been recommending it to anybody who works in a
putting-plots-in-powerpoint job. And there are a _lot_ of people who have jobs
like that.

~~~
taeric
That sounds more amenable to an Excel sheet, honestly. Which I suppose is not
that surprising, since spreadsheets were the original freeform notebook style
program.

~~~
sevensor
Excel is what we had to fall back on when they pulled our JMP license. There's
a lot you can do with Excel, but automating Excel is incredibly error-prone. I
came away from that experience with the conclusion that Excel is great as long
as you stick to writing formulas, but as soon as you start writing macros
things go bad in a hurry. And if you're not using macros for automation, that
means pasting data in by hand every day, which is quite possibly worse.

~~~
taeric
I feel like I could replace "Excel" with "Jupyter" and not much changes there,
honestly. Having a data ingest process in a notebook worries me because it
seems to take a ton of the engineering practices we have managed to get in
software, and completely ignore them.

And I get that I am being a little harsher than reality dictates. However, the
testing and "build" process that surrounds most of "notebooks" is laughably
like what we specifically avoided in software when we said your build should
be standardized in an external file. And not scripting the main IDE that you
happen to be using.

Indeed, I am perplexed by folks that don't know how to move between IDEs or
who won't bother to understand how they are pulling dependencies into their
system. Notebooks, though, seem to embrace that.

Which, as I've indicated elsewhere, is great for interactive use, but seems a
major step backwards for serious solutions.

~~~
sevensor
The advantage I see with Jupyter is that at least there's a path from the
notebook to a proper programming language. Excel macros live in Excel and are
tied to the sheets in a particular workbook. Pandas / Matplotlib / Jupyter
lets you turn your exploratory analysis history into a script that runs
outside of Jupyter. That's a huge advantage -- schedule it to run at 5am for
your 6am meeting, and you can come in to work half an hour later! Excel macros
can do this, but because they rely on the interaction between formula
evaluation and procedural code, it's such a headache in comparison that it's
much less likely to be worth it. Overwrite the wrong cell and the whole thing
falls apart. I'll take the Python world any day.

~~~
taeric
Sadly, I'm cynical enough to think that just because there is a path, doesn't
mean it is encouraged or used. In fact, most of the excitement seems to be
about doubling down on the Jupyter infrastructure so that you can have
"executable notebooks."

I don't know why that bothers me, but it definitely does.

~~~
sevensor
> I don't know why that bothers me, but it definitely does.

It bothers me, and I can tell you why! Between pandas and matplotlib, the
royal road to liberation from routine analysis tasks is paved with Python
scripts. Jupyter has an important ancillary role in aiding discovery. But this
whole notion of "executable notebooks" seems designed to keep people in
bondage to fragile workflows based on capturing and replaying user input. It
caters to the least common denominator, to the one person on the team who
can't be trusted to read things. I'm infuriated on behalf of anybody subjected
to such foolishness.

~~~
taeric
I was really hoping someone would give a counter argument to this. Do you know
of any common "devil's advocates" in this vein?

~~~
sevensor
No, but I'd like to see them too. The closest I can come is what the now-
retired Jupyter Dashboards project had to say:

[http://jupyter-dashboards-
layout.readthedocs.io/en/latest/us...](http://jupyter-dashboards-
layout.readthedocs.io/en/latest/use-cases.html)

> Alice is a Jupyter Notebook user. Alice prototypes data access, modeling,
> plotting, interactivity, etc. in a notebook. Now Alice needs to deliver a
> dynamic dashboard for non-notebook users. Today, Alice must step outside
> Jupyter Notebook and build a separate web application. Alice cannot directly
> transform her notebook into a secure, standalone dashboard application.

I find this pretty unconvincing. The gap between "stuff I did in a notebook"
and a secure, let alone correct, application is nontrivial. There's no way for
Alice to do this without learning to write computer programs for real. And if
she does that, she'll find that it's a lot easier when you don't pull in a
huge dependency like Jupyter.

------
adwhit
I would be interested to know what Wes thinks of the Weld project, which seems
to have some similar goals, but takes the 'query planner' concept much
further.

[https://weld-project.github.io/](https://weld-project.github.io/)

~~~
timClicks
A deeper link outlining the Weld-pandas integration [https://github.com/weld-
project/weld/blob/master/python/griz...](https://github.com/weld-
project/weld/blob/master/python/grizzly/README.md)

------
pleasecalllater
The pandas memory consumption is hilarious.

The last time I was trying to use pandas, it was the hackernews data dump. It
wasn't big. However when pandas started using the memory, my 32GB was just too
little.

I just ended to convert the data within postgres, much faster, with sensible
memory usage.

~~~
F_J_H
I'm always a little baffled as to why people just don't dump the data into a
database, and then use SQL for further data manipulation and analysis.

I recently moved some data processing from Python/pandas into a database, and
with SQL, the processing time went from several minutes to a couple seconds,
(and that on a tiny VM).

I understand that not everyone is familiar with databases and SQL, and so
default to the toolset they know. But, the performance gains can make learning
databases and SQL highly worthwhile. (And, much can be learned in a just few
days, especially for those already familiar with working with data.)

~~~
jampekka
SQL is a total pain for many types of data. Eg time-series analysis is
horrible and large datasets are almost impossible to fetch to the application
due to the hilariously inefficient serialization formats.

~~~
pleasecalllater
If large datasets are impossible to fetch from a SQL database, even more
impossible is to load them to memory.

From my point of view, the SQL database can store the huge dataset, with its
changes, and I can iterate through the results, and make lots of nice queries
getting only the data I need.

~~~
jampekka
This depends largely on the use case. For example PostgreSQL insists of
transferring all data using ASCII encoding, so eg. high sampling rate floating
point sensor readings are extremely slow to fetch from the database.

And not all operations can be done incrementally by iterating through the
results.

~~~
anarazel
> This depends largely on the use case. For example PostgreSQL insists of
> transferring all data using ASCII encoding, so eg. high sampling rate
> floating point sensor readings are extremely slow to fetch from the
> database.

It doesn't. There's a binary version of the protocol. The output conversion
for that is near trivial (transformation to big endian).

~~~
jampekka
I stand corrected. The problem was due to Python drivers at the time not
supporting it. Now there appears to be asyncpg.

~~~
anarazel
Yea, the driver situation around the binary protocol isn't ideal :(. A bunch
of the newer drivers have it, but a lot of the old stuff doesn't. Especially
things like transferring some columns in binary, others not, isn't supported
enough - even though it's pretty crucial.

------
goatlover
How does Julia DataFrames compare to R & Pandas for the 11 issues he
mentioned?

------
twic
Noob question: what is the relationship between Arrow, Parquet, and ORC? Do we
need all three?

~~~
jamesblonde
Parquet and ORC are columnar on-disk data formats that power Hadoop-on-SQL
engines (Impala/SparkSQL and Hive, respectively). Arrow is an in-memory
representation (Parquet/Orc are on-disk). The idea is that you can have
workflows in different languages or frameworks using the same in-memory
representation, not having to rebuild it just because you're going from Spark
to another framework

------
wodenokoto
Does anyone know how dataframe in R compares on these 10/11 points?

~~~
olympus
I'm not an R-guru, but I did my thesis with 5k lines of R, so I'd call myself
proficent:

Point 10: R has lazy evaluation, which means here that a function will not be
evaluated when you define it, it will be evaluated when you call it (maybe not
quite the same as some other language's lazy eval). I'm not aware of any built
in feature for query planning, if you ask for nrow(some_func(myframe)), it
will evaluate the some_func(myframe) function and then count up the rows. You
could always write your own query planning function I suppose.

Point 11: R has several multicore/cluster libraries, and some are actually
decent. If you are like most R users, you use StackOverflow a lot, and you'll
end up with one algorithm that uses the snow package and another algorithm
that uses multicore, one that uses parallel, and so on. A few very well
written packages have hooks that make going to multiple cores easy, but most
do not and you typically have to roll your own.

~~~
wesm
Hi, Wes here. R data frames have limited to no query planning and are not
multithreaded in general, so Problems 10 and 11 are problems in R also.

~~~
gigatexal
And R’s syntax leaves much to be desired: it’s ugly compared to python’s.

~~~
huac
To each their own - I find the grammar of dplyr and the tidyverse is quite
elegant.

~~~
shoyer
Agreed! Sadly it is difficult to fully emulate in Python without some form of
delayed evaluation in the language.

------
sandGorgon
This is why I was hoping ONNX (Facebook+Microsoft's new machine learning
serialization format) was built on top of Arrow rather than proto2.

Just like Feather is built on top of Arrow, ONNX can be based on top of Arrow.

~~~
chubot
Wow I totally missed that it's based on protobuf 2. I thought Facebook used
thrift and not protobufs.

Well it's good to see that open source works and competitors can benefit from
each other's work.

------
StreamBright
This is huge. Performance matters (even in 2017) and we need to do things the
right way. Project like Julia and Apache Arrow are paving the way for high
performance analytics even for large data sets.

------
anentropic
> Logical operator graphs for graph dataflow-style execution (think TensorFlow
> or PyTorch, but for data frames)

> A multicore schedular for parallel evaluation of operator graphs

Does anything like this already exist somewhere?

~~~
almostkorean
We are using Airflow for this

~~~
RHSman2
We use a luigi type app for this.

------
quotemstr
Pandas is definitely powerful, if somewhat mind-bending at first if you're
used to a relational, SQL world. It's never been clear to me why Pandas wasn't
more copy-on-write from the start: it's difficult to predict which operations
copy.

------
Myrmornis
As has been said before, the problem with pandas is the confusing and hard-to-
remember python API.

------
dogruck
I really like this post.

Is there a list of major projects that are leveraging Apache Arrow?

~~~
olympus
It's not even two years old and doesn't have a 1.0 release. It wouldn't be
smart for "major" projects to be using it.

There are probably folks using it for exploration but probably not a
production system.

~~~
wesm
Yikes! I disagree with you -- these are assertions made on no evidence. This
software is suitable for production systems as long as you are OK with
occasional API changes / deprecations as the software evolves. We are not far
off from a 1.0 release, but release numbers are a synthetic construct anyway.

~~~
olympus
I'm not putting the project down, but your comment "as long as you are OK with
occasional API changes / deprecations" is exactly my point. Facebook doesn't
want a system borking because an Arrow dev decided that some API feature was
redundant or needed a new name. Presumably they have a test environment to
check for that sort of thing. They certainly don't want to refactor a large
code base because of some tiny API change.

If someone is using Pandas, Spark, or whatever for an important product, it's
probably best for them to maintain whatever underlying data layer until the
Arrow devs (I guess that means you) are willing to commit to a somewhat stable
API. A _stable_ API and a relatively bug-free experience is what typically
marks a 1.0 release.

There are plenty of smaller projects that should be perfectly happy to use the
0.7 release and grow/evolve as Arrow does. Especially when using Pandas+Arrow,
since it's probably not a production environment and I can spare a few hours
to fix a confusing bug.

~~~
wesm
I disagree with your premise that production systems require API stability in
all thirdparty dependencies.

~~~
dotancohen
Stability does not imply immutability, but rather established.

In new projects it is common for a method or class to change direction or role
as it develops, which may bring with it a refactor of identifier names,
parameters, and such. After the class has been used for some time in different
contexts these changes happen less and less, hence we call the class stable. A
1.0 release implies this stability.

Third party dependencies are still free to mutate their APIs, but when
maintaining a production system you don't want to be mole-whacking API changes
every point release.

------
fish2000
tl;dr this article has no Native American weaponry review nor far-Eastern
bear-hate listicles, meh

------
wpietri
A tip for authors: briefly explain what you're talking about in the first
paragraph. Or at least link to it. Because then people don't have to go
hunting all over to find out that, e.g., pandas is a python data analysis
library: [http://pandas.pydata.org/](http://pandas.pydata.org/)

~~~
Blackthorn
This is directly from the blog of the Pandas creator. "What is Pandas" is
assumed to be known, for good reason, and that bikeshed is the correct color.

~~~
phaemon
_> "What is Pandas" is assumed to be known, for good reason_

What good reason? Why on earth would every single person on HN who happened to
click on that link know what `pandas` is? How would they even know whose blog
it is on?

Unless your blog is specifically, only for people who are already familiar
with your `thing` (in which case, a mailing list might be better), then it
simply makes sense to always have a header with a tagline explaining what your
`thing` is and a link back to the main project website. Just in case it gets
featured on HN or something.

~~~
Blackthorn
> Why on earth would every single person on HN who happened to click on that
> link know what `pandas` is? How would they even know whose blog it is on?

It's the author of Pandas's blog. He didn't submit it to HN, someone else did.
It's totally reasonable to expect that if people are reading his, the author
of Pandas, blog, then they already know what Pandas is.

~~~
wpietri
Sorry, but I don't think it's reasonable to expect that. It's the web. People
link to things. Which is why I think it's reasonable to add a link somewhere
in the first paragraph to context.

