Hacker News new | past | comments | ask | show | jobs | submit login
Apache Arrow and the “Things I Hate About Pandas” (wesmckinney.com)
394 points by jbredeche on Sept 26, 2017 | hide | past | web | favorite | 139 comments



I started with Python because of (or rather, thanks to) pandas, it was my gateway drug. Over the past ~5 years I've done all sorts of things with it, including converting the whole company I worked at. At one of my employer's, I sampled our big data platform, because it was tedious and slow to work with and used pandas instead.

All that being said, I'd stress pretty clearly that I never let a single line of pandas into production. There are a few reasons that I've long wanted to summarise, but just real quick: 1) It's a heavy dependency and things can go wrong, 2) It can act in unexpected ways - throw in an empty value in a list of integers and you suddenly get floats (I know why, but still), or increase the number of rows beyond a certain threshold and type inference works differently. 3) It can be very slow, especially if your workflow is write heavy (at the same time it's blazing fast for reads and joins in most cases, thanks to its columnar data structure). 4) The API evolves and breaking changes are not infrequent - that's a great thing for exploratory work, but not when you want to update libs on your production.

pandas is an amazing library, the best at exploratory work, bar none. But I would not let it power some unsupervised service.


> pandas is an amazing library, the best at exploratory work,

I will add ... "in python." That is definitely true, but to call it "the best at exploratory work" is not accurate. I might be opening up a completely separate debate, but for down and dirty exploratory work nothing beats R's dplyr and ggplot.

With that being said, I now do most of my work in python because of putting models into production. I also haven't had any issues with pandas in production; maybe because I'm not doing high throughput operations and our ML application is relatively lightweight.


You mean R's data.table and ggplot ;)


1) Conda helps with that quite a bit, Pandas is not a much heavier dependency than NumPy itself.

2) Depends a bit on your background, but to me this is not really unexpected. Integers don't have a well-defined "missing" value while Floats do, so Pandas is trying to help you by not using python objects and instead converting to the "most useful" array type. It only does so if it can convert the integers without loss of precision.

3) This one I totally get, I wrote a custom, msgpack-based serialisation due to that for our usage (before Arrow was around, seriously considering that for data exchange now).

4) Apart from the changes to `resample` all of those breaking changes had a prior `DeprecationWarning`, IIRC.


3) msgpack. Can you expand on that? I use this method and store/cache it for temporary passing between tasks in wf. Looking to arrow to be able to extend capability of this utility.


I needed a binary, stable format to pass large dataframes efficiently to our JS frontend as well as to other Python processes. The builtin msgpack converter in Pandas was very pandas-centric in that it was basically just a msgpack representation of Pandas's internal structures, which didn't fit our purposes.

In the end, I implemented a format that is essentially a list of columns, each having a specific type and some other metadata. The format for each column depends on the stored type, fallback is a msgpack sequence of strings/ints/dates mixed with nils, but array data is stored as-is and datetimes are stored either as an array of int64 with the unit attached or as begin, end, frequency.


I just got a x3 speedup from initial tests using arrow.

Then I found plasma and that has blown my mind.

@wesm, how hard have you pushed plasma?


Holy crap, thank you for the tip about Plasma. That project looks amazing. I've been in desperate need for something like this.

For everyone else: http://arrow.apache.org/blog/2017/08/08/plasma-in-memory-obj...

Plasma, an in-memory object store that is being developed as part of Apache Arrow. Plasma holds immutable objects in shared memory so that they can be accessed efficiently by many clients across process boundaries.

...

One of the goals of Apache Arrow is to serve as a common data layer enabling zero-copy data exchange between multiple frameworks. A key component of this vision is the use of off-heap memory management (via Plasma) for storing and sharing Arrow-serialized objects between applications.


Gotcha. Thanks. I see how arrow will help!


Ditto, same here (prop trading); pandas in notebooks and for quick (to hack together, not to run) scripts; pure NumPy or C++ via pybind11 for releases


pandas is a great example of data science code at its best and its worst. If you look at the source code you will see every object and every function allows for way too many variations in input options and therefore about 20 conditional statements. For instance, I believe DataFrame's init method can take a dictionary, DataFrame, Series, etc. versus a class method for each one. Contrast that to requests where the public interface is a nice requests.get, requests.post. Yet, have a csv file you are only uploading once or twice to peak at? Then its super efficient. I think my biggest issue is all the effort that goes into pandas-like api's, i.e. https://github.com/ibis-project/ibis. To me, it doesn't make sense to take something stable and known (SQL) and build a complex DSL so it works like pandas.


so you prototyped in pandas, and build production code around numpy arrays?


Production wouldn't usually be in Python but if it was, it'd probably be numpy (if it was numerical). It's also fairly heavy (we'd usually exclude MKL for that reason), but less 'smart' (fewer defaults, more explicit in most places), so it's a lot safer.


That's what we have done (algo trading). Our research backend uses pandas, but we ended up taking about a month removing it from prod code. It does surprising things with memory usage, and the functionality we needed was more or less wrappers around numpy anyway. Most of our performance critical code is in cython as well. For this trading application, speed obviously isn't the biggest concern, so python+numpy is fine. It is C++/Java everywhere else though.


any opinions between cython vs numba ? especially now that numba has gpu acceleration


Never tried numba. I write all of our cuda stuff by hand anyway, and wrap that into cython from c++ where needed.


We went through sort of a similar exercise. The features that pandas provided was compelling. For example, our main research guy uses R and so something like data frames were wanted. My conclusion was that pandas was too heavy to add as a dependency however. sloccount says about 200k lines of code.

Instead, I wrote a small wrapper around numpy to provide a data frame like object (850 lines of code by sloccount). So far, this has worked well for us.


Wes seems to be very focused on performance and big data applications these days, and of course it'd be great if Pandas could be used for bigger datasets, but when I hear people complain about Pandas they complain about:

1. the weird handling of types and null values (#4) 2. the verbosity of filtering like `dataframe[dataframe.column == x]` and transformations like `dataframe.col_a - dataframe.col_b`, compared to `dplyr` in R 3. warts on the indexing system (including MultiIndex, which is very powerful but confusing)

For those of us who use Pandas as an alternative to R, these usability shortcomings matter way more than memory efficiency.


People interested in perf and big data ETL for the JVM may want to check out DataVec:

https://github.com/deeplearning4j/datavec

https://deeplearning4j.org/datavec

It vectorizes/tensorizes most major data types to put them in shape for machine learning. It also lets you save the data pipeline as a reusable object.


I too would welcome a friendlier pandas library, but every time I've tried to think of an API that would work I fail. Well actually, I keep on wanting pandas to understand SQL.


There’s a pandasql library which lets you execute SQL on dataframes. It’s a little slower because it needs to serialize via SQLite, but it’s a quick way to get going.

https://pypi.python.org/pypi/pandasql


I will definitely echo (2). dplyr is amazing and works in far, far fewer lines of code than pandas. That was my largest issue when migrating over from R.

There is dplython, but it doesn't quite work the same so I don't use it much. https://github.com/dodger487/dplython


I created plydata, you may find it sufficient for your needs.

https://github.com/has2k1/plydata


Is the only difference the placeholder X? I was running into issues with dplython in executing arbitrary functions outside the tidyverse. Can your package handle situations such as this?

df %>% select(var1, var2) %>% rbind(df2) %>% na.omit()

etc? That was the big benefit I saw from using dplyr.`


Yes, no X placeholder. And at the moment you cannot pipe to arbitrary functions, Python limitations. I'll get around this by providing a helper function e.g

df >> call(pd.dropna, axis=1)


#2 is a big issue for me. Filtering, subsetting, these are all really arcane-feeling transformations, there's a lot of weirdness with view vs copy, etc.


You can write 2) as dataframe.query("column == x").


I often want one of these middlewares.

Strings are a killer -- indeed any variable-length object makes array programming tricky when it's nested, so a sound strategy is to intern your strings first (in some way; KDB has enumerations, but language support isn't necessary: hash the strings and save an inverted index works good enough for a lot of applications). Interning strings means you see integers in your data operations, which is about as un-fun to program in as it sounds. People want to be able to write something like:

    ….str.extract('([ab])(\d)', expand=False)
and then get disappointed that it's slow. Everything is slow when you do it a few trillion times, but slow things are really slow when you do them a few trillion times.

If we think about how we build our tables, we can store these as a single-byte column (or even a bitmask) and an int (or long) column, then we get fast again.

However it is clear "fast" and "let's use JSON" are incompatible, and a good middleware or storage system isn't going to make me trade.


Could you expand on this (I'm asking because you clearly intend something you have thought about a lot, but I am missing the point - it's me not you)

As far as I understand you want to handle nested arrays of strings in your data. Ok

The "right" way is to build an index of the strings we are storing and then store the index values (hashes of some kind) in the arrays as longs

This way our arrays are doing numbers and we handwavy search for or use strings through some wrapper

Is this right?

And I am guessing the middleware you want does this transparently? Maybe storing the index alongside the data in some fashion


> The "right" way is to build an index of the strings we are storing and then store the index values (hashes of some kind) in the arrays as longs

I don't think the rule is that firm:

You don't have to use longs because if there are only 256 unique values, why waste so many bytes?

Meanwhile if there are so many unique values, is longs enough?

> And I am guessing the middleware you want does this transparently?

Maybe.

I personally think explicit is fine, as long as it's the most easiest and obvious way to do it.

But I get that there's a lot of "data scientists" (probably the gross majority) that really struggle with (what I think are) basic data structures -- they'll rarely produce an efficient solution, and we'll see "yet another" article about how awk+sort is faster than a 160 node hadoop cluster...

> Maybe storing the index alongside the data in some fashion

Maybe not. What works for 30k strings falls over with 300m unique strings.

The real trick with nested objects is to invert the query -- to convert the select statements you want to run, into an insert/upsert statement you run when you're loading the data.


I would love to see some sort of "smarter" indexing in the engine. I use pandas quite a bit, but I've never really understood the rationale behind the indexing, especially why indexes are treated so separately from data columns. I seem to be resetting and recreating indexes all the time, and use the .values a lot.

More SQL-style indexing would be a lot more intuitive at least for me.


I used to hate it, but I've come around to its usefulness in some cases.

However I do prefer the R data.table model, which is what you descibe. You can set an index on one or more columns in the table, and that's that.


Apache Arrow is the next big thing in data analytics, imagine doing a SQL query in MAPD ( executes in ms on GPU ) then passing the data zero-copy into python, doing some calculations in pandas and outputting the result into web interface, because everything is zero-copy it can be done faster than ever before.


What is your "go to" web interface for displaying/visualizing data?

I use Oracle APEX because it has a killer "interactive report" feature (ie a data grid on steroids), which enables non-programers to easily filter, aggregate,export,report on, etc, the data. However, although APEX is a free option that comes with the DB, it ties you to Oracle.

It would be great if there was a similar, database independent, low-code tool like APEX out there, so am curious what you have seen to work well.


Have you tried the native visual analytics client that comes with MapD, MapD Immerse? (https://www.mapd.com/platform/immerse/). It’s not a full BI tool but is very good for interactively slicing and dicing datasets from 1M to 100B rows. Here’s a demo of it on 11.6B rows of US Ship data (https://www.mapd.com/demos/ships).


No, I have not - thank you for pointing that out. I'll take a look.


(Disclaimers: I don't have much experience using Python to build data science products; potentially silly questions)

In industry, does Pandas tend to power the application layer, or does it find more use as an exploratory data tool?

If the latter, do people prefer to push computation down into OLAP databases for performance reasons?

And if so, what impact will the convergence of libraries and database functionality have on product development? These features strike me as things that you'd find in a database, e.g. query optimizer. I know in the past couple years there have been a couple commercial acquisitions of in-memory execution engines, e.g. Hyper by Tableau.


We currently use a combination of Pandas and Scikit-Learn to run our production models. We're not in the big data space, instead, creating small tightly tuned models for a very specific purpose in a large energy company.

At the moment the general work flow is:

* Internal library based over Pandas which abstracts our mess of internal databases

* Application specific model code that utilises the internal library to pull data in. This is then fed into a trained scikit-learn model and then further processed by Pandas.

* Internal monitoring tools (dashboards based upon Ploty and Flask as well as an alerting system) are built using the internal library and Pandas as the glue.

From a design decision we focused upon Pandas as the root source of all data. Everything is a DataFrame throughout the entire application.

Painpoints:

* Writing to a database is pretty painful (SQL Server here as Windows shop).

* Minor API changes can be irritating.

* Pandas MultiIndexing is both very painful and mind bending at the same time trying to get the slice syntax to work.

Overall though, Pandas is a huge value add and we've gradually rolled out from 2 people to approximately 9-10 people who hadn't used python in anger before.

Almost all reporting functionality is being migrated into Pandas instead of SQL stored procs, excel, tableau etc for the additional flexibility it provides.


Wes here. From what I understand, pandas is the middleware layer powering about 90% (if not more) of analytical applications in Python. It is what people use for data ingest, data prep, and feature engineering for machine learning models.

The existence of other database systems that perform equivalent tasks isn't useful if they are not accessible to Python programmers with a convenient API.


This is exactly how I use it.

I pull the relevant data out of a production database, clean it, add relevant columns, filter out trash data, use seaborn to produce some simple plots to see what my data approximately looks/structured like then off to sklearn.


Definitely this and I like how simple plotting is at pandas, usually df['some_column'].plot() gives decent plot out of box.


We use pandas as a last-mile library for offline exploration. Typically, datasets have been sampled or aggregated enough that they can fit inside 10gb so you can work with them comfortably using pandas. I don't like using pandas in prod because the performance is really sensitive to stuff like missing a type declaration or calling the wrong method and the API is really convoluted.

E: that's not to say pandas isn't good. It's really good. Thanks for the software, Wes!


> In industry, does Pandas tend to power the application layer, or does it find more use as an exploratory data tool?

My experience echos yours, Pandas from my observation, is more like a post-modeling tool, that people use to further process data that they digest from certain DB query or Spark jobs.

After reading through Arrow homepage, I am left somewhat baffled about where it seats. If my reading is correct, it is a client-side protocol that abstracts away the underlying data storage implementations? If so, isn't it still limited by how much data the client machine can handle? Or the benefits is about the unified interface of accessing different storage system? No matter what, it seems pretty ambitious. Looking forward to see how it goes.


I recommend my JupyterCon keynote to explain how it fits into the picture: https://www.youtube.com/watch?v=wdmf1msbtVs

Data processing systems need runtime memory formats. Arrow is an efficient one for analytical data processing. It has the additional benefit of zero-copy data interchange layer for sharing memory between processes written in any language.


I'll be honest, it took me a while to really 'get' what it's for. I think an example use case would really help. I now get it and it applies to my use case 100% (probably as it evolved from mirroring your lead) but it took a while to understand this somewhat intangible concept (to a non computer scientist)


I usually use pandas as exploratory data tool and then rewrite code using numpy because pandas needs much more memory and is lot slower.


Anyone here who is smart enough to write alternative solutions to the problems that pandas solves is capable of making a meaningful contribution to the pandas project, yet it seems that once pandas reaches the limits of its usefulness people go off and write a proprietary solution, never giving back.

What stopped you from contributing improvements to pandas? Have you taken alternate routes to open source your work?


Not very familiar with pandas, but it looks like the author of this post is the creator of pandas.


Yes, Wes is. My comment isn't about him.


You probably never took a deep look at pandas. It is a very complex library with lots of dependencies. It is not surprising that it is easier to implement an alternative rather than change the existing one.


There are pretty huge barriers to open sourcing work in certain large companies.


This is exciting stuff but will it have any downsides for the majority (??) of users who don't use pandas for big data? Also, this all sounds very similar to the Blaze ecosystem, whatever happened to that? Finally, will arrow/feather replace hdf5 and bcolz in the future?


Blaze was also my thought - I'd love to know how this/these proposals match up with what Blaze is doing/planning to do.


The first talk about Blaze was in November 2012. It was marketed as many things over the years, including "pandas for Big Data". My understanding is that Anaconda (fka Continuum Analytics) is no longer working on it.


Pandas is the library I wish I'd had in the late '00s when my employer decided our site license for JMP was too costly. Well, really pandas plus matplotlib plus Jupyter notebooks. My job frequently involved creating plots and putting them in Powerpoint. Often the same plot, day in and day out, with new data from the production line. An interactive tool that can automate this, with a low barrier to entry, can save an incredible amount of time. Since I discovered pandas, I've been recommending it to anybody who works in a putting-plots-in-powerpoint job. And there are a lot of people who have jobs like that.


I'd add seaborn to that, it work perfectly with pandas.DataFrames, often it creates exactly what you want with minimal input (just a dataframe), ie:

import seaborn as sns

sns.violinplot(data=dataframe)

Set %matplotlib inline and you don't need more commands in your notebook.


That sounds more amenable to an Excel sheet, honestly. Which I suppose is not that surprising, since spreadsheets were the original freeform notebook style program.


Excel is what we had to fall back on when they pulled our JMP license. There's a lot you can do with Excel, but automating Excel is incredibly error-prone. I came away from that experience with the conclusion that Excel is great as long as you stick to writing formulas, but as soon as you start writing macros things go bad in a hurry. And if you're not using macros for automation, that means pasting data in by hand every day, which is quite possibly worse.


I feel like I could replace "Excel" with "Jupyter" and not much changes there, honestly. Having a data ingest process in a notebook worries me because it seems to take a ton of the engineering practices we have managed to get in software, and completely ignore them.

And I get that I am being a little harsher than reality dictates. However, the testing and "build" process that surrounds most of "notebooks" is laughably like what we specifically avoided in software when we said your build should be standardized in an external file. And not scripting the main IDE that you happen to be using.

Indeed, I am perplexed by folks that don't know how to move between IDEs or who won't bother to understand how they are pulling dependencies into their system. Notebooks, though, seem to embrace that.

Which, as I've indicated elsewhere, is great for interactive use, but seems a major step backwards for serious solutions.


The advantage I see with Jupyter is that at least there's a path from the notebook to a proper programming language. Excel macros live in Excel and are tied to the sheets in a particular workbook. Pandas / Matplotlib / Jupyter lets you turn your exploratory analysis history into a script that runs outside of Jupyter. That's a huge advantage -- schedule it to run at 5am for your 6am meeting, and you can come in to work half an hour later! Excel macros can do this, but because they rely on the interaction between formula evaluation and procedural code, it's such a headache in comparison that it's much less likely to be worth it. Overwrite the wrong cell and the whole thing falls apart. I'll take the Python world any day.


Sadly, I'm cynical enough to think that just because there is a path, doesn't mean it is encouraged or used. In fact, most of the excitement seems to be about doubling down on the Jupyter infrastructure so that you can have "executable notebooks."

I don't know why that bothers me, but it definitely does.


> I don't know why that bothers me, but it definitely does.

It bothers me, and I can tell you why! Between pandas and matplotlib, the royal road to liberation from routine analysis tasks is paved with Python scripts. Jupyter has an important ancillary role in aiding discovery. But this whole notion of "executable notebooks" seems designed to keep people in bondage to fragile workflows based on capturing and replaying user input. It caters to the least common denominator, to the one person on the team who can't be trusted to read things. I'm infuriated on behalf of anybody subjected to such foolishness.


I was really hoping someone would give a counter argument to this. Do you know of any common "devil's advocates" in this vein?


No, but I'd like to see them too. The closest I can come is what the now-retired Jupyter Dashboards project had to say:

http://jupyter-dashboards-layout.readthedocs.io/en/latest/us...

> Alice is a Jupyter Notebook user. Alice prototypes data access, modeling, plotting, interactivity, etc. in a notebook. Now Alice needs to deliver a dynamic dashboard for non-notebook users. Today, Alice must step outside Jupyter Notebook and build a separate web application. Alice cannot directly transform her notebook into a secure, standalone dashboard application.

I find this pretty unconvincing. The gap between "stuff I did in a notebook" and a secure, let alone correct, application is nontrivial. There's no way for Alice to do this without learning to write computer programs for real. And if she does that, she'll find that it's a lot easier when you don't pull in a huge dependency like Jupyter.


Look into Jupyter dashboard


Yikes, I didn't know about this. I would strongly discourage this sort of thing. There's a world of difference between exploratory data analysis and BI platforms. Just because they both produce scatter plots, doesn't mean they should work the same way. Confusion about this, in my opinion, is how TIBCO basically destroyed Spotfire. Fortunately, the dashboards project seems not to be moving forward.

https://github.com/jupyter/enhancement-proposals/blob/master...

Also, there's a world of difference between this and the way I'd recommend people use Jupyter. Jupyter is great for exploratory data analysis. It caputres every step you take along the way, especially if you're disciplined about not reusing cells. At the end, you have something you can paste into Powerpoint. If the boss asks you for that same plot the next day, you don't reuse the notebook -- that means repeating all of your mistakes. You pull out the parts of the analysis you want to keep into a Python script, and you run that in the future. In no way is it a good idea to try to use the notebook operationally.


I would be interested to know what Wes thinks of the Weld project, which seems to have some similar goals, but takes the 'query planner' concept much further.

https://weld-project.github.io/


A deeper link outlining the Weld-pandas integration https://github.com/weld-project/weld/blob/master/python/griz...


I see Weld as an embeddable component in systems like pandas, not a replacement.


The pandas memory consumption is hilarious.

The last time I was trying to use pandas, it was the hackernews data dump. It wasn't big. However when pandas started using the memory, my 32GB was just too little.

I just ended to convert the data within postgres, much faster, with sensible memory usage.


I'm always a little baffled as to why people just don't dump the data into a database, and then use SQL for further data manipulation and analysis.

I recently moved some data processing from Python/pandas into a database, and with SQL, the processing time went from several minutes to a couple seconds, (and that on a tiny VM).

I understand that not everyone is familiar with databases and SQL, and so default to the toolset they know. But, the performance gains can make learning databases and SQL highly worthwhile. (And, much can be learned in a just few days, especially for those already familiar with working with data.)


SQL is a total pain for many types of data. Eg time-series analysis is horrible and large datasets are almost impossible to fetch to the application due to the hilariously inefficient serialization formats.


If large datasets are impossible to fetch from a SQL database, even more impossible is to load them to memory.

From my point of view, the SQL database can store the huge dataset, with its changes, and I can iterate through the results, and make lots of nice queries getting only the data I need.


This depends largely on the use case. For example PostgreSQL insists of transferring all data using ASCII encoding, so eg. high sampling rate floating point sensor readings are extremely slow to fetch from the database.

And not all operations can be done incrementally by iterating through the results.


> This depends largely on the use case. For example PostgreSQL insists of transferring all data using ASCII encoding, so eg. high sampling rate floating point sensor readings are extremely slow to fetch from the database.

It doesn't. There's a binary version of the protocol. The output conversion for that is near trivial (transformation to big endian).


I stand corrected. The problem was due to Python drivers at the time not supporting it. Now there appears to be asyncpg.


Yea, the driver situation around the binary protocol isn't ideal :(. A bunch of the newer drivers have it, but a lot of the old stuff doesn't. Especially things like transferring some columns in binary, others not, isn't supported enough - even though it's pretty crucial.


hmmm...that's not been my experience, but YMMV.


At what point did your memory consumption go too high?

https://stackoverflow.com/questions/25962114/how-to-read-a-6...


At the point when it was loading csv file with data. I think there was 5-6GB. My machine has 32GB of ram, and when usage of the python process alone increased to 35GB, I just stopped that. I've tried a couple of times, with similar results.

I have loaded that csv file to postgres, the database had similar size. With indexing it was 15GB on disk. All queries were quite fast.

So instead of loading the data to pandas, and making searches, I just wrote some SQL, and got the same results. However much faster, and with much smaller memory usage.


How does Julia DataFrames compare to R & Pandas for the 11 issues he mentioned?


Noob question: what is the relationship between Arrow, Parquet, and ORC? Do we need all three?


Parquet and ORC are columnar on-disk data formats that power Hadoop-on-SQL engines (Impala/SparkSQL and Hive, respectively). Arrow is an in-memory representation (Parquet/Orc are on-disk). The idea is that you can have workflows in different languages or frameworks using the same in-memory representation, not having to rebuild it just because you're going from Spark to another framework


Does anyone know how dataframe in R compares on these 10/11 points?


I'd say native data frames in R aren't great at these (maybe #4, maybe #9). I'm excited to see how Arrow can perform, and hopefully we'll see solid bindings to R as well.

The data.table package (https://github.com/Rdatatable/data.table/wiki) does make progress on some of these - I'd say #1, #3, maybe #7, #8. Dplyr has a query planner too, fwiw.


I'm not an R-guru, but I did my thesis with 5k lines of R, so I'd call myself proficent:

Point 10: R has lazy evaluation, which means here that a function will not be evaluated when you define it, it will be evaluated when you call it (maybe not quite the same as some other language's lazy eval). I'm not aware of any built in feature for query planning, if you ask for nrow(some_func(myframe)), it will evaluate the some_func(myframe) function and then count up the rows. You could always write your own query planning function I suppose.

Point 11: R has several multicore/cluster libraries, and some are actually decent. If you are like most R users, you use StackOverflow a lot, and you'll end up with one algorithm that uses the snow package and another algorithm that uses multicore, one that uses parallel, and so on. A few very well written packages have hooks that make going to multiple cores easy, but most do not and you typically have to roll your own.


Hi, Wes here. R data frames have limited to no query planning and are not multithreaded in general, so Problems 10 and 11 are problems in R also.


Native data.frames don't but data.table does to some extent. It's parallelization is still relatively new/nascent and requires OpenMP which doesn't have good support on MacOS but it is present.


And R’s syntax leaves much to be desired: it’s ugly compared to python’s.


To each their own - I find the grammar of dplyr and the tidyverse is quite elegant.


Agreed! Sadly it is difficult to fully emulate in Python without some form of delayed evaluation in the language.


R is one of the few language that popularize data frame.

Out of SAS, R, and Python or even C++ and Java. Data frame are native.

So personally, the syntax especially for data frame is beautiful compares to python.

Python don't even have missing value built in or subsetting data frame.


> R has lazy evaluation, which means here that a function will not be evaluated when you define it, it will be evaluated when you call it.

It seems unlikely you meant this as stated. How is it possible to "evaluate a function" when you define it? Certainly you need to give it arguments, and that can only happen when you call it.


A clearer explanation is the lazy evaluation here: http://adv-r.had.co.nz/Functions.html#function-arguments

I can pass an argument but R won’t try to evaluate unless it needs it. This can be beneficial when only some of a function’s branches need the argument. You can pass a solver for the traveling salesman problem but R won’t waste CPU cycles until it reaches a point where it has to solve the TSP to get an answer. Maybe the first branch of the function is a feasibility check, and the TSP will be skipped for something else.

This is a little harder to explain for data frames, but you can create R functions that act a little like generators in python. This can help with memory management where instead of a gigantic matrix you have a function that generates the part of the matrix that you need.


Wow, if I'm understanding that correctly R's function arguments can have default values that depend on the values variables within the function have when the argument is first used. That seems insane!


It's a sort of lisp-y idea. Arguments passed to functions get "quoted", so the function can get or change data about the expression and its scope/environment before evaluating it. Like others have mentioned, it's what make R quite good for developing DSLs like R's formula language or dplyr. (And other conveniences, like auto-labeled plots, etc.) But similar to lisp macros, it can make for unpleasant surprises if not used wisely.

If you look at attempts to do this stuff in python---e.g. patsy, which emulates R's formula DSL, and there's another project that emulates dplyr I don't recall the name of---you see they have to resort to parsing and eval'ing strings instead of working on expressions (language objects that represent ASTs), which is not nearly as nice or safe.

Edit: But just to emphasize your surprise -- yes, you can definitely be surprised by delayed evaluation in many contexts if you're used to more traditional languages.

    > y <- 10
    > wat <- function(x=10*y) { y = -y; x }
    > wat()
   [1] -1000
But (1) good library writers don't play these kinds of tricks, so it doesn't come up too often in practice; and (2) when writing/debugging my own code, I've not found it too hard to reason about, anticipate, and avoid these effects.

    > y <- 10
    > less_wat <- function(x=10*y) { force(x); y = -y; x }
    > less_wat()
   [1] 1000


What's even crazier is how (according to this snippet) R thinks that 10*10 is 1000 :)


"That seems insane!"

Welcome to R!


It's odd but it enables a lot of useful things (e.g. magrittr's pipe operators). It's possible to write functions that change their behaviour depending on what name they were called by too.


R data frames have excellent support for missing and categorical data. Everything else is just as much of a problem.


Disagree. Factors (categorical data) in R only support strings


But they can be ordered and are represented as integers. So you can turn your numerics to factors and keep them in order and do whatever goofy math you want to do on your categorical data.


Right, but I don't think it's fair to say that R's support for categorical data is "excellent" if only strings can be category labels/levels. Categories (aka dictionary-encoded data in other systems) semantically may be any type in practice (strings, timestamps, numbers, etc.). The function as.factor in R is lossy because the input type is coerced to string.


I'm trying to think of an example where you would want to use a factor with levels that are timestamps rather than just a vector of timestamps. And likewise for other data types, even strings. The only case I can think of would be if you wanted to limit the possible values to specific set.

My impression is that factors in R are borderline-deprecated, especially in the tidyverse, in favor of just using the equivalent non-factor vector.


I think you're mixing up scales. Timestamps and numbers can be recoded to categories but when doing this you're throwing away information -- the same way you lose information when converting an integer to a string. When you work with factors you're no longer interested in qualities the data would have on its original scale like distance etc.

Could you please provide a real world example where this actually is a problem.


> the input type is coerced to string.

Wow. Didn't knew that.. strings is the one thing I always minimize in my datasets, due to speed and memory considerations..

BTW, I wanted to thank you for your 2012 slides on how you used hashes to group and join data. It led me to learn more about categoricals and I ended up implementing a Factor() object [1] in the other tool I use (Stata) that ended up being a life saver. In fact, once you have a powerful and fast categorical type, with a set of key functions, you can do anything from group the data, to count distinct categories, to run fixed effect regressions in no time.

[1] http://fmwww.bc.edu/repec/scon2017/Baltimore17_Correia.pdf


Memory isn't the issue. Both factors and string vectors only store each unique string once in memory. The issue is the limitation that factor levels can only be strings and not any other data type. Also, I think testing strings for equality is O(1) since it should simplify to a pointer comparison.


That's not right. Factor _levels_ can only be integers, factor _labels_ can only be strings since labels are the printable representation of a level. Maybe you could store the "real" values of a level as attribute (I.e. metadata) of the factor. Anyway, I think that's as solvable problem.


Well, I don't know if there's a standard terminology used elsewhere, but what you call labels, R calls levels.


Erm ... Well, I admit it's somewhat confusing.

    factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x), nmax = NA)

    [...]

    levels
    :   an optional vector of the values (as character strings) that x might 
        have taken. The default is the unique set of values taken by 
        as.character(x), sorted into increasing order of x. Note that this set 
        can be specified as smaller than sort(unique(x)).

    labels
    :   either an optional character vector of (unique) labels for the 
        levels (in the same order as levels after removing those in exclude), or 
        a character string of length 1.
This way, you can do something like that:

    > x <- 1:3
    > factor(x, levels = 1:2, labels = c("foo", "bar"))
    [1] foo  bar  <NA>
    Levels: foo bar
But this actually is:

    > factor(as.character(x), levels = c("1", "2"), labels = c("foo", "bar"))
    [1] foo  bar  <NA>
    Levels: foo bar


This is why I was hoping ONNX (Facebook+Microsoft's new machine learning serialization format) was built on top of Arrow rather than proto2.

Just like Feather is built on top of Arrow, ONNX can be based on top of Arrow.


Wow I totally missed that it's based on protobuf 2. I thought Facebook used thrift and not protobufs.

Well it's good to see that open source works and competitors can benefit from each other's work.


This is huge. Performance matters (even in 2017) and we need to do things the right way. Project like Julia and Apache Arrow are paving the way for high performance analytics even for large data sets.


> Logical operator graphs for graph dataflow-style execution (think TensorFlow or PyTorch, but for data frames)

> A multicore schedular for parallel evaluation of operator graphs

Does anything like this already exist somewhere?


Spark SQL does exactly that: declarative graph dataflow computations + a multicore (also distributed) scheduler and executor, although it doesn't really scale down that well to a single machine and the execution layer happens mostly on the JVM, even when using PySpark (with an ungodly mix of of serialized Python and JVM objects and code).

Ibis (also by Wes McKinney) does the first part, but it offloads scheduling and execution to the underlying database you are using.


We are using Airflow for this


We use a luigi type app for this.


Pandas is definitely powerful, if somewhat mind-bending at first if you're used to a relational, SQL world. It's never been clear to me why Pandas wasn't more copy-on-write from the start: it's difficult to predict which operations copy.


As has been said before, the problem with pandas is the confusing and hard-to-remember python API.


I really like this post.

Is there a list of major projects that are leveraging Apache Arrow?


Graphistry is pretty under the radar as we're focusing more on big enterprise/federal security users and certain data science users before opening to others, but we've been investing big on Arrow for our V2.

Ex: We recently released https://github.com/apache/arrow/tree/master/js and are generally using Arrow as a way to compose interactive-time columnar CPU/GPU visual analytics technologies, including our own: https://devblogs.nvidia.com/parallelforall/goai-open-gpu-acc... . I'm hoping we'll have the cycles to start describing how the piece fit together for how we're rethinking visual analytics web apps (and interactive-time ETL in general), but you can start to guess one aspect of it based on the above.


we're due to add a "Powered By" page to the website, the list of users is growing pretty quickly (e.g. Spark has Arrow support now)


It's not even two years old and doesn't have a 1.0 release. It wouldn't be smart for "major" projects to be using it.

There are probably folks using it for exploration but probably not a production system.


Yikes! I disagree with you -- these are assertions made on no evidence. This software is suitable for production systems as long as you are OK with occasional API changes / deprecations as the software evolves. We are not far off from a 1.0 release, but release numbers are a synthetic construct anyway.


I'm not putting the project down, but your comment "as long as you are OK with occasional API changes / deprecations" is exactly my point. Facebook doesn't want a system borking because an Arrow dev decided that some API feature was redundant or needed a new name. Presumably they have a test environment to check for that sort of thing. They certainly don't want to refactor a large code base because of some tiny API change.

If someone is using Pandas, Spark, or whatever for an important product, it's probably best for them to maintain whatever underlying data layer until the Arrow devs (I guess that means you) are willing to commit to a somewhat stable API. A stable API and a relatively bug-free experience is what typically marks a 1.0 release.

There are plenty of smaller projects that should be perfectly happy to use the 0.7 release and grow/evolve as Arrow does. Especially when using Pandas+Arrow, since it's probably not a production environment and I can spare a few hours to fix a confusing bug.


I disagree with your premise that production systems require API stability in all thirdparty dependencies.


Stability does not imply immutability, but rather established.

In new projects it is common for a method or class to change direction or role as it develops, which may bring with it a refactor of identifier names, parameters, and such. After the class has been used for some time in different contexts these changes happen less and less, hence we call the class stable. A 1.0 release implies this stability.

Third party dependencies are still free to mutate their APIs, but when maintaining a production system you don't want to be mole-whacking API changes every point release.


I also disagree. I would agree with "won't use because the prod release has a lot of bugs." But APIs change.


It's useful to separate the Arrow file format for standardizing passing columnar data between systems, and the native c library for not reinventing the wheel when doing so. Likewise, who this is to be used by: framework devs, and rather indirectly & niche by data analysts/engineers.

For example, Apache Drill and the follow-on startup Dremio is in use in various places, and I believe they use the Arrow runtime. In contrast, when we worked on Graphistry/NodeJS <> MapD bindings, we did zero-copy data interop by agreeing on just the file format.

For people used to building frameworks like the above, Arrow is at a fine point. We hoped to use the runtime, but it wasn't necessary so far. More importantly, as framework builders, it got us past the typical decision of roll-your-own format via ~flatbuffers vs. dealing with orc/parquet. As a user, you'd be leveraging Pandas, Spark, etc., and only calling Arrow when occasionally talking between systems using your framework's internally supported interop layer. Data engineers will eventually be more exposed to this as they focus more on say streaming arrows vs non-streaming parquets, but that seems early for now.


tl;dr this article has no Native American weaponry review nor far-Eastern bear-hate listicles, meh


A tip for authors: briefly explain what you're talking about in the first paragraph. Or at least link to it. Because then people don't have to go hunting all over to find out that, e.g., pandas is a python data analysis library: http://pandas.pydata.org/


This is directly from the blog of the Pandas creator. "What is Pandas" is assumed to be known, for good reason, and that bikeshed is the correct color.


> "What is Pandas" is assumed to be known, for good reason

What good reason? Why on earth would every single person on HN who happened to click on that link know what `pandas` is? How would they even know whose blog it is on?

Unless your blog is specifically, only for people who are already familiar with your `thing` (in which case, a mailing list might be better), then it simply makes sense to always have a header with a tagline explaining what your `thing` is and a link back to the main project website. Just in case it gets featured on HN or something.


Usually I ignore comments like this, but the author of the post is actually coming through and responding to comments, and I appreciate that.

You're leveling the criticism that since his blog deals with a niche that he should drop to a mailing list. Seriously? HN has been developing a bad reputation, and nitpicky comments presented without decorum create an incentive for people to stay away.

It's perfectly reasonably for the author of a blog post to assume that people familiar with his blog, his usual readers, will know what's up.

It's not too unreasonable to suggest that the author provide a summary if it's not clear what the piece or blog is about, except in this case it is described on the blog's About page on the list of open source projects. Redefining all of these things in every post would get annoying very fast.

It's reasonable to say that people working with statistical and data computing are likely as aware of Pandas as a web developer is aware of several of the largest JavaScript libraries.


> You're leveling the criticism

No, I'm not. I'm offering a suggestion. You're being absurdly thin-skinned on someone else's behalf.

> Redefining all of these things in every post would get annoying very fast.

So put it in a header so it automatically appears at the top of every post. As I suggested originally.

This is simply good practise for any project where you want to attract people who haven't heard of your software before. If you don't care about these people, then a mailing list is a better idea so you can hide it away from everyone. Again, just a suggestion...


Maybe he doesn't need to attract people who haven't heard of his software before because his software is literally the default for data science. It's like complaining that someone didn't define what jQuery is.

> No, I'm not. I'm offering a suggestion. You're being absurdly thin-skinned on someone else's behalf

No, we're being annoyed at the crazy level of bike shedding that hacker news is starting to see.


> Why on earth would every single person on HN who happened to click on that link know what `pandas` is? How would they even know whose blog it is on?

It's the author of Pandas's blog. He didn't submit it to HN, someone else did. It's totally reasonable to expect that if people are reading his, the author of Pandas, blog, then they already know what Pandas is.


Sorry, but I don't think it's reasonable to expect that. It's the web. People link to things. Which is why I think it's reasonable to add a link somewhere in the first paragraph to context.


> He didn't submit it to HN, someone else did.

What difference does that make? It's still here, right? So people are still reading it from here, right?

>It's totally reasonable to expect that if people are reading his, the author of Pandas, blog, then they already know what Pandas

No, not when it's linked from elsewhere. Which is what happens with blogs. Like it was right now.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: