Pandas extension arrays

lmeyerov · on Feb 1, 2019

Innnnteresting! We've been using Pandas as our slow CPU fallback in the GPU Arrow world b/c of issues like this.

Today, we default to using pydata gpu arrow tools like BlazingSQL or Nvidia RAPIDS directly. They ~guarantee perf, and subtle yet critical for maintaining our < 100ms SLA, the Arrow format stays clean. (Ex: don't want a column schema to get coerced to something less structured.) We'll use Pandas as a fallback for when they lack features or are hard to use.

The ideal would be to use Pandas directly. Today it is a bit of a crapshoot on whether schemas will break across calls, and the above libraries are really replacements, rather than integrated accelerator extensions. So thinking like this project get us closer to predictable (and GPU-level) performance within pandas, vs fully replacing it. So cool!

superdimwit · on Feb 1, 2019

I have no idea what any of this means!

dang · on Feb 1, 2019

Could you please stop posting unsubstantive comments to Hacker News?

If you don't understand and want to understand, you can always politely ask for an explanation.

https://news.ycombinator.com/newsguidelines.html

mactrey · on Feb 1, 2019

I can't say this is going to make a big difference in how I use pandas but I've ran into the bizarre "can't have nans in an int Series" annoyance in almost every pandas project I've worked on, so good on them for fixing that.

em500 · on Feb 1, 2019

It might be annoying, but it's certainly not bizarre. In the floating point standard (IEEE 754) there are reserved bit values (with hardware support) to represent NaNs. For integers no such thing exists, so you're left with a bunch of different implementation choices all with different tradeoffs. A long time ago NumPy devs choose not to support NaN in integer arrays at all (for maximum performance), and Pandas (starting as a wrapper around numpy arrays) inherited that.

For a more technical discussion of the performance implications of one possible implementation, see http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/

mFixman · on Feb 1, 2019

There's a nice explanation for this in Pandas' FAQ: https://pandas.pydata.org/pandas-docs/stable/user_guide/gotc...

Not allowing nullable type in raw integer `Series` is the least bad solution this problem. If you really want nulls in numerical data, either use floating point or `IntegerArray`.

bpchaps · on Feb 1, 2019

Has anyone done any perf analysis between this and previous versions?

batxu · on Feb 1, 2019

I just did one using fletcher (https://github.com/xhochy/fletcher), a library that extends pandas arrays with arrow arrays:

    n = 2**25
    data = np.random.choice(list(string.ascii_letters), n)
    df = pd.DataFrame({
        'string': data,
        'arrow': fl.FletcherArray(data),
        'categorical': pd.Categorical(data),
        'ints': np.arange(n)
    })

for a groupby operation on string/arrow/categorical and sum the ints the results are:

  - String: 1.58 s
  - Arrow: 886 ms
  - Categorical: 406 ms

The base type of this string FletcherArray array is <pyarrow.lib.ChunkedArray, so maybe there is another more convenient arrow array type or the bottleneck is in upper layer around the groupby operation (since it only is twice as fast). Any way, for this kind of operation the Categorical is the winner and double speed is quite a notable improvement.

gulda · on Feb 1, 2019

Nice!