
Pandas extension arrays - ptype
https://pandas-dev.github.io/pandas-blog/pandas-extension-arrays.html
======
lmeyerov
Innnnteresting! We've been using Pandas as our slow CPU fallback in the GPU
Arrow world b/c of issues like this.

Today, we default to using pydata gpu arrow tools like BlazingSQL or Nvidia
RAPIDS directly. They ~guarantee perf, and subtle yet critical for maintaining
our < 100ms SLA, the Arrow format stays clean. (Ex: don't want a column schema
to get coerced to something less structured.) We'll use Pandas as a fallback
for when they lack features or are hard to use.

The ideal would be to use Pandas directly. Today it is a bit of a crapshoot on
whether schemas will break across calls, and the above libraries are really
replacements, rather than integrated accelerator extensions. So thinking like
this project get us closer to predictable (and GPU-level) performance within
pandas, vs fully replacing it. So cool!

~~~
superdimwit
I have no idea what any of this means!

~~~
dang
Could you please stop posting unsubstantive comments to Hacker News?

If you don't understand and want to understand, you can always politely ask
for an explanation.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

------
mactrey
I can't say this is going to make a big difference in how I use pandas but
I've ran into the bizarre "can't have nans in an int Series" annoyance in
almost every pandas project I've worked on, so good on them for fixing that.

~~~
em500
It might be annoying, but it's certainly not bizarre. In the floating point
standard (IEEE 754) there are reserved bit values (with hardware support) to
represent NaNs. For integers no such thing exists, so you're left with a bunch
of different implementation choices all with different tradeoffs. A long time
ago NumPy devs choose not to support NaN in integer arrays at all (for maximum
performance), and Pandas (starting as a wrapper around numpy arrays) inherited
that.

For a more technical discussion of the performance implications of one
possible implementation, see [http://wesmckinney.com/blog/bitmaps-vs-sentinel-
values/](http://wesmckinney.com/blog/bitmaps-vs-sentinel-values/)

------
bpchaps
Has anyone done any perf analysis between this and previous versions?

~~~
batxu
I just did one using fletcher
([https://github.com/xhochy/fletcher](https://github.com/xhochy/fletcher)), a
library that extends pandas arrays with arrow arrays:

    
    
        n = 2**25
        data = np.random.choice(list(string.ascii_letters), n)
        df = pd.DataFrame({
            'string': data,
            'arrow': fl.FletcherArray(data),
            'categorical': pd.Categorical(data),
            'ints': np.arange(n)
        })
    

for a groupby operation on string/arrow/categorical and sum the ints the
results are:

    
    
      - String: 1.58 s
      - Arrow: 886 ms
      - Categorical: 406 ms
    

The base type of this string FletcherArray array is <pyarrow.lib.ChunkedArray,
so maybe there is another more convenient arrow array type or the bottleneck
is in upper layer around the groupby operation (since it only is twice as
fast). Any way, for this kind of operation the Categorical is the winner and
double speed is quite a notable improvement.

~~~
gulda
Nice!

