Innnnteresting! We've been using Pandas as our slow CPU fallback in the GPU Arrow world b/c of issues like this.
Today, we default to using pydata gpu arrow tools like BlazingSQL or Nvidia RAPIDS directly. They ~guarantee perf, and subtle yet critical for maintaining our < 100ms SLA, the Arrow format stays clean. (Ex: don't want a column schema to get coerced to something less structured.) We'll use Pandas as a fallback for when they lack features or are hard to use.
The ideal would be to use Pandas directly. Today it is a bit of a crapshoot on whether schemas will break across calls, and the above libraries are really replacements, rather than integrated accelerator extensions. So thinking like this project get us closer to predictable (and GPU-level) performance within pandas, vs fully replacing it. So cool!
I can't say this is going to make a big difference in how I use pandas but I've ran into the bizarre "can't have nans in an int Series" annoyance in almost every pandas project I've worked on, so good on them for fixing that.
It might be annoying, but it's certainly not bizarre. In the floating point standard (IEEE 754) there are reserved bit values (with hardware support) to represent NaNs. For integers no such thing exists, so you're left with a bunch of different implementation choices all with different tradeoffs. A long time ago NumPy devs choose not to support NaN in integer arrays at all (for maximum performance), and Pandas (starting as a wrapper around numpy arrays) inherited that.
Not allowing nullable type in raw integer `Series` is the least bad solution this problem. If you really want nulls in numerical data, either use floating point or `IntegerArray`.
n = 2**25
data = np.random.choice(list(string.ascii_letters), n)
df = pd.DataFrame({
'string': data,
'arrow': fl.FletcherArray(data),
'categorical': pd.Categorical(data),
'ints': np.arange(n)
})
for a groupby operation on string/arrow/categorical and sum the ints the results are:
- String: 1.58 s
- Arrow: 886 ms
- Categorical: 406 ms
The base type of this string FletcherArray array is <pyarrow.lib.ChunkedArray, so maybe there is another more convenient arrow array type or the bottleneck is in upper layer around the groupby operation (since it only is twice as fast). Any way, for this kind of operation the Categorical is the winner and double speed is quite a notable improvement.
Today, we default to using pydata gpu arrow tools like BlazingSQL or Nvidia RAPIDS directly. They ~guarantee perf, and subtle yet critical for maintaining our < 100ms SLA, the Arrow format stays clean. (Ex: don't want a column schema to get coerced to something less structured.) We'll use Pandas as a fallback for when they lack features or are hard to use.
The ideal would be to use Pandas directly. Today it is a bit of a crapshoot on whether schemas will break across calls, and the above libraries are really replacements, rather than integrated accelerator extensions. So thinking like this project get us closer to predictable (and GPU-level) performance within pandas, vs fully replacing it. So cool!