

PandaPy has the speed of NumPy and the usability of Pandas - firedup
https://github.com/firmai/pandapy
https:&#x2F;&#x2F;github.com&#x2F;firmai&#x2F;pandapy<p>PandaPy has the speed of NumPy and the usability of Pandas (10x to 50x faster)
======
shoyer
It's a lovely idea to build pandas like functionality on top of NumPy's
structured dtypes, but these benchmarks comparing PandaPy to Pandas are
extremely misleading. The largest input dataset has 1258 rows and 9 columns,
so basically all these tests shows is that PandaPy has less Python overhead.

For a more representative comparison, let's make everything 1000x larger,
e.g., closing = np.concatenate(1000 * [closing])

Here's how a few representative benchmark change:

\- describe: PandasPy was 5x faster, now 5x slower

\- add: PandasPy was 2-3x faster than pandas, now ~15x slower

\- concat: PandasPy was 25-70x faster, now 1-2x slower

\- drop/rename: PandasPy is now ~1000x faster (NumPy can clearly do these
operations without any data copies)

I couldn't test merge because it needs a sorted dataset, but hopefully you get
the idea -- these benchmarks are meaningless, unless for some reason you only
care about manipulating small datasets very quickly.

At large scale, pandas has two major advantages over NumPy/PandasPy:

\- Pandas (often) uses a columnar data format, which makes it much faster to
manipulate large datasets.

\- Pandas has hash tables which it can rely upon for fast look-ups instead
sorting.

~~~
meowface
This is why you can never accept benchmarks provided solely by the software
creators. Same for accepting studies about a company's product when the
company's commissioned and funded the studies.

It'd be cool if there were neutral third-parties, kind of like Jepsen, that
any project could defer rigorous benchmarking to, perhaps in exchange for a
flat fee (everyone pays the same fee, no matter how big or small they are).

~~~
munmaek
And then they learn how to game the benchmarks.

You just can’t win.

~~~
skrebbel
No, because the trick is that you're paying a knowledgeable person to run the
benchmark. That person would presumably actively iterate on the benchmarks and
try to detect / avoid cheating.

~~~
kmbriedis
People would probably find out what hardware they use for benchmarks and
optimize for that, leading to performance decrease for many othes

------
smabie
Pandas is usable? I had no idea..

Pandas is really badly designed, in the same way that most Python libraries
are: each function has so many parameters. And a parameter can often be a
bunch of different types. Pandas is useful, especially for time-series data,
but no one particularly loves it. And, it’s embarrassingly slow. Maybe PandaPy
is better, but I doubt it. When you start trying to use Python implemented
functions (vs C ones) things are going to get bad no matter what you do.

Speaking of which, I decided to port over a statistical model for betting from
Python to Julia week ago. I’m not done yet, and this is my first major
experience with Julia, but it’s been _so_ much nicer than using Python. The
performance can easily be 10x-50x faster without really doing any extra work.

Also the language feels explicitly designed for scientific computing and
really meshes well with the domain. Python the language never really was good
for this, but the libraries were pretty compelling. Julia libraries have
almost caught up (or in some domains, like linear algebra) have actually
exceeded what’s available dor Python. Moreover, if you need to, PyCall is
really easy to use.

I’m going to go out on a limb and say that people shouldn’t be using Python
for new scientific computing projects. Julia has arrived, and is better in
everyway (I’m still unsure about the 1-based indexing, but I’m sure I’ll get
over it. 0-based waa never that great in the first place).

~~~
woah
How’s the package management story on Julia ? Python package management is a
fractal of badness

~~~
ddragon
Usually pretty good. The package management is completely centered around
Pkg.jl, which is integrated in the REPL and you can also import it to your
program for more advanced scripting. If you don't create an environment,
everything is added to the global user library, and if you do create it, it
will automatically manage your projects dependency files, and each
environment/package can have it's own independent versions of each library (so
you don't really have dependency hell issues, but you might end up with more
disk space due to multiple versions of the same library, although it will
respect semver when keeping multiple libraries).

Pkg.jl is based on git/github with a central registry that can I believe it's
automatically updated with new packages using bots. The current version also
has native supports for automatically deploying binaries and other stuff like
datasets, that can be optionally loaded on demand.

Most troubles I hear is with more strict enterprise firewall scenarios and
perhaps Julia JIT making it compile the libraries every time (though that's
not an issue with the package manager).

------
fjp
Some Python devs seem to pull in Pandas whenever any math is required.

IMO Pandas documentation somehow manages to document every parameter of every
method and somehow it’s almost as helpful as no documentation at all. Combined
with the fact that it’s a huge package, I avoid it unless I really really need
it.

A version with human-understandable docs could convince me otherwise

~~~
powowowow
I've found Pandas extremely easy to learn and to use; to the point where I
find it confusing to see somebody say that it's not human-understandable.

If you're reading this thread and wondering if it's easy to hard to use, I
suggest taking a look at the docs ([https://pandas.pydata.org/pandas-
docs/stable/index.html](https://pandas.pydata.org/pandas-
docs/stable/index.html)) and making your own decision.

I find the combination of basic intros, user guides, and the API reference to
be extremely usable and understandable; and I am reasonably sure I am human.
But opinions may vary.

~~~
jfim
Pandas has enough gotchas that it looks friendly until you hit one of them.
Examples of gotchas:

Want to join two dataframes together like you'd join two database tables?
df.join(other=df2, on='some_column') does the wrong thing, silently, what you
really wanted was df.merge(right=df2, on='some_column')

Got a list of integers that you want to put into a dataframe?
pd.DataFrame({'foo': [1,2,3]}) will do what you want. What if they're
optional? pd.DataFrame({'foo': [1,2,3,None]}) will silently change your
integers to floating point values. Enjoy debugging your joins (sorry, merges)
with large integer values.

Want to check if a dataframe is empty? Unlike lists or dicts, trying to turn a
dataframe into a truth value will throw ValueError.

~~~
qwhelan
>Want to join two dataframes together like you'd join two database tables?
df.join(other=df2, on='some_column') does the wrong thing, silently, what you
really wanted was df.merge(right=df2, on='some_column')

Simply a matter of default type of join - join defaults to left while merge
defaults to inner. They use the exact same internal join logic.

>What if they're optional? pd.DataFrame({'foo': [1,2,3,None]}) will silently
change your integers to floating point values.

This was a long standing issue but is no longer true.

>Want to check if a dataframe is empty? Unlike lists or dicts, trying to turn
a dataframe into a truth value will throw ValueError.

Those are 1D types where that's simple to reason about. It's not as
straightforward in higher dimensions (what's the truth value of a (0, N)
array?), which is why .empty exists

~~~
jfim
> Simply a matter of default type of join - join defaults to left while merge
> defaults to inner.

No, join does an index merge. For example, if you try to join with string
keys, it'll throw an error (because strings and numeric indexes aren't
compatible).

    
    
      left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [1,2,3,4]})
      right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something_else": [4,3,1,2]})
      left.join(other=right, on="abcd")
      
      ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
    

If you try to join with numeric keys:

    
    
      left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [10,20,30,40]})
      right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something": [40,30,10,20]})
      
      left.join(other=right, on="something", rsuffix="_r")
      
        abcd  something abcd_r  something_r
      0    a         10    NaN          NaN
      1    b         20    NaN          NaN
      2    c         30    NaN          NaN
      3    d         40    NaN          NaN
    

Or even worse if your numeric values are within the range for indexes, which
kind of looks right if you're not paying attention:

    
    
      left = pd.DataFrame({"abcd": ["a", "b", "c", "d"], "something": [1,2,3,4]})
      right = pd.DataFrame({"abcd": ["d", "c", "a", "b"], "something": [4,3,1,2]})
      left.join(other=right, on="something", rsuffix="_r")
      
        abcd  something abcd_r  something_r
      0    a          1      c          3.0
      1    b          2      a          1.0
      2    c          3      b          2.0
      3    d          4    NaN          NaN
    

Whereas merge does what one would expect:

    
    
      left.merge(right=right, on="something", suffixes=['', '_r'])
      
        abcd  something abcd_r
      0    a         10      a
      1    b         20      b
      2    c         30      c
      3    d         40      d
    

>> What if they're optional? pd.DataFrame({'foo': [1,2,3,None]}) will silently
change your integers to floating point values.

> This was a long standing issue but is no longer true.

Occurs in pandas 0.25.1 (and the release notes for 0.25.2 and 0.25.3 don't
mention such a change), so that would likely be still the case in the latest
stable release.

    
    
      pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807]})
      
                  foo
      0  1.000000e+00
      1  2.000000e+00
      2  3.000000e+00
      3  4.000000e+00
      4           NaN
      5  9.223372e+18
    

It's also a lossy conversion if the integer values are large enough:

    
    
      df = pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807,9223372036854775806]})
      
                  foo
      0  1.000000e+00
      1  2.000000e+00
      2  3.000000e+00
      3  4.000000e+00
      4           NaN
      5  9.223372e+18
      6  9.223372e+18
      
      df["foo"].unique()
      
      array([1.00000000e+00, 2.00000000e+00, 3.00000000e+00, 4.00000000e+00, nan, 9.22337204e+18])
    

>> Want to check if a dataframe is empty? Unlike lists or dicts, trying to
turn a dataframe into a truth value will throw ValueError.

> Those are 1D types where that's simple to reason about. It's not as
> straightforward in higher dimensions (what's the truth value of a (0, N)
> array?), which is why .empty exists

It's not very pythonic, though. A definition of "all dimensions greater than
0" would've been much less surprising.

~~~
qwhelan
> Occurs in pandas 0.25.1 (and the release notes for 0.25.2 and 0.25.3 don't
> mention such a change), so that would likely be still the case in the latest
> stable release.

It was released in 0.24.0: [https://pandas.pydata.org/pandas-
docs/stable/user_guide/inte...](https://pandas.pydata.org/pandas-
docs/stable/user_guide/integer_na.html)

For example:

    
    
        pd.DataFrame({"foo": [1,2,3,4,None]}, dtype=pd.Int64Dtype())
    
            foo
        0     1
        1     2
        2     3
        3     4
        4  <NA>
    
        pd.DataFrame({"foo": [1,2,3,4,None,9223372036854775807,9223372036854775806]}, dtype=pd.Int64Dtype())
    
                           foo
        0                    1
        1                    2
        2                    3
        3                    4
        4                 <NA>
        5  9223372036854775807
        6  9223372036854775806

~~~
jfim
Sure, if you specify the type. It's still a gotcha because the default
behavior is to upcast to floating point unless the type is defined for every
integer column of every data frame, which isn't very pythonic.

The example with the (incorrect) join above shows how even other operations
can cause this type conversion.

~~~
qwhelan
Yes, there's a lot of existing code written assuming the old behavior. But
most code has only a few ingestion points, so it's pretty simple to turn on.

------
gewa
I worked with Pandas and numpy for different projects, and I really like the
low level component way how numpy works. In most cases where I used Pandas I
regretted it at some point. OOP and numpy in the first place would’ve been a
better solution, especially because of the ease of Numba integration.

------
sriku
Nice to see .. but I think Julia is pretty much targeted at not having to do
these kinds of jugglery.

(Don't get me wrong. I actually appreciate the work, but also use julia)

------
anakaine
The one reference I didn't see was to chunking. Currently using Dask because
of its graceful chunking of large and medium data - but pandaspy doesn't make
reference to this capability.

------
beefield
Slightly off-topic, I have been occasionally trying to learn to use pandas,
but having worked quite a lot with SQL, there is one thing that I can't get
over. Is there a way to force pandas to have same data type for each element
in a column? (Especially pandas seems to think that NaN is a valid
replacelemnt of None, and after that you really can't trust anything to run on
a column because the data types may chnage.

Or then, more likely, I have missed some idiomatic way to work with pandas.

~~~
TheGallopedHigh
Off the top of my head there is an as_type method to set a column to a type.

You can also choose how to fill None values, namely what value you want
instead. See fill_na function

------
gww
There's an cool python library called anndata ([https://icb-
anndata.readthedocs-hosted.com/en/stable/anndata...](https://icb-
anndata.readthedocs-hosted.com/en/stable/anndata.AnnData.html)). It's designed
for single cell RNA-seq experiments where datasets have multiple 2d matrices
of data along with row/column annotation data. It's use of NumPy structured
arrays is interesting.

------
enriquto
My whole work consists in manipulating arrays of numbers, mostly in python,
and I never found any use for pandas. Whenever I receive some code that uses
pandas, it is easy to remove this dependency without much ado (it was not
really necessary for anything).

Can anybody point me to a reasonable use case of pandas? I mean, besides
printing a matrix with lines of alternating colors.

~~~
cerved
Lots of built in statistical stuff and powerful visualization makes exploring
datasets easy

~~~
enriquto
> Lots of built in statistical stuff and powerful visualization makes
> exploring datasets easy

I see. For linear algebra stuff it does not offer anything essential. You
rarely see a matrix as a "dataset".

------
ben509
If you've mucked with numpy dtypes, they're shockingly powerful, but this
seems like a much nicer way to do it. Great idea!

------
kristianp
Has anyone here compared Turi Create with pandas and numpy recently? It was
open-sourced by apple:
[https://github.com/apple/turicreate](https://github.com/apple/turicreate)

Seems like it's good for creating ml models and deploying them to apple
devices.

------
hsaliak
nice to see more libraries in python that embrace optional static typing

------
throwlaplace
isn't pandas already built on top of numpy? so what does this mean and owing
to what is it faster?

~~~
skyyler
Did you read the README.md? The author discusses the motivations of the
project there.

