
Data Science: Performance of Python vs. Pandas vs. Numpy - lukasz_km
http://machinelearningexp.com/data-science-performance-of-python-vs-pandas-vs-numpy/
======
em500
I left the following comment at the blog:

For this example, idiomatic Pandas would look like this:

    
    
      def gen_stats_pandas2(dataset_pandas):
          grouped = dataset_pandas.groupby('product_id')
          stats = grouped.aggregate(['count', 'sum', 'mean'])
          return stats[[('product_order_num','count'), ('quantity','sum'), ('price','mean')]]
    
    

which is simpler and about 500x faster than gen_stats_pandas() in the example:

    
    
      %%time 
      res1 = gen_stats_pandas(dataset_pandas)
    
      CPU times: user 8.48 s, sys: 32 ms, total: 8.51 s
      Wall time: 8.58 s
    
      %%time 
      res2 = gen_stats_pandas2(dataset_pandas)
    
      CPU times: user 16 ms, sys: 4 ms, total: 20 ms
      Wall time: 17.6 ms

~~~
blattimwind
Sometimes I wonder whether actual optimization (as a craft, if you like) has
been poisoned by scripting languages (Python etc.) and frameworks.

For one thing they completely changed how the speed of computers is perceived,
e.g. it is _actually considered fast_ when a web app spends "only" 20 server
CPU milliseconds on a request.

And because most things are so inefficient, it is relatively easy to get
integer speed-ups. When looking at applications written in C++, written by
actual engineers, then often just getting 5 or 10 % on some functionality is
appreciated, and achieving more can be seriously difficult. With software in
scripting languages improvements below a few hundred percent don't raise an
eye brow.

However, actually optimizing code in a scripting language (and not changing
the way a native library is used) can be seriously difficult and annoying,
because most simply don't have efficient tools — obviously, if a scripting
language (like Python) had something substantially faster than, say, attribute
access, then that something would simply be used to power attribute access in
the general case. So in C++ (or other native languages) you can actually make
good use of the processor's and memory system's resources, but you just _can
't_ writing in a scripting language.

It is also surprising that most tooling available for e.g. Python is rather
immature, insufficient and cumbersome to use compared to the tooling
(especially commercial) available for native code (or Java). This likely has
something to do with people a) not caring about performance at all and/or b)
since it is basically not possible to write fast Python code, as soon as
something becomes important for performance, it is written in another language
and used from Python, and for analyzing the performance of that code, which is
now an external library, you don't need tools for Python.

~~~
stestagg
This comment contains some fairly pejorative language

I can assure you that there are real-world high performance python systems
with per-event budgets of far less than 1ms being used for critical systems in
large organisations. Sometimes with the help of compiled components, but made
understandable by being coordinated in python.

Of course these systems are atypical, but the idea that ‘actual’ engineers
always use c/Java, while python is just a scripting language is just ignorant

~~~
blattimwind
That's not what I wrote.

------
aleyan
There are a couple of things wrong with this article.

The first problem, as pointed out by apl and em500, is the non-idiomatic
pandas code. Pandas is fast not when you rewrite your algorithms using one for
one with pandas functions, but when you use functions provided uniquely by
pandas, then you start thinking with pandas. My rewrite[0] with groupby is _47
times faster_ than pure python; it would be _1848 times faster_ if I didn't
convert the result to list of lists from DataFrame as in the other cases.

Secondly, if your analysis is taking too long on your MacBook Air, you should
switch up to faster hardware before investing your time in time consuming
optimizations. Presumably this is work done for an employer and for them the
data scientists time is much more expensive than a powerful AWS instance. Of
course if you made big Big O mistakes, no amount of hardware is going to fix
that.

[0]
[https://github.com/aleyan/machinelearningexp/blob/master/Dat...](https://github.com/aleyan/machinelearningexp/blob/master/DataScience_Performance_Python_Pandas_Numpy.ipynb)

------
apl
I only skimmed the notebook, but the code looks fairly inadequate. The pandas
portion, for instance, treats the data frame as a dumb array and critically
ignores grouping functionality which should offer a tremendous speedup.
Moreover, for the exact same task pandas should never be slower than NumPy.

Benchmarks are hard to get right but this one falls way short of saying
anything at all about performance penalties incurred by various libraries and
abstractions.

------
Radim
No comment on the poor implementations (why quadratic complexity, in all three
cases?), but some general notes:

For reporting benchmarks, you probably want the _minimum_ , rather than
average. While keeping an eye on the distribution/variation of the individual
timings as well (to mitigate caching effects). Check out Python's _timeit_
module or the _%timeit_ magic command in IPython.

With more numeric-heavy workflows, the gap between Python/C grows even larger.
The more you can push to native-compiled-land, away from Python loops, the
better. As another data point, when optimizing the word2vec algorithm, the
numbers came out something like [0]:

1x (pure Python) - ~120x (NumPy) - ~2880x (C/Cython)

[0] [https://rare-technologies.com/word2vec-in-python-part-two-
op...](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/)

------
petters
The data consists of 50k records. The fastest code in the post takes 3
seconds.

Have many programmers completely lost the ability to know roughly how fast
code "should" be?

~~~
Johnny_Brahms
We use stupid, simple maths problems (like project Euler 1-30) when hiring for
junior positions.

We are not looking for someone to do the maths tricks that lets you solve most
of the problems with pen and paper, but for someone that recognizes a 100x
speed increase using memoization.

You'd think people with some programming experience would be able to reason
about how to speed up a simple brute force approach, but those parts of the
hiring process usually weed out 75%, and I don't expect someone to give me a
perfect solution. Heck, not even a working piece of code. Just being able to
say "hmm, we could maybe cut half of the computation by just using the
relevant parts of the alphabet" is all I want.

Then we have people like my colleague that just wrote print(103) which was the
correct answer. (He could also give several fast approaches to solving the
problem, one which we hadn't thought about before).

------
akilat90
While we're on the subject, I'd like to see what this community has to offer
on the efficiency of the pandas groupby-apply operations, which I believe is
an idiomatic pandas way. The question [0] on SO is a demonstration where the
groupby-apply falls to complete iteration of the dataset(also using some
pandas, not sure if that's idiomatic pandas though)

Thanks!

[0]
[https://stackoverflow.com/questions/45012503/conditionally-p...](https://stackoverflow.com/questions/45012503/conditionally-
populating-elements-in-a-pandas-groupby-object-a-vectorized-solu)

------
autokad
I usually do all my feature engineering and data exploration in pandas, and
then I change my data from pandas to numpy just before i throw it to the
classifiers / regressors

~~~
homarp
have you looked at PyTorch ? can be seen as a GPU acceleratd numpy

see [https://vwrideout.github.io/QR-
Blog/index.html](https://vwrideout.github.io/QR-Blog/index.html)

------
karuth
This post is rather naive. Pandas performance improves when one uses in built
functions. Also the performance of pandas is better over numpy for larger
sizes [1].

[1] [http://gouthamanbalaraman.com/blog/numpy-vs-pandas-
compariso...](http://gouthamanbalaraman.com/blog/numpy-vs-pandas-
comparison.html)

------
lukasz_km
There is a lot of great Python and Pandas code snippets, but I am not sure
anynone posted a Numpy based solution. Below is mine. It gives 100x speed up
over the base (quadratic Python) solution. Still, suggested Pure Python
solutions and idiomatic Numpy solutions are much faster. I suspect Numpy has
more power than that.

def gen_stats_numpy_l(dataset_numpy): start = time.time()
unique_products,unique_indices = np.unique(dataset_numpy[:,0],return_index =
True) product_stats = [] split = np.split(dataset_numpy,unique_indices)[1:]
for item in split: length = len(item)
product_stats.append([int(item[0,0]),int(length),int(np.sum(item[:,2])),float(np.round(np.sum(item[:,3])/length,2))])
end = time.time() working_time = end-start return product_stats,working_time

------
Stasis5001
This comparison isn't so great because of the following:

    
    
        for product_id in unique_products:
            product_items = [x for x in dataset_python if x[0]==product_id ]
    

This is O(unique_products * observations), and it looks like
O(unique_products) = O(observations). Thus we have a quadratic scan when a
linear one would suffice. You'll get the best performance using whichever
solution lets you code this to linear the fastest. E.g. pure python, make a
dict from product_id to observations and iterate over that, or for pandas, use
groupby.

------
0xfaded
As other comments point out, performance mostly depends on correct
implementation.

Once the scripting overhead is under control, you're now measuring the
performance of your blas library.

A much more useful comparison, although 2.5 years old:

[http://blog.nguyenvq.com/blog/2014/11/10/optimized-r-and-
pyt...](http://blog.nguyenvq.com/blog/2014/11/10/optimized-r-and-python-
standard-blas-vs-atlas-vs-openblas-vs-mkl/)

------
gravypod
Maybe someone can tell me the benefit but why use Pandas rather then a real
database? Pandas seems like a non-standard, in memory, python specific
database system that has no indexing, caching, or persistence layer aside from
dumping to flat files. It's an RDBMS without the Relational bit.

~~~
eesmith
I don't think you have looked at Pandas that closely. For example, Pandas has
multiple ways to persist besides a flat file, including to SQLite and HDF5. -
[http://pandas.pydata.org/pandas-
docs/stable/io.html](http://pandas.pydata.org/pandas-docs/stable/io.html)

It's also aware of the relational data model. The 10 minute intro at
[http://pandas.pydata.org/pandas-
docs/stable/10min.html?highl...](http://pandas.pydata.org/pandas-
docs/stable/10min.html?highlight=relational) says:

"pandas provides various facilities for easily combining together Series,
DataFrame, and Panel objects with various kinds of set logic for the indexes
and _relational algebra_ functionality in the case of join / merge-type
operations."

and [http://pandas.pydata.org/pandas-
docs/stable/comparison_with_...](http://pandas.pydata.org/pandas-
docs/stable/comparison_with_sql.html) gives a mapping of basic SQL operations
to Pandas.

There is also some caching. The documentation mentions things like "A large
range of dates for various offsets are pre-computed and cached under the hood
in order to make generating subsequent date ranges very fast (just have to
grab a slice)" and "Bug in Series update where the parent frame is not
updating its cache based on changes".

Perhaps you can think of it as an in-memory relational database without SQL or
a declarative language but instead has an imperative API for Python,
containing a large number of helper functions not in most RDMSes but which are
important in data analysis. Think of it as a database engine where the user
has control over the internal memory layout so C extension and other code (a
"cartridge" or "datablade", if you will) can work with it directly.

Supposed you want to read a CSV into a data frame, compute the difference
between column "X" and "Y", and show those differences as a histogram. That's
one line of Pandas. How do you do that in a RDBMS?

------
wcr3
Much of pandas is actually built on numpy...

