Hacker News new | past | comments | ask | show | jobs | submit login
Data Science: Performance of Python vs. Pandas vs. Numpy (machinelearningexp.com)
118 points by lukasz_km on July 15, 2017 | hide | past | favorite | 40 comments



I left the following comment at the blog:

For this example, idiomatic Pandas would look like this:

  def gen_stats_pandas2(dataset_pandas):
      grouped = dataset_pandas.groupby('product_id')
      stats = grouped.aggregate(['count', 'sum', 'mean'])
      return stats[[('product_order_num','count'), ('quantity','sum'), ('price','mean')]]

which is simpler and about 500x faster than gen_stats_pandas() in the example:

  %%time 
  res1 = gen_stats_pandas(dataset_pandas)

  CPU times: user 8.48 s, sys: 32 ms, total: 8.51 s
  Wall time: 8.58 s

  %%time 
  res2 = gen_stats_pandas2(dataset_pandas)

  CPU times: user 16 ms, sys: 4 ms, total: 20 ms
  Wall time: 17.6 ms


Sometimes I wonder whether actual optimization (as a craft, if you like) has been poisoned by scripting languages (Python etc.) and frameworks.

For one thing they completely changed how the speed of computers is perceived, e.g. it is actually considered fast when a web app spends "only" 20 server CPU milliseconds on a request.

And because most things are so inefficient, it is relatively easy to get integer speed-ups. When looking at applications written in C++, written by actual engineers, then often just getting 5 or 10 % on some functionality is appreciated, and achieving more can be seriously difficult. With software in scripting languages improvements below a few hundred percent don't raise an eye brow.

However, actually optimizing code in a scripting language (and not changing the way a native library is used) can be seriously difficult and annoying, because most simply don't have efficient tools — obviously, if a scripting language (like Python) had something substantially faster than, say, attribute access, then that something would simply be used to power attribute access in the general case. So in C++ (or other native languages) you can actually make good use of the processor's and memory system's resources, but you just can't writing in a scripting language.

It is also surprising that most tooling available for e.g. Python is rather immature, insufficient and cumbersome to use compared to the tooling (especially commercial) available for native code (or Java). This likely has something to do with people a) not caring about performance at all and/or b) since it is basically not possible to write fast Python code, as soon as something becomes important for performance, it is written in another language and used from Python, and for analyzing the performance of that code, which is now an external library, you don't need tools for Python.


When Python is being used for these data analytical tasks, random attribute access is not the common case. Instead, SIMD and vectorized mathematical transformations is the order of the day, and those can be specified very concisely and powerfully within the syntax of a "scripting" language.

More importantly, that conciseness means that that language is accessible to people who understand the right maths to use. This is frequently not your CS-trained C++/Java software engineer.

It's the age-old philosophical quandary: is it more useful to have the fastest bomber that can only be piloted by specialized pilots who must receive targeting instructions via long-distance radio, or is it better to have a moderately fast bomber than can be piloted by people who can actually see the targets?

Python is succeeding and taking over the world because businesses are overwhelmingly seeing the value of the latter approach.


Ad-hoc statistical analysis that takes a few seconds to run is a different technical case than production environments where every second is critical. For most datasets/workflows, as long as it can fit into a few GB of memory, it can run on a computer in a reasonable amount of time on any modern data analysis library like pandas or R's dplyr without having to dive into bare metal.

The exceptions are if you are coding your algorithm very poorly (like in this case) and have bad time complexity that causes the run time to balloon at large amounts of data.


This comment contains some fairly pejorative language

I can assure you that there are real-world high performance python systems with per-event budgets of far less than 1ms being used for critical systems in large organisations. Sometimes with the help of compiled components, but made understandable by being coordinated in python.

Of course these systems are atypical, but the idea that ‘actual’ engineers always use c/Java, while python is just a scripting language is just ignorant


That's not what I wrote.


It's especially weird in this case because the value-proposition of pandas is the ability to do things like aggregation in 3 lines of code and not have to iterate through every row.


Well pandas iterates through every row anyway, just in native-land rather than pure Python. That's not the difference.

The trouble with this code was the choice of the quadratic algorithm (which the idiomatic pandas version avoids). You could achieve a similar speed-up as OP for the pure Python code too, by dropping the doubly-nested loop.

The point stands as usual: better algo > clever tricks :) Personally, I find the code brevity that pandas offers does not outweigh the added complexity and cognitive load of its (IMO hard to remember and reason about) API.

EDIT: pure Python version

  def gen_stats_python2(dataset_python):
      aggr = {}
      for product_id, product_order_num, quantity, price in dataset_python:
          item = aggr.get(product_id, [0, []])
          item[0] += quantity
          item[1].append(price)
          aggr[product_id] = item
      return [
          (int(product_id), len(prices), int(quantity), round(float(sum(prices)) / len(prices), 2))
          for product_id, (quantity, prices) in aggr.items()
      ]
761x speedup!

  %timeit gen_stats_python(dataset_python)
  1 loops, best of 3: 36.3 s per loop

  %timeit gen_stats_python2(dataset_python)
  10 loops, best of 3: 47.7 ms per loop


It's a lot faster if you use python's itertools and operator itemgetter:

  def gen_stats(data):
    start = time.time()
    stats = []
    data.sort(key=itemgetter(0))
    for p_id, p_items in itertools.groupby(data, key=itemgetter(0)):
      num_orders, _, total_quantity, total_price = map(sum, zip(*p_items))
      p_id, num_orders, total_quantity = map(int, [p_id, num_orders/p_id, total_quantity])
      stats.append([p_id, num_orders, total_quantity, round(float(total_price/num_orders),2)])
    end = time.time()
    working_time = end-start
    return stats, working_time
Results:

  Dataset size 55096 records
  [[1, 8, 340, 22.55], [2, 4, 94, 15.83], [3, 2, 97, 3.89]]
  [[1, 8, 340, 22.55], [2, 4, 94, 15.83], [3, 2, 97, 3.89]]
  All results equal: True
  Runtime (s)
  gen_stats_slow: 28.489468574523926
  gen_stats_fast:  0.06000018119812012


That doesn't look faster -- 60ms? Nor clearer, at least to me :)

Additionally, you might have a bug in the code: groupby() only groups consecutive keys, so your code will fail if the dataset is not already sorted by product_id.

I guess that's the lesson here -- use pandas to make sure you get a sane, fast, tested implementation of some common use-cases. That is, if you can remember its API, when it makes copies vs views under the hood, and all that gnarly stuff :-)


Wee! Pointless optimization Saturday!

Looks like my gen_stats_eesmith is even faster. Radim's is about 400x faster, iClaudiusX is 450x faster, and eesmith is 580x faster.

Under pypy2-v5.8.0 those are 65x, 16x, and 91x, respectively. The original algorithm is 10x faster under pypy.

    def gen_stats_python(dataset_python):
        start = time.time()
        product_stats = []
        unique_products = set([x[0] for x in dataset_python])
        for product_id in unique_products:
            product_items = [x for x in dataset_python if x[0]==product_id ]
            num_orders = len(product_items)
            total_quantity = 0
            total_price = 0
            for row in product_items:
                total_quantity += row[2]
                total_price += row[3]
            avg_price = float(total_price/num_orders)
            product_stats.append([int(product_id),int(num_orders),int(total_quantity),round(avg_price,2)])
        end = time.time()
        working_time = end-start
        return product_stats,working_time

    def gen_stats_radim(dataset_python):
      start = time.time()
      aggr = {}
      for product_id, product_order_num, quantity, price in dataset_python:
          item = aggr.get(product_id, [0, []])
          item[0] += quantity
          item[1].append(price)
          aggr[product_id] = item
      return [
          [int(product_id), len(prices), int(quantity), round(float(sum(prices)) / len(prices), 2)]
          for product_id, (quantity, prices) in aggr.items()
      ], time.time()-start

    def gen_stats_iclaudiusx(data):
        start = time.time()
        stats = []
        data.sort(key=itemgetter(0))
        for p_id, p_items in itertools.groupby(data, key=itemgetter(0)):
          num_orders, _, total_quantity, total_price = map(sum, zip(*p_items))
          p_id, num_orders, total_quantity = map(int, [p_id, num_orders/p_id, total_quantity])
          stats.append([p_id, num_orders, total_quantity, round(float(total_price/num_orders),2)])
        end = time.time()
        working_time = end-start
        return stats, working_time

    def gen_stats_eesmith(dataset_python):
        start = time.time()
        product_stats = []
        total_num_orders = defaultdict(int)
        total_quantity_by_product_id = defaultdict(int)
        total_price_by_product_id = defaultdict(float)

        for product_id, order_id, quantity, price in dataset_python:
            total_num_orders[product_id] += 1
            total_quantity_by_product_id[product_id] += quantity
            total_price_by_product_id[product_id] += price

        for product_id in total_price_by_product_id:
            num_orders = total_num_orders[product_id]
            total_quantity = total_quantity_by_product_id[product_id]
            total_price = total_price_by_product_id[product_id]

            avg_price = float(total_price)/num_orders
            product_stats.append([int(product_id),int(num_orders),int(total_quantity),round(avg_price,2)])
        end = time.time()
        working_time = end-start
        return product_stats,working_time

    reference_result = reference_dt = None
    gen_stats_functions = (gen_stats_python, gen_stats_radim,
        gen_stats_iclaudiusx, gen_stats_eesmith)
    for gen_stats_function in gen_stats_functions:
        result, dt = gen_stats_function(dataset_python)
        if reference_result is None:
            reference_result = result
            reference_dt = dt
        assert result == reference_result, (gen_stats_function.__name__,
                 result[:2], reference_result[:2])
        print(gen_stats_function.__name__, dt, reference_dt/dt)


Who doesn't like a good pointless optimisation day? In the same vein (and because what else was I going to do with my Sunday morning?) I ran yours under 3.x and came up with radim/iclaudiusx at ~430x, eesmith at ~600x, tech2 at ~850x. I didn't get a chance to try pypy. Thanks for the amusement!

    def gen_stats_tech2(dataset_python):
        start = time.time()
        product_stats = []

        # totals = (num_orders, quantity, price)
        totals = defaultdict(lambda: (0, 0, 0.0))

        for product_id, order_id, quantity, price in dataset_python:
            o_count, o_quant, o_price = totals[product_id]
            totals[product_id] = (o_count + 1, o_quant + quantity, o_price + price)

        for product_id, product_info in totals.items():
            num_orders, total_quantity, total_price = product_info
            avg_price = total_price / num_orders
            product_stats.append(
                [product_id, num_orders, int(total_quantity),
                 round(avg_price, 2)]
            )
        end = time.time()
        working_time = end - start
        return product_stats, working_time


And, just because I'm in a golfing mood:

    def gen_stats_tech2(dataset_python):
        start = time.time()

        totals = defaultdict(lambda: (0, 0, 0.0))

        for product_id, order_id, quantity, price in dataset_python:
            o_count, o_quant, o_price = totals[product_id]
            totals[product_id] = (o_count + 1, o_quant + quantity, o_price + price)

        product_stats = [
            [product_id, num_orders, total_quantity,
             round(total_price / num_orders, 2)]
            for product_id, (num_orders, total_quantity, total_price)
            in totals.items()
        ]
        end = time.time()
        working_time = end - start
        return product_stats, working_time
950x improvement. I think the former reads more pleasantly and you're likely to do better with something like pypy though rather than dragging the code through the muck.

Most improvements are from use of some of the builtins like dictionary's .items method rather than making a getitem call per loop, removal of references to globals (int) since the values stored already have an integer value, removing N calls to append and letting the list-comp manage, and removing a store-then-unpack.

Sadly I don't think the top loop can be reduced in a similar manner because it's self-referential. Still, that made for a fun hour or so, I was really hoping to hit 1,000x improvement :/


Nice work!


Nice :)

Interestingly, the O(n) Python is still about 4x slower than the O(n) Pandas -- the same relative difference as in the original O(n^2) implementations from the blog post.

So despite the mishaps of this "Senior IT Business Analyst" choosing an unsuitable algorithm, his relative graphs pretty much stand.


I have had much fun reading this thread of comments today :)

This comparison was never about concrete (optimal) implementations but about comparing exactly the same implementations, doing exactly the same operations. Such approach would be the same for comparing performance of any languages, like C and Java. This is not "the best algo" race.

So, after a day of discussion, the result is the following: regardless if the implementation is quadratic or linear, the relative speed of all 3 technologies is (roughly) the same. I could not agree more as this is primary school math: dividing by the same denominator still means that proportions are the same :)

I plan to update my blog post, based on many interesting comments you have all provided. I definetely think this analysis should be more complex. Thank you for your insights.

I have many great pieces of code for inspiration, but I am struggling with providing exactly the same implementations. So it may cost me some time to provide all 3 optimized implementations that are exactly relevant to each other.


"the relative speed of all 3 technologies is (roughly) the same"

I'll be a bit pedantic. The original essay says "pure Python", which is not a technology but a language. There are several implementations of Python, which I'll argue is the "technology". The PyPy implementation of the pure Python algorithm is 10x faster than the CPython implementation.

According to the original essay, that's close to the CPython/NumPy performance, and faster than the CPython/Pandas version.

It's tempting to omit PyPy because "nobody uses it in data science". That's a self-fulfilling argument.

PyPy is doing amazing work to support both NumPy and Pandas, but it's limited by funding. Why aren't data science people pumping money into PyPy? It would give a huge performance boost even for naive, throw-it-together algorithms.

If you don't know why PyPy can do for you, how do you learn if/when you should use it?


I have tried PyPy once and it gives huge speed up. I am not sure how it works with Pandas/Numpy. I have also tried Numba JIT. Eeasy to use, but in my experince it saves just up to 10% of time. There is also Cython (https://pandas.pydata.org/pandas-docs/stable/enhancingperf.h...) and it definitely works with Pandas. Have not tried yet.


If you have a pure Python implementation, then why does it matter how well PyPy works with Pandas and NumPy?

My point is a minor one. "Python" isn't a technology. "CPython" and "PyPy" are technologies. Or you can try MicroPython, Iron Python, Jython, and others.

In any case, you'll see the best idiomatic Pandas code is reported to be 500x faster than the original, while the best idiomatic Python is reported to be 950x faster. The proportions are not the same, so it's not like "dividing by the same denominator".

The rank order is unchanged, though that's a weaker observation than what you wrote.


Alright man, good luck. I know HN can be brutal.

But this was really not your best piece (technically; it may have met its marketing / traffic goals).


>> Well pandas iterates through every row anyway

The power of pandas comes - partly - from vectorized operations. This means that instead of iterating through every row it processes the data in data sets transforming the operations into matrix/vector manipulation. For matrix manipulation it uses low lewel high performat packages borrowed from c, numpy, etc.

Writing "pandas code" not using the pandas operations is like eating soup with a fork. It works but it is pretty slow.


Even vectorized operations still have to go through every data item.

I'm not sure what you imagine happens inside pandas, but I assure you it's not magic.


Thanks for coding this up. The code in the blog post is, at best, incredibly misleading.


Was the comment redacted?


There are a couple of things wrong with this article.

The first problem, as pointed out by apl and em500, is the non-idiomatic pandas code. Pandas is fast not when you rewrite your algorithms using one for one with pandas functions, but when you use functions provided uniquely by pandas, then you start thinking with pandas. My rewrite[0] with groupby is 47 times faster than pure python; it would be 1848 times faster if I didn't convert the result to list of lists from DataFrame as in the other cases.

Secondly, if your analysis is taking too long on your MacBook Air, you should switch up to faster hardware before investing your time in time consuming optimizations. Presumably this is work done for an employer and for them the data scientists time is much more expensive than a powerful AWS instance. Of course if you made big Big O mistakes, no amount of hardware is going to fix that.

[0] https://github.com/aleyan/machinelearningexp/blob/master/Dat...


I only skimmed the notebook, but the code looks fairly inadequate. The pandas portion, for instance, treats the data frame as a dumb array and critically ignores grouping functionality which should offer a tremendous speedup. Moreover, for the exact same task pandas should never be slower than NumPy.

Benchmarks are hard to get right but this one falls way short of saying anything at all about performance penalties incurred by various libraries and abstractions.


No comment on the poor implementations (why quadratic complexity, in all three cases?), but some general notes:

For reporting benchmarks, you probably want the minimum, rather than average. While keeping an eye on the distribution/variation of the individual timings as well (to mitigate caching effects). Check out Python's timeit module or the %timeit magic command in IPython.

With more numeric-heavy workflows, the gap between Python/C grows even larger. The more you can push to native-compiled-land, away from Python loops, the better. As another data point, when optimizing the word2vec algorithm, the numbers came out something like [0]:

1x (pure Python) - ~120x (NumPy) - ~2880x (C/Cython)

[0] https://rare-technologies.com/word2vec-in-python-part-two-op...


The data consists of 50k records. The fastest code in the post takes 3 seconds.

Have many programmers completely lost the ability to know roughly how fast code "should" be?


We use stupid, simple maths problems (like project Euler 1-30) when hiring for junior positions.

We are not looking for someone to do the maths tricks that lets you solve most of the problems with pen and paper, but for someone that recognizes a 100x speed increase using memoization.

You'd think people with some programming experience would be able to reason about how to speed up a simple brute force approach, but those parts of the hiring process usually weed out 75%, and I don't expect someone to give me a perfect solution. Heck, not even a working piece of code. Just being able to say "hmm, we could maybe cut half of the computation by just using the relevant parts of the alphabet" is all I want.

Then we have people like my colleague that just wrote print(103) which was the correct answer. (He could also give several fast approaches to solving the problem, one which we hadn't thought about before).


While we're on the subject, I'd like to see what this community has to offer on the efficiency of the pandas groupby-apply operations, which I believe is an idiomatic pandas way. The question [0] on SO is a demonstration where the groupby-apply falls to complete iteration of the dataset(also using some pandas, not sure if that's idiomatic pandas though)

Thanks!

[0] https://stackoverflow.com/questions/45012503/conditionally-p...


I usually do all my feature engineering and data exploration in pandas, and then I change my data from pandas to numpy just before i throw it to the classifiers / regressors


have you looked at PyTorch ? can be seen as a GPU acceleratd numpy

see https://vwrideout.github.io/QR-Blog/index.html


Likewise. There's also sklearn_pandas which helps you plug a DataFrame directly into a Transformer or Classifier.


This post is rather naive. Pandas performance improves when one uses in built functions. Also the performance of pandas is better over numpy for larger sizes [1].

[1] http://gouthamanbalaraman.com/blog/numpy-vs-pandas-compariso...


There is a lot of great Python and Pandas code snippets, but I am not sure anynone posted a Numpy based solution. Below is mine. It gives 100x speed up over the base (quadratic Python) solution. Still, suggested Pure Python solutions and idiomatic Numpy solutions are much faster. I suspect Numpy has more power than that.

def gen_stats_numpy_l(dataset_numpy): start = time.time() unique_products,unique_indices = np.unique(dataset_numpy[:,0],return_index = True) product_stats = [] split = np.split(dataset_numpy,unique_indices)[1:] for item in split: length = len(item) product_stats.append([int(item[0,0]),int(length),int(np.sum(item[:,2])),float(np.round(np.sum(item[:,3])/length,2))]) end = time.time() working_time = end-start return product_stats,working_time


This comparison isn't so great because of the following:

    for product_id in unique_products:
        product_items = [x for x in dataset_python if x[0]==product_id ]
This is O(unique_products * observations), and it looks like O(unique_products) = O(observations). Thus we have a quadratic scan when a linear one would suffice. You'll get the best performance using whichever solution lets you code this to linear the fastest. E.g. pure python, make a dict from product_id to observations and iterate over that, or for pandas, use groupby.


As other comments point out, performance mostly depends on correct implementation.

Once the scripting overhead is under control, you're now measuring the performance of your blas library.

A much more useful comparison, although 2.5 years old:

http://blog.nguyenvq.com/blog/2014/11/10/optimized-r-and-pyt...


Maybe someone can tell me the benefit but why use Pandas rather then a real database? Pandas seems like a non-standard, in memory, python specific database system that has no indexing, caching, or persistence layer aside from dumping to flat files. It's an RDBMS without the Relational bit.


I don't think you have looked at Pandas that closely. For example, Pandas has multiple ways to persist besides a flat file, including to SQLite and HDF5. - http://pandas.pydata.org/pandas-docs/stable/io.html

It's also aware of the relational data model. The 10 minute intro at http://pandas.pydata.org/pandas-docs/stable/10min.html?highl... says:

"pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations."

and http://pandas.pydata.org/pandas-docs/stable/comparison_with_... gives a mapping of basic SQL operations to Pandas.

There is also some caching. The documentation mentions things like "A large range of dates for various offsets are pre-computed and cached under the hood in order to make generating subsequent date ranges very fast (just have to grab a slice)" and "Bug in Series update where the parent frame is not updating its cache based on changes".

Perhaps you can think of it as an in-memory relational database without SQL or a declarative language but instead has an imperative API for Python, containing a large number of helper functions not in most RDMSes but which are important in data analysis. Think of it as a database engine where the user has control over the internal memory layout so C extension and other code (a "cartridge" or "datablade", if you will) can work with it directly.

Supposed you want to read a CSV into a data frame, compute the difference between column "X" and "Y", and show those differences as a histogram. That's one line of Pandas. How do you do that in a RDBMS?


Much of pandas is actually built on numpy...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: