which is simpler and about 500x faster than gen_stats_pandas() in the example:
%%time
res1 = gen_stats_pandas(dataset_pandas)
CPU times: user 8.48 s, sys: 32 ms, total: 8.51 s
Wall time: 8.58 s
%%time
res2 = gen_stats_pandas2(dataset_pandas)
CPU times: user 16 ms, sys: 4 ms, total: 20 ms
Wall time: 17.6 ms
Sometimes I wonder whether actual optimization (as a craft, if you like) has been poisoned by scripting languages (Python etc.) and frameworks.
For one thing they completely changed how the speed of computers is perceived, e.g. it is actually considered fast when a web app spends "only" 20 server CPU milliseconds on a request.
And because most things are so inefficient, it is relatively easy to get integer speed-ups. When looking at applications written in C++, written by actual engineers, then often just getting 5 or 10 % on some functionality is appreciated, and achieving more can be seriously difficult. With software in scripting languages improvements below a few hundred percent don't raise an eye brow.
However, actually optimizing code in a scripting language (and not changing the way a native library is used) can be seriously difficult and annoying, because most simply don't have efficient tools — obviously, if a scripting language (like Python) had something substantially faster than, say, attribute access, then that something would simply be used to power attribute access in the general case. So in C++ (or other native languages) you can actually make good use of the processor's and memory system's resources, but you just can't writing in a scripting language.
It is also surprising that most tooling available for e.g. Python is rather immature, insufficient and cumbersome to use compared to the tooling (especially commercial) available for native code (or Java). This likely has something to do with people a) not caring about performance at all and/or b) since it is basically not possible to write fast Python code, as soon as something becomes important for performance, it is written in another language and used from Python, and for analyzing the performance of that code, which is now an external library, you don't need tools for Python.
When Python is being used for these data analytical tasks, random attribute access is not the common case. Instead, SIMD and vectorized mathematical transformations is the order of the day, and those can be specified very concisely and powerfully within the syntax of a "scripting" language.
More importantly, that conciseness means that that language is accessible to people who understand the right maths to use. This is frequently not your CS-trained C++/Java software engineer.
It's the age-old philosophical quandary: is it more useful to have the fastest bomber that can only be piloted by specialized pilots who must receive targeting instructions via long-distance radio, or is it better to have a moderately fast bomber than can be piloted by people who can actually see the targets?
Python is succeeding and taking over the world because businesses are overwhelmingly seeing the value of the latter approach.
Ad-hoc statistical analysis that takes a few seconds to run is a different technical case than production environments where every second is critical. For most datasets/workflows, as long as it can fit into a few GB of memory, it can run on a computer in a reasonable amount of time on any modern data analysis library like pandas or R's dplyr without having to dive into bare metal.
The exceptions are if you are coding your algorithm very poorly (like in this case) and have bad time complexity that causes the run time to balloon at large amounts of data.
This comment contains some fairly pejorative language
I can assure you that there are real-world high performance python systems with per-event budgets of far less than 1ms being used for critical systems in large organisations. Sometimes with the help of compiled components, but made understandable by being coordinated in python.
Of course these systems are atypical, but the idea that ‘actual’ engineers always use c/Java, while python is just a scripting language is just ignorant
It's especially weird in this case because the value-proposition of pandas is the ability to do things like aggregation in 3 lines of code and not have to iterate through every row.
Well pandas iterates through every row anyway, just in native-land rather than pure Python. That's not the difference.
The trouble with this code was the choice of the quadratic algorithm (which the idiomatic pandas version avoids). You could achieve a similar speed-up as OP for the pure Python code too, by dropping the doubly-nested loop.
The point stands as usual: better algo > clever tricks :) Personally, I find the code brevity that pandas offers does not outweigh the added complexity and cognitive load of its (IMO hard to remember and reason about) API.
EDIT: pure Python version
def gen_stats_python2(dataset_python):
aggr = {}
for product_id, product_order_num, quantity, price in dataset_python:
item = aggr.get(product_id, [0, []])
item[0] += quantity
item[1].append(price)
aggr[product_id] = item
return [
(int(product_id), len(prices), int(quantity), round(float(sum(prices)) / len(prices), 2))
for product_id, (quantity, prices) in aggr.items()
]
761x speedup!
%timeit gen_stats_python(dataset_python)
1 loops, best of 3: 36.3 s per loop
%timeit gen_stats_python2(dataset_python)
10 loops, best of 3: 47.7 ms per loop
That doesn't look faster -- 60ms? Nor clearer, at least to me :)
Additionally, you might have a bug in the code: groupby() only groups consecutive keys, so your code will fail if the dataset is not already sorted by product_id.
I guess that's the lesson here -- use pandas to make sure you get a sane, fast, tested implementation of some common use-cases. That is, if you can remember its API, when it makes copies vs views under the hood, and all that gnarly stuff :-)
Who doesn't like a good pointless optimisation day? In the same vein (and because what else was I going to do with my Sunday morning?) I ran yours under 3.x and came up with radim/iclaudiusx at ~430x, eesmith at ~600x, tech2 at ~850x. I didn't get a chance to try pypy. Thanks for the amusement!
950x improvement. I think the former reads more pleasantly and you're likely to do better with something like pypy though rather than dragging the code through the muck.
Most improvements are from use of some of the builtins like dictionary's .items method rather than making a getitem call per loop, removal of references to globals (int) since the values stored already have an integer value, removing N calls to append and letting the list-comp manage, and removing a store-then-unpack.
Sadly I don't think the top loop can be reduced in a similar manner because it's self-referential. Still, that made for a fun hour or so, I was really hoping to hit 1,000x improvement :/
Interestingly, the O(n) Python is still about 4x slower than the O(n) Pandas -- the same relative difference as in the original O(n^2) implementations from the blog post.
So despite the mishaps of this "Senior IT Business Analyst" choosing an unsuitable algorithm, his relative graphs pretty much stand.
I have had much fun reading this thread of comments today :)
This comparison was never about concrete (optimal) implementations but about comparing exactly the same implementations, doing exactly the same operations. Such approach would be the same for comparing performance of any languages, like C and Java. This is not "the best algo" race.
So, after a day of discussion, the result is the following: regardless if the implementation is quadratic or linear, the relative speed of all 3 technologies is (roughly) the same. I could not agree more as this is primary school math: dividing by the same denominator still means that proportions are the same :)
I plan to update my blog post, based on many interesting comments you have all provided. I definetely think this analysis should be more complex. Thank you for your insights.
I have many great pieces of code for inspiration, but I am struggling with providing exactly the same implementations. So it may cost me some time to provide all 3 optimized implementations that are exactly relevant to each other.
"the relative speed of all 3 technologies is (roughly) the same"
I'll be a bit pedantic. The original essay says "pure Python", which is not a technology but a language. There are several implementations of Python, which I'll argue is the "technology". The PyPy implementation of the pure Python algorithm is 10x faster than the CPython implementation.
According to the original essay, that's close to the CPython/NumPy performance, and faster than the CPython/Pandas version.
It's tempting to omit PyPy because "nobody uses it in data science". That's a self-fulfilling argument.
PyPy is doing amazing work to support both NumPy and Pandas, but it's limited by funding. Why aren't data science people pumping money into PyPy? It would give a huge performance boost even for naive, throw-it-together algorithms.
If you don't know why PyPy can do for you, how do you learn if/when you should use it?
I have tried PyPy once and it gives huge speed up. I am not sure how it works with Pandas/Numpy. I have also tried Numba JIT. Eeasy to use, but in my experince it saves just up to 10% of time.
There is also Cython (https://pandas.pydata.org/pandas-docs/stable/enhancingperf.h...) and it definitely works with Pandas. Have not tried yet.
If you have a pure Python implementation, then why does it matter how well PyPy works with Pandas and NumPy?
My point is a minor one. "Python" isn't a technology. "CPython" and "PyPy" are technologies. Or you can try MicroPython, Iron Python, Jython, and others.
In any case, you'll see the best idiomatic Pandas code is reported to be 500x faster than the original, while the best idiomatic Python is reported to be 950x faster. The proportions are not the same, so it's not like "dividing by the same denominator".
The rank order is unchanged, though that's a weaker observation than what you wrote.
The power of pandas comes - partly - from vectorized operations. This means that instead of iterating through every row it processes the data in data sets transforming the operations into matrix/vector manipulation. For matrix manipulation it uses low lewel high performat packages borrowed from c, numpy, etc.
Writing "pandas code" not using the pandas operations is like eating soup with a fork. It works but it is pretty slow.
There are a couple of things wrong with this article.
The first problem, as pointed out by apl and em500, is the non-idiomatic pandas code. Pandas is fast not when you rewrite your algorithms using one for one with pandas functions, but when you use functions provided uniquely by pandas, then you start thinking with pandas. My rewrite[0] with groupby is 47 times faster than pure python; it would be 1848 times faster if I didn't convert the result to list of lists from DataFrame as in the other cases.
Secondly, if your analysis is taking too long on your MacBook Air, you should switch up to faster hardware before investing your time in time consuming optimizations. Presumably this is work done for an employer and for them the data scientists time is much more expensive than a powerful AWS instance. Of course if you made big Big O mistakes, no amount of hardware is going to fix that.
I only skimmed the notebook, but the code looks fairly inadequate. The pandas portion, for instance, treats the data frame as a dumb array and critically ignores grouping functionality which should offer a tremendous speedup. Moreover, for the exact same task pandas should never be slower than NumPy.
Benchmarks are hard to get right but this one falls way short of saying anything at all about performance penalties incurred by various libraries and abstractions.
No comment on the poor implementations (why quadratic complexity, in all three cases?), but some general notes:
For reporting benchmarks, you probably want the minimum, rather than average. While keeping an eye on the distribution/variation of the individual timings as well (to mitigate caching effects). Check out Python's timeit module or the %timeit magic command in IPython.
With more numeric-heavy workflows, the gap between Python/C grows even larger. The more you can push to native-compiled-land, away from Python loops, the better. As another data point, when optimizing the word2vec algorithm, the numbers came out something like [0]:
We use stupid, simple maths problems (like project Euler 1-30) when hiring for junior positions.
We are not looking for someone to do the maths tricks that lets you solve most of the problems with pen and paper, but for someone that recognizes a 100x speed increase using memoization.
You'd think people with some programming experience would be able to reason about how to speed up a simple brute force approach, but those parts of the hiring process usually weed out 75%, and I don't expect someone to give me a perfect solution. Heck, not even a working piece of code. Just being able to say "hmm, we could maybe cut half of the computation by just using the relevant parts of the alphabet" is all I want.
Then we have people like my colleague that just wrote print(103) which was the correct answer. (He could also give several fast approaches to solving the problem, one which we hadn't thought about before).
While we're on the subject, I'd like to see what this community has to offer on the efficiency of the pandas groupby-apply operations, which I believe is an idiomatic pandas way. The question [0] on SO is a demonstration where the groupby-apply falls to complete iteration of the dataset(also using some pandas, not sure if that's idiomatic pandas though)
I usually do all my feature engineering and data exploration in pandas, and then I change my data from pandas to numpy just before i throw it to the classifiers / regressors
This post is rather naive. Pandas performance improves when one uses in built functions. Also the performance of pandas is better over numpy for larger sizes [1].
There is a lot of great Python and Pandas code snippets, but I am not sure anynone posted a Numpy based solution.
Below is mine. It gives 100x speed up over the base (quadratic Python) solution.
Still, suggested Pure Python solutions and idiomatic Numpy solutions are much faster. I suspect Numpy has more power than that.
This comparison isn't so great because of the following:
for product_id in unique_products:
product_items = [x for x in dataset_python if x[0]==product_id ]
This is O(unique_products * observations), and it looks like O(unique_products) = O(observations). Thus we have a quadratic scan when a linear one would suffice. You'll get the best performance using whichever solution lets you code this to linear the fastest. E.g. pure python, make a dict from product_id to observations and iterate over that, or for pandas, use groupby.
Maybe someone can tell me the benefit but why use Pandas rather then a real database? Pandas seems like a non-standard, in memory, python specific database system that has no indexing, caching, or persistence layer aside from dumping to flat files. It's an RDBMS without the Relational bit.
I don't think you have looked at Pandas that closely. For example, Pandas has multiple ways to persist besides a flat file, including to SQLite and HDF5. - http://pandas.pydata.org/pandas-docs/stable/io.html
"pandas provides various facilities for easily combining together Series, DataFrame, and Panel objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations."
There is also some caching. The documentation mentions things like "A large range of dates for various offsets are pre-computed and cached under the hood in order to make generating subsequent date ranges very fast (just have to grab a slice)" and "Bug in Series update where the parent frame is not updating its cache based on changes".
Perhaps you can think of it as an in-memory relational database without SQL or a declarative language but instead has an imperative API for Python, containing a large number of helper functions not in most RDMSes but which are important in data analysis. Think of it as a database engine where the user has control over the internal memory layout so C extension and other code (a "cartridge" or "datablade", if you will) can work with it directly.
Supposed you want to read a CSV into a data frame, compute the difference between column "X" and "Y", and show those differences as a histogram. That's one line of Pandas. How do you do that in a RDBMS?
For this example, idiomatic Pandas would look like this:
which is simpler and about 500x faster than gen_stats_pandas() in the example: