Python performance the easy(ish) way

wulczer · on Sept 9, 2012

I've tried this with GCC 4.7.1 on Debian x86_64 and did not get it to work in constant time with -O2.

I'm guessing (from the OP's usage of -install_name) that he's been compiling this on OSX. I wonder what did my compiler miss that the OP's didn't?

EDIT: just tried with clang and got constant time behaviour, interesting

EDIT 2: reading the comments in the post, I now suspect it has to do with integer overflow. However, compiling with -fwrapv did not change anything. Need to dig into it more.

EDIT 3: it seem that clang simply notices that the computation can be done in constant time, whereas gcc does not. I'm not sure if it's actually useful in real world code, but it's certainly somewhat magical to see a compiler understand that you can substitute the entire loop with a simple calculation

keeperofdakeys · on Sept 9, 2012

I assume the OP would be using llvm, as this is the default OSX compiler, and the gcc command is usually just an alias or link to clang.

In real world code this kind of optimisation is designed to catch things programmers aren't aware of, for a large decrease in required computation. In this case, sum(1 .. n) == n*(n+1)/2.

DeepDuh · on Sept 10, 2012

overflow exception is plausible (it overflows anyway but the question is whether the system will let it continue). Performance timing without validating your results is, well, not worth anything.

afhof · on Sept 10, 2012

Did anyone even read his code? The C code gets the wrong answers once its integers overflow. The Python code will produce the correct answer using longs once the integer calculations overflow. Ctypes is the right answer for a different problem.

jiaaro · on Sept 11, 2012

I agree that it'd be more correct to make the C and python code output the same answer.

The important thing here is how to hook up C code to python.

the benchmarks are still (mostly) valid because it's still the same number of addition operations (regardless of the overflow).

Honestly, I'm not a very good C programmer (as you can see). I was really just documenting how to do this for myself. I never expected to get such a big surge of traffic from HN.

That being said, I'll take a second look at the code and try to make it more correct, suggestions welcome!

jiaaro · on Sept 12, 2012

I updated the post to use unsigned long long for the values (which does not overflow)

I timed the calls again, but didn't bother to update the values in the post as they were within 1% of the original timings

arunchaganty · on Sept 9, 2012

I'd also recommend the cython library (http://docs.cython.org/src/tutorial/). It is as easy as writing usual python code with some type annotations to achieve similar speedups.

elliptic · on Sept 10, 2012

So I've done some limited work with writing C/numpy extensions (for numerical work) and it's gone swimmingly. However, a lot of performance-critical things I need to do don't necessarily fit into array/matrix numerical computing - lots of times there are difficult,complex datastructures involved that need to be accessed in the "inner loop" (not really a loop, but you get the point). I typically end up using Java for these sorts of problems, although sometimes Pypy does okay.

I rather dislike the constant advice I see doled out to those who mention Python's performance issues - namely, that they can just rewrite the performance critical parts in C. Lots of the time, the performance critical parts are just as sophisticated and require just as much abstract, high-level coding as the rest of the program. Some awareness of that would be nice.

evgen · on Sept 10, 2012

I hate responding with such a flip answer, but have you tried Cython? If you have to do some serious munging inside inner loops and need something that is a bit more aware of data structures and the shape of your data I have found Cython to be a nice option for building the equivalent of a Python C-extension module with mostly Python-like code and structure.

freyrs3 · on Sept 9, 2012

CTypes is neat for simple tasks but Cython is the industrial strength solution for large extensions.

tsahyt · on Sept 10, 2012

First of all, the C code will return wrong answers because of integer overflow for sufficiently large inputs. But since this is only for the sake of an explanation on Python performance, I'm willing to take that.

What disturbs me much more is that the article basically concludes, that if Python performance isn't good enough for you, write it in C? How is that Python performance? If speeding up Python is what you want I'd suggest having a look at PyPy, which is a JIT python runtime. It doesn't work with some libraries (for reasons I didn't look into yet) but works wonders on performance!

Oh yeah and if anyone ever sums up large amounts of continous integers using a linear time algorithm: Please use n*(n+1)/2. This is obvious, but since the article ommited it I though I'd say it here.

dagw · on Sept 10, 2012

As awesome as PyPy is it isn't going to give you anywhere near the sort of speedup you'll get from dropping down to C for you critical functions. The best I've seen on actual code I've written (as opposed to benchmarks) is 4-5 times speed up, which is certainly nothing to sneeze at, but still a long way to go before you can start comparing with C. For what it's worth In the contrived example used in the article I only got a 1.2-1.3 time speedup using pypy (pypy 1.9 vs python 2.7).

(of course you can use ctypes from pypy as well, so you can get the best of both worlds)

tsahyt · on Sept 10, 2012

This is true, yes. The kind of performance C delivers is often only achieved by C, Fortran and some other compiler languages. But that's not the point of what I was saying. I was talking about speeding up the Python language. That's what PyPy is incredibly good at.

I once wrote a small traffic simulation in Python. Not real-time. It would take about 15 minutes to run a medium sized simulation using the CPython interpreter. PyPy ran the same simulation in slightly less than a minute. This is a substantial speedup. In this case it was because function calls can be incredibly slow in Python and PyPy takes care of that.

I'm not claiming that implementing parts of the program in C is bad (I'm mainly a C programmer anyway). All I say is that it's not really "speeding up Python" when you're not using Python. It's definately an option though and a very good one too!

ctypes might not be the best way to do it for some use cases though. In my opinion, the most beautiful way is extending the Python API with native C. That option produces the most LOCs though, and therefore probably more bugs. However, if you've got large portions of a Python program in need of speedup and PyPy won't cut it, it's probably the best option to write a decent part of the low level stuff in C and provide native Python bindings. This way you get the best of both worlds: The performance of C for your underpinnings and the flexibility of Python for developing all the other parts, saving you time and therefore money and sweat.

wladimir · on Sept 10, 2012

If you like this kind of dynamic code generation and usage, LLVM with ctypes is also a promising avenue, as you don't (necessarily) have to invoke an external compiler. You can build the module on-the-fly.

For example see bitey, an abstraction to directly import LLVM bitcode files: https://github.com/dabeaz/bitey

That said, there's so many ways to speed up Python these days...

In most practical cases where I need performance I'd try to run it in PyPy (easiest), use Cython (great python-C hybrid), use Theano (to generate GPU code), or even write a plain Python C API extension (no dependencies to deploy, and in contrary to ctypes you can directly offer Python classes without Python wrapper).

sdfjkl · on Sept 9, 2012

Thanks for showing it was that easy. I knew this was possible, but I thought it to be way more complicated. Now I'm more inclined to use it (if I really need that level of optimisation).

haileys · on Sept 10, 2012

I wrote a program in assembly a few weeks ago to do this exact thing.

It can spawn worker threads to speed up the summation. I have a quad core CPU, so I've set it to 4. If you have a different number of cores, you can modify line 8 to change the number of workers it will spawn.

https://gist.github.com/3369946

azylman · on Sept 9, 2012

I'm not a big user of Python, but it seems to me that any language that requires you to embed a different language for performance is inherently broken.

To regular pythonistas: Is this a common issue with Python, or only in the author's contrived scenario of summing a ridiculous amount of numbers?

achompas · on Sept 9, 2012

I'd actually call it a feature, and not a bug or flaw, in Python.

You can prototype something in Python (or another language with native hooks) very, very quickly. If that thing works, you can roll in ctypes or cython and get your code running as quickly as C. Thus, you get quick development + numerous paths to native-level performance.

That's much better than taking longer to develop a library or model in C/C++, only to find they don't satisfy your needs.

dagw · on Sept 9, 2012

By your definition all dynamic scripting style languages (lua, ruby, perl, etc.) are inherently broken. Very few languages out there can reasonably expect to get C like speed for numerical work without dropping down to C. All this shows is how relatively easy it is to do when you need it.

One of pythons really big and popular uses is as a powerful language for tying together high performance low level C/C++/Fortran libraries for serious numerical work (see libraries like numpy, numexpr, scipy and so on), so as such calling C libraries from python is about as standard a scenario as you can get.

benhoyt · on Sept 9, 2012

It's a contrived scenario. I've used Python a ton and never needed to break out to C, certainly not for summing numbers. Granted, I use Python mostly for web development, but I do care about performance.

In case of number crunching, I've heard numpy is very good and fast at number crunching using arrays, and the crunchy parts of numpy are written in C.

All that said, why is a language "inherently broken" if it requires you to use another language for tight-loop performance such as this? Game and graphics developer writing in C break out to assembly for things like this -- is C also "inherently broken"?

azylman · on Sept 9, 2012

Wow, I don't think I've ever had such a visceral reaction to anything I've posted on HN before. By people who clearly didn't take the time out of their day to even read what I posted!

Basically, I don't use python very much and wanted to know if python had performance issues that required you to drop down to C code often, or if that was just for a scenario such as this.

I think that any language that REQUIRES you to drop down to a lower level for performance (not one that SUPPORTS it, but one where it's REQUIRED) is inherently broken. I would think that would be a pretty noncontroversial statement - if every python script you wrote had to have embedded C code because otherwise it was too slow, that would be pretty broken. That's why I wanted to know how common this was.

bryanh · on Sept 9, 2012

I think the reaction was because you put Python in a broadly negative camp by infering ctype extensions are required for any sort of performant code. Its actually relatively uncommon to use ctypes and most developers who use Python know this.

Python is a great general purpose language and is used as such. No one would be silly enough to solely rely on it in highly performance sensitive environments (IE: 3d renderers or high frequency trading). Just like no one would be silly enough to write an entire CRUD web app in C.

The point is, Python isn't broken and claiming that it may be because you read a short article on a Python feature got you the negative reaction.

azylman · on Sept 9, 2012

I never inferred that ctypes were required - in fact, I explicitly did the opposite. I asked if it was required for performant code. See the part of my post where I said "To regular pythonistas: Is this a common issue with Python, or only in the author's contrived scenario of summing a ridiculous amount of numbers?"

I also never claimed that python was broken.

bryanh · on Sept 10, 2012

> I'm not a big user of Python, but it seems to me that any language that requires you to embed a different language for performance is inherently broken.

It seems like you imply both of those things, though perhaps not as explicit as "Python is broken because ctypes are required", which you did not say.

dagw · on Sept 10, 2012

Words like "Slow" and "Performance" are very relative. Yes python (well at least cPython) is a very slow language (on par with the likes of ruby, lua, perl, php etc.) and yes if you really need performance you really need C (again, not that different from the above mentioned language) and yes most non-trivial python code relies on C/C++ in at least some point.

However very few python coders find themselves actually writing C code when working in Python. Most rely on existing libraries that put a nice python wrapper around highly tuned C code, and in many cases most python programmers are only vaguely aware of which libraries are pure python and which are wrappers.

I think that any language that REQUIRES you to drop down to a lower level for performance.

Again without defining "performance" this statement means nothing. All languages (even C) have cases where you are required to drop down to a lower level to get the fastest possible performance.

That's why I wanted to know how common this was.

Depends very much on what you're doing. If you're doing web development, it's very uncommon. If you're doing fairly performance critical numerical work over very large data sets, then it's much more common (but even in that case the kind folks over at numpy and friends have done 98% of the hard work for you).

dbecker · on Sept 9, 2012

Whether calling c is "required" depends on a few things. The relative importance of run-time vs programming-time, the nature of the problem, etc.

I've been doing scientific programming in Python for 4 years, and I've never felt I needed to drop to C (though I was aware of the option.)

The hard computing for most problems is done in calls to a few libraries that wrap highly optimized code from lower level languages. So matrix inversion, arithmetic on arrays, etc are just as fast in python as in C or Fortran.

If your work can't be done using library calls, pure python is going to be much slower.

IanOzsvald · on Sept 9, 2012

As dagw mentions - treat it as a feature. I taught High Performance Python at PyCon, I've published all the src and a 55 page PDF covering a range of techniques to optimise a Mandelbrot example (including cython, shedskin, numpy, multiprocessing, pycuda): http://ianozsvald.com/2012/03/18/high-performance-python-1-f... With some languages this is painful, with Python you can easily profile (cProfile, line_profiler) and then work on easy or complex optimisations (both within Python and using external tools) - great flexibility for when you really need it.

njharman · on Sept 9, 2012

Obviously broken, because all possible uses of a programming language require the absolute utmost speed. Every other consideration of language design, usability, and ridiculous fluff such as "productivity" must kowtow before the god of performance.

vitriolix · on Sept 9, 2012

So C/C++ are also inherently broken because game engines break out to raw ASM at their mist critical points? Or Java because it exposes native code via JNI?

This is actually totally standard behavior in pretty much every language.

jiaaro · on Sept 11, 2012

Hi there, I'm the author of the post (my username matches the domain of the post =D )

The purpose is just to show an example of how to drop to C.

I chose a contrived example so the reader could expend the minimum possible number of brain cycles (so to speak) trying to understand that part of the code.

hythloday · on Sept 10, 2012

I think it's exactly the opposite--any language that doesn't allow you to break out to a different language for performance is broken.