Hacker News new | comments | show | ask | jobs | submit login
Why Python Runs Slow, Part 1: Data Structures (lukauskas.co.uk)
244 points by Sauliusl 1316 days ago | hide | past | web | 162 comments | favorite

If you're using Python for performance-critical applications, simple use of the integrated ctypes module (uses libffi in background) can make a world of difference. There's a modest performance overhead, but basically you're getting near C performance with proper use of ctypes.

Coincidentally, one of the usual examples given in the ctypes howto is a Point structure, just like in this post. It's simple:

  from ctypes import *

  class Point(Structure):
      _fields_ = [("x", c_int), ("y", c_int)]
Then you can use the Point class the same way you'd use a regular Python class:

  p = Point(3, 4)
  p.x == 3
  p.y == 4
Really, taking a half-day to learn how to use ctypes effectively can make a world of difference to your performance-critical Python code, when you need to stop and think about data structures. Actually, if you already know C, it's less than a half-day to learn... just an hour or so to read the basic documentation:


If you plan to write an entire application using ctypes, it'd be worth looking at Cython, which is another incredible project. But for just one or two data structures in a program, ctypes is perfect.

And best of all, ctypes is included in any regular Python distribution -- no need to install Cython or any additional software. Just run your Pythons script like you normally do.

EDIT: I was just demonstrating how to get an easy-speed up using ctypes, which is included in the Python standard library. To be clear, you would usually use this type of data structure in ctypes along with a function from a C shared library.

Furthermore, if you're serious about optimizations, and you can permit additional dependencies, you should absolutely look at cython and/or numpy, both of which are much faster than ctypes, although they do bring along additional complexity. Other commenters are also pointing out cffi, which I've never used but also bears consideration.

Using ctypes here is slower in CPython 2.7.5 than a simple class. Summing 10000 points takes 1.81ms using a simple class vs 2.22ms using the code above. ctypes is designed for compatibility not speed.

Using slots is 0.06ms faster while namedtuple is slower.

A bigger speedup in CPython can be achieved by just using the builtin sum function twice to avoid the for loop.

Here's a gist: https://gist.github.com/absherwin/8976587

Please expand on your assertion. How are you storing your "two ints"? Two int tuple? Dict? List? I don't even have enough information to respond.

I do agree that ctypes is primarily meant for compatibility, but it can be used for speed optimizations in certain cases, particularly when there is a speed penalty for non-continguous storage of data structures (like a list of dicts, or list of tuples).

Edit: Thanks for the additional information. I totally agree that Python build-ins are often faster, as in this case. However, there are times when no build-in is available, and you have to use a custom function (usually from C). This gets back to the compatibility question, and I totally agree that ctypes is primarily meant for compatibility.

Apologies for the confusion. The comparison was against a new-style class as described in the original post. All of the types have two ints. The code is now linked so you can inspect and respond.

Edit to respond to edit: Built-ins and slots are even faster but the key point is that the ctypes-based class is actually slower than simply using a class with ordinary Python variables.

The reason ctypes is slower is that addition isn't defined on c_ints which requires casting them back to Python ints for each operation. This can be avoided by using ctypes to call a C function to perform addition.

Response to edit: you may be right, I don't have time to do comprehensive benchmarking now. I never use ctypes on its own, but rather always with an external library. I assumed that the ctypes Structures in a ctypes array would be faster because its stored contiguously, but it's possible that is incorrect. I'll have to come back to this later when I have time.

I stand by my assertion that ctypes along with an external C library is a great way to do Python speed-ups. It's very simple to do, see here:


This is the kind of optimizations I usually use ctypes for, or for interfacing with a third-party shared library.

The ctypes structures are stored contiguously with a C-compatible memory layout, but every time an element is accessed, a new Python object is constructed to hold that return value. It's like it defines getattr to call int(<C value>). That's why they end up slower than regular classes - every access needs to create a new object.

ctypes + C code can be quite efficient, but you have to write the entire fast-path in C, not flip-flop between C and Python. It's best when you have a certain operation that needs to be fast (say, serving your most common path on a webserver, or running a complex matrix operation in Numpy), and then a bunch of additional features that are only called once in a while.

You might also want to look at cffi – it's not in stdlib but the interface is pretty nice for this kind of thing:


In the specific example above, however, I'd try to reduce that overhead by going even more C-style and allocating an array or two so you avoid the overhead of individual instances and instead have one Python object for many elements.

>In the specific example above, however, I'd try to reduce that overhead by going even more C-style and allocating an array or two so you avoid the overhead of individual instances and instead have one Python object for many elements.

How exactly would you avoid the overhead of individual instances using a ctypes structure? Sure, you can allocate an array very easily, using Point * 10 and then putting your instances in the array. In fact, you'd have to do this if you wanted a collection of these structs. I'm pretty sure that this array is stored continguously, but you still have the overhead of the Structure base class. How are you proposing to avoid this in ctypes? Or were you referring to cffi?

If you really want to be serious about large, performance-intensive arrays in Python, it's time to pull out numpy. It actually would fit well here.

Re: cffi, I do love luajit but I'm not a huge fan of its C interface. I've not used cffi, but it seems to be inspired by, or at least shares a lot with, luajit's C ffi. Which means you have to basically paste all the C header files into a string in your Lua or Python code to use the corresponding shared library.

The beautiful thing about ctypes is that you don't have to declare all the function prototypes and type declarations when you use a shared library. You simply make sure to feed the functions the properly typed variables, and store the returned value properly, and it just works.

As a result, I find ctypes a lot easier to use for simple optimizations. YMMV.

> The beautiful thing about ctypes is that you don't have to declare all the function prototypes and type declarations when you use a shared library. You simply make sure to feed the functions the properly typed variables, and store the returned value properly, and it just works.

It just works... until the argument that you passed as int but is actually size_t makes the program crash on 64-bit. Or you have to use an API which includes a giant structure which you then must transcribe manually into Python. Or you need to use a system library located at different paths on different platforms.

In my opinion, ctypes isn't the worst, but a sane C FFI should allow you to specify some #includes and -ls and automatically extract the proper declarations and linkage. cffi only partially satisfies this, since you have to add declarations manually and then have cffi "verify" them using a C compiler, but it's still better than ctypes.

> How exactly would you avoid the overhead of individual instances using a ctypes structure?

This optimization only works if you have an upper bound for the total number of items but if so I'd at least try something like declaring point_x and point_y arrays so you'd have two Python objects accessed by indexes rather than many Point references. This technique can be particularly useful if you don't access all of the elements at the same time and you can get more efficient memory access patterns because you can access a subset of the elements in nice linear reads.

> If you really want to be serious about large, performance-intensive arrays in Python, it's time to pull out numpy. It actually would fit well here.

Very strong agreement - numpy is a treasure.

> As a result, I find ctypes a lot easier to use for simple optimizations. YMMV.

Ditto - I've found cffi to be faster in some cases but it's definitely more overhead to use.

Ah ok, so you're proposing a nested array, not an array of structure pointers. I thought you were saying there was some way to use ctypes structures and yet avoid the overhead of the class instances.

And yes, I agree that if you're going to create this type of data structure, then it's worth the trouble to pull out numpy if you can.

This would be one of the approaches to use if you wanted to squeeze the last pieces of performance out of this class. I would recommend using Cython for this though.

The problem with both of these approaches is that, well, use this too often, however, and you suddenly realise you are not really coding in Python anymore.

Also, correct me if I'm wrong, overuse of tricks like these is one of the key reasons why we cannot have nice things like PyPy for all modules.

I actually agree with everything you write here. If I could re-write the post to include these points, I would.

I do think PyPy is very near to having full ctypes support, if they don't already.

This is like saying that if you call C programs from shell scripts, or run node programs that make use of C libraries, that you are just writing C.

I am a python programmer who has yet to dive into ctypes, and I am curious if you have ever played with cffi[1], and if so, thoughts? The project seems to be aimed at the same use cases that you would use ctypes for.

1: http://cffi.readthedocs.org/en/release-0.8/

cffi's API is similar to LuaJIT's, so if you're familiar and comfortable with that, then cffi might be a good option.

If you're looking at creating your whole application around C extensions for Python, as in the case with some Cython apps, then libffi might be a good alternative. I've never used it, so I can't make any empirical observations about cffi (unlike Cython, which I've used -- it's awesome).

The great thing about all these newish ways to interface with C code in Python is that we no longer have to write Python extensions in C. I like C, but writing Python extensions in C was very painful and tedious for me. So all these options are a good thing.

Otherwise, I think ctypes is a more convenient option (it's always available, no extra installations required). Also, the ctypes API is much more Pythonic, in my opinion.

Thanks for the thoughts!

I really like cffi for interacting with external c libraries. See for example my post on talking to openssl using cffi:


If you using python for performance critical work would you not prototype it in python and then rewrite it in C C++ or even Fortan if its HPC

Would you call me a heretic for writing it all in Haskell then rewriting hot spots in C[1] if performance isn't good enough?

1. http://book.realworldhaskell.org/read/interfacing-with-c-the...

No. I don't know how Haskell fans feel about it, but I don't see anything wrong with it if it works and lets you write more good code than you would writing everything in C. C isn't that scary.

C isn't that sca... Segmentation fault

I've never understood why people always seem to immediately bring up segfaults as scary C behavior. Is it really that different than Python dying with "AttributeError: 'NoneType' object has no attribute 'foo'", or a NullPointerException in Java? Sure, in the latter two you get a stack trace, but you can rig up C to give you that too if you want.

What's actually potentially scary in C is code that doesn't segfault, but just continues running along with silently-corrupted memory. (And that's something that doesn't have an analogous failure mode in most other languages, as far as I know.)

Yes it's actually really different because the error is non-specific (unhelpful) and can be caused by a completely unrelated operation, so debugging is harder.

  $ gdb myprog myprog.core
  % backtrace full
  ...[stack frames with program line numbers printed here...]

Edit: formatting

Indeed, but that backtrace can be completely unrelated to what caused the memory corruption, so it's definitely more difficult than with managed languages, which was the point I made.

And the AttributeError can be completely unrelated to what set a variable to something of the wrong type, so it's definitely not dissimilar to C's memory corruption.

The kind of errors that scare me are silently writing past the end of an array on the stack, which will silently introduce wrong values that will probably not crash your program.

The AttributeError is in a different category, because the interpreter protects you from doing unsafe things, and you get a clue as to what is wrong because you get the name of an attribute and the object which was supposed to have it. Segmentation faults can be much more difficult. For example the other day I got a segmentation fault deep in the CPython code; there was no bug there, and moving up the stack trace also did not reveal a bug either. It turned out that I had messed up a reference count _in a completely different (Cython compiled) module_, and I only found out after poring over several sections of code repeatedly.

The reason that this can happen in C is actually exactly what you bring up, silently corrupting memory. It's just that with the segmentation fault the silence is broken abruptly.

That is what Valgrind / LLVM memory sanitizer are for

We can't know how much data was destroyed between the truncation of the message and the actual segfault message...

Segfaults are just the most user-visible consequence of lack of memory safety.

You can't have everything at once

"My experience in working on Graphite has reaffirmed a belief of mine that scalability has very little to do with low-level performance but instead is a product of overall design. I have run into many bottlenecks along the way but each time I look for improvements in design rather than speed-ups in performance. I have been asked many times why I wrote Graphite in Python rather than Java or C++, and my response is always that I have yet to come across a true need for the performance that another language could offer. In [Knu74], Donald Knuth famously said that premature optimization is the root of all evil. As long as we assume that our code will continue to evolve in non-trivial ways then all optimization6 is in some sense premature.

For example, when I first wrote whisper I was convinced that it would have to be rewritten in C for speed and that my Python implementation would only serve as a prototype. If I weren't under a time-crunch I very well may have skipped the Python implementation entirely. It turns out however that I/O is a bottleneck so much earlier than CPU that the lesser efficiency of Python hardly matters at all in practice."

-- Chris Davis, ex Google, Graphite creator

This is the key. Very often, the app is IO bound, and moving to C will make little difference.

This is one reason why I'm very excited about Python's new AsyncIO module in Python 3 only. It's asyncore done right, and is a great way to write network applications. I look forward to seeing what is built on top.

Would this work with existing libraries? For example, if you want to use whisper without blocking the thread.

> Very often, the app is IO bound

I hear this all the time but it really doesn't square up with my personal experiences. I've had to re-implement dynamic language stuff in C++ many times to get acceptable performance. No real design changes, just straight ports to C++. The entire value proposition of Golang is that the scripting language dynamic type systems are too slow, and it seems like loads of folks agree.

That is why they never used Dylan, Lisp, SELF or any other dynamic language with native compilers.

Those dynamic languages are only performant when you add a bunch of type information. You wind up writing a simulacrum of C and it still under-performs.

The idea is that there is some useful speed for much of the application by using native compilers. Higher speed might be needed only in parts of the program. Then a type inferencer, etc. comes to use. If that's still not enough one calls C or assembler routines - but, say, 95% of the application can be written in Lisp.

> I've had to re-implement dynamic language stuff in C++ many times to get acceptable performance.

Performance in a certain sense isn't just speed of a single uncoupled module, but also the end-to-end seamlessness of a system. The higher level languages might 'waste' 300ms in every function call, but save perhaps years in development time.

Also- architecting and developing medium-sized projects is not just the 'ENTERPRISE THING' anymore. A vastly greater number of people now have a-shot-at larger projects. So this creative freedom and technical UN-chaining also add to performance in a very in-direct way by allowing better minds to enter the game.

This is all true, but the endlessly repeated line about "IO Bound" needs to die. It's just not true. Most stuff written in dynamic languages is a lot slower than it would be in a statically typed language.

What I find interesting here is that Python (and most other dynamically-typed languages) treat instances of classes as arbitrary bags of properties, with the negative performance implications of that, even though that feature is rarely used.

I'd love to have the time to examine codebases and get real data, but my strong hunch is that in Python and Ruby, most of the time every instance of a class has the exact same set of fields and methods. These languages pay a large performance penalty for all field accesses, to enable a rare use case.

JavaScript VMs (and maybe Ruby >=1.9?) don't pay the perf cost because they do very advanced optimizations ("hidden classes" in V8 et. al.) to handle this. But then they pay the cost of the complexity of implementing that optimization.

I'm working on a little dynamically-typed scripting language[1] and one design decision I made was to make the set of fields in an object statically-determinable. Like Ruby, it uses a distinct syntax for fields, so we can tell just by parsing the full set a class uses.

My implementation is less than 5k lines of not-very-advanced C code, and it runs the DeltaBlue[2] benchmark about three times faster than CPython.

People think you need static types for efficiency, but I've found you can get quite far just having static shape. You can still be fully dynamically-dispatched and dynamically type your variables, but locking down the fields and methods of an object simplifies a lot of things. You lose some metaprogramming flexibility, but I'm interesting in seeing how much of a trade-off that really is in practice.

[1] https://github.com/munificent/wren [2] https://github.com/munificent/wren/blob/master/benchmark/del...

Python actually comes static shapes for objects with __slots__

class vertex(object):

   __slots__ = ["x", "y", "z"]
I use __slots__ in most code after the first refactor for a very important reason - I'm a hyper-caffeinated code monkey.

I wasted an entire day because I set self.peices in one place and self.pieces in another & couldn't figure out the bug, until I used the "grep method".

Right, but the problem here is you have to opt in to that. The most terse, simple way to express something is also the slowest.

Most classes have static shapes, so for the few types that are more dynamic, I think it makes sense for users to have to opt out.

pylint would have told you that as well.

try pycharm, it highlights occurrences of self.pieces with different colors where you are reading from and writing to it. It even makes it visible on the scrollbar.

Common Lisp implementations implement CLOS and structure objects as vectors. In high-performance implementations CLOS dynamism can be reduced so that slot-accessors can be compiled to inlined vector references.

I think this is a very crucial point. There are dynamic languages which can be implemented efficiently (e.g., Lisp), and with others it is difficult because of fundamental features of the language. Another example is monkey patching, e.g., replacing a method of an object at runtime. Restricting such language features and thereby speeding up the interpreter would have been an ideal selling point for Python 3, in my opinion, but of course it's too late now.

> Restricting such language features and thereby speeding up the interpreter would have been an ideal selling point for Python 3, in my opinion, but of course it's too late now.

This could trivially be implemented in a point-release as an opt-in feature – the scientific community has paved the way for this with Cython and Python 3's function annotations[1] would be an excellent way to do something like that where a function could e.g. promise not to change things dynamically, guarantee that you only accept certain data types and thus a particular list operation can be run without checking for dynamic dispatch / duck typings, etc.

1. http://www.python.org/dev/peps/pep-3107/

You can add some bolt-on features to speed some things up, but it is much nicer when things are fast by default, by design. Instead you could have the opposite kind of annotations, e.g., decorators for classes @monkeypatchable or @attributes_can_be_added_outside_of_init; after all, those features are rarely needed, and it's good to optimize for the common case.

the feature is maybe rarely used directly, but is crucial to make dynamic typing actually useful. Things like SQLAlchemy are possible thanks to the dynamic aspect of python. Writing a simple plugin system in python is trivial. Mocking is more or less trivial in those languages as well.

It is of cause possible to write a faster Python, as PyPy proves, but some of the design choices in Python does make it hard to optimize. This is not inherently a problem with dynamic languages as the development of Julia proves. In Julia it is easy to write programs that is in the same ballpark as C efficiency wise, while still feeling very dynamic and having a REPL. Pythons problem is that both the interpreter and the language is quite old, and existed before modern JIT technology was developed, so that was not considered when designing the language and the interpreter.

That said, Python emphasis on fast development as opposed to fast running code is very often the right tradeoff, and is why it is so popular.

Yes, the sacred place C still enjoys is many times related to lack of knowledge how to use the tools properly.

With proper implementations and correct use of data structures, there are seldom reasons to use C on user space applications.

What language features would you have or not have if you designed a language with knowledge of "modern JIT technology"?

If I would guess, I would say that you want to design the language in such a way that things such as types are typically easy to guess for the JIT compiler.

There's no silver bullet, but NumPy comes pretty close sometimes. Here's an example where I achieved a 300x speedup with regular old CPython: http://stackoverflow.com/questions/17529342/need-help-vector...

The trick is often to somehow get Python to hand the "real work" off to something implemented in C, Fortran, C++, etc. That's pretty easy for a lot of numerical and scientific applications thanks to NumPy, SciPy, Pandas, PIL, and more. Lots of libraries are exposed to Python, and you can expose more either by compiling bindings (see Boost.Python) or using ctypes. If you do this well, nobody will notice the speed difference, yet your code will still look like Python.

And sometimes your performance problem is not Python's fault at all, such as this lovely performance bug: https://github.com/paramiko/paramiko/issues/175 - a 5-10x speedup can be had simply by tweaking one parameter that is no fault of CPython or the GIL or anything else (in fact, Go had basically the same bug).

What I'm getting at here is that performance is not just one thing, and the GIL is a real and worthy spectre but hardly matters for most applications where Python is relevant. Your Python is slow for the same reason that your Bash and your C and your Javascript and your VBA are slow: you haven't profiled enough, you haven't thought long and hard with empirical results in front of you about where your program is spending its time, your program is improperly factored and doing work in the wrong order, or perhaps you should just throw some hardware at it!

The part that really surprised me was this:

  We only had 23 years of Python interpreter development,
  how would things look like when Python is 42, like C?
C, which I always think of as an ancient venerable systems language, is less than twice as old as Python, which I think of as a hot new kid on the block.

In thirty years, when my career will probably be drawing to a close, Python will be 53 years old and C will be 72 years old. Barely any difference at all.

>We only had 23 years of Python interpreter development, how would things look like when Python is 42, like C?

Well, mostly like it is today. Maybe a little better, maybe a little slower. Python hasn't progressed much. Compare it to the last 10 years and it's mostly a flatline, if not a slight decline with some 3.x versions.

Whereas C was from the onset one of the fastest or THE fastest language on any machine.

Getting fast is not usually due to some incremental effort. You either go for it or not. If you start from a slow language, incremental changes without a big redesign of the compiler/interpreter never get you that far.

Javascript got fast in 3 years -- not by incrementally changing the Javascript interpreters they had before, but by each of the major players (Apple, Google, Mozilla) writing a new JIT interpreter from scratch. In 2007 it was dog slow, and then in 2008 all three major JS interpreters got fast.

Sure, they have been enhanced since, but the core part was changing to JIT, adding heuristics all around, etc. Not incrementally improving some function here and some operation there.

>Whereas C was from the onset one of the fastest or THE fastest language on any machine.

You can't write performance critical code in C. C is great for high level code, but if performance is important there is no alternative to assembly.

>You can't write performance critical code in C. C is great for high level code, but if performance is important there is no alternative to assembly.

That statement makes no sense.

Of course you can write performance critical code in C. People write all kinds of performance critical code in C. Heck, kernels are written in C. What's more "performance critical" than a kernel? Ray traycers are written in C. Real time systems are written in C. Servers are written in C. Databases are written in C. Socket libraries. Multimedia mangling apps like Premiere are written in C (or C++). Scientific number crunching libraries for huge datasets are written in C (and/or Fortran).

The kind of "performance critical code" you have to write in assembly is a negligible part of overall performance critical code. And the speed benefit is not even that impressive.

That's true now (and perhaps has been true for easily the past fifteen years). But up til the early-to-mid-90s, it was relatively easy for a decent assembly language programmer to beat a C compiler. In fact, it wasn't until the early mid-90s that general purpose CPUs broke 100MHz.

Just be very thankful you don't have to deal with C compilers under MS-DOS what with the tiny, small, medium, compact, large and huge memory models (I think that's all of them).

Even now performance-critical code is assembly. Just take a look at your libc's memcpy implementation, for example. Most likely there's a default C implementation that is reasonably fast and a bunch of assembly versions for individual architectures.

>Even now performance-critical code is assembly.

The problem I have with this statement is that it implies other code, like the kernel, a ray-tracer, video editor, number crunching etc, stuff usually done in C/C++, are not "performance critical".

Let's call what you describe "extremely performance critical" if you wish.

With that said, MOST performance critical code is written in C/C++.

Code that's not that much performance critical is written in whatever.

True. The Linux kernel and most image processing libraries are written in assembly, after all.

Actually, assembly will generally not give you much of a speedup over c. The exception is when you use assembly only features like vectorization in ways that compilers can't figure out.

Well, actually there is no alternative to machine code:


Speedup is not the only kind of progress. Programming in Python has gotten more convenient due to more libraries, while avoiding buffer overflows in C will always require enormous care.

C still requires the programmer to know the language inside and out, but all kinds of static and dynamic checking tools have become available.

Not really, PyPy happened. We do have a nice JIT for Python, we just need to use it more.

You assert that Python hasn't progressed much, but you thumb down the project to progress it (Python 3).

Should Python try to improve, or just remain in the same state forever? You can't have both.

Your assertion that optimization must always happen all at once is completely bogus.

Take a deep breath, I believe parent post's comment about python not progressing was referring to performance, not a general dig.

>You assert that Python hasn't progressed much, but you thumb down the project to progress it (Python 3).

I was talking performance wise. There are benchmarks from the Python devs out there that confirm what I wrote.

I did not speak about Python as a language in general. That said, if you want my opinion, Python 3 is not much progress in that regard either.

>Your assertion that optimization must always happen all at once is completely bogus

That's good, because I never asserted that. I didn't say that "optimization must always happen all at once", just that incremental optimization doesn't yield much benefits compared to a thorough interpreter rewrite with optimization in mind.

Python's history confirms this totally. And not only is Javascript a counter-example (all at once optimization getting huge results), but PyPy also is.

Who would use a dictionary for a point? If it's a 2D grid, you are only going to have two axes and by convention x-axis is always first, and y-axis is always second. Perfect use case for a tuple.

   point = (0, 0)

It's just an example. There are many cases where you need more than what you can safely stuff into a tuple, which means the choice becomes 'naked dictionaries vs. classes', it's a tradeoff I've had to make many times writing python code.

Of course there are many cases where using tuples is the best way to represent some data item, (x, y) point data is an obvious example. But when your data structures become more complex, need to be mutable, need to have operations defined on them, have relations to other objects, etc, tuples are a bad choice. You don't want to end up with code that uses tuples of varying lengths, composed of elements of varying types, consumed and produced by functions that are hard-coded to rely on the exact length, ordering and element type of their their tuple inputs. Named tuples are one step in the direction of dictionaries, but they are still nowhere as flexible as mutable dictionaries or proper classes.

If you can't use a tuple, it's not like you could have really used a C-struct either. So the point is valid - the code should be using tuples and not dictionaries or classes.

Again, I assume the article only served as an example of how in Python you usually end up using dictionaries or classes, which have fundamentally different performance characteristics compared to C-structs. Yes, you could use tuples instead, in some cases, and get performance characteristics comparable to C structs. Often this is not an option though, and you would still have to use classes or dictionaries. There are many scenarios you could conveniently implement in C using structs, that would end up as an ungodly mess in Python if you would implement them using nothing but tuples.

Just as a thought exercise, try to come up with a way to represent and manipulate an hierarchical, mutable C-structure with 20 fields, using nothing but tuples in Python.

Like... named tuples? That's the nice thing about python, you're not restricted to using one data type. If you have a data structure that doesn't change, and speed matters, you can use something better suited to the task at hand.

Of course, you're not restricted to using one datatype in C either, but the constant comparison this article makes between structs and dictionaries is misleading. Comparing hash map implementations with dictionaries would be much more apt.

named tuples are classes, but they use __slots__ so they are sealed to extension thus using less memory. Take a look at the `namedtuple` source, it is a fun little bit of metaprogramming.



I am wrong. namedtuple does NOT using slots. weird, wonder why?

Because it directly subclasses `tuple`

point = namedtuple("Point","x y z",verbose=True)


Or named tuples, if you want fancy . notation. Although I haven't actually looked at how the interpreter handles them, I assume they are faster than dicts since the fields are pre-defined.

> So the point is valid - the code should be using tuples and not dictionaries or classes.

Yes if this was "production code" but for the sake of comparison in this example it should be a Python class vs. C struct.

And as the text describes, a dict behaves similarly to a class in Python when it comes to runtime perf.

It's not "just an example", it's the core of their argument. They say "Pythonic code uses a hash table-like data structure" - but that's completely wrong.

His other argument, about summing points, makes no sense either. He doesn't seem to have even profiled his "slow" point summing code or tried basic Python optimisations like using a list comprehension or generator to sum the points. And if that doesn't work well enough, and point summing is a critical part of your program, then you'd use a library like NumPy to handle that, rather than using raw python.

Perfect use case for a tuple.

Or a namedtuple, if you wanted to have access to the x and y fields as named attributes.

The article actually does mention this, but only in a footnote, and the footnote gets the facts wrong:

"One would be right to argue that collections.namedtuple() is even closer to C struct than this. Let’s not overcomplicate things, however. The point I’m trying to make is valid regardless."

No, it isn't, because a namedtuple does not work the same under the hood as the straightforward class definition he gives. If p = Point(0, 0) is a namedtuple instance, and you do p.x or p.y, what actually happens is that the "attribute" is looked up by index, which is equivalent to a C struct field access.

Is that faster now? Back in Python 2.4 or so (last time I looked under the hood of CPython many moons ago) all member access required a hashtable operation. For __slots__-enabled objects (the namedtuple of their day) this meant hitting the class hashtable to get the __slots__ index before using that index to fetch the actual member from the object. So __slots__ was mostly about space efficiency, not performance.

Has that changed?

Attribute access of namedtuples aren't any quicker as you've surmised, it requires a hash lookup and an index lookup. Index lookup is as quick as normal tuple index lookups. The main reasons I reach for namedtuples is memory efficiency and much quicker creation times vs objects.

Memory efficiency? The last time I checked namedtuple dynamically generates code for a class with the specified fields, and evals the code. So it would be exactly as efficient as a normal class. For memory efficiency you want to have multiple points in a single array, to avoid the overhead of objects, but you'd need Numpy or Cython for that.

Not quite. The namedtuple will not have a per-instance attribute dictionary, so there is a significant memery saving when creating loads of them.

Right. But you can fit 2 floats in 8 bytes so if it's memory saving and efficiency you're after, you would use an array to avoid object creation overhead. Namedtuples are nice for their self-documenting properties though, but it's misleading to advertise them as an optimization.

Attribute access of namedtuples aren't any quicker as you've surmised, it requires a hash lookup and an index lookup.

Yes, you're right, I was mistaken about how namedtuples work under the hood.

You're right, a point was a bit contrived example and some more practical use case should have been the example. But the bottom line still stands.

If you have a class in Python, the member variable lookup will essentially be a hash table lookup (requires at least a little pointer traversal). By contrast, looking up a member variable is a compile time constant offset to a known pointer value. It's O(1) vs. O(1) but a huge difference in the constant factor.

Yep, using a tuple would be a little better if you were actually writing Python code but it would be comparing apples to oranges when it comes to this example case.

edit note: a dict and class are equal only in CPython, see this video linked in the OP http://vimeo.com/61044810 - it discusses various optimizations regarding classes vs. dicts in PyPy etc

The pythonic way of creating a Point like data structure is not

  point = {'x': 0, 'y': 0}
but instead, this

  Point = namedtuple('Point', ['x', 'y'], verbose=True)
The resulting class, I believe to be more performant than using a dictionary - but I haven't actually measured this.

I'm probably doing it wrong but I measured exactly this yesterday and the dict version was faster:

    $ python -m timeit --setup "from collections import namedtuple; Point = namedtuple('Point', ['x', 'y']); p = Point(x=0, y=0)" "p.x + p.y"
    1000000 loops, best of 3: 0.284 usec per loop

    $ python -m timeit --setup "p = {'x': 0, 'y': 0}" "p['x'] + p['y']"
    10000000 loops, best of 3: 0.0737 usec per loop
Maybe the use isn't right because I agree with your belief that namedtuple is suppose to be more performant.

You should be creating a dict vs. a namedtuple instance once, then accessing it many, many times. What you're doing is creating a lot of dict vs. namedtuple instances, then accessing each one only once. That's mostly testing the instance creation time, not the field access time.

That code should only create the dict/namedtuple instance once, then access it many, many times.

The creation occurs in the --setup portion. The field accesses occur in the actual looping portion of the timeit code.

Ah, ok, I see, for some reason my browser wasn't showing the namedtuple code snippet right. Now I need to go look at the namedtuple code to see why accessing the fields by name is slower than accessing them by index.

Yep. Because the dictionary literal gets created via opcodes in the interpreter loop whereas the named tuple pays the price of calling a function and setting up an interpreter frame.

The --setup code (creating the namedtuple and dict) is only executed once.

The timed portion is simply the field accesses.

Here's a neat (and totally unpythonic) way to bring that down a bit:

    python -m timeit --setup "from collections import namedtuple; Point = namedtuple('Point', ['x', 'y']); p = Point(x=0, y=0); get_y = Point.y.fget; get_x = Point.x.fget" "get_x(p) + get_y(p)"

I think you're timing the import in there, too. Plus, creating an object of a class rather than using a literal will always be slower.

Does the --setup statement really get timed?

My reading of the docs[1] has always led me to believe that it doesn't. For example, "It is possible to provide a setup statement that is executed only once at the beginning"

[1]: http://docs.python.org/2/library/timeit.html

It doesn't get timed. You can quickly test this by adding an expensive operation to your setup and checking that it doesn't affect the resulting time.

Aaah, ok. I totally didn't read the --setup arg.

I did a stupid microbenchmark and it turned out that simple class with __slots__ was the fastest method.

While we are at stupid microbenchmarks. The following loop in C (line being a struct, length is int):

  for (int i = 1; i <= 400000000; ++i) {
Compiles to the following ASM:

  inc eax
  dec edx
  jne @8
..as expected. The compiler realized that there is no need to actually copy the intermediate values to line.length and does everything in registers.

Now here is the same loop in Lua (everything dynamically typed, line.length must be hashed (in theory) just like in Python):

  for i = 1, 400000000 do
    line.length = line.length + 1
LuaJIT 2.1 generates the following ASM:

  addsd xmm7, xmm0
  movsd [rax], xmm7
  add edi, +0x01
  cmp edi, 0x17d78400
  jle 0x7fee13effe0  ->LOOP
The C program executes in ~0.3s, the Lua one in ~0.5s .. and those 0.5s include LuaJIT's startup, profiling, and JIT compilation time. So for a longer running program the difference would be even smaller.

Tl;dr: modern JIT compilers are amazing and can optimize away the things mentioned in the article.

I think you forgot to turn on your optimizer. GCC will optimize that loop into something along the lines of

That is because it can predict the loop result during compile time.

For example, this is the entire program with your loop and a line to print the result (gcc 4.3.4/linux -O3),

  mov    $0x17d78401,%esi
  mov    $0x400654,%edi
  xor    %eax,%eax
  jmpq   0x400430 <printf@plt>
In fact without the printf all you get is "reqz ret" (not counting .init overhead in the binary). That is because the compiler detects that line.length is not used and fails to even set it.

  #include <stdio.h>

  int main(int argc,char *argv[])
	struct lines
		char *somedata;
		int length;
	} line;
	int i;


        for (i=0;i<=400000000;++i)

        printf("Line len %d\n",line.length);

>I think you forgot to turn on your optimizer.

I did not. I used Pelles C to compile it (with the optimizer turned on). I am not surprised that GCC managed to eliminate the pointless loop entirely in this situation. In fact I was happy that both Pelles C and LuaJIT did not realize that the whole loop was pointless and thus I did not have to come up with a more complex example.

The primary point of this was to show that JIT compilers can optimize away hash table access and dynamic typing in hot loops, not a code generation competition between LuaJIT and GCC.

I am a pretty big LuaJIT fanboy and not even I would claim that Lua compiled with LuaJIT can compete against C compiled by current versions of GCC with all optimizations turned on, at least not in most real-world cases. However, it does get amazingly close, most of the time within an order of magnitude, which means that I personally do not need to use C anymore.

Curious if you included ctypes.Structure in your benchmark.

I just did. And either a) I'm doing something terribly wrong or b) ctypes.Structure is slower than Python class with __slots__.

I would bet on a) - I've never worked with ctypes before.

EDIT: I bet on a) and I was wrong: ctypes offers no speed advantage.

I'm curious where one would go to learn specifically about performance issues in Python - patterns to avoid, and those to stick to. Any recommendations?

Would't it be more like this?

    points = [(0, 0), (3, 4), ...]
    sum_x = sum([x for x,_ in points])
    sum_y = sum([y for _,y in points])

It's mentioned in the article, albeit at the end.

> The resulting class, I believe to be more performant than using a dictionary - but I haven't actually measured this.

If you didn't measure it then how did you come to believe this?

Hehe, came here to say exactly that.

The article was kind of interesting, it's always good to talk about why things are slow, since that might help highlight what you can do about it (short of switching away from Python, that is).

I was abit annoyed about the benchmarks part; there was no C-based benchmark for comparison, it was not clear how many points were being processed which made the exact time kind of pointless. Also, C and C++ are not the same which the author seems to think.

Author here:

There is a difference between C/C++, but for the purposes of this article it does not matter -- the point I am trying to make is that you would not use a hash table in any of them, would you?

The exact times are kind of pointless indeed, all what matters are the ratios. C benchmark would probably be as well. What I was trying to show was that your runtime comparisons are not necessarily comparing the same things.

> C and C++ are not the same which the author seems to think.

The part where he lost me was when he said this referring to C:

> Here we tell the interpreter that each point would have exactly two fields, x and y.

Maybe I misread that, but I don't think I did, he was referring to the C struct when he said that. It's one of those glaring mistakes that is hard for me to look past when someone says something like this.

Not sure why I put an "interpreter" there when C is compiled, of course, I fixed this now.

Again, this does not change the fact that structs and dictionaries are very different data structures.

He probably didn't cared enough to find/create a hash map for C

The sentence "Python run slow" is a little misleading. While is true that does more stuff than other languages (C, for example) for similar operations, it is also true that in the majority of cases, the relevant time consuming parts of execution are in other areas, like waiting for I/O, etc.

In most of the cases, a program done in Python (or Ruby, or JavaScript) is not slower than a program done in C or even assembly. Sure, for some cases, it may be necessary to take care of rough cases, but I think that just saying "Python is slow" is not very fortunate...

Another bottleneck in addition to file or network I/O can be memory consumption. It's been my experience that trading off extra CPU work for lower memory consumption can make a world of difference when your application is paging (which is I guess just I/O). So in looser terms, the algorithms I write may be purposefully less optimized in order to reduce memory consumption. Like more iterations, more scans, more stuff instead of sticking references in memory.

> the relevant time consuming parts of execution are in other areas, like waiting for I/O, etc.

It's safe to assume that people complaining about a language's performance deal with CPU bound cases.

> In most of the cases, a program done in Python (or Ruby, or JavaScript) is not slower than a program done in C or even assembly.

This is false: http://benchmarksgame.alioth.debian.org/u64q/benchmark.php?t...

> It's safe to assume that people complaining about a language's performance deal with CPU bound cases.

That's a very unlikely assumption – many people will complain about the language before even profiling their code. I've seen people do things like complain about Python's performance before, say, realizing that they were making thousands of database queries or allocating a complex object inside an inner loop. Because they've heard “The GIL makes Python slow” it's incredibly common for even fairly experienced programmers to simply assume that they can't make their program faster without rewriting it in C and they never actually confirm that assumption.

>> In most of the cases, a program done in Python (or Ruby, or JavaScript) is not slower than a program done in C or even assembly. > This is false: http://benchmarksgame.alioth.debian.org/u64q/benchmark.php?t....

That's a micro-benchmark collection. It's interesting for low-level performance but it doesn't tell you much about the kind of programs most programmers write. For example, I serve a lot of images derivatives over HTTP – it's certain that are many operations which could be significantly faster if they were written in C (i.e. reading HTTP headers in place rather than decoding them into string instances) but in practice none of that matters because almost all of the total runtime is already spent in a C library.

> That's a micro-benchmark collection. <

No, it's a meagre dozen "toy programs". (See Hennessy and Patterson "Computer Architecture").

> ... many operations which could be significantly faster if they were written in C ... none of that matters because almost all of the total runtime is already spent in a C library. <

See Sawzall quote -- http://benchmarksgame.alioth.debian.org/dont-jump-to-conclus...

> See Sawzall quote -- http://benchmarksgame.alioth.debian.org/dont-jump-to-conclus....

Agreed – my point was simply that over a couple decades I've only seen a handful of cases where a performance issue which was due to the language rather than the algorithm or simple data volume. Such problems exist but not for the majority of working programmers.

It's safe to assume that people complaining about a language's performance deal with CPU bound cases.

Not in my experience. I've seen countless cases where people just assume they have a CPU bound bottleneck, but profiling reveals that it is actually IO bound.

And that's ignoring all the cases where people rewrite their code in a new language to get a 10 time performance boost, ignoring the fact that a better choice of algorithm and data structure would give them a 100 time performance boost.

This is false:

benchamarksgame doesn't come even close to cover what I'd consider "most cases" of programs written.

My concern is that we repeat so often "Python is slow" that we can forget when this is a valid assumption. I've already dealt with people using since the start of a project a low-level language because "it's faster" (and the project was obviously not CPU-bound). The same goes with explaining a huge latency on connecting to a web service as "you're using Python on the client"

Is Python slower than C? In some cases YES. Is Python a slow language? NO, not always. Do I think that titles like "Why Python is slow" are misleading? YES, because are too generic, and give the wrong kind of idea.

Speed, as scalability or a lot of other concepts, are WAY more a problem of architecture and design than language. This doesn't mean that languages are irrelevant, but their importance in the achievement (or failure) of those goals is largely overstated.

PD: You can replace Python with Ruby or JavaScript.

The language shootout is not "most cases", it's a set of artificial microbenchmarks. Python is very slow on CPU-bound problems, but IME most real-world software is not CPU-bound.

The benchmarks game is not "most cases"; and the benchmarks game says "Measurement is highly specific" and "Measurement is not prophesy".

Rather than using a better data structure from the start (class vis-a-vis dictionaries), why not move to a better Python implementation (PyPy vis-a-vis CPython)? When I need a data structure, I look for one that is a natural fit to the problem. Perhaps everything is implemented as a hash table in a certain language (not taking names here :)) but would I replace every data structure with a hash table for efficiency? That would be Premature Optimisation.

I would recommend to code in idiomatic Python as much as possible (in this case, use a tuple for Point). Primarily because it is easy for other fellow programmers and your future self to understand. Performance optimisations are best done by the compiler for a reason, it is usually a one-way path that affects readability.

I can't use pypy yet, I need scipy/numpy.

Wouldn't a regular tuple be the most pythonic (and possibly most efficient) way to represent a point?

I think that's what most image libraries do (at least Pillow).

It was just an example to illustrate the problem. Pythons don't have legs. So they can't run, not even slowly.

It would have been more instructive if the author showed the alternate solutions classified by measured speed.

That depends very much on what you are going to do with your points

For my casual use, Python has been fast enough that I can be pretty carefree about how I program. But I had one shocking experience when I ported a program with a fair amount of math over to a Raspberry Pi, and it practically ground to a halt. That was a disappointment. It won't drive me away from Python, but it means I've got some learning to do on how to better optimize Python code.

Python and performance can be summed up as "just think about what it has to do to add two ints together". The I/O requirements of that vastly exceed the CPU requirements.

Basically any tight loop, heavy use of hash tables (so that's Python, Ruby etc.) or floating point (since this is a sore spot for ARM) is going to create an explosion in time used. C/C++/C#/Java can, by being careful about the types being used, be much faster.

I don't understand what I/O has to do with that. The issue is all the extra branching and copying with every op code. Python is going to be 20X to 100X slower than C++ at pretty much anything. It's an inevitability of the type system.

Learn how to use Python's integrated ctypes effectively to create C-like data structures when performance is absolutely critical, and you can use Python for almost any imaginable scenario (systems and embedding programming excepted).

For math you can get, literally, a hundred or thousandfold speedup by going to a compiled language.

Speaking from experience porting algorithms to low-powered embedded ARM devices, perhaps you should consider porting your code directly to C instead? The space, memory and processor penalty considerations are much greater than one would think.

NumPy, eventually?

NumPy didn't help much on the Raspberry Pi, though I admit that I could probably learn more about how to use it effectively. The routine has a weird bunch of branching inside the inner loop, so I can't just let NumPy whale away at an array unless I come up with a bright idea.

I ended up putting my numerical routine in a C program. Nice and snappy.

This should be bringing up the classic array-of-structures versus structure-of-arrays debate instead of what was mentioned. If only NumPy were standard Python and so we could have an easy, standard answer that is usually fast enough.

what is stopping you from using NumPy?

I do use it - it's just that it can't quite be given as the 'standard' answer for cases like this. I just wish it could become the definitive rather than the de facto standard.

I'm surprised that Python's interpreter tech isn't advanced enough to try and JIT some of this away. Python could probably learn a lot from various JS interpreter advances.

The PyPy project is doing that http://pypy.org/ (is mentioned on the article)

There is also Numba, which is a JIT python to LLVM compiler that supports optional type annotations. Unlike pypy, it is designed to seamlessly integrate with CPython.

On a related note, the field of compilers seems to have ignored data optimization almost completely at the expense of code optimization.

Applying a few basic tricks lets a human typically save 50-75% off an object's storage requirements and a compiler could combine this with profile feedback data about access patterns & making it more cache friendly.

This would be a good fit for JIT where the compiler could do deopt when the assumptions are violated, and this would let you make more aggressive optimizations.

This article is making a misleading comparison. Yes, dynamically typed languages have overheads that statically typed languages don't. But the better question to ask is why Python implementations are factors slower than leading javascript implementations, which have the same theoretical overheads that Python does.

The V8 is is now about a factor of 4 or 5 off of the speed of C for general purpose code, while CPython is 30x to 100x.

if Google would invest the same engineering effort in Python

Runs 'slowly'. Led by the population of the North American continent, the adverb is dying gradual.

Yet another case of someone (the blog engine?) being too smart and replacing " with “ and ” in code.

If you don't want to mess with ctypes, this sounds like a straightforward use case for namedtuple.

    std::hash_set<std::string, int> point;
    point[“x”] = x
    point[“y”] = y
This is a very interesting implementation of a set...

if you ever watch Robert's talk http://www.youtube.com/watch?v=ULdDuwf48kM you will find that different version of Python can have different run time for the same code.

Wouldn't a tuple be more appropriate than a class?

Then you have stuff like point[0] and point[1] littered throughout your code, or you write getter functions and then you have function calls as an overhead which is roughly as expensive as the class. An instance of namedtuple is actually the Correct (tm) solution.

>An instance of namedtuple is actually the Correct (tm) solution.

Do you have an implementation to support that? Because a comment from TFA shows that's actually the worst idea in cpython. http://lukauskas.co.uk/articles/2014/02/13/why-your-python-r...

Sorry, which comment are you referring to? All I see is a footnote saying "One would be right to argue that collections.namedtuple() is even closer to C struct than this. Let’s not overcomplicate things, however. The point I’m trying to make is valid regardless" which doesn't really say it's a bad idea, just that he didn't decide to talk about it in that article.

You can pull the values out with de-constructive assignment, as well, which is a little nicer:

(x, y) = your_fun()

Not sure how this performs comparatively, I'm not much of a python programmer, I didn't realize namedtuples were a thing (TIL)

Something along the lines of <x, y = point> would be a little cleaner than directly referencing the indexes. This technique can also be used in a loop <for x, y in [(0, 1), (23, 41)]: print(x, y)>.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact