Hacker News new | past | comments | ask | show | jobs | submit login
Python Practices for Efficient Code: Performance, Memory, and Usability (codementor.io)
109 points by happy-go-lucky on Aug 18, 2017 | hide | past | favorite | 42 comments

> Use format instead of + for generating strings — In Python, str is immutable, so the left and right strings have to be copied into the new string for every pair of concatenations.

It isn't always faster to use string formatting.

    $ python -m timeit -s 'a, b, c, d = "1234567890", "abcdefghij", "ABCDEFGHIJ", "0987654321"' 'a + b + c + d'
    10000000 loops, best of 3: 0.181 usec per loop
    $ python -m timeit -s 'a, b, c, d = "1234567890", "abcdefghij", "ABCDEFGHIJ", "0987654321"' '"{}{}{}{}".format(a, b, c, d)'
    1000000 loops, best of 3: 0.447 usec per loop
    $ python --version
    Python 2.7.13

On pypy, + also wins for the 4x10 case:

  $ pypy -mperf timeit -s 'a, b, c, d = "1234567890", "abcdefghij", "ABCDEFGHIJ", "0987654321"' 'a + b + c + d'
  Mean +- std dev: 1.06 ns +- 0.04 ns
  $ pypy -mperf timeit -s 'a, b, c, d = "1234567890", "abcdefghij", "ABCDEFGHIJ", "0987654321"' '"{}{}{}{}".format(a, b, c, d)'
  Mean +- std dev: 45.8 ns +- 0.9 ns
  $ pypy -mperf timeit -s 'a, b, c, d = "1234567890", "abcdefghij", "ABCDEFGHIJ", "0987654321"' '"".join([a, b, c, d])'
  Mean +- std dev: 62.0 ns +- 4.8 ns
  $ pypy -mperf timeit -s 'a, b, c, d = "1234567890", "abcdefghij", "ABCDEFGHIJ", "0987654321"' '"%s%s%s%s" % (a, b, c, d)'
  Mean +- std dev: 78.3 ns +- 1.9 ns

On Python 2.7.10:

    In [2]: %timeit a+b+c+d
    The slowest run took 6.66 times longer than the fastest. This could mean that an intermediate result is being cached.
    1000000 loops, best of 3: 247 ns per loop

    In [4]: %timeit "{}{}{}{}".format(a, b, c, d)
    The slowest run took 6.37 times longer than the fastest. This could mean that an intermediate result is being cached.
    1000000 loops, best of 3: 709 ns per loop
On Python 3.6.1:

    In [3]: %timeit a+b+c+d
    355 ns ± 18.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

    In [4]: %timeit "{}{}{}{}".format(a, b, c, d)
    630 ns ± 38.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

    In [5]: %timeit f"{a}{b}{c}{d}"
    21.8 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Edit: that last %timeit made me suspicious, so I dug a bit deeper with the dis module:

    In [8]: import dis

    In [9]: def f1():
       ...:     return a+b+c+d

    In [10]: def f2():
        ...:     return "{}{}{}{}".format(a, b, c, d)

    In [11]: def f3():
        ...:     return f"{a}{b}{c}{d}"

    In [12]: dis.dis(f1)
      2           0 LOAD_GLOBAL              0 (a)
                  2 LOAD_GLOBAL              1 (b)
                  4 BINARY_ADD
                  6 LOAD_GLOBAL              2 (c)
                  8 BINARY_ADD
                 10 LOAD_GLOBAL              3 (d)
                 12 BINARY_ADD
                 14 RETURN_VALUE

    In [13]: dis.dis(f2)
      2           0 LOAD_CONST               1 ('{}{}{}{}')
                  2 LOAD_ATTR                0 (format)
                  4 LOAD_GLOBAL              1 (a)
                  6 LOAD_GLOBAL              2 (b)
                  8 LOAD_GLOBAL              3 (c)
                 10 LOAD_GLOBAL              4 (d)
                 12 CALL_FUNCTION            4
                 14 RETURN_VALUE

    In [14]: dis.dis(f3)
      2           0 LOAD_GLOBAL              0 (a)
                  2 FORMAT_VALUE             0
                  4 LOAD_GLOBAL              1 (b)
                  6 FORMAT_VALUE             0
                  8 LOAD_GLOBAL              2 (c)
                 10 FORMAT_VALUE             0
                 12 LOAD_GLOBAL              3 (d)
                 14 FORMAT_VALUE             0
                 16 BUILD_STRING             4
                 18 RETURN_VALUE

    In [15]: %timeit f1()
    415 ns ± 30.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

    In [16]: %timeit f2
    31.4 ns ± 1.45 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

    In [17]: %timeit f2()
    727 ns ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

    In [18]: %timeit f3()
    344 ns ± 27.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
My understanding: in the %timeit f"..." example, a constant value for the expression must have been calculated before the execution of the timings. When wrapping things around a function, I'm forcing the interpreter to actually evaluate the f-string on every call, so it's a more apples-to-apples comparison. With the dis module I also verify that there is no constant expression fixed on every function.

Timings are now closer, but the f-string is still a bit faster.

Edit 2: (final edit?) still for string concatenation, "".join(...) is the king:

    In [19]: %timeit "".join([a, b, c, d])
    365 ns ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

    In [20]: l = [a, b, c, d]

    In [21]: %timeit "".join(l)
    229 ns ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Notice that in [19] %timeit is also measuring the time to build the list in the first place, which is why I also tested with a prebuilt list in [21]. Building the list in the first place is a cost you have to pay, so it's not entirely free.

Still, all this optimizations make more sense for really long strings and fragments (substrings that you want to join). Extreme case of 256 1-long strings:

    In [43]: %timeit l[0] + l[1] + l[2] + l[3] + l[4] + l[5] + l[6] + l[7] + l[8] + l[9] + l[10] + l[11] + l[12] + l[13] + l[14] + l[15] + l[16] + l[17]
        ...:  + l[18] + l[19] + l[20] + l[21] + l[22] + l[23] + l[24] + l[25] + l[26] + l[27] + l[28] + l[29] + l[30] + l[31] + l[32] + l[33] + l[34] +
        ...: l[35] + l[36] + l[37] + l[38] + l[39] + l[40] + l[41] + l[42] + l[43] + l[44] + l[45] + l[46] + l[47] + l[48] + l[49] + l[50] + l[51] + l[5
        ...: 2] + l[53] + l[54] + l[55] + l[56] + l[57] + l[58] + l[59] + l[60] + l[61] + l[62] + l[63] + l[64] + l[65] + l[66] + l[67] + l[68] + l[69]
        ...: + l[70] + l[71] + l[72] + l[73] + l[74] + l[75] + l[76] + l[77] + l[78] + l[79] + l[80] + l[81] + l[82] + l[83] + l[84] + l[85] + l[86] + l
        ...: [87] + l[88] + l[89] + l[90] + l[91] + l[92] + l[93] + l[94] + l[95] + l[96] + l[97] + l[98] + l[99] + l[100] + l[101] + l[102] + l[103] +
        ...: l[104] + l[105] + l[106] + l[107] + l[108] + l[109] + l[110] + l[111] + l[112] + l[113] + l[114] + l[115] + l[116] + l[117] + l[118] + l[11
        ...: 9] + l[120] + l[121] + l[122] + l[123] + l[124] + l[125] + l[126] + l[127] + l[128] + l[129] + l[130] + l[131] + l[132] + l[133] + l[134] +
        ...:  l[135] + l[136] + l[137] + l[138] + l[139] + l[140] + l[141] + l[142] + l[143] + l[144] + l[145] + l[146] + l[147] + l[148] + l[149] + l[1
        ...: 50] + l[151] + l[152] + l[153] + l[154] + l[155] + l[156] + l[157] + l[158] + l[159] + l[160] + l[161] + l[162] + l[163] + l[164] + l[165]
        ...: + l[166] + l[167] + l[168] + l[169] + l[170] + l[171] + l[172] + l[173] + l[174] + l[175] + l[176] + l[177] + l[178] + l[179] + l[180] + l[
        ...: 181] + l[182] + l[183] + l[184] + l[185] + l[186] + l[187] + l[188] + l[189] + l[190] + l[191] + l[192] + l[193] + l[194] + l[195] + l[196]
        ...:  + l[197] + l[198] + l[199] + l[200] + l[201] + l[202] + l[203] + l[204] + l[205] + l[206] + l[207] + l[208] + l[209] + l[210] + l[211] + l
        ...: [212] + l[213] + l[214] + l[215] + l[216] + l[217] + l[218] + l[219] + l[220] + l[221] + l[222] + l[223] + l[224] + l[225] + l[226] + l[227
        ...: ] + l[228] + l[229] + l[230] + l[231] + l[232] + l[233] + l[234] + l[235] + l[236] + l[237] + l[238] + l[239] + l[240] + l[241] + l[242] +
        ...: l[243] + l[244] + l[245] + l[246] + l[247] + l[248] + l[249] + l[250] + l[251] + l[252] + l[253] + l[254] + l[255]
    27.9 µs ± 2.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

    In [44]: %timeit "".join(l)
    3.71 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So the conclusion is: Have an existing list to join? Use str.join(list); Otherwise, operator+ might be your friend, but f-strings are probably cleaner to handle (but you only get those in Py3.6+). As to str.format(): Nah, not good.

Didn't expect str.format() to turn out to be "so" poor, though.

But this conclusion is taken w/o looking into the "old" formatting approach, `"%s%s%s%s" % (a, b, c, d)`, though.

The reason join is fast for these cases is because join basically converts all iterables to a list first and figures out how much it has to join exactly.

It walks the iterables (no need to convert them to lists), but its main trick is actually knowing how much memory it'll need for the end result, so no re-allocations happen, simply a one pass (through multiple iterables) copy.

No, it really converts all iterables to lists (there is a utility function in the C API to do just that, because it's kinda hairy to do by hand. PySequence_ToList or so). Otherwise it would not be possible to calculate how much space is required beforehand, since iterables cannot be walked twice.

I think they're really good to be aware of, but think it's overreaching to advise "Use slots when defining a Python class."

I'm surprised there's no mention of exceptions. Constructing, throwing, catching, and discarding an exception can be relatively slow (especially in a tight loop). My usual advice is "exceptions should be the exceptional case."

In general, get familiar with inspection tools so your code is easy to measure so you can clean up hotspots. Trying to optimize code without inspecting it often makes code harder to read and may not address the slower-performing parts. Maybe everyone needs to spend hours trying to speed things up just to realize the slow parts were in a different part of the code and now it's harder for the next guy to reason what's going on.

One of the best tools for profiling I've used with python is gprof2dot, which analizes cProfiles output and generates a function call-graph that gives deep insights as to what is really taking execution time.

It can help you see a lot clearer why things are slow. Though it has its nuissances, it significantly improves your understanding of where time is spent. It's the same data from cProfile, but much better organized.

That tool, combined with the line_profiler (kernprof) can work wonders.

I made a wrapper of cProfile and gprof2dot, which can be found here:


so that you don't have to remember how to call gprof2dot all the time.

Nice, it looks handy!

Slots have benefits outside of the memory savings. By preventing arbitrary attribute assignment on an object, slots can provide nice guarantees about the shape of an object, making it easier to reason about.

> exception[s] can be relatively slow

That really depends on your hit rate. Say you have a generator returning objects and you don't know if they have an attribute. You can use hasattr(..) to check or just try to use the attribute and catch the AttributeError.

One-on-one the exception is slower, but there comes a point where the exception is rare enough that actively checking every single item is slower. Eg if only 1-in-1000 doesn't have the attribute, just mop up with an exception.

That's what I was trying to get at. Not to avoid exceptions, but to write your code in such a way that exceptions are rare. I usually encounter it with file i/o situations, like like you're saying, an extra file stat in a loop would be really slow and open you to race conditions (file being modified before testing an acting). If an exception was thrown 90% of the time, you want to rethink your logic because performance might be equally poor. Catching the rare exception is ideal.

As an aside (since you mentioned generators), I wish exception handling in comprehensions were easier to deal with.

Out of curiosity, is there any overhead to the python exception mechanism if you don't hit an exception (ie just by wrapping things in a try block)

I come from an embedded background, where in some c++ projects, we would disable exceptions for various reasons, including the memory overhead, which is why I ask..

Exception handlers in CPython work by having an instruction store the offset to the start of the handler code which is stored on a stack attached to the frame (iirc). So the cost of a "try:" block is pretty low (for Python, anyway).

The handlers themselves are pretty basic, an "except SomeError" pretty much translates to "if isinstance(exc, SomeError):".

Its 2017. After python 3.6, I hope the debate btw python2 vs 3 is put to an end.

I can't believe this is still a discussion, python 3 already practically every python 2 library


In general, Python 3 should have been what you start with for the past few years. In visual effects most of our apps have an embedded Python runtime. Those apps have a lot of internal code built around Python2 in addition to clients that have a decades worth of Python2 infrastructure. Everyone needs to move at the same time (and it's been hard even with minor versions of Python).

As an industry, we're only now talking about migrating to Python 3 in 2019: http://www.vfxplatform.com/ Looking around, most people I worked with just do what they're told and have probably never seen Python3 yet. The past few years the industry has pushed off Python3 because of other large changes requested for third-party apps: new versions of C++, gcc, boost, Qt.

It's actually been kind of tough to use Python 2 for the past couple years. I noticed the PyCon talks are a lot less relevant because they're all Python 3 oriented. Python bindings for Qt are a lot more difficult because of the old compiler used for Python 2 on Windows.

I wonder how many other industries are similarly stuck?

I stopped reading there.

In fact, I'm finding more and more libraries written for Python 3 exclusively, or for which Python 2 support has been discontinued. Stating "you may find a lot of packages that only support Python2" makes me think the author is just spouting off random things he's heard.

> Multiprocess, not Multi-thread

or Gevent - which is built on libev and provides constructs like queues, etc to make your multi-processing life much better.

I still find it terrifying monkey patching all the internal functions with Gevent; to be honest it caused a loads of weird bugs with celery that were impossible to debug. I’m pretty sure given my experience I’d choose to avoid it in future.

I've used gevent monkey patching for years in real codebases with success although I have not used celery. Most problems I've seen with this come from not doing the monkey patching before importing anything else.

Also, generally avoiding c-extensions for talking to databases unless they have explicit support for greenlets is advisable.

My main webapp currently is Flask on gunicorn with async gevent workers. Mix in Flask-Sockets (using gevent-websocket) and I have websocket support as well and use redis pubsub as a broker to communicate intraprocess. It's a nice and capable system and you can reach into gevent if you want to do scatter-gather sorts of patterns without dealing with threads.

Do you have a small sample of gunicorn + flask + websockets + sqlalchemu config ? I've been exploring a little here and looking for something that works well.

Sorry, I don't have something that is easily sharable.

for standalone programs, there is simply no better multi-processing library than gevent. Forget the libev performance thing, the api is super pleasant... even more so than asyncio.

When you use gevent to run explicit green threads, you do NOT monkey patch - you call the library functions. You only monkey patch if you want stuff happening for free. Which also works pretty well in production with gunicorn.. and yes with celery.

gevent is officially supported by celery - https://github.com/celery/celery/tree/master/examples/gevent


Sure, the example doesn’t monkey patch there though. My issues happened around complex multi part Chords and removing Gevent fixed stuff, but then eventually I moved away from celery too and just used Pikka and Rabbitmq directly. This gave me more control and allowed me to see what was happening more easily; celery sometimes seemed to hide exceptions from me. Glad you haven’t had issues.

Curious if you have tried or evaluated stackless in production. One doesnt hear a lot on stackless python these dsys. BTW this is for my education only, not advocating one over the other.

Not really. I'm really going with the community here and the recent community is huge.

Gevent works in production with celery and even newrelic works with pretty OK (not perfect)

When we are talking about production systems, I need the whole ecosystem to work. So I'd rather use vanilla Python with few libraries like gevent.

> On the other hand, you may find a lot of packages that only support Python2, and Python3 is not backward-compatible. This means that running your Python2 code on a Python3.x interpreter can possibly throw errors.

I've never found this situation, though I've found the inverse (Py3 but no Py2 support).

Apple's coremltools[0] are only available on Python 2.7

0: https://developer.apple.com/documentation/coreml/converting_...

The Numba JIT function decorator called with (nopython=True, nogil=True) is very useful for small functions with heavy use of NumPy. Often faster than Cython.


Big fan of using Cython for things where you need significant performance optimization.

how do you develop in cython - doesnt the compile, load and test cycle break your flow inside IDE ?

You can either compile-before-run (as you would any compiled language), our you can leverage Cythons compile-on-import mechanism. Which does exactly the same, but you don't have to think about having to compile.

Compile time is usually down to a few seconds (or less), so it's not really a burden.

hey - thanks for this. what do you personally use ? It is more interesting to learn from someone who has been using it for a while.

I'm especially concerned about IDE support, debuggability and testability.

For IDE support, I'm totally in love with PyCharm and been using it for years, first community and then I upgraded to Pro for Cython support. PyCharm Pro has syntax highlighting and does inspections (linting) on Cython code, so it's a wonderful tool, totally worth its price.

Can't really recommend any other IDE because I haven't used another one for over 4 years, though a friend uses (loves) Sublime Text and has high praise for it too. It has Cython syntax support, but there's really no linter for Cython -- what's more, you may have to turn off warnings if you're using a Python linter because it'll complain about cdefs and such.

As for the day to day use, I don't do that much Cython really, ocassionally to speed things up. I start from a Python function, then make a .pyx (or bunch of .pyxes) files and replicate what's needed. Compile, load both versions (Py and Cy) and compare outputs and speed from IPython or with benchmark scripts.

I did use a lot the Cython functionality to create the HTML view of the source source at the beggnining, though you get the hang of what transpiles well with time and use it less and less. Still, sometimes it's useful.

why did you not choose to use numba ? because the performance in a lot of cases is faster, and most importantly the IDE/debugging integration is seamless.

'Use static code analysis tools' is missing a mention of mypy and pytype.

hint: switch to scala, or jython.

Skip straight to assembly

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact