
Python Practices for Efficient Code: Performance, Memory, and Usability - happy-go-lucky
https://www.codementor.io/satwikkansal/python-practices-for-efficient-code-performance-memory-and-usability-aze6oiq65
======
davidwtbuxton
> Use format instead of + for generating strings — In Python, str is
> immutable, so the left and right strings have to be copied into the new
> string for every pair of concatenations.

It isn't always faster to use string formatting.

    
    
        $ python -m timeit -s 'a, b, c, d = "1234567890", "abcdefghij", "ABCDEFGHIJ", "0987654321"' 'a + b + c + d'
        10000000 loops, best of 3: 0.181 usec per loop
        $ python -m timeit -s 'a, b, c, d = "1234567890", "abcdefghij", "ABCDEFGHIJ", "0987654321"' '"{}{}{}{}".format(a, b, c, d)'
        1000000 loops, best of 3: 0.447 usec per loop
        $ python --version
        Python 2.7.13

~~~
dr_zoidberg
On Python 2.7.10:

    
    
        In [2]: %timeit a+b+c+d
        The slowest run took 6.66 times longer than the fastest. This could mean that an intermediate result is being cached.
        1000000 loops, best of 3: 247 ns per loop
    
        In [4]: %timeit "{}{}{}{}".format(a, b, c, d)
        The slowest run took 6.37 times longer than the fastest. This could mean that an intermediate result is being cached.
        1000000 loops, best of 3: 709 ns per loop
    

On Python 3.6.1:

    
    
        In [3]: %timeit a+b+c+d
        355 ns ± 18.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
        In [4]: %timeit "{}{}{}{}".format(a, b, c, d)
        630 ns ± 38.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
        In [5]: %timeit f"{a}{b}{c}{d}"
        21.8 ns ± 1.4 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
    

Edit: that last %timeit made me suspicious, so I dug a bit deeper with the dis
module:

    
    
        In [8]: import dis
    
        In [9]: def f1():
           ...:     return a+b+c+d
           ...:
    
        In [10]: def f2():
            ...:     return "{}{}{}{}".format(a, b, c, d)
            ...:
    
        In [11]: def f3():
            ...:     return f"{a}{b}{c}{d}"
            ...:
    
        In [12]: dis.dis(f1)
          2           0 LOAD_GLOBAL              0 (a)
                      2 LOAD_GLOBAL              1 (b)
                      4 BINARY_ADD
                      6 LOAD_GLOBAL              2 (c)
                      8 BINARY_ADD
                     10 LOAD_GLOBAL              3 (d)
                     12 BINARY_ADD
                     14 RETURN_VALUE
    
        In [13]: dis.dis(f2)
          2           0 LOAD_CONST               1 ('{}{}{}{}')
                      2 LOAD_ATTR                0 (format)
                      4 LOAD_GLOBAL              1 (a)
                      6 LOAD_GLOBAL              2 (b)
                      8 LOAD_GLOBAL              3 (c)
                     10 LOAD_GLOBAL              4 (d)
                     12 CALL_FUNCTION            4
                     14 RETURN_VALUE
    
        In [14]: dis.dis(f3)
          2           0 LOAD_GLOBAL              0 (a)
                      2 FORMAT_VALUE             0
                      4 LOAD_GLOBAL              1 (b)
                      6 FORMAT_VALUE             0
                      8 LOAD_GLOBAL              2 (c)
                     10 FORMAT_VALUE             0
                     12 LOAD_GLOBAL              3 (d)
                     14 FORMAT_VALUE             0
                     16 BUILD_STRING             4
                     18 RETURN_VALUE
    
        In [15]: %timeit f1()
        415 ns ± 30.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
        In [16]: %timeit f2
        31.4 ns ± 1.45 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
    
        In [17]: %timeit f2()
        727 ns ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
        In [18]: %timeit f3()
        344 ns ± 27.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    

My understanding: in the %timeit f"..." example, a constant value for the
expression must have been calculated before the execution of the timings. When
wrapping things around a function, I'm forcing the interpreter to actually
evaluate the f-string on every call, so it's a more apples-to-apples
comparison. With the dis module I also verify that there is no constant
expression fixed on every function.

Timings are now closer, but the f-string is still a bit faster.

Edit 2: (final edit?) still for string concatenation, "".join(...) is the
king:

    
    
        In [19]: %timeit "".join([a, b, c, d])
        365 ns ± 39.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    
        In [20]: l = [a, b, c, d]
    
        In [21]: %timeit "".join(l)
        229 ns ± 21.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    

Notice that in [19] %timeit is also measuring the time to build the list in
the first place, which is why I also tested with a prebuilt list in [21].
Building the list in the first place is a cost you have to pay, so it's not
entirely free.

Still, all this optimizations make more sense for really long strings and
fragments (substrings that you want to join). Extreme case of 256 1-long
strings:

    
    
        In [43]: %timeit l[0] + l[1] + l[2] + l[3] + l[4] + l[5] + l[6] + l[7] + l[8] + l[9] + l[10] + l[11] + l[12] + l[13] + l[14] + l[15] + l[16] + l[17]
            ...:  + l[18] + l[19] + l[20] + l[21] + l[22] + l[23] + l[24] + l[25] + l[26] + l[27] + l[28] + l[29] + l[30] + l[31] + l[32] + l[33] + l[34] +
            ...: l[35] + l[36] + l[37] + l[38] + l[39] + l[40] + l[41] + l[42] + l[43] + l[44] + l[45] + l[46] + l[47] + l[48] + l[49] + l[50] + l[51] + l[5
            ...: 2] + l[53] + l[54] + l[55] + l[56] + l[57] + l[58] + l[59] + l[60] + l[61] + l[62] + l[63] + l[64] + l[65] + l[66] + l[67] + l[68] + l[69]
            ...: + l[70] + l[71] + l[72] + l[73] + l[74] + l[75] + l[76] + l[77] + l[78] + l[79] + l[80] + l[81] + l[82] + l[83] + l[84] + l[85] + l[86] + l
            ...: [87] + l[88] + l[89] + l[90] + l[91] + l[92] + l[93] + l[94] + l[95] + l[96] + l[97] + l[98] + l[99] + l[100] + l[101] + l[102] + l[103] +
            ...: l[104] + l[105] + l[106] + l[107] + l[108] + l[109] + l[110] + l[111] + l[112] + l[113] + l[114] + l[115] + l[116] + l[117] + l[118] + l[11
            ...: 9] + l[120] + l[121] + l[122] + l[123] + l[124] + l[125] + l[126] + l[127] + l[128] + l[129] + l[130] + l[131] + l[132] + l[133] + l[134] +
            ...:  l[135] + l[136] + l[137] + l[138] + l[139] + l[140] + l[141] + l[142] + l[143] + l[144] + l[145] + l[146] + l[147] + l[148] + l[149] + l[1
            ...: 50] + l[151] + l[152] + l[153] + l[154] + l[155] + l[156] + l[157] + l[158] + l[159] + l[160] + l[161] + l[162] + l[163] + l[164] + l[165]
            ...: + l[166] + l[167] + l[168] + l[169] + l[170] + l[171] + l[172] + l[173] + l[174] + l[175] + l[176] + l[177] + l[178] + l[179] + l[180] + l[
            ...: 181] + l[182] + l[183] + l[184] + l[185] + l[186] + l[187] + l[188] + l[189] + l[190] + l[191] + l[192] + l[193] + l[194] + l[195] + l[196]
            ...:  + l[197] + l[198] + l[199] + l[200] + l[201] + l[202] + l[203] + l[204] + l[205] + l[206] + l[207] + l[208] + l[209] + l[210] + l[211] + l
            ...: [212] + l[213] + l[214] + l[215] + l[216] + l[217] + l[218] + l[219] + l[220] + l[221] + l[222] + l[223] + l[224] + l[225] + l[226] + l[227
            ...: ] + l[228] + l[229] + l[230] + l[231] + l[232] + l[233] + l[234] + l[235] + l[236] + l[237] + l[238] + l[239] + l[240] + l[241] + l[242] +
            ...: l[243] + l[244] + l[245] + l[246] + l[247] + l[248] + l[249] + l[250] + l[251] + l[252] + l[253] + l[254] + l[255]
        27.9 µs ± 2.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
        In [44]: %timeit "".join(l)
        3.71 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

~~~
dom0
The reason join is fast for these cases is because join basically converts all
iterables to a list first and figures out how much it has to join exactly.

~~~
dr_zoidberg
It walks the iterables (no need to convert them to lists), but its main trick
is actually knowing how much memory it'll need for the end result, so no re-
allocations happen, simply a one pass (through multiple iterables) copy.

~~~
dom0
No, it really converts all iterables to lists (there is a utility function in
the C API to do just that, because it's kinda hairy to do by hand.
PySequence_ToList or so). Otherwise it would not be possible to calculate how
much space is required beforehand, since iterables cannot be walked twice.

------
pfranz
I think they're really good to be aware of, but think it's overreaching to
advise "Use slots when defining a Python class."

I'm surprised there's no mention of exceptions. Constructing, throwing,
catching, and discarding an exception can be relatively slow (especially in a
tight loop). My usual advice is "exceptions should be the exceptional case."

In general, get familiar with inspection tools so your code is easy to measure
so you can clean up hotspots. Trying to optimize code without inspecting it
often makes code harder to read and may not address the slower-performing
parts. Maybe everyone needs to spend hours trying to speed things up just to
realize the slow parts were in a different part of the code and now it's
harder for the next guy to reason what's going on.

~~~
oliwarner
> exception[s] can be relatively slow

That really depends on your hit rate. Say you have a generator returning
objects and you don't know if they have an attribute. You can use hasattr(..)
to check or just try to use the attribute and catch the AttributeError.

One-on-one the exception _is_ slower, but there comes a point where the
exception is rare enough that actively checking every single item is slower.
Eg if only 1-in-1000 doesn't have the attribute, just mop up with an
exception.

~~~
TFortunato
Out of curiosity, is there any overhead to the python exception mechanism if
you don't hit an exception (ie just by wrapping things in a try block)

I come from an embedded background, where in some c++ projects, we would
disable exceptions for various reasons, including the memory overhead, which
is why I ask..

~~~
dom0
Exception handlers in CPython work by having an instruction store the offset
to the start of the handler code which is stored on a stack attached to the
frame (iirc). So the cost of a "try:" block is pretty low (for Python,
anyway).

The handlers themselves are pretty basic, an "except SomeError" pretty much
translates to "if isinstance(exc, SomeError):".

------
ram_rar
Its 2017. After python 3.6, I hope the debate btw python2 vs 3 is put to an
end.

~~~
Kagerjay
I can't believe this is still a discussion, python 3 already practically every
python 2 library

[http://py3readiness.org/](http://py3readiness.org/)

~~~
pfranz
In general, Python 3 should have been what you start with for the past few
years. In visual effects most of our apps have an embedded Python runtime.
Those apps have a lot of internal code built around Python2 in addition to
clients that have a decades worth of Python2 infrastructure. Everyone needs to
move at the same time (and it's been hard even with minor versions of Python).

As an industry, we're only now talking about migrating to Python 3 in 2019:
[http://www.vfxplatform.com/](http://www.vfxplatform.com/) Looking around,
most people I worked with just do what they're told and have probably never
seen Python3 yet. The past few years the industry has pushed off Python3
because of other large changes requested for third-party apps: new versions of
C++, gcc, boost, Qt.

It's actually been kind of tough to use Python 2 for the past couple years. I
noticed the PyCon talks are a lot less relevant because they're all Python 3
oriented. Python bindings for Qt are a lot more difficult because of the old
compiler used for Python 2 on Windows.

I wonder how many other industries are similarly stuck?

------
sandGorgon
> _Multiprocess, not Multi-thread_

or Gevent - which is built on libev and provides constructs like queues, etc
to make your multi-processing life much better.

~~~
andy_ppp
I still find it terrifying monkey patching all the internal functions with
Gevent; to be honest it caused a loads of weird bugs with celery that were
impossible to debug. I’m pretty sure given my experience I’d choose to avoid
it in future.

~~~
sandGorgon
for standalone programs, there is simply no better multi-processing library
than gevent. Forget the libev performance thing, the api is super pleasant...
even more so than asyncio.

When you use gevent to run explicit green threads, you do NOT monkey patch -
you call the library functions. You only monkey patch if you want stuff
happening for free. Which also works pretty well in production with gunicorn..
and yes with celery.

gevent is officially supported by celery -
[https://github.com/celery/celery/tree/master/examples/gevent](https://github.com/celery/celery/tree/master/examples/gevent)

[http://docs.celeryproject.org/en/latest/_modules/celery/conc...](http://docs.celeryproject.org/en/latest/_modules/celery/concurrency/gevent.html)

~~~
srean
Curious if you have tried or evaluated stackless in production. One doesnt
hear a lot on stackless python these dsys. BTW this is for my education only,
not advocating one over the other.

~~~
sandGorgon
Not really. I'm really going with the community here and the recent community
is huge.

Gevent works in production with celery and even newrelic works with pretty OK
(not perfect)

When we are talking about production systems, I need the whole ecosystem to
work. So I'd rather use vanilla Python with few libraries like gevent.

------
dr_zoidberg
> On the other hand, you may find a lot of packages that only support Python2,
> and Python3 is not backward-compatible. This means that running your Python2
> code on a Python3.x interpreter can possibly throw errors.

I've never found this situation, though I've found the inverse (Py3 but no Py2
support).

~~~
Kpourdeilami
Apple's coremltools[0] are only available on Python 2.7

0:
[https://developer.apple.com/documentation/coreml/converting_...](https://developer.apple.com/documentation/coreml/converting_trained_models_to_core_ml)

------
jjawssd
The Numba JIT function decorator called with (nopython=True, nogil=True) is
very useful for small functions with heavy use of NumPy. Often faster than
Cython.

[https://numba.pydata.org/](https://numba.pydata.org/)

------
EternalData
Big fan of using Cython for things where you need significant performance
optimization.

~~~
sandGorgon
how do you develop in cython - doesnt the compile, load and test cycle break
your flow inside IDE ?

~~~
dr_zoidberg
You can either compile-before-run (as you would any compiled language), our
you can leverage Cythons compile-on-import mechanism. Which does exactly the
same, but you don't have to think about having to compile.

Compile time is usually down to a few seconds (or less), so it's not really a
burden.

~~~
sandGorgon
hey - thanks for this. what do _you_ personally use ? It is more interesting
to learn from someone who has been using it for a while.

I'm especially concerned about IDE support, debuggability and testability.

~~~
dr_zoidberg
For IDE support, I'm totally in love with PyCharm and been using it for years,
first community and then I upgraded to Pro for Cython support. PyCharm Pro has
syntax highlighting and does inspections (linting) on Cython code, so it's a
wonderful tool, totally worth its price.

Can't really recommend any other IDE because I haven't used another one for
over 4 years, though a friend uses (loves) Sublime Text and has high praise
for it too. It has Cython syntax support, but there's really no linter for
Cython -- what's more, you may have to turn off warnings if you're using a
Python linter because it'll complain about cdefs and such.

As for the day to day use, I don't do that much Cython really, ocassionally to
speed things up. I start from a Python function, then make a .pyx (or bunch of
.pyxes) files and replicate what's needed. Compile, load both versions (Py and
Cy) and compare outputs and speed from IPython or with benchmark scripts.

I did use a lot the Cython functionality to create the HTML view of the source
source at the beggnining, though you get the hang of what transpiles well with
time and use it less and less. Still, sometimes it's useful.

~~~
sandGorgon
why did you not choose to use numba ? because the performance in a lot of
cases is faster, and most importantly the IDE/debugging integration is
seamless.

------
shalabhc
'Use static code analysis tools' is missing a mention of mypy and pytype.

------
jjtheblunt
hint: switch to scala, or jython.

~~~
number6
Skip straight to assembly

