
Loops in Python – Comparison and Performance - duomly
https://www.blog.duomly.com/loops-in-python-comparison-and-performance/
======
chronial
Note that this article is missing the most pythonic implementation, which is

    
    
       z = [a + b for a, b in zip(x, y)]
    

This is also twice as fast as their fastest pure-python implementation on my
machine.

~~~
BerislavLopac
Or even better, and likely faster for huge lists:

    
    
        z = list(map(sum, zip(x, y)))
    

Even that list is superfluous; the map function returns a generator, so it's
better you just need one number at a time but possibly not the whole list
(e.g. loop through results until you find a certain value).

~~~
dagw
_likely faster for huge lists:_

Nope, at least not on my computer. The map solution was ~50% slower than the
list comprehensions when run a list with 1 million ints, and almost twice as
slow when run on a list with 100k ints.

In fact even using a for loop in the obvious way was slightly faster than map
with 1 million elements (about halfway between the map and list comprehension
performance wise).

Map without the last list call is however very slightly faster than the
equivalent generator comprehension.

~~~
BerislavLopac
That is interesting. It's possible that they have done some optimisations on
loops since the last time I measured it, but that was back in Python 2.6 (I
think), and map was _much_ faster than a loop (no comprehensions back then),
and it was particularly obvious on very high number of input elements.

Also, from your description I suspect that the list() call over a generator
might introduce extra overhead, rather than map itself...

~~~
dagw
Your last comment seems reasonable. I guess the new rule of thumb should be
maps are still fast, but only if you want a generator. If you need a list use
a list comprehension.

------
underdeserver
I seem to remember other analyses show that the differences are consistent
with the number of variable name lookups.

The while loop looks up i, n, z, x, y in every iteration.

The for loop looks up i, z, x and y in every iteration.

The list comprehension looks up i, x and y in every iteration.

The numpy one doesn't look up anything.

If you check the proportions you'll find this is compatible with this guy's
data.

~~~
hedora
Veteran python hater here.

Last I checked, accessing global variables is slow in python because it goes
through a hash table (or something) on every lookup. On the other hand, the
interpreter is able to optimize accesses to stuff on the stack.

I suspect wrapping the benchmark code (and all the variables) in a method, and
then invoking that once would substantially improve the raw python numbers.

By the time I figured this out, I’d already written a pile of python code that
was mysteriously >10x slower than it would be in perl 5.

~~~
chronial
> On the other hand, the interpreter is able to optimize accesses to stuff on
> the stack.

That is correct – there is no hash lookup for local variables and the
variables in a list comprehension are local.

------
wyldfire
Let's take a step back from the empirical measurements. It's always good to
take measurements when considering performance but in this case I think the
concepts are pretty simple. Simple enough that we can analytically predict
performance.

Pure python is slow. Dog slow. For the most part, its slowness is due to its
power. It's so terribly powerful that very near nothing is determined
statically. The next line of code we're about to execute is "i = i + 1". What
sort of computation is that in terms of the computer itself -- registers,
load/stores with memory? We have no idea: none! Until we get to that line and
start to break it down. We can't even guess (or even bound) whether it does
one operation or a billion operations. And not just the first time, either.
Every time†! i could refer to something other than an integer this time, or
you could have patched i's __add__(). Sometimes we capitalize on python's
mega-dynamic features but in simpler cases like these loops we don't.

So we should expect to see vast differences between numpy and pure python
because numpy's guts are implemented in an extension.

But I'd love it if this article had gone a little deeper. What are some
specific details about why we see some differences here? Does one of these
versions suffer more heap fragmentation? Are the differences for bigger
matrices dominated by cache misses? What's the difference in the disassembly
between the for/while/comprehension loops?

† this is technically an implementation detail of CPython IIRC because tracing
JIT optimizers like PyPy are somehow able to reason that this can't happen (or
trap the cases when it could) and make this go faster when we don't use
dynamic features.

~~~
jcranmer
> For the most part, its slowness is due to its power. It's so terribly
> powerful that very near nothing is determined statically.

No. It's slow because the language is interpreted, not JITed. Powerful,
highly-dynamic languages can be quite amenable to JIT compilation, especially
if you're not using the features. Shape optimizations (assuming the code in
question is going to be used more or less monomorphically) can convert name
lookups to structure offsets. You can make code work with dynamic
modifications by implementing deoptimization guards.

The CPython implementation explicitly favored simplicity of the interpreter
over making it a competitively fast runtime.

~~~
m_mueller
Yes and I recommend trying numba for a mostly seemless optimization using JIT.

------
koblas
Pitting numpy append against python append is an odd choice for a performance
comparison.

~~~
vsmhn
Precisely my thought :).

Also, what about the builtin array module?
[https://docs.python.org/3.7/library/array.html](https://docs.python.org/3.7/library/array.html)

~~~
dr_zoidberg
Well, given that they didn't even use pythonic constructs, I'm not quite sure
what to think of the article:

    
    
        In [1]: import random
        In [2]: r = [random.randrange(100) for _ in range(100000)]
        In [3]: x, y = random.sample(r, 1000), random.sample(r, 1000)
        In [4]: %timeit z = [x[i] + y[i] for i in range(1000)]
        106 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
        In [5]: %timeit z = [i + j for i, j in zip(x, y)]
        67.3 µs ± 3.38 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

(under python 3.6.3)

For those that "don't see it", what I'm seeing is that, instead of looping
over a zip of iterables, like they should be doing, they're using an "index
range" to access each element by index -- not the best practice, and also
results in a noticeable slowdown.

Edit: And for the curious who might think they're different result:

    
    
        In [6]: z1 = [x[i] + y[i] for i in range(1000)]
        In [7]: z2 = [i + j for i, j in zip(x, y)]
        In [8]: z1 == z2
        Out[8]: True
    

Final edit: all in all, I'd say this is a low effort post, aimed at gathering
attetion and showing "look how good we are that we know how to speed up python
loops using numpy" (/s)...

And I've successfully been python-nerd-sniped.

------
dotdi
The article is interesting but not for the face value.

On the surface, the comparison is rather weak because it is comparing loops in
a dynamic language to optimized, low-level vector operations. Which is, you
know, "whoa, low-level hand-optimized routines are faster than Python", or, as
I said, weak.

The interesting bit is that Python cannot figure out that it could unroll or
vectorize these, rather simple, operations. Surely there are languages with
optimizers that can figure that out. Is that impossible to do in Python? Is
that by design or by accident?

And, as many people said, if you are doing a lot of math in pure Python, you
are doing it wrong. There are other, much more useful, ways to use loops. I
like Python especially because it beautifully handles those "other cases".

~~~
gmiller123456
_On the surface, the comparison is rather weak because it is comparing loops
in a dynamic language to optimized, low-level vector operations. Which is, you
know, "whoa, low-level hand-optimized routines are faster than Python", or, as
I said, weak._

I'm not sure where you're trying to go with that, but comparing Python's
interpreted, non-optimized speed to low level OMG optimized code is exactly
the point of the article.

The comparison is very useful for anyone looking to increase the speed of code
they've written in Python. This points out the limits of what can be achieved
in pure Python, and where you'd need to look to improve things, even if it
means rewriting the whole thing in a different language. From what you've
said, it sounds like you were expecting an article on something different.

~~~
BerislavLopac
> The comparison is very useful for anyone looking to increase the speed of
> code they've written in Python.

Python has numerous mechanisms that are much more efficient than the solution
in this artice, like zip, map and filter functions, as well as the excellent
built-in libraries like itertools and functools.

------
xaedes
So a conclusion would be to use numpy instead of for loops for number
crunching.

But behold! This does not scale as one could expect.

Turning a little more complex elementwise computation into vectorized numpy
calls produces a lot of memory overhead. For each vectorized numpy call, which
does a basic computation (like adding two operands, or negating, or sinus of,
...), you allocate a whole new array. The calls _can_ leverage parallel
computation but you run into cache issues because of the massive memory
overhead.

A simple C-loop computing each element completely can be much more memory
local and cache friendly. Turning it into a simple chain of numpy operations
can destroy that.

I think using Numba to compile ufunc for numpy can be a better approach to
that problem.

~~~
jph00
> Turning a little more complex elementwise computation into vectorized numpy
> calls produces a lot of memory overhead. For each vectorized numpy call,
> which does a basic computation (like adding two operands, or negating, or
> sinus of, ...), you allocate a whole new array.

I believe this is no longer correct, within a single expression. In the last
couple of years things were fixed so that `a = x+y+z`, for instance, will not
create an intermediate array.

~~~
nerdponx
This is also what the 'numexpr' library does.

------
wombatpm
Isn't using numpy just an example of custom C code for a python bottleneck?

------
zeckalpha
I’d be curious to see map with a lambda as well as using numba directly rather
than through numpy compared as well.

~~~
ben509
In CPython, a list comprehension looks like this internally:

    
    
         dis.dis(lambda y: [x + 1 for x in y])
            1    0 LOAD_CONST               1 (<code object <listcomp> at 0x106f441e0, file "<ipython-input-2-1f82f6afa2d5>", line 1>)
                  2 LOAD_CONST               2 ('<lambda>.<locals>.<listcomp>')
                  4 MAKE_FUNCTION            0
                  6 LOAD_FAST                0 (y)
                  8 GET_ITER
                 10 CALL_FUNCTION            1
                 12 RETURN_VALUE
    
        Disassembly of <code object <listcomp> at 0x106f441e0, file "<ipython-input-2-1f82f6afa2d5>", line 1>:
           1      0 BUILD_LIST               0
                  2 LOAD_FAST                0 (.0)
            >>    4 FOR_ITER                12 (to 18)
                  6 STORE_FAST               1 (x)
                  8 LOAD_FAST                1 (x)
                 10 LOAD_CONST               0 (1)
                 12 BINARY_ADD
                 14 LIST_APPEND              2
                 16 JUMP_ABSOLUTE            4
            >>   18 RETURN_VALUE
    

You get very similar disassembled code to this:

    
    
        def foo(y):
            def bar():
                l = []
                a = l.append
                for x in y:
                    a(x + 1)
                return l
            return bar()
    

Obviously you can't use the `LIST_APPEND` primitive in a for loop. Also, note
that CPython doesn't try to preallocate the list. `l = [None] * len(y)` and
then iterating over a range might beat a comprehension for long lists.

The performance story with numba and numpy is basically that you have a
boundary between Python and their internal representation of data, and
crossing that boundary is slow.

For instance, in numpy, even if you specify a dtype, `np.array` must check and
cast every element to that type. Likewise, `tolist` must construct new python
objects for every value in the array.

Once you're dealing with ndarrays, though, operations are fast as the array
has a single type and the code is heavily optimized. But iterating over
ndarrays from python tends to be worse than native Python containers.

Numba is a mixed bag. It's no worse than numpy at interpreting ndarrays, and
bad with Python containers. It has to inspect arguments to call a JITted
function, and this can be far more surprising than numpy.

But if you can combine many JITted functions, you can combine a lot of logic
and completely avoid introspection in a way you can't do with numpy alone. The
difficulty of doing that is simply that it's having to reimplement many of the
numpy features in an early-binding language.

------
eesmith
The 'premature optimization is the root of all evil' quote is from Knuth's
1974 paper "Structured Programming With Go To Statements", p268, second column
not "The Art of Computer Programming"

~~~
nine_k
Sometimes optimization is not premature: when you have a known performance
problem, have done proper profiling, and see that it's a tight loop (as
usual).

Then you can shave off some execution time by using a slightly different
construct, though bigger wins usually lie elsewhere.

~~~
magicalhippo
If you've got a known performance problem it's not premature. Knuths quote is
the programming version of don't buy clothes for your baby before you've had
sex for the first time.

There are some important distinctions though, which gets lost if one takes the
quote at face value without much context. For example, I think one should
strive to avoid O(n^2) loops and similar unless one is absolutely positively
certain n will forever be bounded to some sane value.

My point being that O(n^2) code can make the application completely unusable,
which makes it something more than a pure optimization problem.

~~~
eesmith
(As an aside, some asexual people, who never had sex and are not interested in
ever having sex, are still interested in adopting, and would buy clothes for
their baby before having sex for the first time.)

------
ddavis
This comparison would benefit from showing numba
([https://numba.pydata.org/](https://numba.pydata.org/)) results for explicit
loops.

------
google2342
You shouldn't use the mean when doing benchmarking. Better to use the median
or fastest time. Lots of random things can happen on computers (usually in the
OS) that can result in some operation taking 1000x longer.

~~~
chronial
Interesting point. The %timeit functionality actually used to output the mean
of the x fastest runs, they seem to have changed that at some point. The docs
still explain the old behavior [1].

I assume that they feel your concerns are addressed because they display the
stddev together with the mean, so you can see if there were any extreme
outliers.

[1] [https://ipython.org/ipython-
doc/dev/interactive/magics.html#...](https://ipython.org/ipython-
doc/dev/interactive/magics.html#magic-timeit)

------
jsilence
TL;DR: Use Numpy! Numpy rocks!

