
Hunting Performance in Python Code: Measuring Memory Consumption - palecsandru
https://pythonfiles.wordpress.com/2017/05/18/hunting-python-performance-part-2/
======
mherrmann
The thing I have to do most often is find out where a function spends its
time. It's a trivial problem. But there's no good out-of-the-box solution. I
ended up creating one [1]. Example:

    
    
      with Timer('Task') as timer:
        with timer.child('First step'):
          sleep(1)
        for _ in range(5):
          with timer.child('Baby steps'):
            sleep(.5)
    

Outputs:

    
    
      Task: 3.520s
        Baby steps: 2.518s (71%)
        First step: 1.001s (28%)
    

This way, you know where most of the time is spent and what you need to
optimise.

It's on PyPI as well:

    
    
      pip install timer_cm
    

The code is a grand 42 lines. But once you've had to do it a few times it
becomes so tedious to code up from scratch that it's much nicer to have a
ready-made solution.

[1]: [https://github.com/mherrmann/timer-
cm](https://github.com/mherrmann/timer-cm)

~~~
mattmanser
I hope I'm not teaching you to suck eggs, but you can use a profiler to do
that rather than changing your code with what are basically print statements.

I don't use Python any more and have never profiled it, but there seems to be
good options:

[https://docs.python.org/2/library/profile.html](https://docs.python.org/2/library/profile.html)

There are reasons to do what you're doing, e.g. Stack Exchange have a mini-
profiler to watch production, but to identify problems in a development
environment, profilers are usually quicker/easier once you've learnt to use
them. It also helps you identify problems you didn't even know about.

~~~
dom0
Couple notes about cProfile

\- The slowdown is difficult to predict and somewhat unstable; never compare
timings from runs with/without cProfile

\- It's a deterministic, _wall-clock_ profiler

\- pyprof2calltree & kcachegrind.

\- Loading a profile file = arbitrary code execution. Don't ever load a
profile file you did not generate yourself.

~~~
szatkus
I find SnakeViz much more simple and user-friendly.

------
existencebox
If we're talking methods/tooling, I'd throw Heapy in the pile. It's a PITA to
get set up in some environments but was the best tool I've used thus far for
answering the question of "what objects are eating my memory"

I'd also, with the _Massive_ disclaimer that I'm a MSFTie working for the
Python group, mention that the visual studio IDE's Python support has a _mean_
profiler. hotpath/comparative reports/drill down/pretty pictures.

That all being said I'd be curious if anyone else would echo my sentiment that
maybe 85% of the python perf issues I've seen in practice (in both my own code
and others, and even in light of the more sophisticated "best practices" I
often read) come down to one of:

\- linear datastructures/algos instead of dicts/sets

\- deeply nested loops

\- excessive conversion between "stuff" (which basically falls back to #2)
and/or lack of memoization.

Tooling aside, it gives me a nice dose of mental conflict re: utility of algo
knowlege in interviews, as someone who typically rails against it...

~~~
bobosha
> mention that the visual studio IDE's Python support has a _mean_ profiler.
> hotpath/comparative reports/drill down/pretty pictures.

Is that VS code python? or the full visual studio?

~~~
existencebox
Full. Code does debugging, but no profiling yet. (I have no knowledge as to
that particular roadmap, but I can hope)

------
boltzmannbrain
Some guidelines I like for Python optimization (and on loops specifically):

\- Rule 1: profile first! Only optimize when there is a proven speed
bottleneck; I like using kernprof to profile. When you do go to optimize, test
it in a tight loop with time.clock().

\- Go for built-in functions; you can't beat a loop written in C.

\- Use intrinsic operations. An implied loop is faster than an explicit for
loop, and a while loop with an explicit loop counter is even slower.

\- Avoid calling functions written in Python in your inner loop. This includes
lambdas. In-lining the inner loop can save a lot of time.

\- Local variables are faster than globals; if you use a global constant in a
loop, copy it to a local variable before the loop. And in Python, function
names (global or built-in) are also global constants!

\- Try to use map(), filter() or reduce() to replace an explicit for loop, but
only if you can use a built-in function: map with a built-in function beats
for loop, but a for loop with in-line code beats map with a lambda function!

\- Check your algorithms for quadratic behavior, but notice that a more
complex algorithm only pays off for large N.

More at
[https://www.python.org/doc/essays/list2str/](https://www.python.org/doc/essays/list2str/)

~~~
joncampbelldev
This advice seems to really highlight the difference between high level
managed memory languages languages:

those that fallback to native functions for performance Vs those that work on
a fast VM / compilation to native code.

it seems like the former gets some quick performance wins but results in being
unable to use the language as intended in any remotely hot spots.

~~~
fnord123
An almost free FFI to C is one of python's best features. Using this great
feature is definitely using it as intended.

~~~
joncampbelldev
Only if every python programmer can easily implement and distribute their own
c functions for every performance hotspot they find in their program.

This is the trade-off I was talking about, you cannot trust the language
itself to be performant, you have to rely on others or jump through extra
hoops. It's certainly one way to ensure you won't spend the time to fix
performance issues until they're really biting you (premature optimisation
quote goes here).

------
sprt
Relevant:
[https://github.com/python/cpython/commit/30d00e54dde47b11f5b...](https://github.com/python/cpython/commit/30d00e54dde47b11f5b338aaba17760b641e1705#diff-
aec9f5760d05df1e3c7f6f503e781a11R350)

------
falcolas
I find it very telling that both memory hotspots were explicit loops, and the
fixes were to use implicit looping constructs (range and a list
comprehension). I wonder if the underlying problem was the loop, or the
constant append() calls.

~~~
makmanalp
That's an interesting point - there are competing factors here. The list
comprehension is usually a bit more efficient due to not having to look up the
local variable for the list on every iteration, I think. Take a look at this
(albeit a bit old) article:

[https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loo...](https://wiki.python.org/moin/PythonSpeed/PerformanceTips#Loops)

I imagine that lookup overhead would happen on every iteration, but the list
add only happens when the conditional evaluates to true - so it seems a bit
less likely that the append is the overhead, but maybe the cost of allocating
memory is so bad (it is a syscall after all) that it compensates for the
infrequency.

Generators would help with this too: instead of materializing and appending
the list at the end every time (i.e. [2] + [x for x in s if x]), you could do
something like itertools.chain([2], (x for x in s if x)). Then you avoid ever
allocating a massive list -- provided that the rest of your code avoids that
too.

~~~
lilbobbytables
Localizing append helps, but that's not all of it:

Building a list of 10,000,000 items:

List comprehension: 599ms For loop: 930ms For loop with append localized:
736ms

~~~
dom0
Comprehensions have their own opcodes which directly call e.g. PyList_Append
with no further checks or lookups in each iteration.

------
flavio81
Well, in total honesty, the very fact of using PyPy is already a performance
optimization over the regular CPython interpreter.

(Although, this comparison is very interesting, it shows that not always PyPy
will overperform regular CPython; it depends on what you are trying to do and
if the system has just bootstrapped (after all, PyPy is a JIT compiler)):

[https://lincolnloop.com/blog/speed-comparison-cpython-
pypy-p...](https://lincolnloop.com/blog/speed-comparison-cpython-pypy-pyston/)

All in all, i'm happy that there are alternatives to speed Python up, but i'm
kind of a bit worried that the bottleneck on performance might be the Global
Interpreter Lock (GIL), the big white-elephant-in-the-room that is never
talked about often enough when discussing Python. Mind you, Python is one of
my favorite languages and i've used it a lot, so no hate here.

There are projects for overcoming the GIL, for example PyPy with Software
Transactional Memory, but they're still work in progress:

[http://doc.pypy.org/en/latest/stm.html](http://doc.pypy.org/en/latest/stm.html)

