The thing I have to do most often is find out where a function spends its time. It's a trivial problem. But there's no good out-of-the-box solution. I ended up creating one [1]. Example:
with Timer('Task') as timer:
with timer.child('First step'):
sleep(1)
for _ in range(5):
with timer.child('Baby steps'):
sleep(.5)
Outputs:
Task: 3.520s
Baby steps: 2.518s (71%)
First step: 1.001s (28%)
This way, you know where most of the time is spent and what you need to optimise.
It's on PyPI as well:
pip install timer_cm
The code is a grand 42 lines. But once you've had to do it a few times it becomes so tedious to code up from scratch that it's much nicer to have a ready-made solution.
I hope I'm not teaching you to suck eggs, but you can use a profiler to do that rather than changing your code with what are basically print statements.
I don't use Python any more and have never profiled it, but there seems to be good options:
There are reasons to do what you're doing, e.g. Stack Exchange have a mini-profiler to watch production, but to identify problems in a development environment, profilers are usually quicker/easier once you've learnt to use them. It also helps you identify problems you didn't even know about.
Interesting point, though again much more complex.
[edit] Anyone reading this: Do not discard cProfile like I did. It just let me shave 900 ms off the startup time of my file manager (mentioned in a child comment). Wow!
I admit they take time up front to learn to use and interpret the results, but it's a valuable skill and worth doing every year or so even on smaller projects to see if there are any easy wins.
Omg, cProfile just let me shave 900ms off the startup-time of my PyQt-based file manager [1]. I would not have found this that quickly without your hint. Thank you!
I really like the simplicity of your solution, but you might be interested in kernprof: https://github.com/rkern/line_profiler It seems to be the nearly-canonical line-by-line profiler for Python, or was some years ago.
Its implimentation is much more complex but I'd recommend trying it as it's trivial to use in practice:
- Find a suspiciously slow function using cProfile
- pip install line_profiler
- Edit function to start with @profile
- Run kernprof
- Take a look at the output and act accordigly
If we're talking methods/tooling, I'd throw Heapy in the pile. It's a PITA to get set up in some environments but was the best tool I've used thus far for answering the question of "what objects are eating my memory"
I'd also, with the _Massive_ disclaimer that I'm a MSFTie working for the Python group, mention that the visual studio IDE's Python support has a _mean_ profiler. hotpath/comparative reports/drill down/pretty pictures.
That all being said I'd be curious if anyone else would echo my sentiment that maybe 85% of the python perf issues I've seen in practice (in both my own code and others, and even in light of the more sophisticated "best practices" I often read) come down to one of:
- linear datastructures/algos instead of dicts/sets
- deeply nested loops
- excessive conversion between "stuff" (which basically
falls back to #2) and/or lack of memoization.
Tooling aside, it gives me a nice dose of mental conflict re: utility of algo knowlege in interviews, as someone who typically rails against it...
On Windows, VTune supports managed code sampling (Java, Python, ...); it basically decodes the internal state of the VM (such as PyEval_FrameEx). The basics work on Linux as well, but as usual setup is more painful due to kernel issues. The most basic analyses do not require kernel support.
It's one of these "Developers, Developers, Developers, Developers, Developers, Developers, Developers" Tools...
-
PyCharm has a Concurrency Visualizer, which sounds nice, but it does only apply to Python locks, excluding the GIL. Oops...
Some guidelines I like for Python optimization (and on loops specifically):
- Rule 1: profile first! Only optimize when there is a proven speed bottleneck; I like using kernprof to profile. When you do go to optimize, test it in a tight loop with time.clock().
- Go for built-in functions; you can't beat a loop written in C.
- Use intrinsic operations. An implied loop is faster than an explicit for loop, and a while loop with an explicit loop counter is even slower.
- Avoid calling functions written in Python in your inner loop. This includes lambdas. In-lining the inner loop can save a lot of time.
- Local variables are faster than globals; if you use a global constant in a loop, copy it to a local variable before the loop. And in Python, function names (global or built-in) are also global constants!
- Try to use map(), filter() or reduce() to replace an explicit for loop, but only if you can use a built-in function: map with a built-in function beats for loop, but a for loop with in-line code beats map with a lambda function!
- Check your algorithms for quadratic behavior, but notice that a more complex algorithm only pays off for large N.
Only if every python programmer can easily implement and distribute their own c functions for every performance hotspot they find in their program.
This is the trade-off I was talking about, you cannot trust the language itself to be performant, you have to rely on others or jump through extra hoops. It's certainly one way to ensure you won't spend the time to fix performance issues until they're really biting you (premature optimisation quote goes here).
I find it very telling that both memory hotspots were explicit loops, and the fixes were to use implicit looping constructs (range and a list comprehension). I wonder if the underlying problem was the loop, or the constant append() calls.
That's an interesting point - there are competing factors here. The list comprehension is usually a bit more efficient due to not having to look up the local variable for the list on every iteration, I think. Take a look at this (albeit a bit old) article:
I imagine that lookup overhead would happen on every iteration, but the list add only happens when the conditional evaluates to true - so it seems a bit less likely that the append is the overhead, but maybe the cost of allocating memory is so bad (it is a syscall after all) that it compensates for the infrequency.
Generators would help with this too: instead of materializing and appending the list at the end every time (i.e. [2] + [x for x in s if x]), you could do something like itertools.chain([2], (x for x in s if x)). Then you avoid ever allocating a massive list -- provided that the rest of your code avoids that too.
Well, the fact that they're running it in PyPy makes all of this that much more challenging to debug - you'd have to look at the runtime-generated bytecode to identify the actual causes.
Well, in total honesty, the very fact of using PyPy is already a performance optimization over the regular CPython interpreter.
(Although, this comparison is very interesting, it shows that not always PyPy will overperform regular CPython; it depends on what you are trying to do and if the system has just bootstrapped (after all, PyPy is a JIT compiler)):
All in all, i'm happy that there are alternatives to speed Python up, but i'm kind of a bit worried that the bottleneck on performance might be the Global Interpreter Lock (GIL), the big white-elephant-in-the-room that is never talked about often enough when discussing Python. Mind you, Python is one of my favorite languages and i've used it a lot, so no hate here.
There are projects for overcoming the GIL, for example PyPy with Software Transactional Memory, but they're still work in progress:
It's on PyPI as well:
The code is a grand 42 lines. But once you've had to do it a few times it becomes so tedious to code up from scratch that it's much nicer to have a ready-made solution.[1]: https://github.com/mherrmann/timer-cm