with Timer('Task') as timer:
with timer.child('First step'):
for _ in range(5):
with timer.child('Baby steps'):
Baby steps: 2.518s (71%)
First step: 1.001s (28%)
It's on PyPI as well:
pip install timer_cm
I don't use Python any more and have never profiled it, but there seems to be good options:
There are reasons to do what you're doing, e.g. Stack Exchange have a mini-profiler to watch production, but to identify problems in a development environment, profilers are usually quicker/easier once you've learnt to use them. It also helps you identify problems you didn't even know about.
- The slowdown is difficult to predict and somewhat unstable; never compare timings from runs with/without cProfile
- It's a deterministic, wall-clock profiler
- pyprof2calltree & kcachegrind.
- Loading a profile file = arbitrary code execution. Don't ever load a profile file you did not generate yourself.
 Anyone reading this: Do not discard cProfile like I did. It just let me shave 900 ms off the startup time of my file manager (mentioned in a child comment). Wow!
This was on the front page today, but flagged as a dupe, also an interesting read as it explains about flame graphs too:
I'd also, with the _Massive_ disclaimer that I'm a MSFTie working for the Python group, mention that the visual studio IDE's Python support has a _mean_ profiler. hotpath/comparative reports/drill down/pretty pictures.
That all being said I'd be curious if anyone else would echo my sentiment that maybe 85% of the python perf issues I've seen in practice (in both my own code and others, and even in light of the more sophisticated "best practices" I often read) come down to one of:
- linear datastructures/algos instead of dicts/sets
- deeply nested loops
- excessive conversion between "stuff" (which basically
falls back to #2) and/or lack of memoization.
Tooling aside, it gives me a nice dose of mental conflict re: utility of algo knowlege in interviews, as someone who typically rails against it...
It's one of these "Developers, Developers, Developers, Developers, Developers, Developers, Developers" Tools...
PyCharm has a Concurrency Visualizer, which sounds nice, but it does only apply to Python locks, excluding the GIL. Oops...
Is that VS code python? or the full visual studio?
- Rule 1: profile first! Only optimize when there is a proven speed bottleneck; I like using kernprof to profile. When you do go to optimize, test it in a tight loop with time.clock().
- Go for built-in functions; you can't beat a loop written in C.
- Use intrinsic operations. An implied loop is faster than an explicit for loop, and a while loop with an explicit loop counter is even slower.
- Avoid calling functions written in Python in your inner loop. This includes lambdas. In-lining the inner loop can save a lot of time.
- Local variables are faster than globals; if you use a global constant in a loop, copy it to a local variable before the loop. And in Python, function names (global or built-in) are also global constants!
- Try to use map(), filter() or reduce() to replace an explicit for loop, but only if you can use a built-in function: map with a built-in function beats for loop, but a for loop with in-line code beats map with a lambda function!
- Check your algorithms for quadratic behavior, but notice that a more complex algorithm only pays off for large N.
More at https://www.python.org/doc/essays/list2str/
those that fallback to native functions for performance Vs those that work on a fast VM / compilation to native code.
it seems like the former gets some quick performance wins but results in being unable to use the language as intended in any remotely hot spots.
This is the trade-off I was talking about, you cannot trust the language itself to be performant, you have to rely on others or jump through extra hoops. It's certainly one way to ensure you won't spend the time to fix performance issues until they're really biting you (premature optimisation quote goes here).
I imagine that lookup overhead would happen on every iteration, but the list add only happens when the conditional evaluates to true - so it seems a bit less likely that the append is the overhead, but maybe the cost of allocating memory is so bad (it is a syscall after all) that it compensates for the infrequency.
Generators would help with this too: instead of materializing and appending the list at the end every time (i.e.  + [x for x in s if x]), you could do something like itertools.chain(, (x for x in s if x)). Then you avoid ever allocating a massive list -- provided that the rest of your code avoids that too.
Building a list of 10,000,000 items:
List comprehension: 599ms
For loop: 930ms
For loop with append localized: 736ms
(Although, this comparison is very interesting, it shows that not always PyPy will overperform regular CPython; it depends on what you are trying to do and if the system has just bootstrapped (after all, PyPy is a JIT compiler)):
All in all, i'm happy that there are alternatives to speed Python up, but i'm kind of a bit worried that the bottleneck on performance might be the Global Interpreter Lock (GIL), the big white-elephant-in-the-room that is never talked about often enough when discussing Python. Mind you, Python is one of my favorite languages and i've used it a lot, so no hate here.
There are projects for overcoming the GIL, for example PyPy with Software Transactional Memory, but they're still work in progress: