
Measure code execution time accurately in Python - bknasmueller
https://knasmueller.net/measure-code-execution-time-accurately-in-python
======
hinkley
One of my special skills as a young developer was intuiting that the perf tool
timings for small methods were highly inaccurate (at the time I characterized
this as lying), and in a GC’d language that inaccuracy can be quite amplified.

I found a lot of performance issues that others missed, and there were a
number of quite notable cases where I had rather extraordinary results. Like
20% from tweaking (not even deleting) a method measured as taking 5% of run
time. Not the weirdest I every did, but a solid number of top ten entries from
this category.

How did I find them? Sorting by invocation count instead of run time. You can
blindly jump into any of functions and get some useful results, but a more
sophisticated approach is to think first about expected versus actual results.
You think to yourself, “I just ran this task ten times. It has this much data
to deal with. How many calls to this particular function should I have
expected? 250? Then why was it called 759 times?”

What you find next will be architectural problems, or problems of data flow
through the system (while some, including me, would argue that I just repeated
myself, not everyone accepts that).

The fastest methods are the ones you never call, and I’m surprised at how
frequently people ignore call count even though it’s often right there in the
results. Maybe because it’s often the last value displayed, as if to say “This
is barely worth mentioning.”

~~~
zimbatm
That's a great advice.

Another one is to only look at memory allocations. Usually when there is a lot
of allocations, a lot of data is being handled. This gives a bit of a
different perspective where you start thinking of folding chunks of data
together.

The nice thing about those two approaches is that they don't vary between CPU
and architectures.

~~~
hinkley
Certainly. When a quick function allocates lots of memory, the slowdown gets
blamed on the functions that follow. In purely sequential code this leaves
some clues behind. With event-based or concurrent code these problems can be
more elusive.

Cache invalidation can follow a similar misattribution pattern.

When you do succeed in making an improvement, the % of time for everything
else should increase. 10% of 90% is 11%. Look back at the old results and see
if anything else went up disproportionately, and see if you can understand
why.

Related to that notion, and as a parting thought, since this is getting some
love: game and A/V programmers understand something the rest of us do not. And
that is a computation/time budget. It doesn’t matter if one thing takes four
times as long as the other if they both take too long. You haven’t fixed “the
problem” until you’ve fixed them all. Late is late.

There’s another “systematic” approach to making very large perf improvements
that doesn’t fix problems from most to least significant but by functional
unit. Fix all of the problems in that functional unit even if it’s 1% of the
total (if you’re after a 5x speed gain it’s actually 5% of the total budget).
If you make large scale changes you are going to have to do some expensive
regression testing on this part of the code, tests or no tests. Once you’ve
paid that price once, you will never go back for that last 2% because you can
not justify it to your organization. You’re stuck with those in ten different
functional areas, forever. What you _can_ justify is boosting a raft of work
from 18% to a 20% overall improvement.

------
oefrha
No mention of PSL timeit?
[https://docs.python.org/3/library/timeit.html](https://docs.python.org/3/library/timeit.html)

Also, quote from the timeit page:

> Note: It’s tempting to calculate mean and standard deviation from the result
> vector and report these. However, this is not very useful. In a typical
> case, the lowest value gives a lower bound for how fast your machine can run
> the given code snippet; higher values in the result vector are typically not
> caused by variability in Python’s speed, but by other processes interfering
> with your timing accuracy. So the min() of the result is probably the only
> number you should be interested in. After that, you should look at the
> entire vector and apply common sense rather than statistics.

~~~
hyperpape
It's technically true that the lowest value will approach the lower bound of
how fast your machine can run the code, but there's no guarantee that number
is valuable. If there is any significant non-determinism in the execution, the
minimum time will be a value that you'll only expect to reach in a fraction of
executions, even if there are no other processes contending for resources.

The only real solution for avoiding noise is to control its sources.

Reference here:
[https://tratt.net/laurie/blog/entries/minimum_times_tend_to_...](https://tratt.net/laurie/blog/entries/minimum_times_tend_to_mislead_when_benchmarking.html).

~~~
oefrha
If there's significant nondeterminism in the execution, the method discussed
in TFA is also worthless. It's pretty clear determinism is the basic
assumption of this discussion; no one's arguing that you should use the
minimum execution time among random cases as the performance characteristics
of timsort, for instance.

Of course, caching of deterministic code complicates matters too.

~~~
hyperpape
I would have to spend a long time studying the relevant Python implementation
to think I knew whether there was nondeterminism present in a given benchmark.
That's a tall order.

~~~
oefrha
Now that I think about it, you’re right that randomized hashing in Python
itself could cause nondeterminism in seemingly deterministic code. However,
timeit does mitigate that to some extent by repeating the same code a large
number of times — 1 million times by default, and it’s better used that way.
Of course, depending on the actual code, it could be the case that all 1M
invocations are skewed by the specific choice of PYTHONHASHSEED; it’s not
something that you can solve in-process, so there’s no panacea, you must ditch
the idea of getting a single stable number in that case.

------
__s
Victor Stinner (python dev) wrote a good series on trying to get stable
benchmarks in python:

[https://vstinner.github.io/journey-to-stable-benchmark-
syste...](https://vstinner.github.io/journey-to-stable-benchmark-system.html)
[https://vstinner.github.io/journey-to-stable-benchmark-
deadc...](https://vstinner.github.io/journey-to-stable-benchmark-
deadcode.html) [https://vstinner.github.io/journey-to-stable-benchmark-
avera...](https://vstinner.github.io/journey-to-stable-benchmark-average.html)

------
poooogles
Eurgh for starters using `time.time()` is a bad idea because it calls
gettimeofday which is susceptible to clock changes. `time.perf_counter` is
what you want to use as it can never go backwards and it will always be from
the highest granularity clock.

~~~
gmrple
Next, if you really want to time something fast you should probably know about
cache. In a modern computer there are levels of memory. For now we’ll say we
have 3 levels: cache, ram, and a hard drive. Cache is small but very fast, it
is the closest level to the CPU. RAM is bigger but slower and the hard drive
is much much slower. Because the cache is small the computer has to pick
what’s in it. At program load the cache will be full of information related to
whatever was running before you. After you run you will have left some
information in the cache which will not need to be fetched from slower levels
of memory. Unless your usecase is to run your test code repeatedly very
quickly, you should be wary of timing multiple calls in fast sequence as the
later calls will execute faster due to the cache being primed to execute your
code.

~~~
hyperpape
Great point. In addition to caching, there's potential for branch prediction
effects.

This is why the JMH samples warn against including loops[0] of the operations
you're trying to measure, and have an API that's designed to avoid the
programmer having to every write their own looping code.

[0] [https://hg.openjdk.java.net/code-
tools/jmh/file/7bc6011260aa...](https://hg.openjdk.java.net/code-
tools/jmh/file/7bc6011260aa/jmh-
samples/src/main/java/org/openjdk/jmh/samples/JMHSample_11_Loops.java)

~~~
stygiansonic
Great point about using JMH. As of JDK 12, it’s included in the SDK:
[https://openjdk.java.net/jeps/230](https://openjdk.java.net/jeps/230)

What’s the equivalent of JMH for Python?

~~~
hyperpape
I'm afraid I don't know, I really don't have any particular knowledge of
performance engineering in Python.

------
cscheid
I don't know how else to put it, but the technique suggested in this paper is
irreparably incorrect in the setting that it recommends. Attempting to fix a
linear least squares fit by removing points with high residuals does not work.
Please don't do that: they won't necessarily be the outliers in the dataset
and your model will converge to the wrong thing.

Statistics is hard. Like, really really hard. Stuff goes wrong all the time.
Please leave it to the experts.

If you're going to do this, please look up the methods behind (for example)
robust least squares, outlier detection, L_1 regression, etc. The right way to
do this is to start with a small dataset that is with very high probability
free of outliers, and slowly grow it by never adding points which have large
residuals. (If you've done 3D scan registration and image alignment, this is
what RANSAC does.)

The principle is, intuitively, that once an out-of-distribution point gets
into your linear model, the model is poisoned forever. You can't trust the
model to tell you that the bad points are the outliers. The way this paper
does is is irreparably broken, sorry :/

~~~
colmmacc
There's another reason too; there are strong reasons why the timing of
successive code executions should not be linear. Typically the first call to a
function will incur the cost of fetching data into a cache. At minimum the
basic blocks for the function will be loaded through the l1/l2/l3 pipelines.

2, 3, 4 function calls in a row will benefit from that caching, but also help
train the speculative execution engine. At some point that speculation may
radically alter performance. Later, with a sufficient number of runs, the
timing will also suffer from preemption, context-switches, or other
interruptions that can invalidate both the cache and the speculative execution
engine.

The idea that there is a fixed performance time for a function is naive on
modern hardware.

~~~
cscheid
You make a good point I hadn't considered. In that case, not even robust least
squares will save you, because the samples are not independent of one another.
You'd have to start looking into explicit time-dependent methods and yikes, I
have no idea of what the literature says about time dependence _and_ outliers.
I wouldn't be surprised if it's "here be dragons" territory.

~~~
gdxhyrd
You are speaking as a stats person. :)

While there are dozens of things that can go south in benchmarking, how to
solve them is not about applying advanced stats, but about understanding and
eliminating the sources of noise.

Most benchmarks in CS can be done quite easily as long as one understands all
the technology stack.

~~~
cscheid
I'm not a stats person! But I have been burned in real projects by doing the
wrong thing. I just have learned --- the hard way --- to honestly appreciate
it.

------
pletnes
It’s good to know about turbo boost when measuring cpu bound code. When the
cpu gets hot, the clock drops, sometimes by a large factor. When it’s cooler
and the OS wants to run code, the clock goes back up. This can be disabled in
BIOS to get a more stable result. You can monitor the cpu clock in Resource
Monitor on windows 10, to see how big a deal the issue is. It varies by cpu
model, temperature, cooling of your particular machine etc.

~~~
kqr
...but obviously only disable it if it is disabled in the environment which
you want to predict. Don't try extrapolating from an environment in which it
is disabled to one in which it is not, and vice versa.

~~~
pletnes
Not really true. If the particular computer you’re running the code on is warm
due to weather conditions, which affects clock frequency, it doesn’t tell you
anything about the performance of your code.

------
spankalee
Isn't figuring out the slope a red herring? The total run time is:

    
    
        n * timing_overhead + n(n+1)/2 * runtime
    

Seems like you'd do better by measuring n^2 runs and dividing by n^2. But
that's not very good because it's not taking into account the observed
variance. What you really want to know it something like the 95% confidence
interval after observing n runs.

My teammate has been working on a library to do just this for JavaScript, with
added automation for running a set of benchmarks until some set of thresholds,
like A is 95% likely faster than B.

[https://www.npmjs.com/package/tachometer](https://www.npmjs.com/package/tachometer)

~~~
kqr
> What you really want to know it something like the 95% confidence interval
> after observing n runs.

Great in theory, but almost always relies on a normality assumption which just
doesn't hold in general for performance measurements.

How well does this package handle pareto distributed run times?

~~~
mattkrause
Though this particular package doesn't, you can certainly derive confidence
(or better, credible) intervals for non-normally distributed data.

------
longemen3000
[https://arxiv.org/abs/1608.04295](https://arxiv.org/abs/1608.04295) this
paper is the statistical basis of time measuring in Julia. It is, at least a
good read when trying to measure the execution of some program seriously at
low level. It uses also a predictor of the estimate number of evaluations via
a logistic function

------
cbsks
Off topic, but why is there a pop up on the site that says “This website uses
cookies to ensure you get the best experience on our website”? What is the
“best experience” that I am missing out on by refusing to allow cookies? I
didn’t think cookies on a blog are useful to anyone except to the website
owner for tracking, but maybe I am missing something?

~~~
userbinator
One thing I've learned over the years is that "best experience" is an
euphemism for "what we want to force down your throat".

------
rcthompson
So the "algorithm" is just... running the code different numbers of times and
then using least squares linear regression? Or am I missing something?

------
speedplane
I used to develop in C and C++, back then the software ran on a single
machine, and the most common performance issues we faced were unintentionally
introducing some O(N^2) complexity, usually within a deep call stack.

Today I develop mostly in Python on a web backend system. Now, the most common
performance issues I see are making multiple network calls when one would
suffice (e.g., not caching repeated database queries). I have no doubt that
there are many hidden O(N^2) issues lurking in the python code, but because
network calls are so expensive, they seem to dominate optimization efforts.

I suspect that I'm not alone here, as today, there are many I/O performance
tools than there were in the past.

------
jniedrauer
I haven't used Python full time in a few years, but it seems like this has
been a solved problem for a while with timeit[0]. Am I missing something?

[0]:
[https://docs.python.org/3/library/timeit.html](https://docs.python.org/3/library/timeit.html)

------
orf
Just use the timeit module, or the %timeit iPython magic command.

------
xurukefi
You could also do the following to compute the overhead of the timing
function:

    
    
        time_start = time.perf_counter()
        time_end = time.perf_counter()
        execution_time = time_end - time_start

~~~
fleur-de-lotus
I initially thought it was an entry for a Rube-Goldberg contest.

------
luord
We made measurements this way in my physics I lab classes. Guess there's no
reason why it wouldn't apply to software.

------
cmonnow
how about a simple average ?

------
cmonnow
how about a simple average?

------
mastrsushi
20 minutes

------
rasz
Its python! The moment you need to measure precise execution time of functions
is the moment you should be looking at rewrite to C, C++, Rust, maybe even
Java.

