
When optimising code, never guess, always measure - ColinWright
http://www.solipsys.co.uk/new/WhenOptimisingCodeMeasure.html?HNri05
======
hyperpape
I've never really spent any time reading Python bytecode, but disassembly
suggests the difference could be:

    
    
        f3
        ------
        92 LOAD_FAST                2 (a2)
        94 LOAD_FAST                3 (c)
        96 BINARY_ADD
        98 LOAD_FAST                3 (c)
        100 LOAD_CONST              1 (2)
        102 BINARY_ADD
        104 ROT_TWO
        106 STORE_FAST              2 (a2)
        108 STORE_FAST              3 (c)
    
        f4
        ------
        92 LOAD_FAST                2 (a2)
        94 LOAD_FAST                3 (c)
        96 INPLACE_ADD
        98 STORE_FAST               2 (a2)
        100 LOAD_FAST               3 (c)
        102 LOAD_CONST              1 (2)
        104 INPLACE_ADD
        106 STORE_FAST              3 (c)
    

And perhaps the culprit is that INPLACE_ADD requires more work than BINARY ADD
[https://stackoverflow.com/questions/15376509/when-is-i-x-
dif...](https://stackoverflow.com/questions/15376509/when-is-i-x-different-
from-i-i-x-in-python)

I strongly believe that the idea that it's running on multiple cores is not
the case. Python doesn't implicitly schedule any user code on separate
threads, and if it did, the coherence cost of doing so would swamp any benefit
of doing two simple operations in parallel.

Disassembly is as simple as:

    
    
        with open('f3', 'w') as f:
             dis.dis(factor_fermat3, file = f)
    
        with open('f4', 'w') as f:
            dis.dis(factor_fermat4, file = f)
    

Edit: Whoops, I see that I used Python3, while the post was about 2.7.4, and
there's some wrangling over the differences below. I reran the disassmbly
under 2.7.4. The bytecode was more verbose, but differed in the same way.

~~~
bazizbaziz
This comment is excellent. The title of the original post should be: "When
optimising code, never guess, always read the bytecode/assembly."

Without actually reading the assembly/bytecode/etc, you end up speculating
about silly things like 'the two evaluations and assignments can happen in
parallel, and so may happen on different cores.'.

~~~
ColinWright
Indeed, I was unduly influenced by the code I was writing in the late 80s and
early 90s that really did take languages with multiple assignments like this
and ran them on different processors. You say it's a silly thing, but we used
to do it - things have changed.

 _Added in edit: The magic term is "execution unit" not "core". As I say,
things have changed, and the bundling of multiple execution units into each
core, and multiple cores into each processor, is different in interesting and
subtle ways from the situation I used to code, where I had a few hundred, or a
few thousand, processors in each machine, but the individual processors were
simpler._

~~~
bazizbaziz
I didn't mean to say this was a silly thing to do - most modern processors
execute instructions out of order on multiple ALUs.

The problem is that the abstraction layer between the python code in question
and the processor's instruction stream is so thick that it's hard to say one
way or the other that the processor is indeed executing that _particular_ pair
of instructions in parallel. It's definitely executing many instructions out
of order, but it's unclear (without inspection of the python interpreter and
its assembly) what's happening at the machine level.

Looking at the bytecode of the python program at least begins to tells us that
the python bytecode of the two versions is fundamentally different, which
could account for the performance difference. Although, what exactly makes the
material difference is also under debate elsewhere in the thread. :)

------
kristofferc
A nice thing with using a compiled language (like Julia) is that you typically
do not need to worry about these type of things. Using the identical code in
the blog post in Julia (with some trivial syntactic modifications,
[https://pastebin.com/RGYunpF4](https://pastebin.com/RGYunpF4)) we not only
get a ~500x speedup but the speed difference is pretty much negligible between
all the different implementations:

    
    
        factor_fermat0:  1.599 ms
        factor_fermat1:  1.488 ms 
        factor_fermat2:  1.387 ms 
        factor_fermat3:  1.421 ms 
        factor_fermat4:  1.422 ms 
    

Of course, you should still measure when you optimize, but dealing with issues
such as whether a=a+1 or a+=1 is faster seems annoying. At that point, you are
benchmarking the language and not your own code.

~~~
ColinWright
I was surprised to see that your version of 0 ran nearly as fast as the
others, despite the multiple calls to the square root. Then I realised that
you're using the library sqrt routine, and I wonder if that would be accurate
enough for the case I'm usually in where my numbers have hundreds of digits. I
don't know what algorithm Julia uses for that, but I've found floats (and
their friends) insufficiently accurate.

But that's useful feedback - thank you.

~~~
kristofferc
If you need higher precision you could always either simply bump up the
accuracy a bit using e.g.
[https://github.com/JuliaMath/DoubleDouble.jl](https://github.com/JuliaMath/DoubleDouble.jl)
or you can use the arbitrary precision numbers in Julia
([https://docs.julialang.org/en/v1/manual/integers-and-
floatin...](https://docs.julialang.org/en/v1/manual/integers-and-floating-
point-numbers/index.html#Arbitrary-Precision-Arithmetic-1))

The performance will of course change accordingly.

~~~
ColinWright
Is there an arbitrary precision square root routine? I've not found it ...

~~~
kristofferc
The concrete routine that is dispatched to is based on the type of the input
number.

So `sqrt(2.0)` will, in the end, call the assembly-instruction (depending on
your specific CPU) `vsqrtss` since the input type is a 64-bit float,
`sqrt(Float32(2.0))` will end up calling `vsqrtsd`, `sqrt(big"2.0")` will call
the sqrt from the arbitrary precision library (MPFR) since we now have a
"BigFloat" number etc.

Directly from a Julia REPL session:

    
    
        julia> sqrt(2.0)
        1.4142135623730951
    
        julia> sqrt(Float32(2.0))
        1.4142135f0
    
        julia> sqrt(big"2.0")
        1.414213562373095048801688724209698078569671875376948073176679737990732478462102
    
        julia> setprecision(512) # Bump the precision
        512
    
        julia> sqrt(big"2.0")
        1.41421356237309504880168872420969807856967187537694807317667973799073247846210703885038753432764157273501384623091229702492483605585073721264412149709993586

~~~
ColinWright
Yes, but I'm talking about getting exact BigInt ceil square roots of BigInt
inputs. I've not found anything about whether that is supported natively or in
a library.

------
linsomniac
I spent much of the Need for Speed sprint on Python doing benchmarking between
2.4 and the 2.5 alpha that we were trying to speed up. It was significantly
slower than 2.4 at pybench.

Measuring performance is hard. It's easy when there are big differences, like
the new exceptions in 2.5 were 60% slower. But small things can add up, and
it's hard to measure them.

One piece of wisdom I got from Tim Peters: Run short benchmarks. The longer
you run them the more likely they are to be influenced by context switches and
the like. I used to always run long benchmarks hoping to push these changes
down into the noise. Instead I started running short tests and looking at the
fastest of the runs.

Another area of speed improvement though was less about A/B testing and more
about thinking hard about where time is used and how to just get rid of it
entirely. Some use cases really suffered because in Python strings are
immutable. If you are reading from the network and, say, looking for the end
of a data frame, you may create a lot of new objects as you append incoming
data. IIRC, a "bytearray" was created that was basically a mutable string to
get rid of that problem entirely.

------
doomslice
Also remember that the maximum speedup your program can get from a particular
optimization is related to the total time your program spends executing that
part.

You can speed up an algorithm 1000 fold but if it only accounts for 1% of your
total execution time you've only gained a 1% speedup.

~~~
paulddraper
Exactly. Carrying the logic a step further, that means you should focus first
on the longest part of the program. And confidently knowing the longest part
of the program requires measuring it. QED

~~~
taeric
Not necessarily longest, though. Right? Most executed would also suffice.

Beware measuring, though. Sometimes, just thinking about the problem ahead of
time can clearly expose where you are spending time. At least, it can give a
hypothesis to test for where you are spending time. With measurements to
confirm/refute. :)

~~~
paulddraper
> Not necessarily longest, though. Right? Most executed would also suffice

"Longest total." I don't care if its a big slice of pie or a lot of little
slices. But it better be one of those.

> Sometimes, just thinking about the problem ahead of time can clearly expose
> where you are spending time.

That's certainly true. But 99% of people overestimate their ability to do
this. And are surprised by the results.

~~~
CarolineW
So when you say "longest" you don't mean "longest section of code", I'm
guessing you mean "section of code in which execution spends the longest
time".

If so, good, but it wasn't clear to me that that's what you meant. If you mean
something else then I don't know what you mean at all.

~~~
paulddraper
"Longest" seemed clear to me. (Over "largest" and "slowest".)

~~~
PeterisP
IDK, "longest part of code" to me _clearly_ seems to refer to length of the
code (i.e. lines of code), not to length of its execution time; so I'd say
that there definitely seems to be some confusion caused by the choice of
words.

~~~
taeric
Just to further these two posts, that is why I asked my question at the top
here. "Hottest" code, I thought, was already well established for most
executed. I had never seen "longest" for anything.

To that end, I was taking it to mean that longest "synchronous" path through
your system. Not necessarily a single method, by any measure. But, systems
have plenty of what I will call "checkpoints" where code can be
restarted/rerun with no hard to recover penalty. That is what I took to mean
by longest.

------
ColinWright
This was prompted by unexpectedly finding that in Python 2.7.4, the fragment:

    
    
        a2, c = a2+c, c+2
    

ran faster than the fragment:

    
    
        a2 += c
        c += 2
    

Context and timings in the post. Speculation welcome as to why this might be
true, and your predictions for Python 3 invited.

~~~
snovv_crash
If you have performance bottlenecks, Python is the wrong language. The only
place where this kind of investigation makes sense is if you're a Python VM
dev.

~~~
ColinWright
> _The only place where this kind of investigation makes sense is if you 're a
> Python VM dev._

You are mistaken - this investigation does make sense in my context.

~~~
snovv_crash
If you're having Python bottlenecks, it is the wrong language. First try Pypy,
and if that doesn't/can't work switch to Julia/Go/R/C++ as your domain
requires.

~~~
ColinWright
You are right when you say:

> _If you 're having Python bottlenecks, it is the wrong language._

But you are missing the point. The object of the underlying exercise was not
to try to make this specific instance of the algorithm run fast, Python was
not the bottleneck. The activity was exploring the effects of algorithmic
changes and investigating the Big-O performance of the different variants and
changes. As a part of that the underlying code was being reworked and modified
to allow the larger algorithmic changes to be made, and this was one of those
changes. It says so in the post:

> _Then I made a small change to the code layout ready for another shot at an
> optimisation._

As always I ran the timing tests (in addition to the functional tests) to make
sure nothing substantial had changed, and there was an anomaly which attracted
my attention and piqued my curiosity. Hence the write up.

And I believe you are wrong when you say that this sort of thing is, or should
be, only of interest to Python VM Devs. I think that curiosity about this sort
of thing is, or should be, important to everyone. Perhaps you will then decide
you don't have time to pursue it, or that your time is better spent elsewhere,
and that's fair enough, but curiosity and a desire to learn has been a
pervasive theme among the best devs I've ever worked with.

So again, you are wrong, this investigation does make sense in my context.
Your mistake is in assuming the post is about a Python bottleneck. As I say,
you've missed the point.

~~~
snovv_crash
The issue is that this investigation is so biased by what is slow in
specifically the Python VM that I'm not sure what the effects would be in a
case where performance is actually something you are trying to optimize for.
The moment you translate this into C, C++, Java etc you suddenly have a smart
compiler that can deal with things like function inlining, to where things
like cache sizes and memory streaming speed will present as bottlenecks
instead of object to object pointer jumping and bytecode interpreter overhead.

Languages don't get slower on a single dimension, and computers have multiple
dimensions of performance. Hitting a performance bottleneck in one dimension
on one language doesn't mean you'll hit the same bottleneck on a different
language. Different scales might have different bottlenecks as well, think
exceeding L2 cache or finally using all your GPU cores on problems that
parallelize with scale. If you're using Python you'll never hit those other
bottlenecks because the VM is always 40x slower than a naive native
implementation.

Unless, of course, your bottleneck is in native code, which is where the easy
prototyping of Python really shines.

------
wruza
In fact, this shows that “when optimizing code, know what your asymptotes
are”. It is more than an order of magnitude _trivial_ difference between 34
and 1.2, but the conclusion bases itself on silly 0.1 which plays no sensible
role at all.

People take advices like “don’t optimize” and then in 2030 someone 300byte-
patches system updates so they would install at ~ssd write speed instead of
sudden 1.5 hours. And everyone is like wow, how we overlooked that. You never
hired an engineer, that’s how.

~~~
ColinWright
Your comment shows that you've missed the point. You're assuming that because
I'm doing this kind of micro-optimisation that is, in the grand scheme of
things, irrelevant, that I'm concentrating on the wrong thing.

You say:

> _the conclusion bases itself on silly 0.1 which plays no sensible role at
> all._

You've missed the point. This specific code change is not the point for the
task overall. This specific code change was preparation for a significant
algorithmic shift that was about to happen. On the way I happened to notice
something odd, and I thought people would be mildly curious, so I wrote it
down for people to see.

Please don't assume that I'm clueless about the wider picture.

------
maxxxxx
That's good advice that should be repeated over and over. I see it so often
that people don't write clean code because that's considered not efficient but
they don't have any real data.

Especially when you deal with optimizing compilers it's often totally
counterintuitive where the real bottlenecks are.

Obviously you shouldn't be stupid and always shout "premature optimization"
but make it a habit to profile from item and time and learn from it.

------
ncmncm
It looks to me like you found out that Python is slow in even more ways than
we knew. But we knew that. That is another cost of such a language -- that
even things that rationally look faster are slower.

But machine code, and thus optimized compiled code, has moved into the same
space -- the machine instructions are interpreted, too, now. Speeds are a
thousand times more, but things affecting speed are way more complex, so
reasoning is similarly compromised. It's a deal with the devil: in exchange
for typical speedups of an order of magnitude or more, you lose the ability to
tell what is making it slow, and even whether it is slow at all. What does
slow mean, now? Only than another program is faster. But you need to have
found it to know.

On modern chips (Haswell and later), a loop of only four instructions -- two
comparisons, two branches, which may run for a microsecond or months,
depending on input, may take time t or 2t depending on whether the compiler
guesses right about which way one of the tests goes. It happens that gcc
tends, systematically, to guess wrong on these. The difference comes down to
whether the loop takes one cycle or two per typical iteration.

Using __builtin_expect() can overpower the compiler's guess, but who wants
that in their code? But it doesn't always work either. Slightly older chips
(Sandybridge) would always take the 2t months, regardless. So we got a really
significant speedup, sometimes, at the cost of not knowing whether we are
getting it, or what we need to do to get it.

Engineering is turned into shamanism and guessing.

At least we always know that a Python program will always be slower than any
equivalent compiled program. But Mercurial is fast...

------
dustingetz
How about:

1\. Get the data structure right

0\. Use a language that helps you get the data structure right

~~~
ColinWright
How about assuming I have a good reason to be doing this task in this
language. Probably you were trying to be helpful, but your comment comes
across as incredibly condescending.

~~~
dustingetz
Hi Colin, the comment was not addressed to you.

------
bartread
Nowadays, I'd agree. Back in the mid-80s when I was learning BASIC on my ZX
Spectrum, not so much: there was no tooling. I mean you could, and I did,
measure elapsed time, but you also had to experiment by tweaking the program.
I suppose you were still measuring, but perhaps not in the way we'd mean when
we say "measure" today. Nowadays, when I ask someone to measure something it's
shorthand for, "your program is complicated: use a tool to help you." Back in
the day it was easier: programs were simpler, and there was only one thing
running on the CPU, and using devices attached to the system, at a time.

------
barrkel
When designing for performance, never stop measuring.

Establish time budgets for high-level operations. Measure performance
continuously as part of integration testing to catch regressions.

~~~
dragontamer
It would seem like if performance is a big deal, you should have "unit tests"
(or as they're called: benchmark code) with well-documented runtimes on
specific hardware.

Stockfish Chess has a benchmark command, Cinema4d has Cinebench, and Blender
now has its benchmark suite.

And really, a benchmark is just a unit test where you're testing for speed.

~~~
barrkel
It depends on the app. For a game engine (and I don't really mean Stockfish, I
mean a game loop with input, network, world scripts, rendering etc.) you have
a target frame rate which gives a fixed time budget for everything to happen,
and you might divvy that up between different bits of code, where it makes
sense to budget at a fine-grained level.

For something with a core loop, and all the value is in the core loop, sure,
you can perf test the core loop.

But many applications are big, with lots of entry points, and data being
worked with can come in different shapes and sizes. Sometimes it can have
different dimensions that stress different bits of code. And it may be
difficult to plan the breakdown of a global budget in this context.

Part of the catch-22 of optimization is that you need to profile before you
optimize, otherwise you're likely to optimize the wrong thing. If you put your
benchmarks at the unit level, you run the risk of having individual bits that
are fast, but the whole doesn't run as fast as you'd hoped.

~~~
dragontamer
But I stand by my claim. Game engines often ship with some kind of benchmark
suite. For example, the "Total War" games often have a benchmark where the
camera moves across a pre-set map and a battle takes place.

Now true, there can still be UI issues which need to be optimized, but my
point is that if you want to test the speed of your rendering engine, of the
animation framework, unit-pathing, or whatnot, its best to build a benchmark
where you can repeatedly document and execute the critical code path.

A singular, automatic test, which can be run nearly automatically (without
user-intervention)... as well as a documentation regime which can document
progress on the benchmark, is key to improving the code.

------
sbr464
If you built the compiler, can’t you just throw a few helpful
notifications/logs letting a user know when they are using unoptimal
paths/techniques? I’m by no means a compiler developer but curious if it could
be more automated without resorting to debugging tools?

~~~
mhh__
Yes, LLVM has analysis passes which you could do this with. However, just
profiling the code is usually goood enough (It's usually pretty obvious where
the problem-code e.g. Something blocking from the hardware)

It is worth pointing out that, currently, compilers can basically only do what
they're told to do (Like a state machine) so this kind of stuff would be
unlikely given that the effort with either be wasted or only part of a larger
program (Just optimize it away).

------
ummonk
Is the Python executor actually using multiplication for the square and left-
shift for the multiply by 2? If not, that would probably be the biggest change
he should be making...

Edit: guess it doesn't matter - Python is so ridiculously slow either way...

------
mesozoic
Or you know. Guess, then implement a test, then measure. Visa vi the
scientific method.

~~~
Bjartr
"vis a vis", I believe, is what you wanted to say.

~~~
mesozoic
What you didn't see my zero length space?

------
SamReidHughes
When in doubt, I've usually found it faster to just write the optimized
version and never measure.

~~~
mhh__
A famous quote by Knuth comes to mind...

Fast code is often unreadable or unmaintainable, software should be written
with it's lifespan in mind. That's not an invitation to write slow code, but
microoptimizations can be a disatrous temptation (Especially when writing low
level code). Compilers are pretty fucking good these days so I think trusting
them can be very helpful.

How can you know if you don't measure?

~~~
SamReidHughes
I know what compilers do and might not do to code, so they're not the issue.
I'm not talking about a x ^= y ^= z here, I'm taking about alternate
algorithms and data representations.

------
laythea
Surely this statement is not well formed, for to optimise anything, a measure
of it is implied.

