When optimising code, never guess, always measure

hyperpape · on Sept 5, 2018

I've never really spent any time reading Python bytecode, but disassembly suggests the difference could be:

    f3
    ------
    92 LOAD_FAST                2 (a2)
    94 LOAD_FAST                3 (c)
    96 BINARY_ADD
    98 LOAD_FAST                3 (c)
    100 LOAD_CONST              1 (2)
    102 BINARY_ADD
    104 ROT_TWO
    106 STORE_FAST              2 (a2)
    108 STORE_FAST              3 (c)

    f4
    ------
    92 LOAD_FAST                2 (a2)
    94 LOAD_FAST                3 (c)
    96 INPLACE_ADD
    98 STORE_FAST               2 (a2)
    100 LOAD_FAST               3 (c)
    102 LOAD_CONST              1 (2)
    104 INPLACE_ADD
    106 STORE_FAST              3 (c)

And perhaps the culprit is that INPLACE_ADD requires more work than BINARY ADD https://stackoverflow.com/questions/15376509/when-is-i-x-dif...

I strongly believe that the idea that it's running on multiple cores is not the case. Python doesn't implicitly schedule any user code on separate threads, and if it did, the coherence cost of doing so would swamp any benefit of doing two simple operations in parallel.

Disassembly is as simple as:

    with open('f3', 'w') as f:
         dis.dis(factor_fermat3, file = f)

    with open('f4', 'w') as f:
        dis.dis(factor_fermat4, file = f)

Edit: Whoops, I see that I used Python3, while the post was about 2.7.4, and there's some wrangling over the differences below. I reran the disassmbly under 2.7.4. The bytecode was more verbose, but differed in the same way.

bazizbaziz · on Sept 5, 2018

This comment is excellent. The title of the original post should be: "When optimising code, never guess, always read the bytecode/assembly."

Without actually reading the assembly/bytecode/etc, you end up speculating about silly things like 'the two evaluations and assignments can happen in parallel, and so may happen on different cores.'.

mikekchar · on Sept 6, 2018

Possibly a subtitle: It's never what you think it is. I've worked in embedded and real time environments before and optimisation used to be my thing, but I am always surprised at how badly I guess what the problem is. It's a hard lesson to teach other people though because programmers are smart and smart people tend to default to thinking that they are correct :-)

But you are right: even when you isolate where the code is slow, you've still got a lot of work to do to find out why it is slow.

ColinWright · on Sept 5, 2018

Indeed, I was unduly influenced by the code I was writing in the late 80s and early 90s that really did take languages with multiple assignments like this and ran them on different processors. You say it's a silly thing, but we used to do it - things have changed.

Added in edit: The magic term is "execution unit" not "core". As I say, things have changed, and the bundling of multiple execution units into each core, and multiple cores into each processor, is different in interesting and subtle ways from the situation I used to code, where I had a few hundred, or a few thousand, processors in each machine, but the individual processors were simpler.

bazizbaziz · on Sept 6, 2018

I didn't mean to say this was a silly thing to do - most modern processors execute instructions out of order on multiple ALUs.

The problem is that the abstraction layer between the python code in question and the processor's instruction stream is so thick that it's hard to say one way or the other that the processor is indeed executing that particular pair of instructions in parallel. It's definitely executing many instructions out of order, but it's unclear (without inspection of the python interpreter and its assembly) what's happening at the machine level.

Looking at the bytecode of the python program at least begins to tells us that the python bytecode of the two versions is fundamentally different, which could account for the performance difference. Although, what exactly makes the material difference is also under debate elsewhere in the thread. :)

m_mueller · on Sept 6, 2018

I'd say without measuring first you have too much to look through for anything large enough to be interesting. So measure, drill in, eliminate some first obvious culprits, measure again, then look at assembly.

I'm mentioning obvious because sometimes it really is, like doing a linear search in a hotspot where it should be a binary tree and things like that.

detaro · on Sept 6, 2018

Measuring is more important. Knowing "a is faster than b" without understanding why is more useful than guessing what should be faster based on assembly and getting it wrong because you don't perfectly understand how the CPU is actually executing your code.

Looking at assembly/byte code is interesting to understand what you could tweak, but again you need measurements to verify.

toast0 · on Sept 6, 2018

As a non python user, that stack overflow post is very interesting. It seems that in python x += y is a significantly different operation than x = x + y, despite my expectation that one is just syntax magic for the other.

ubernostrum · on Sept 6, 2018

If you want to test your theory, you can look at the implementation of those opcodes.

Here's INPLACE_ADD:

https://github.com/python/cpython/blob/874809ea389e6434787e7...

And here's BINARY_ADD:

https://github.com/python/cpython/blob/874809ea389e6434787e7...

They're almost identical :)

hyperpape · on Sept 6, 2018

Thanks for the links! However:

Line 1296 is PyNumber_Add(left, right) Line 1495 is PyNumber_InPlaceAdd(left, right)

Those go to:

PyNumber_Add https://github.com/python/cpython/blob/874809ea389e6434787e7...

PyNumber_InPlaceAdd https://github.com/python/cpython/blob/874809ea389e6434787e7...

And those are different, though I can't tell how it should matter.

ShroudedNight · on Sept 7, 2018

> And perhaps the culprit is that INPLACE_ADD requires more work than BINARY ADD

It would not surprise me if the culprit was instead:

     98 STORE_FAST 2 (a2)
    100 LOAD_FAST  3 (c)

Due to them being the same amalgamated memory location in practice, with the load hitting the in-flight store. But that's more typical of a POWER performance characteristic than an Intel one...

ubernostrum · on Sept 10, 2018

Here it's worth knowing a bit about how Python manages data.

Each CPython VM stack frame has three tuples associated with it:

* co_consts is the constant/literal names local to the frame

* co_varnames is all other names local to the frame

* co_names is the all nonlocal names referenced in the frame

Under the hood, Python creates a structure called 'fastlocals' which is basically an array of (pointer to) PyObject, where the PyObject at each index of the array is the value corresponding to the name at the same index of co_varnames. LOAD_FAST and STORE_FAST are extremely simple opcodes; they do a tiny bit of error checking, refcount management (increment on load, decrement on store), and then just manipulate 'fastlocals'.

So, for example, 'LOAD_FAST i' consists in its entirety of:

1. Grab the PyObject at fastlocals[i]

2. If it's null, throw an UnboundLocalError

3. Otherwise, increment its refcount and push it to the top of the frame's evaluation stack.

blattimwind · on Sept 5, 2018

I mean this is CPython, so its bytecode never, ever runs in parallel anyway.

jhayward · on Sept 6, 2018

You mean concurrently, if you're referring to the GIL.

blattimwind · on Sept 7, 2018

No. I mean parallel.

concurrent: multiple tasks at the same time are "in progress"

parallel: multiple tasks are progressing at the same time

jhayward · on Sept 7, 2018

Are you certain that CPython doesn't compile to any of the SIMD parallel instructions available in e.g., AVX?

You mean concurrent.

kristofferc · on Sept 5, 2018

A nice thing with using a compiled language (like Julia) is that you typically do not need to worry about these type of things. Using the identical code in the blog post in Julia (with some trivial syntactic modifications, https://pastebin.com/RGYunpF4) we not only get a ~500x speedup but the speed difference is pretty much negligible between all the different implementations:

    factor_fermat0:  1.599 ms
    factor_fermat1:  1.488 ms 
    factor_fermat2:  1.387 ms 
    factor_fermat3:  1.421 ms 
    factor_fermat4:  1.422 ms

Of course, you should still measure when you optimize, but dealing with issues such as whether a=a+1 or a+=1 is faster seems annoying. At that point, you are benchmarking the language and not your own code.

ColinWright · on Sept 5, 2018

I was surprised to see that your version of 0 ran nearly as fast as the others, despite the multiple calls to the square root. Then I realised that you're using the library sqrt routine, and I wonder if that would be accurate enough for the case I'm usually in where my numbers have hundreds of digits. I don't know what algorithm Julia uses for that, but I've found floats (and their friends) insufficiently accurate.

But that's useful feedback - thank you.

kristofferc · on Sept 6, 2018

If you need higher precision you could always either simply bump up the accuracy a bit using e.g. https://github.com/JuliaMath/DoubleDouble.jl or you can use the arbitrary precision numbers in Julia (https://docs.julialang.org/en/v1/manual/integers-and-floatin...)

The performance will of course change accordingly.

ColinWright · on Sept 6, 2018

Is there an arbitrary precision square root routine? I've not found it ...

kristofferc · on Sept 6, 2018

The concrete routine that is dispatched to is based on the type of the input number.

So `sqrt(2.0)` will, in the end, call the assembly-instruction (depending on your specific CPU) `vsqrtss` since the input type is a 64-bit float, `sqrt(Float32(2.0))` will end up calling `vsqrtsd`, `sqrt(big"2.0")` will call the sqrt from the arbitrary precision library (MPFR) since we now have a "BigFloat" number etc.

Directly from a Julia REPL session:

    julia> sqrt(2.0)
    1.4142135623730951

    julia> sqrt(Float32(2.0))
    1.4142135f0

    julia> sqrt(big"2.0")
    1.414213562373095048801688724209698078569671875376948073176679737990732478462102

    julia> setprecision(512) # Bump the precision
    512

    julia> sqrt(big"2.0")
    1.41421356237309504880168872420969807856967187537694807317667973799073247846210703885038753432764157273501384623091229702492483605585073721264412149709993586

ColinWright · on Sept 6, 2018

Yes, but I'm talking about getting exact BigInt ceil square roots of BigInt inputs. I've not found anything about whether that is supported natively or in a library.

yoklov · on Sept 6, 2018

I haven't used Julia in quite a while, but given its focus on numerical computation it would be fairly surprising to me if it didn't have the ability to meet your precision needs.

pjmlp · on Sept 6, 2018

Having gone through a Julia training recently, I am betting on it to slowly overtake Python in data analysis, either that, or JIT usage in Python is finally taken seriously.

vmchale · on Sept 6, 2018

> the speed difference is pretty much negligible

really? I'd say that's quite significant, if it's repeatable.

ColinWright · on Sept 6, 2018

The change from 1.6ms to 1.4ms is small compared with the difference I was seeing in the code I was using that calls sqrt. There the change was about 50% to 80%.

But in effect it's an algorithm change, so it's not surprising, and it would be interesting to see how it changes as the numbers involved get bigger. That's not why I was doing it in this case, but it is intriguing.

linsomniac · on Sept 5, 2018

I spent much of the Need for Speed sprint on Python doing benchmarking between 2.4 and the 2.5 alpha that we were trying to speed up. It was significantly slower than 2.4 at pybench.

Measuring performance is hard. It's easy when there are big differences, like the new exceptions in 2.5 were 60% slower. But small things can add up, and it's hard to measure them.

One piece of wisdom I got from Tim Peters: Run short benchmarks. The longer you run them the more likely they are to be influenced by context switches and the like. I used to always run long benchmarks hoping to push these changes down into the noise. Instead I started running short tests and looking at the fastest of the runs.

Another area of speed improvement though was less about A/B testing and more about thinking hard about where time is used and how to just get rid of it entirely. Some use cases really suffered because in Python strings are immutable. If you are reading from the network and, say, looking for the end of a data frame, you may create a lot of new objects as you append incoming data. IIRC, a "bytearray" was created that was basically a mutable string to get rid of that problem entirely.

doomslice · on Sept 5, 2018

Also remember that the maximum speedup your program can get from a particular optimization is related to the total time your program spends executing that part.

You can speed up an algorithm 1000 fold but if it only accounts for 1% of your total execution time you've only gained a 1% speedup.

collinf · on Sept 5, 2018

Amdahl's law, for further reading. [0]

[0] https://en.wikipedia.org/wiki/Amdahl%27s_law

coldtea · on Sept 5, 2018

>You can speed up an algorithm 1000 fold but if it only accounts for 1% of your total execution time you've only gained a 1% speedup.

In a pure program, yes. In an impure one, not necessarily. The change in algorithm could result in some different data structures or layout, and this could cause another part of the program to run much faster or slower (e.g. because of cache locality issues).

justinsaccount · on Sept 5, 2018

I often test out things like this by completely disabling components.

It's one way of measuring, and can really help narrow things down in larger programs.

i.e., pipeline takes 300s to run. Suspect the final output may be worth profiling and optimizing. Stick a 'return' at the top of the output function and see the pipeline now takes 295s to run. Conclusion: final output not the problem.

monocasa · on Sept 6, 2018

Profilers with a flame graph gets you the same information without having to do a manual search through your execution.

BeeOnRope · on Sept 6, 2018

Yes, although it's not exactly the same because there can cross-cutting effects where the behavior of one function affects another.

For example, one function thrash the cache, which ends up greatly slowing down a second function. If you remove the first function entirely (comment it out, etc), the change will reflect that, but a flame graph won't (since it only shows the actual runtime for the second function: it has no way to determine how much impact it is having on other functions indirectly).

TeMPOraL · on Sept 6, 2018

Iff your language has one, and it doesn't choke on the codebase...

paulddraper · on Sept 5, 2018

Exactly. Carrying the logic a step further, that means you should focus first on the longest part of the program. And confidently knowing the longest part of the program requires measuring it. QED

taeric · on Sept 5, 2018

Not necessarily longest, though. Right? Most executed would also suffice.

Beware measuring, though. Sometimes, just thinking about the problem ahead of time can clearly expose where you are spending time. At least, it can give a hypothesis to test for where you are spending time. With measurements to confirm/refute. :)

paulddraper · on Sept 5, 2018

> Not necessarily longest, though. Right? Most executed would also suffice

"Longest total." I don't care if its a big slice of pie or a lot of little slices. But it better be one of those.

> Sometimes, just thinking about the problem ahead of time can clearly expose where you are spending time.

That's certainly true. But 99% of people overestimate their ability to do this. And are surprised by the results.

CarolineW · on Sept 5, 2018

So when you say "longest" you don't mean "longest section of code", I'm guessing you mean "section of code in which execution spends the longest time".

If so, good, but it wasn't clear to me that that's what you meant. If you mean something else then I don't know what you mean at all.

paulddraper · on Sept 5, 2018

"Longest" seemed clear to me. (Over "largest" and "slowest".)

CarolineW · on Sept 5, 2018

But you wrote it, so of course it's clear to you. To me, when someone says "the longest part of the program" I immediately think of the routine that stretches over the most lines. That's the longest part of the code.

PeterisP · on Sept 5, 2018

IDK, "longest part of code" to me clearly seems to refer to length of the code (i.e. lines of code), not to length of its execution time; so I'd say that there definitely seems to be some confusion caused by the choice of words.

taeric · on Sept 6, 2018

Just to further these two posts, that is why I asked my question at the top here. "Hottest" code, I thought, was already well established for most executed. I had never seen "longest" for anything.

To that end, I was taking it to mean that longest "synchronous" path through your system. Not necessarily a single method, by any measure. But, systems have plenty of what I will call "checkpoints" where code can be restarted/rerun with no hard to recover penalty. That is what I took to mean by longest.

scott_s · on Sept 5, 2018

> Beware measuring, though. Sometimes, just thinking about the problem ahead of time can clearly expose where you are spending time.

Why beware measuring? In my experience, people are generally terrible at guessing where the bulk of the time is being spent. Measure as soon as you care.

taeric · on Sept 6, 2018

Apologies, I meant that as blind measuring. Don't just get a reading and take action. Hypothesize about where you think the time is spent and measure to test your hypothesis. And always question why the time is spent where it is.

ColinWright · on Sept 5, 2018

This was prompted by unexpectedly finding that in Python 2.7.4, the fragment:

    a2, c = a2+c, c+2

ran faster than the fragment:

    a2 += c
    c += 2

Context and timings in the post. Speculation welcome as to why this might be true, and your predictions for Python 3 invited.

barrkel · on Sept 5, 2018

You're so far away from the hardware when you're in an interpreted environment, reasoning about performance is fraught with danger. Happy paths could be an order of magnitude faster than very similar but just different enough code.

Usually, in an interpreted environment, I reason about other things: number of database calls, HTTP fetches, that kind of thing. If I need CPU performance, I'll fire up something faster.

Finding the meatiest primitive that does the job in one go is often best - batching, where the batch operation is implemented in native code - but not always, you're at the mercy of the implementer.

snovv_crash · on Sept 5, 2018

If you have performance bottlenecks, Python is the wrong language. The only place where this kind of investigation makes sense is if you're a Python VM dev.

ColinWright · on Sept 5, 2018

> The only place where this kind of investigation makes sense is if you're a Python VM dev.

You are mistaken - this investigation does make sense in my context.

snovv_crash · on Sept 7, 2018

If you're having Python bottlenecks, it is the wrong language. First try Pypy, and if that doesn't/can't work switch to Julia/Go/R/C++ as your domain requires.

ColinWright · on Sept 8, 2018

You are right when you say:

> If you're having Python bottlenecks, it is the wrong language.

But you are missing the point. The object of the underlying exercise was not to try to make this specific instance of the algorithm run fast, Python was not the bottleneck. The activity was exploring the effects of algorithmic changes and investigating the Big-O performance of the different variants and changes. As a part of that the underlying code was being reworked and modified to allow the larger algorithmic changes to be made, and this was one of those changes. It says so in the post:

> Then I made a small change to the code layout ready for another shot at an optimisation.

As always I ran the timing tests (in addition to the functional tests) to make sure nothing substantial had changed, and there was an anomaly which attracted my attention and piqued my curiosity. Hence the write up.

And I believe you are wrong when you say that this sort of thing is, or should be, only of interest to Python VM Devs. I think that curiosity about this sort of thing is, or should be, important to everyone. Perhaps you will then decide you don't have time to pursue it, or that your time is better spent elsewhere, and that's fair enough, but curiosity and a desire to learn has been a pervasive theme among the best devs I've ever worked with.

So again, you are wrong, this investigation does make sense in my context. Your mistake is in assuming the post is about a Python bottleneck. As I say, you've missed the point.

snovv_crash · on Sept 12, 2018

The issue is that this investigation is so biased by what is slow in specifically the Python VM that I'm not sure what the effects would be in a case where performance is actually something you are trying to optimize for. The moment you translate this into C, C++, Java etc you suddenly have a smart compiler that can deal with things like function inlining, to where things like cache sizes and memory streaming speed will present as bottlenecks instead of object to object pointer jumping and bytecode interpreter overhead.

Languages don't get slower on a single dimension, and computers have multiple dimensions of performance. Hitting a performance bottleneck in one dimension on one language doesn't mean you'll hit the same bottleneck on a different language. Different scales might have different bottlenecks as well, think exceeding L2 cache or finally using all your GPU cores on problems that parallelize with scale. If you're using Python you'll never hit those other bottlenecks because the VM is always 40x slower than a naive native implementation.

Unless, of course, your bottleneck is in native code, which is where the easy prototyping of Python really shines.

usefulcat · on Sept 5, 2018

> My guess is that in the first case the two evaluations and assignments can happen in parallel, and so may happen on different cores, whereas in the second case they must happen in serial, but that's just a guess.

That would require using multiple threads, which I'm pretty sure python won't do automagically. And even if it did, trying to do those calculations concurrently would almost certainly be slower due to the additional synchronization overhead.

It would be interesting to know how different python implementations react to that change.

ColinWright · on Sept 5, 2018

I thought that modern CPUs would send instruction streams down multiple cores, if available, and that the interpreter could and would then, in some sections, get executed in parallel for you. Perhaps I was wrong, and the parallelism happens at too fine a scale for the interpreter to get any advantage from it that would be visible for code like this. I'm probably suffering from the hangover of my experiences from 20 years ago writing schedulers on parallel machines for serial code.

toast0 · on Sept 5, 2018

A modern CPU may have multiple cores, each executing a serial instruction stream (or with SMT aka hyperthreading, each core may execute on some sepecific number of instruction streams). Within a core, there may several execution units such as instruction decoders and adders and what not, if you have multiple adders, you can sometimes be processing arithmetic instructions in parallel. A CPU that can do that is a superscalar CPU and the parallelism shown is often discussed as instruction level parallelism.

Instruction level parallelism (on common platforms) is an optimization a processor applies, based on the conditions when an instruction is decided, but it isn't visible directly in the code. You would really need to inspect the low level interpretter loop, with some idea of the specific bytecode it was processing and specifics about the processor you're using. Setting up the order of instructions to reduce data dependencies is one of the things an optimizing c compiler does, to help the processor be able to utilize instruction level parallelism, but to compute things with thread level parallelism would require too much coordination -- starting a thread is a many stepped thing, and then transferring the values is too, even if writing to memory addresses, there's hidden coordination there.

Ysx · on Sept 5, 2018

Red herring perhaps - seems due to the inplace add. It's quicker yet if you take out the multiple assigment:

    $ for STMT in "a2, c = a2+c, c+2" "a2 += c; c += 2" "a2 = a2+c; c = c+2"; do
        echo \[$(python3 -m timeit "a2 = 1; c = 1; $STMT")\] $STMT
    done | sort -n
    [... 0.0547 usec per loop] a2 = a2+c; c = c+2
    [... 0.0569 usec per loop] a2, c = a2+c, c+2
    [... 0.0608 usec per loop] a2 += c; c += 2

ColinWright · on Sept 5, 2018

You're using python3. My brief experiments showed it actually ran faster in python2. You might want to re-run your experiments to see if I've done anything stupid with that test.

But definitely it seemed that python2 and python3 had significantly different performance.

Ysx · on Sept 5, 2018

Oh you're right, that's interesting - have replicated your result.

"a2 = a2+c; c = c+2" is still fastest on both 2 and 3, though don't know enough to say why.

joshuamorton · on Sept 6, 2018

Try python 3.7, iirc it should be faster again

aidos · on Sept 5, 2018

Well that’s....surprising

Edit: did you compare the bytecode?

ColinWright · on Sept 5, 2018

I could, but I would have a lot to learn, and the benefit would be minimal. Better to learn simply to be wary of these sorts of things, and always measure.

I'm hoping someone with a better background will find this sufficiently surprising and be provoked into doing that analysis, writing it up, and letting us know.

I've got it in my boomerang queue.

Bjartr · on Sept 6, 2018

What is a boomerang queue?

ColinWright · on Sept 6, 2018

I use a wiki, email, and a bunch of scripts to organise my projects (short and long term), tasks, reference material, and idle thoughts. As part of that system I use the "Do, Delegate, Defer, Delete" technique for things that come to hand.

So everything that comes to hand gets a unique reference (auto-generated) and I put tags on them. I don't try to be complete, I don't try to make sure tags are unique, I just put what's on my mind and seems relevant in a meta field. The system then auto-cross-references by content and tags, so when I put tags related to stuff I'm working on it uses a sort of TF-IDF system to find relevant documents and other reference material and puts links between them.

But sometimes I don't want to deal with something just yet, so there is a tool that lets me send things into the future - I call that my "Boomerang Queue".

Edit: it's been a while since you asked the question, so I tried to find some way to let you know I've replied, but you have no contact information in your profile. I hope you see this.

Bjartr · on Sept 6, 2018

Thanks for caring that I get this.

I've long considered and wanted a system like you've described. It seems like it would have incredible long-term value. Is everything for yours self-hosted? What wiki do you use?

ColinWright · on Sept 7, 2018

For years I've used the wiki I wrote. That has the feature that links are auto-created from free text, so if a pagename happens to turn up in the text anywhere, a link is made. This requires a little discipline - you don't want to create a page called "And".

But recently I've been experimenting with zim to see if I can use that instead. It saves pages in markdown in a directory structure, so my external scripts can be adapted to run over them and create the augmenting pages. It could possibly insert the free-text links as well in the background. Just an idle thought.

It's all running on my home machine, but I can connect to it from (nearly) anywhere. It's effectively unusable on my phone, so I have a staging area in which I make notes, and then transfer things later. The act of transferring implies a review of the material, and that's of value anyway, so it's not wasted effort.

For years, decades, I've been thinking of polishing, bundling, and releasing it, but it would be one of a gazillion organisation systems and no one would find it. If they did, I'd then be committed to supporting it, so that's not going to happen. I might, one day, convert these replies into a blog post, but again, limited value.

Bjartr · on Sept 7, 2018

Well I appreciate you sharing this with me. It's moved me a little closer towards putting something like this together for myself. The free text linking is an interesting idea.

I worry about not having the discipline to regularly curate such a system. That's probably my biggest roadblock towards just going for it.

ColinWright · on Sept 7, 2018

What I found is that it was important to (a) not over-engineer what I was trying to do, (b) do something that worked for the document I had in my hand[0], and (c) be able to extend what I was doing. If I then "fell off the perch" I could come back to it later, look at the thing in my hand, and put that into the system, thus climbing back on the perch.

Quite quickly the system was useful, and I had fewer decisions to make by using than by not using it, so it saved me effort. That was the key.

[0] ... or file/email/document on the screen

ggggtez · on Sept 5, 2018

When optimizing, don't forget to consider the byte code. Without looking, I'd guess the first one has 1 less assignment operation.

wruza · on Sept 6, 2018

In fact, this shows that “when optimizing code, know what your asymptotes are”. It is more than an order of magnitude trivial difference between 34 and 1.2, but the conclusion bases itself on silly 0.1 which plays no sensible role at all.

People take advices like “don’t optimize” and then in 2030 someone 300byte-patches system updates so they would install at ~ssd write speed instead of sudden 1.5 hours. And everyone is like wow, how we overlooked that. You never hired an engineer, that’s how.

ColinWright · on Sept 6, 2018

Your comment shows that you've missed the point. You're assuming that because I'm doing this kind of micro-optimisation that is, in the grand scheme of things, irrelevant, that I'm concentrating on the wrong thing.

You say:

> the conclusion bases itself on silly 0.1 which plays no sensible role at all.

You've missed the point. This specific code change is not the point for the task overall. This specific code change was preparation for a significant algorithmic shift that was about to happen. On the way I happened to notice something odd, and I thought people would be mildly curious, so I wrote it down for people to see.

Please don't assume that I'm clueless about the wider picture.

maxxxxx · on Sept 5, 2018

That's good advice that should be repeated over and over. I see it so often that people don't write clean code because that's considered not efficient but they don't have any real data.

Especially when you deal with optimizing compilers it's often totally counterintuitive where the real bottlenecks are.

Obviously you shouldn't be stupid and always shout "premature optimization" but make it a habit to profile from item and time and learn from it.

ncmncm · on Sept 6, 2018

It looks to me like you found out that Python is slow in even more ways than we knew. But we knew that. That is another cost of such a language -- that even things that rationally look faster are slower.

But machine code, and thus optimized compiled code, has moved into the same space -- the machine instructions are interpreted, too, now. Speeds are a thousand times more, but things affecting speed are way more complex, so reasoning is similarly compromised. It's a deal with the devil: in exchange for typical speedups of an order of magnitude or more, you lose the ability to tell what is making it slow, and even whether it is slow at all. What does slow mean, now? Only than another program is faster. But you need to have found it to know.

On modern chips (Haswell and later), a loop of only four instructions -- two comparisons, two branches, which may run for a microsecond or months, depending on input, may take time t or 2t depending on whether the compiler guesses right about which way one of the tests goes. It happens that gcc tends, systematically, to guess wrong on these. The difference comes down to whether the loop takes one cycle or two per typical iteration.

Using __builtin_expect() can overpower the compiler's guess, but who wants that in their code? But it doesn't always work either. Slightly older chips (Sandybridge) would always take the 2t months, regardless. So we got a really significant speedup, sometimes, at the cost of not knowing whether we are getting it, or what we need to do to get it.

Engineering is turned into shamanism and guessing.

At least we always know that a Python program will always be slower than any equivalent compiled program. But Mercurial is fast...

dustingetz · on Sept 5, 2018

How about:

1. Get the data structure right

0. Use a language that helps you get the data structure right

mhh__ · on Sept 6, 2018

At scale, this is the right approach but hand optimizations can still be extremely important in (say) a large codebase where you can't change to the right data structure or in (let's say) a game where the same basic operations are done over and over again so getting them fast will be very helpful.

ColinWright · on Sept 5, 2018

How about assuming I have a good reason to be doing this task in this language. Probably you were trying to be helpful, but your comment comes across as incredibly condescending.

dustingetz · on Sept 6, 2018

Hi Colin, the comment was not addressed to you.

bartread · on Sept 5, 2018

Nowadays, I'd agree. Back in the mid-80s when I was learning BASIC on my ZX Spectrum, not so much: there was no tooling. I mean you could, and I did, measure elapsed time, but you also had to experiment by tweaking the program. I suppose you were still measuring, but perhaps not in the way we'd mean when we say "measure" today. Nowadays, when I ask someone to measure something it's shorthand for, "your program is complicated: use a tool to help you." Back in the day it was easier: programs were simpler, and there was only one thing running on the CPU, and using devices attached to the system, at a time.

barrkel · on Sept 5, 2018

When designing for performance, never stop measuring.

Establish time budgets for high-level operations. Measure performance continuously as part of integration testing to catch regressions.

dragontamer · on Sept 5, 2018

It would seem like if performance is a big deal, you should have "unit tests" (or as they're called: benchmark code) with well-documented runtimes on specific hardware.

Stockfish Chess has a benchmark command, Cinema4d has Cinebench, and Blender now has its benchmark suite.

And really, a benchmark is just a unit test where you're testing for speed.

woolvalley · on Sept 5, 2018

Performance testing is not a boolean yes/no like a unit test. You often have to run the same benchmark test ~50 to get a proper significant mid point, which means they run significantly longer.

Performance issues are often surprising where they come from too, so most of the time budget testing value comes from some sort of integration, UI or end to end test. You also need in the wild perf tracking to find the regional or subset performance issues.

barrkel · on Sept 5, 2018

It depends on the app. For a game engine (and I don't really mean Stockfish, I mean a game loop with input, network, world scripts, rendering etc.) you have a target frame rate which gives a fixed time budget for everything to happen, and you might divvy that up between different bits of code, where it makes sense to budget at a fine-grained level.

For something with a core loop, and all the value is in the core loop, sure, you can perf test the core loop.

But many applications are big, with lots of entry points, and data being worked with can come in different shapes and sizes. Sometimes it can have different dimensions that stress different bits of code. And it may be difficult to plan the breakdown of a global budget in this context.

Part of the catch-22 of optimization is that you need to profile before you optimize, otherwise you're likely to optimize the wrong thing. If you put your benchmarks at the unit level, you run the risk of having individual bits that are fast, but the whole doesn't run as fast as you'd hoped.

dragontamer · on Sept 6, 2018

But I stand by my claim. Game engines often ship with some kind of benchmark suite. For example, the "Total War" games often have a benchmark where the camera moves across a pre-set map and a battle takes place.

Now true, there can still be UI issues which need to be optimized, but my point is that if you want to test the speed of your rendering engine, of the animation framework, unit-pathing, or whatnot, its best to build a benchmark where you can repeatedly document and execute the critical code path.

A singular, automatic test, which can be run nearly automatically (without user-intervention)... as well as a documentation regime which can document progress on the benchmark, is key to improving the code.

sbr464 · on Sept 6, 2018

If you built the compiler, can’t you just throw a few helpful notifications/logs letting a user know when they are using unoptimal paths/techniques? I’m by no means a compiler developer but curious if it could be more automated without resorting to debugging tools?

mhh__ · on Sept 6, 2018

Yes, LLVM has analysis passes which you could do this with. However, just profiling the code is usually goood enough (It's usually pretty obvious where the problem-code e.g. Something blocking from the hardware)

It is worth pointing out that, currently, compilers can basically only do what they're told to do (Like a state machine) so this kind of stuff would be unlikely given that the effort with either be wasted or only part of a larger program (Just optimize it away).

jakear · on Sept 6, 2018

You’d be more likely to just make the optimization. That’s why we have optimizing compilers, rather than “perf linters”. Though the second does have its merits for suggesting not-quite-identical approaches.

ummonk · on Sept 6, 2018

Is the Python executor actually using multiplication for the square and left-shift for the multiply by 2? If not, that would probably be the biggest change he should be making...

Edit: guess it doesn't matter - Python is so ridiculously slow either way...

mesozoic · on Sept 6, 2018

Or you know. Guess, then implement a test, then measure. Visa vi the scientific method.

Bjartr · on Sept 6, 2018

"vis a vis", I believe, is what you wanted to say.

mesozoic · on Sept 18, 2018

What you didn't see my zero length space?

SamReidHughes · on Sept 6, 2018

When in doubt, I've usually found it faster to just write the optimized version and never measure.

mhh__ · on Sept 6, 2018

A famous quote by Knuth comes to mind...

Fast code is often unreadable or unmaintainable, software should be written with it's lifespan in mind. That's not an invitation to write slow code, but microoptimizations can be a disatrous temptation (Especially when writing low level code). Compilers are pretty fucking good these days so I think trusting them can be very helpful.

How can you know if you don't measure?

SamReidHughes · on Sept 6, 2018

I know what compilers do and might not do to code, so they're not the issue. I'm not talking about a x ^= y ^= z here, I'm taking about alternate algorithms and data representations.

ColinWright · on Sept 6, 2018

I've found situations where the code that everyone thought would be faster was, in fact, slower. Your assertion of "just write the optimized version" assumes that you always know which version would be faster - I'd suggest that there will be times when you are wrong, and you won't always know which times they'll be.

SamReidHughes · on Sept 6, 2018

If you don't have the time budget for that, you have to make the trade off of probabilities. It's perfectly possible to hold a basic model of the CPU in your head and predict the performance of your code. The whole reason why people tell you to hold a model of CPU and memory latencies in your head is so that you can avoid writing the slow version in the first place.

detaro · on Sept 6, 2018

In how many cases are you certain that the "optimized version" is actually better without measuring? People tend to carry a lot of ideas about what should be faster that don't necessarily hold in practice.

SamReidHughes · on Sept 6, 2018

In tons of cases you are in fact "certain." You don't need an experiment to know that pointer chasing is going to be slower than array iteration, or that in some circumstance pulling a function out of a loop might be better but won't be substantially worse than the compiler's output.

Generally speaking, the cases where you are unsure, it's sensitive to platform details or unstable compiler optimizations, or you don't know what sort of user input you'll really get.

Or in the case of the blog post, if you don't have a good model of the execution environment in your head (because Python).

detaro · on Sept 6, 2018

Ok, fair to your examples, I for some reason wasn't thinking about high-ish level decisions like that (and some annoying memories of people writing "optimized" code that was hard to read and not any faster than what the compiler did with the simple one)

lolc · on Sept 6, 2018

That's a recipe for convoluted code with no benefit. It's what we call premature optimization.

SamReidHughes · on Sept 6, 2018

No, it's called saving effort, because the situation requires determining whether the optimization is worth it anyway. Convoluted architecture is a problem, not some code that's an O(1) rewrite away from being simplified.

lolc · on Sept 6, 2018

I think your definition is off: Optimized does not mean simplified. Optimized code tends to be harder to understand for the reader. And readability is the first metric any code is judged by.

If you want to optimize away from the plain version you have to support the introduced loss in legibility with relevant measurements.

SamReidHughes · on Sept 6, 2018

> Optimized does not mean simplified.

I didn't say that and was assuming the opposite, so I think you've misread me.

Obviously if the faster version is substantially more complicated it's a downside.

lolc · on Sept 8, 2018

"Optimized code" has an established meaning in our field. It's beyond simplified.

Don't tell me I'm misreading you when I make you aware of the distinction. I've understood what you wanted to say, albeit with difficulty. Just say "Ah, I meant simplified."

SamReidHughes · on Sept 11, 2018

I never meant simplified by optimized. Simplification is what you do (with O(1) effort) when removing the optimization.

When they are the same thing, there is no trade-off to be made.

laythea · on Sept 5, 2018

Surely this statement is not well formed, for to optimise anything, a measure of it is implied.