
Outsmarting the compiler: Short story about optimization - dpzmick
http://dpzmick.com/2017/05/25/outsmarting-the-compiler/
======
cperciva
When looking at code performance, it's important to remember that conditional
branches are almost always cheap _inside microbenchmarks_ , because the CPU
can figure out when the same branches get taken on every loop... but far more
expensive in the real world. A similar issue applies to cache: Your code might
fit inside the L1 cache in your benchmarks, but when it's used in the real
world you get cache misses since the rest of the program accesses data too.

------
Veedrac
> Then, a bit later, we need to very quickly finish populating the structs.

I am finding it extremely hard to envision a circumstance where this is a
bottleneck for anything. Care to clarify the context?

I also find the struct layout really odd; why not just move c?

Your benchmarks are also probably broken; branch predictors use global state
so will almost certainly predict fine the way you've used things. You need to
_repopulate_ a significantly-sized _array_ each time with randomly chosen
values. You can't use the same array because it'll be learnt, and you can't
use a short array because it'll be predicted globally.

~~~
dpzmick
I'll give that benchmark trick a try. Thanks for the feedback.

For some additional context, these structs are packed wire-line protocol
structs. In reality the padding bytes are full of other fun details.

~~~
Veedrac
The question for context is how this became something you want to optimise. It
seems almost anything else you do with the values would cost more than the
part you tried to optimise.

~~~
dpzmick
Very little is done with these values in the fast path. The application in
question is basically a smart proxy that performs a very minimal amount of
formatting. Most of the code in the fast path is pretty linear so I was
curious if there were any good way to keep this particular case conditional
free.

But, you are right that it probably doesn't matter. The minimal benchmarking I
did convinced us that there wouldn't be much of a cost. The code that actually
ended up being committed to our project has the branch.

------
BenjiWiebe
Quick upvote for a well designed mobile friendly site. The last few HN posts I
read were awful on mobile.

~~~
dpzmick
Thanks! I started with a fork of this Jekyll site:
[https://github.com/johnotander/pixyll](https://github.com/johnotander/pixyll)
so I can't take much credit for the design

------
JoachimSchipper
This is not so much "outsmarting the compiler" as "working with the compiler"
\- tweaking code to trigger certain optimizations/generate particular output.
The missing part, of course, is showing that these optimizations actually
help...

~~~
dpzmick
In this case I don't think that being clever can actually help us much. I have
some microbenchmarks towards the end of the post that are not very indicative
of any performance gained by writing this "clever" code.

------
Dreami
How does this wrapper and each variant connect together? I supposed he would
include a pointer to the actual struct in the wrapper, but I just see payload
(that I'm gonna assume needs to be written to that array with padding in the
actual struct).

For me, a wrapper includes the original thing and just wraps stuff around it.
How does this work here? May also be a question to the author, I suppose
now...

~~~
JoachimSchipper
In this wrapper, the payload[] field holds an instance of either struct A or
of struct B - that is, the wrapper is "prepended" to the actual data.

------
bloaf
So why wouldn't you just put c before the arr[padding#] chars?

~~~
dpzmick
In the real world version of this problem the padding bytes contain some real
values and the structs are both written straight to a network line (protocol
is defined externally)

------
dmh2000
gcc warning: ISO C++ forbids zero-size array ‘payload’ [-Wpedantic]

clang warning: flexible array members are a C99 feature [-Wc99-extensions]

is this still the case?

~~~
dpzmick
Try compiling with c++11 turned on.

~~~
dmh2000
I used -std=c++14

~~~
slrz
Use -std=gnu++14 then.

------
taneq
Outsmarting the compiler: A short story about optimisation:

"Don't."

The end.

(Basically, as the story shows, with this kind of micro-optimization you may
or may not beat the compiler but you're almost certainly wasting your time
compared with more effective optimization methods, like rethinking the
problem.)

~~~
frozenport
This is stupid because it categorically denies the business case for micro-
optomization.

Yet, a business case might exist when the library is heavily utilized, or
often when a compiler isn't able to produce the correct code.

There are also cases primarily, in finance, where single threaded low-latency
distinguishes competing groups. Some of those guys count every nanosecond.

The techniques described here ( and in other places) are universally
applicable.

~~~
spc476
I recently played around with some code I wrote in college [1] which involves
a pair of intertwined functions. I decided to play around with the SSE
extensions on modern CPUs and write a vectorized version of the code.

Try as I might, I could _not_ beat GCC [2], which used non-vectorized code. I
chalk it up to not knowing how best to write optimized x86 code anymore (it's
been years since I did any real assembly language programming) and I might be
hitting some scheduling or pipeline issues, I just don't know.

[1] I described the code years ago here:
[http://boston.conman.org/2004/06/09.2](http://boston.conman.org/2004/06/09.2)

[2] I beat clang easily though.

