
Three Optimization Tips for C++ - JSno
https://www.facebook.com/notes/facebook-engineering/three-optimization-tips-for-c/10151361643253920
======
dmunoz
Another presentation by Andrei on similar matters from GoingNative 2013:
Writing Quick Code in C++, Quickly [0]. From memory, it has (at least) these
three majors points

* Measure, then optimize

* Pay attention to data layout to keep the cache hot

* Techniques for devirtualization

[0] [http://channel9.msdn.com/Events/GoingNative/2013/Writing-
Qui...](http://channel9.msdn.com/Events/GoingNative/2013/Writing-Quick-Code-
in-Cpp-Quickly)

------
eco
Nice to see them closing those <uint32_t> tags. C++ users always forget to
close their tags.

~~~
etfb
Your comment would have been amusingly surreal if there were no closing tags
for the <uint32_t> stuff in the article. I'm almost disappointed that there
were. I wonder who was responsible for that particular bug in the Facebook
formatting code.

------
btilly
On my random forays into C++, I saw different things. Here were my three most
important optimizations.

1\. Replace strings with integers wherever possible. Create a mapping table,
use the indexes.

2\. Layout related objects close in memory. Even if it requires special data
structures to do. Suppose that if I need A I am likely to need B soon. If B is
physically close to A, then I am likely to find it in cache, which is about
10x as fast. So, for instance, don't store a tree as a bunch of random
pointers. Store it as offsets inside of a fixed container of memory.

3\. Process data according to how it is laid out in memory. For instance walk
through a vector, don't do random access into it. Again this is all about
trying to make sure that stuff is in cache as often as possible.

For the second and third points I found it was a case of do it perfectly or
don't worry about it. For example if I am missing cache regularly in three
places, fixing one only mades my code 50% faster. Fixing the second one almost
doubled my speed. Getting the last one was an order of magnitude speed
increase.

~~~
TillE
After making sure you're using the correct basic algorithms, cache
optimization is probably the one major thing that's really worth thinking
about.

Even with hyperthreading and all the other fancy tricks that modern CPUs have,
it's still a huge deal. I learned about all this stuff in general terms at
university years ago, but it didn't really click until recently that even when
the CPU is at 100%, it's very very often stuck waiting for a memory read to
complete.

------
yalue
I'd be interested in seeing how much of this advice applies to architectures
other than x86.

As a side note, definitely don't follow the advice about using position-
dependent code unless you are working in a very performance-intensive and
isolated environment. Just a single non-relocatable module/dll completely
undermines the benefits of ASLR, so an attacker with a copy of your program
will have a much easier time crafting an attack against it.

~~~
maximilianburke
A lot of it generally holds true if you're programming for x86/x64, ARM, or
PowerPC. The "hierarchy of speed" could use some updating, I think these days
it is more along the lines of:

1\. Comparisons (no branch, just compare) 2\. u/int add/subtract/bit
operations 3\. FP mul 4\. FP add/sub 5\. FP div 6\. u/int mul 7\. indexed
array access (data in L1, including latency for data to move to registers) 8\.
u/int div

That rough order seems to generally hold on a modern Haswell processor or the
in-order PowerPC cores in an Xbox 360, though the cycle counts will vary.

Being thoughtful with respect how you access your data, paying attention to
the effects of access patterns on caching, will almost always help regardless
of the target architecture :)

~~~
nkurz
_1\. Comparisons (no branch, just compare) 2. u /int add/subtract/bit
operations_

Is there a reason you follow Andrei's lead and suggest that comparisons
operations are faster than integer subtraction? At least for x64 on Intel, I
think they would always be the same speed and thus should be grouped together.
Are they different on the other platforms?

~~~
mattgodbolt
The only justification I can think of for that statement is that on modern
processors (Sandy Bridge and above), micro-op fusion in the decode stage can
fuse a comparison with a branch that immediately follows it, turning it into a
"compare and branch" micro-op.

------
eco
Vimeo link to the talk is dead but I found one on Archive.org:
[https://archive.org/details/AndreiAlexandrescu-Three-
Optimiz...](https://archive.org/details/AndreiAlexandrescu-Three-Optimization-
Tips)

------
nly
This post misses one of the most fundamental and important optimisations for
'digits10' that his compiler is likely doing for him: turning the divide by 10
in to a multiply by using a multiplicative inverse mod 2^64

    
    
        $ cat div10.cpp 
        #include <cstdint>
    
        uint64_t div10(uint64_t y) {
            return y / 10;
        }
    
        $ g++ -std=c++11 -march=native -Ofast -c div10.cpp
        $ objdump -C -d --no-show-raw-insn div10.o
    
        0000000000000000 <div10 (unsigned long)>:
        0:   movabs $0xcccccccccccccccd,%rdx
        a:   mov    %rdi,%rax
        d:   mul    %rdx
        10:   mov    %rdx,%rax
        13:   shr    $0x3,%rax
        17:   retq
    

His basic implementation likely isn't using division at all.

~~~
IvyMike
Doesn't this quote from the article address that? Or was it a late add?

"Truth be told, it's a multiplication because many compilers transform all
divisions by a constant into multiplications; see e.g.
[http://goo.gl/LhPeH](http://goo.gl/LhPeH) "

~~~
nly
Guess _I_ missed it. Interestingly it seems to be impossible to do this
optimisation yourself in C or C++ and get either Clang or GCC to generate the
same machine code (on x86-64).

~~~
ridiculous_fish
It is possible to coax that codegen out of clang. The trick is that you need
the high half of the product, so you must use a larger type, in this case
__uint128_t:

    
    
        uint64_t div10(uint64_t y) {
            const uint64_t magic = 0xCCCCCCCCCCCCCCCDULL;
            __uint128_t prod = magic * (__uint128_t)y;
            return (uint64_t)(prod >> (64 + 3));
        }
    

See [http://libdivide.com](http://libdivide.com) for how to get this codegen
with runtime constants (I am the author).

~~~
mattgodbolt
Clang since at least v3.0 has done this automatically at -O1 and above - no
need to use this cool, but ungainly, and hard-to-maintain approach.

[http://goo.gl/Cx0E0c](http://goo.gl/Cx0E0c)

~~~
nly
GCC has done this for 20 years... the point is this optimisation requires a
wider integer type than 64bits (in this case, apparently, 67 bits). This
optimisation is potentially important for anything generating bytecode, or
native code inside a JIT.

------
ANTSANTS

      >The speed hierarchy of operations is:
      >
      >  comparisons
      >  (u)int add, subtract, bitops, shift
      >  floating point add, sub (separate unit!)
      >  ...
      >  (u)int division, remainder
    

This can't be right. A comparison necessitates either a branch or a cmov, and
there's no way that either of those is faster than a basic ALU operation.

~~~
flebron
I'm guessing he doesn't mean branching on a result, he means just comparing
two things via cmp.

------
bello
Could someone explain why array writes are so expensive? I understand that a
write would mark a cache line as dirty (significantly increasing eviction time
cause you have to write-back) and that it could probably prevent the compiler
from enregistering stuff. However, I don't get the aliasing and the "a write
is really a read and a write" argument.

------
berkut
Good tips, however it's still often beneficial to do reciprocal float
division: i.e.

    
    
      float rcpDiv = 1.0f / value;
      float finalValue = myValue * rcpDiv;
    

In a lot of cases when speed matters. With fpmath=fast, I've seen compilers do
that for you, but not always, as it potentially changes the answer very
slightly.

------
TillE
> Prefer 64-bit code and 32-bit data.

Wait, how is that possible? Like, you can have amd64 optimized code that uses
32-bit pointers?

~~~
eco
Pointers would be an exception, of course. As others wrote he's talking about
general data. Just because 64-bit ints are faster on 64-bit processors than on
32-bit processors doesn't mean you should use them when a 32-bit int would
suffice.

~~~
protopete
Pointers don't have to be the exception. I'm working on a programming language
runtime where objects are stored as a 32-bit value, with 3/4 of the range
representing an integer, and 1/4 of the range (30 bits) representing a
pointer. Given the constraint that pointer objects are aligned to 8 bytes, the
30-bits multiplied by 8 results in an 8GB addressable memory space. Pages
within that space can be allocated using mmap with MAP_FIXED flag. Smaller
memory usage results in more things in cache, resulting in higher performance.

~~~
TorKlingberg
So, your pointers will not be able to address bytes individually? No char
pointers, or is it 8 byte chars? Or is this a completely different language
without raw pointers?

~~~
protopete
The language is currently without raw pointers, suitable for an object system
like Python/Ruby/JS.

------
gfodor
Any good books for someone who wants to get up to speed on modern assembly
programming, who has been out of sync for the last 10-15 years?

------
dchichkov
Liked it. A good, well reasoned advice from an expert. I especially like him
making the point that too often good discussions are being smoothed.

I'll compliment that with another advice. On any modern Linux there is "perf
top". If you haven't yet, I'd suggest learning how to use it. And using it.

------
JSno
another c++ optimization tips from a Clemson professor

[http://people.cs.clemson.edu/~dhouse/courses/405/papers/opti...](http://people.cs.clemson.edu/~dhouse/courses/405/papers/optimize.pdf)

