Branch predictor: How many “if”s are too many?

flakiness · on May 6, 2021

In old days people used #ifdef to compile things out so that the "production code" doesn't have any unnecessary branches. I was shocked when I first saw these living if()s in the server-side C++ code but then realized it was vital for debugging in production.

People also used to prefer table based jump over long if-else-if chain for anecdotal performance reasons. That has gradually changed over the evolution of CPUs with various optimizations like the branch prediction. Now some JIT impls like tracing JIT replace virtual function calls with if-elses.

By the way, I'm glad that the article is so thoughtfully written that it doesn't include any ambiguous "best practices" at the end. It could have created yet another myth. The three top tips in the article is more about stating facts. I admire the author taking that stance.

Retric · on May 7, 2021

Jump tables can still be vastly faster than if-else-if though it really depends on the specifics such as the frequency of each option being correct. Best practices are always subservient to benchmarks when optimizing.

samus · on May 7, 2021

The greatest advantage of jump tables is that they are flexible, within certain margins. You could populate them with entries from a config file or a database without having to go through a full release cycle.

asveikau · on May 7, 2021

> Now some JIT impls like tracing JIT replace virtual function calls with if-elses.

I think it was about 2010ish when I first noticed a C compiler (Microsoft's) doing this optimization for function pointer calls. Might have been a link time optimization, because caller and callee were in different compilation units.

Too · on May 9, 2021

I think ifs based on constants get optimized to nothing anyway. So really no need to use #ifdef.

superdimwit · on May 7, 2021

Big if/else lists will often get compiled into a jump table anyway.

OskarS · on May 6, 2021

Great article, just to add one thing: conditional branches can seriously mess with the compiler's ability to optimize, especially the ability to auto-vectorize. So even if a never-taken branch is more or less free when the CPU executes it, just having it there might have resulted in much less efficient codegen. Just another thing to keep in mind with this issue.

jugg1es · on May 6, 2021

This is a pretty amazing analysis that answers questions we probably all had as newbie programmers before realizing that code readability mattered more than efficiency 99.9% of the time.

hinkley · on May 6, 2021

The goalposts are constantly changing, but there are refactorings that achieve both.

As a student, real world performance analysis was one of the ways I kept myself engaged in otherwise sometimes tedious curricula. After college I landed a job far away from home only to discover their software was so slow I was embarrassed to be associated with it, so I got a quick self-directed study in practical optimization.

On any given day, you're much more likely to be amused by your own cleverness than anyone else is, so you either do it for your own reasons, stop doing it entirely, or find palatable forms of expression.

My most concrete example is probably the bizarre love triangle between locality of reference, lazy evaluation, and common subexpression elimination. There are several islands of sanity where reading comprehension is improved by refactoring in ways that happen to reduce or avoid calculations.

You clean up the code and it gets 5% faster. You use the same pattern in three places and you're comfortably into double digits.

As more evidence piled up for me about the dangers of confusing code, I spent more time thinking about the legibility and very little time thinking about the performance, but the fact is that I still do it intuitively. 25yo me would still be pretty happy with the performance of my code, even though <mumble>yo me gets a little concerned about that level of enthusiasm.

parhaml · on May 6, 2021

I appreciate this comment. We can develop these little ingrained patterns that keep things 'optimal' and at the same time think about future us having to read the code again.

I know that even 1 week future me appreciates readable code at the expense of a _little_ bit of performance.

lmilcin · on May 6, 2021

In most applications you will not be able to get to the level of performance when any of this is going to matter.

For example, if you program Python then there is so much more conditionals in Python runtime itself than you will not see any difference from couple of your conditionals being predicted better or worse. You will also be doing a lot of other stuff (like spending time in framework code, calling libraries, performing I/O) that will be making any gains from this pretty insignificant.

Where this comes into picture is if you are really bent on improving performance of your code on a grand scale (when you develop algorithmic trading or maybe operating system), when you have a very dense, busy inner loop (for example video encoding) or when you develop compilers.

andrepd · on May 6, 2021

I think that's more of an indictment of how bad the concept of "production python" is.

mateo411 · on May 6, 2021

I disagree with that. Production doesn't mean that the code runs fast, but that it is stable and reliable.

There are use cases where you need a performant system that uses CPU really efficiently, in these cases Python is probably not the right tool.

Dylan16807 · on May 7, 2021

Lots of code doesn't run so often that it needs to be optimal. If you use proper data structures and algorithms then you'll avoid the real problems. The constant factor slowdown you get from language choice won't matter for the vast majority of lines of code.

andrepd · on May 7, 2021

This is given as an excuse to write sloppy code and then we get the bloated mess that is modern software and the modern web.

Plus python has many problems besides performance.

Dylan16807 · on May 7, 2021

Sloppy code would be using the wrong data structures and algorithms. Sloppy code would be optimizing nothing at all.

If you write things with good big O performance, for almost all of your lines of code you're done. It will perform fine.

Bloated messes don't come from writing a bunch of business logic in python instead of C, not while processors are this many orders of magnitude faster than they used to be. They comes from layers and layers of abstractions, or doing things completely the wrong way.

BenjiWiebe · on May 7, 2021

But if you're processing 10's of thousands of records in an O(n) loop, it will still be 10's of thousands of times slower than processing a single record in that loop. i.e. having good big O still doesn't mean the code will be fast.

Dylan16807 · on May 7, 2021

Are you saying this is a scenario where O(1) is possible? Then "good big O" is O(1). Code that is O(n) is a mistake.

If you're saying the fastest method is O(n), then sure, no method can guarantee fast processing of the entire list. But that fact sure isn't python's fault!

im3w1l · on May 6, 2021

My takeaway was the opposite. You have a fairly generous budget of if's and code size if you use a low level language. But if you use a higher level language you could run into the limits.

chrisweekly · on May 7, 2021

Bit of a tangent but in the userland app space I'm increasingly interested in sidestepping the whole "conditionals" question with finite state machines, eg with a lib like XState.

simias · on May 6, 2021

I agree of course but I'd argue that an array lookup would be more readable and more maintainable than this `if` soup in the `getCountry` example code. One added benefit for instance is that the language will never allow you to declare two values for the same index, whereas you could very easily mess up a copy/paste and forget to update the condition.

More generally I don't think we should consider that readability and efficiency are at opposing ends of the same axis. With a bit of care and good tools it's often possible to get the best of both worlds, or close enough.

Agentlien · on May 7, 2021

This kind of statement always makes me feel funny. It's true that algorithmic complexity is the most important part, you always have to measure, and micro optimisations are often not worthwhile. But this type of statement seems a lot stronger than that.

I'm a graphics programmer working in game development and most of my time is spent on our performance task force. I spend a lot of time thinking about the performance impact of things I write and revisiting things I've written to improve their performance.

My job would have been easier if more people spent a bit more time considering the performance impact of their solutions, as well.

jugg1es · on May 10, 2021

Your job is very different than mine. Gaming/graphics are their own domain with different priorities. When your goal is to process data on the scale of several seconds, readability matters more. When your goal is performance per frame, you will definitely have different priorities.

solarexplorer · on May 6, 2021

If you care about performance and the number of if's you should look into profile guided optimization. This can help the compiler to arrange the code such that branches like the "if (debug)" in the example, are never taken and so don't occupy precious resources. This will also make instruction fetch faster and if the body of the if is placed on another cache line it can also improve instruction cache performance (and even TLB performance and IO performance if the unused code is placed into another page).

https://llvm.org/docs/HowToBuildWithPGO.html

throwaway744678 · on May 6, 2021

Unrelated to the actual topic, but I believe there is a copy-paste bug in the first sample:

    const char *getCountry(int cc) {
        if(cc == 1) return "A1";
        if(cc == 2) return "A2";
        if(cc == 3) return "O1";
        if(cc == 4) return "AD";
        if(cc == 5) return "AE";
        if(cc == 6) return "AF";
        if(cc == 7) return "AG";
        if(cc == 1) return "AI";
        ...
        if(cc == 252) return "YT";
        if(cc == 253) return "ZA";
        if(cc == 254) return "ZM";
        if(cc == 255) return "ZW";
        if(cc == 256) return "XK";
        if(cc == 257) return "T1";
        return "UNKNOWN";
   }

This will never return "AI" (Anguila, as it seems!)

klohto · on May 6, 2021

Full example doesn't have this mistake, so I would assume it's not in production :) https://godbolt.org/z/KWYEW3d9s

crazypython · on May 6, 2021

Wouldn't the compiler optimize this to an array lookup?

    static const char* countries[257];

Or even:

    static const char countrycodes[514];
    return countrycodes[i * 2];

dahfizz · on May 6, 2021

Doesn't look like it. Each case is a compare and jump:

https://godbolt.org/z/KWYEW3d9s

nitwit005 · on May 6, 2021

Playing around a bit, it does seem that clang will do so: https://godbolt.org/z/1czW91bMM

A presentation on algorithm improvements popped up when looking into it (2015): https://llvm.org/devmtg/2015-10/slides/Wennborg-SwitchLoweri...

ChrisSD · on May 6, 2021

Turn the gcc version up to 11.

jfrunyon · on May 6, 2021

    static const char countrycodes[514];
    return countrycodes[i * 2];

Err... each of the strings is 3 bytes long.

tuesdayrain · on May 6, 2021

If that was javascript you could make it return "AI" with the right `valueOf` call - https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

jfrunyon · on May 6, 2021

> This code is in a performance critical loop and it looks like a waste - we never run with the "debug" flag enabled[1]. 1. One historical solution to this specific 'if debug' problem is called "runtime nop'ing".

You know what another historical solution is called? Use a compile-time constant and let the damn optimizer optimize. (Virtually) no modern, widely used compiler would fail to optimize out `if (0) { ... }`.

Also, why does M1 performance matter to Cloudflare's software? I kinda doubt their servers are on Macbooks.

majke · on May 6, 2021

We do run aarch64. We are very interested if M1 is faster than the ARM's we have.

https://blog.cloudflare.com/arms-race-ampere-altra-takes-on-...

https://blog.cloudflare.com/porting-our-software-to-arm64/

superjan · on May 6, 2021

The “if (debug) log” example in the article is likely compiled to an always taken conditional jump past the logging call. And therefore not quite free. Nitpicking of course, I love such in-depth articles.

xenadu02 · on May 6, 2021

You can easily verify this in your specific code base but I believe the compiler prefers not-taken jumps for paths it predicts to be less frequently taken.

You can provide hints that the path is a cold one to ensure the compiler uses the most efficient conditional branch layout. On some architectures the compiler will also provide this hint to the CPU (with a prefix, the instruction it selects, or a flag bit in the instruction) which avoids the perf hit on the first iteration.

Both Clang and GCC recognize __builtin_expect which can be wrapped in a macro easily:

#define unlikely(x) __builtin_expect(!!(x), 0)

the_duke · on May 6, 2021

C++ 20 introduces standard likely/unlikely attributes: https://en.cppreference.com/w/cpp/language/attributes/likely

astrange · on May 6, 2021

clang also has `__builtin_unpredictable` which is at least as useful as likely/unlikely.

lttlrck · on May 6, 2021

I feel like I'm missing something in the conclusion.

The initial question is:

  if (debug) {
    log("...");
  }

> Is it ok to have if clauses that will basically never be run?

And the summary is:

> If it's never-taken, it's probably ok. I found no evidence that such branches incur any extra cost. But do avoid always-taken branches and function calls.

However, if debug is false in the initial example, the branch is always taken, it's jumping over the call to log. Such code incurs a penalty - correct?

https://www.godbolt.org/z/o1Yd7oWYa

PaulDavisThe1st · on May 6, 2021

Not that this at all answers your question, and I'm guessing you know this already, but for a language with a preprocessor, that should be something more like:

  #ifndef NDEBUG
  #define DEBUG(msg) if (debug) { log (msg); }
  #else
  #define DEBUG(msg) 
  #endif

this eliminates any question of branch cost from an NDEBUG ("production" or "optimized") build, and you don't have to wonder.

Obviously there are more sophisticated methods too.

Scaevolus · on May 6, 2021

Given appropriate likeliness annotations or profile-guided optimization, it's possible for the compiler to _invert_ the conditional and move the log statement to a "cold" section of code outside of the main flow, making it a never-taken branch.

lttlrck · on May 6, 2021

Yes. I guess that's covered by "it's probably ok"... I'm going to play in godbolt now.

Scaevolus · on May 6, 2021

-O2 turns it into a "jump if nonzero" without any extra annotations: https://www.godbolt.org/z/j9dfrbosv

dusted · on May 6, 2021

I'm wondering how you guys would optimize that code ?

My naive approach would be something like this:

const int numCountryIndicies = however many there are..;

const char* countries = "A1\0A2\0...";

return (cc < numCountryIndicies)?countries[cc*2]:"UNKNOWN";

messe · on May 6, 2021

You have an off-by-one error. The strings are 3 bytes long.

I'd write something similar, more or less; Probably the following, not for optimization, but more as a matter of style:

    const char countries[][3] = {
        "A1",
        "A0",
       // [...]
    };

    // [...]

    int total_countries = (int) (sizeof(countries) / sizeof(countries[0]));
    return cc < total_countries ? countries[cc] : "UNKNOWN";

No need to hard code the length, and the cast is guaranteed to be within the bounds of an int on all platforms as long as you don't go over 2^16-1 countries.

simias · on May 6, 2021

I think that's the best solution (the only potential optimization I can think of would be to add a padding byte after every country to force 32bit alignment which would make the offset computation faster on some architectures, although probably not on x86).

Stylistically-wise I think the best solution would be to write the array like:

    const char countries[][3] = {
        [0] = "A1",
        [1] = "A0",
    };

This way the codes are explicit when you read the code and it makes editing the array a little easier. Unfortunately gcc only warns if you set the same index twice with -Wextra, it remains silent with -Wall.

messe · on May 6, 2021

> This way the codes are explicit when you read the code

Good call. It might have caught the off-by-one mistake I made (the codes start from 1 and not 0 in the blog post). Maybe even switch to using an enum as our index instead of an int.

dusted · on May 6, 2021

Well caught! I have two off by one errors :D seems they start from 1 instead of 0, and I didn't account for the null byte :D your solution is nicer and easier to read :)

Cerealkiller050 · on May 6, 2021

This jives with the overall point of the article of keep it simple and readable. The semi fancy indexing caused 2 errors and the upside is negligible after the compiler gets done with it.

majke · on May 6, 2021

Modern freshest gcc 11 can optimize the if's nicely https://godbolt.org/z/771foExcG

  getCountry:
        mov     eax, OFFSET FLAT:.LC0
        cmp     edi, 258
        ja      .L1
        mov     edi, edi
        mov     rax, QWORD PTR CSWTCH.1[0+rdi*8]
  .L1:
        ret

Denvercoder9 · on May 6, 2021

Anyone know what the purpose of the mov edi, edi instruction there is?

Edited to add: I understand that it's a NOP, but why would the compiler emit one here?

Const-me · on May 6, 2021

That instruction clears upper 4 bytes of the rdi register. Note the next instruction uses rdi in the address.

edi register is the lower 4 bytes of rdi. Instructions which write these smaller pieces zero out the unused higher bytes of the destination registers. This helps with performance because eliminates data dependencies on the old values in these higher bytes.

gaul · on May 8, 2021

This is not a NOP; it explicitly clears the upper 32 bits of EDI since the compiler does not know that they are zero in this situation. If you change cc from an int to size_t (long on x86-64) the compiler will generate:

        mov     eax, OFFSET FLAT:.LC0
        cmp     rdi, 258
        ja      .L1
        mov     rax, QWORD PTR CSWTCH.1[0+rdi*8]

Note that in some cases the compiler can do this automatically via lifetime analysis but not in this freestanding example.

JoeAltmaier · on May 6, 2021

It does nothing, so its a noop of sorts. I wonder if its a branch-delay tactic of some kind? Surely the algorithm is unchanged if it were removed.

colejohnson66 · on May 6, 2021

It’s just a two byte NOP. IIRC, it’s Intel’s recommended form for one of that length. Windows uses it for hot patching,[0], but I can’t imagine that’s the reason here.

[0]: https://devblogs.microsoft.com/oldnewthing/20110921-00/?p=95...

qayxc · on May 6, 2021

clang does the same - I tested it back to clang 7.0.

So clang did this for a long time it seems.

fallingknife · on May 6, 2021

Does a ternary actually eliminate a logic branch? I always assumed it was just high level shorthand, and would be the same in assembly as writing out the if block.

throwaway744678 · on May 6, 2021

It does not eliminate the conditional jump, it's just synctatic sugar.

TonyTrapp · on May 6, 2021

And to make it more complicated, both "regular" ifs and ternaries may not generate a branch at all, e.g. by using the CMOV (conditional move) instruction on x86. This avoids a branch but can obviously not eliminate the data dependency between the condition and a later consumer of the result.

PixelOfDeath · on May 6, 2021

Until after C++11, ternary was the only way to use "if then" in constexpr functions. So it has slightly more use from a C++ point of view.

MaxBarraclough · on May 6, 2021

Following that line, it's never exactly been syntactic sugar. When initializing a const local, you can use the ternary operator on the right hand side of the assignment, but you can't get the same effect using if (short of writing a new function). Similarly, C++ lets you use the ternary operator for the left hand side of an assignment, (a ? b : c) = 42; although you could use an if for this pretty easily.

andromeduck · on May 6, 2021

Wow I had no idea you could use it on the LHS. TIL.

MaxBarraclough · on May 6, 2021

A TIL of my own: it seems not to be allowed in C.

saagarjha · on May 7, 2021

It works if you take the address of the b and c and then dereference. I assume this works without those in C++ because of references.

gpderetta · on May 6, 2021

In a funny turn of events, in C++17 there is no ?: shorthand for 'constexpr if' .

rantwasp · on May 6, 2021

it’s probably a switch on the index (or in this case you could actually have an array and use the index).

I don’t know my compilers that well but if I had to guess I would say there is a good chance this will be optimized away by the compiler.

azalemeth · on May 7, 2021

As pointed out above, gcc ≥11 optimises it to a switch / lookup table (https://godbolt.org/z/PxvGKqGrd).

rantwasp · on May 7, 2021

oh wow. this is really nice. was expecting some sort if gcc documentation or a blog post. got the real deal. thanks for the link

lmilcin · on May 6, 2021

> I'm wondering how you guys would optimize that code ?

Run it on infinite multiverse, construct a mechanism to destroy the Universe every time branch predictor did not predict every single conditional correctly.

Narishma · on May 6, 2021

It reaches out it reaches out it reaches out it reaches out— One hundred and thirteen times a second.

dylan604 · on May 6, 2021

won't we need quantum computing for that first?

lmilcin · on May 6, 2021

You don't need that.

The branch predictor can select a random branch, so it doesn't even need to actually predict anything. It should use quantum noise so that it has chance of selecting different branches in different copies of the universe. There is noting to compute, quantum or otherwise.

The bomb to destroy the universe will take care of all universes where it made wrong predictions, so that is where we should concentrate our development resources.

It has more utility than just branch prediction. Imagine the bomb going off automatically whenever there starts a war.

Any copy of the universe that starts a war is automatically eliminated and this guarantees that you, as an observer, will never observe any wars.

teachingassist · on May 6, 2021

I immediately wanted to rewrite it as an array, yep, since the countries are already indexed.

And to minimize the risk of errors such as the "Anguila" error above.

pmarreck · on May 6, 2021

Couple of thoughts here:

> if (debug)

The language Elixir, during its compilation phase, actually automatically removes these in the production environment. That is to say, it is a macro which behaves as expected in every environment but "prod", in which case it removes itself.

> conditionals

A number of years ago I used Ruby to experiment with writing a version of fizzbuzz that avoided conditionals entirely, and was purely functional:

https://github.com/pmarreck/ruby-snippets/blob/master/functi...

(I actually regret using currying here because it hurts the portability of the algorithm)

While it may be difficult (if possible) to convert all conditional logic to functional logic in a given algorithm, perhaps a compiler could do the work, if certain patterns emerged.

I'm not a C guy, but I have to wonder if such an algorithm would run faster on Intel or ARM chips by avoiding all branching and branch prediction. Can someone chime in on this?

cmeacham98 · on May 7, 2021

> The language Elixir, during its compilation phase, actually automatically removes these in the production environment. That is to say, it is a macro which behaves as expected in every environment but "prod", in which case it removes itself.

I mean, you can do this in C too. Make "debug" an #ifdef, and when it is defined to "false" the compiler will obviously optimize that out.

pmarreck · on May 7, 2021

Right.

jacoblambda · on May 7, 2021

There's a bunch of things you can do in C to "clue in" the compiler on what your code is going to do.

The easiest is using the preprocessor to hardcode values such that branches can be eliminated. This is nice and all but it can start to cause problems with things getting confusing.

---

The other common thing you can do is provide compiler attributes to hint at what the code is doing. For example, you can specify that a function is pure (doesn't affect the observable state of the program) or const (pure attribute with the addition that the function is not affected by changes to anything other than the inputs. i.e. same inputs always give you the same output). There are also cold/hot attributes for improving branch prediction.

Similarly there is the leaf attribute which restricts the control flow of a function largely to the current translation unit and allows the compiler to deduce significantly more information about what the code is intending to do.

---

Now on the more fancy/dangerous side is using strict aliasing, the restrict keyword, and array parameters in functions. Strict aliasing tells the compiler that types can only contain what they say they contain (i.e. any two pointers for the same location in memory must have the same type).

Likewise the restrict keyword states that any memory accessed by a restrict qualified pointer can only be accessed by that pointer or by pointers derived from said pointer (member accesses, array access, pointer sliding/offsets). This allows the compiler to know that memory isn't being touched by other accesses (which without restrict it could be). Realistically this allows you some small to large performance gains any time you are interacting with more than one pointer at a time as the alternative is that the compiler may have to recheck every cached value any time you write to another pointer. An example would be `f(char * restrict x, char * restrict y)`. Here you know that no value ever possibly accessed or written to in x will affect any value ever possibly accessed or written to in y and vice versa. Note that the restrict keyword is valid on variables, pointers, members, and arrays.

In a similar vein, you can leverage array arguments in functions to guarantee to the compiler that a pointer argument is non-null and contains some number of elements. For example the difference between the functions `f(char * x)`, `g(char y [5])`, and `h(char z [static 5]` is that the compiler can't necessarily know for sure that x is a valid address in memory or how many elements it contains. The compiler does however know that the variable y contains exactly 5 elements and that the variable z contains at minimum 5 elements. The compiler can now potentially elide null checks and bounds checks (say for null terminated strings). Since the function is aware of this, it may help a bit but it benefits more in that with inlined functions, macros, and LTO the compiler can optimise within the scope of said function trees to elide those checks/branches. Something to note because people don't always pick up on it, array arguments do in fact work with size 1 which allows you to specify that all your arguments are guaranteed valid pointers. Basically for any internal functions, you probably always want to use either an array argument over a pointer so you can elide null checks and the like.

---

Combining all of the above in a C project can sufficiently restrict the search space so that the compiler can elide most "unused" code sections, restructure a decent bit of branching code into branch free forms (since it can be deduced equivalent at compile time), and reasonably tag the branch weights for the remaining branches.

TLDR: Yes however some of the techniques come with the downside that if the code doesn't fit the requirements of the technique the optimisations can introduce bugs into the program. There are some levels of compiler warnings to help you when using these features but they only catch the obvious cases (since if they could catch all of them, they could implement the techniques automatically). It's C, it comes with its footguns but if you know how to use them, you'll probably only be burned a few times and hopefully not too badly.

omginternets · on May 6, 2021

Are there any rules-of-thumb for avoiding branch mispredictions, other than reducing the number of conditional branches?

For example (not that I expect this to be true), something of the same sort as "your if block should contain the rare condition".

initplus · on May 6, 2021

If you have perf critical branches in your code, you should try and make them as predictable as possible. Sometimes that means using a different approach that is algorithmically worse, but has much more predictable branches.

Classic example is linear search is faster than binary search for “small” lists. The item == myItem branch is only taken once at the end. Meanwhile binary search will take the branches of it’s comparison (item < myItem, item > myItem) in equal proportion to each other, so the branch predictor is stuck at a 50% guess for that branch. There is a great talk on this but I can’t remember what it’s called...

acidbaseextract · on May 6, 2021

As you point out, the exact properties that make binary search algorithmically fast can slow it down on real computer hardware.

Branch predictors like low entropy (unsurprising) branches, but binary search is algorithmically fast because it maximizes the entropy (information gained) from the few branch checks that it does make.

sischoel · on May 6, 2021

Although it is possible to create branchless binary search. Here is one article: https://schani.wordpress.com/2010/04/30/linear-vs-binary-sea..., there are probably other ways to do that.

Another interesting question where one could sink a lot of time is how do to binary search on GPU's using multiple threads. There branching in multiple threads at the same time is also bad, but for slightly different reasons.

singron · on May 7, 2021

On a GPU you could do a parallel n-ary search. I.e. instead of making 1 test in the center, make n tests. E.g. 1 round of 32-ary search should be equivalent of 5 rounds of binary search. Not only do you avoid the branch predictor, but the memory access is also parallel, which isn't true in branchless binary search. If the width of the search is larger than a warp/wavefront (32/64), then you need to add synchronization and it might not be faster anymore.

GPUs aren't really built for latency though and this will waste a lot of accesses on speculation. You are probably better off with a good btree.

sellyme · on May 7, 2021

Rule of thumb #1: don't. The compiler is probably smarter than you, it can handle things itself most of the time.

Rule of thumb #2: always test low-level performance improvements on real data. This is very similar to rule #1 - the compiler might already be implementing the optimisation you're changing to, so you might be making the code less readable for no benefit.

Rule of thumb #3: sort your arrays. Sorting algorithms are one of the most aggressively-optimised functions in modern computer science, it's a very small overhead compared to regular branch prediction failures. Make sure to keep rules #1 and #2 in mind, but this is probably the most common significant performance improvement to be gleaned from branch prediction.

fomine3 · on May 7, 2021

Famous SO for #3 https://stackoverflow.com/questions/11227809/why-is-processi...

bob1029 · on May 6, 2021

I might be off base here, but you would probably do best to not overthink this and let the compiler do its job. Attempting to pre-optimize if statements relative to a branch predictor (a very hardware-specific thing) is probably a massive fool's errand.

astrange · on May 6, 2021

The compiler doesn't know how the CPU branch predictor works either. It's not especially good at optimizing this, or guessing if a branch is predictable or not, so you might as well do it yourself.

berkut · on May 6, 2021

You can use the __builtin_expect keyword on GCC and clang.

It does make a difference (HPC code uses it quite a bit because of this), but as it's done manually, results obviously vary depending on processor, and there's a limit to how far you can take it. Trying not to branch (not always possible) is best...

Basically, it's like manually doing profile-guided optimisation - you look at perf or vtune, and add the above to branches which have a high penalty and which are rarely or often taken, and see if it makes a difference.

Veedrac · on May 6, 2021

Note that this is not targeted to branch prediction, but compiler optimizations. It should be done carefully, because it tends to have a larger downside than its upside, and the changes the compiler makes based on it can be unpredictable, and unstable.

> Trying not to branch (not always possible) is best...

Branches are often faster than branch-free methods, since they can speculate certain data dependencies away.

berkut · on May 6, 2021

> Branches are often faster than branch-free methods, since they can speculate certain data dependencies away.

And often they're slower, since predicting some branches is often impossible due to the data :) So the processor ends up miss-predicting most of the time, thinking it knows what happened last time.

It all depends on what you're doing.

knorker · on May 6, 2021

You can hint the compiler: https://en.cppreference.com/w/cpp/language/attributes/likely

Or you can use profile directed optimization, as others mentioned.

In my test recently it reduced branches as a whole by 30%, according to perf stat ./a.out

Veedrac · on May 6, 2021

Branch prediction works predictably well when the branch is heavily biased towards one outcome, typically follows a repeating pattern of a short length (eg. taken once every three iterations), or is strongly correlated with another preceding (not necessarily local) branch. Conversely, branch prediction works badly when they are data dependent on irregular data.

mids_hn · on May 6, 2021

My rule of thumb is to partition my data sets, to make sure the conditional has the same result in as long a row as possible.

throwaway823882 · on May 7, 2021

> Is it ok to have if clauses that will basically never be run?

Yep.

> But when I thought about it more: should it be improved?

Nope.

If you are an average programmer, you should write your code first to be legible and maintainable.

If you are a smart programmer, you should add performance tests to your test suite (since you need to do performance testing anyway for any production-critical code) and run your app on the appropriate-sized machine.

If you are a very smart programmer, you should rely on performance hacks when your application no longer meets performance requirements under heavy load.

If you are a freaking genius, you should think about optimizing your code.

throwaway29303 · on May 6, 2021

  The chart shows funny alignment issues. It's unclear what they are caused by.

I think you've found its heartbeat. :)

jfrunyon · on May 6, 2021

> This is visible with block size 64 breaking at 3072 mark 307264=196K, and for block 32 at 6144: 614432=196K.

You mean 192K.

jfrunyon · on May 14, 2021

Markdown is a lovely and amazing solution to text formatting. Formatting fixed below:

> This is visible with block size 64 breaking at 3072 mark 3072*64=196K, and for block 32 at 6144: 6144*32=196K.

You mean 192K.

sellyme · on May 7, 2021

They mean 196K, aka 196 thousand. You mean 192KB.

jfrunyon · on May 14, 2021

No, they mean 192KB. Because that's how they refer to it literally every other time. Oh, and because it's NOT 196 thousand. 3072*64 = 196608. Which rounds to... 197 thousand. (But is exactly 192KB - or 192KiB if you prefer that sort of thing)

sellyme · on May 17, 2021

> No, they mean 192KB. Because that's how they refer to it literally every other time.

Are you suggesting that you believe that both the number and the unit were a typo?

> Oh, and because it's NOT 196 thousand. 3072*64 = 196608. Which rounds to... 197 thousand.

Truncation is a totally valid way of representing numbers when pinpoint accuracy isn't really necessary. "32K" is a very common way to refer to the number 32768, for example.

> or 192KiB if you prefer that sort of thing

I appreciate the caveat here as I very much do not and believe that the concept of kibibytes being distinct from kilobytes are a scam perpetuated by hard-drive and floppy disk manufacturers as a way to cut costs while still advertising the same storage space. It definitely makes sense to use the same definitions for the SI prefixes, but even two decades after that ISO was published no-one actually says "kibibytes", so clearly it's not much of a standard.

But that's another discussion entirely.

jfrunyon · on May 25, 2021

> I appreciate the caveat here as I very much do not and believe that the concept of kibibytes being distinct from kilobytes are a scam perpetuated by hard-drive and floppy disk manufacturers as a way to cut costs while still advertising the same storage space

Except that you just said that "32768->32K" involves truncation. What is 32*1024? 32 KiB is exactly, no truncation involved, 32768.

(BTW, kibibytes are distinct from kilobytes because 1024 is distinct from 1000. Similarly, 1048576 is distinct from 1000000.)

> Truncation is a totally valid way of representing numbers [with lower precision]

No. Rounding is.

--

Once again. THEY, the authors of the post, use unit multipliers of 1024 in every other figure on the page. Except in this one particular figure where you claim they intentionally use 1000. I frankly don't care whether they define KB as 1000B or 1024B. Either way works. But they should, at the very least, be consistent within the space of a single blog post.

signa11 · on May 7, 2021

how about gcc (and others as well) ‘__builtin_expect’ which is most commonly translated as ‘likely || unlikely’ macros to nudge the thing along in right direction?

jonny_eh · on May 6, 2021

Why don't processors fetch and decode both branches?

solarexplorer · on May 6, 2021

Because doing so would require twice the cache and execution bandwidth. It is usually better to take the gamble and fetch just one branch. If you don't want to gamble you could stop the current thread and continue with the execution of another thread (if you have simultaneously multithreading/hyper threading).

In some cases the compiler may decide to evaluate both paths and then "undo" the path that was wrong. This is very common on VLIW architectures like Itanium which have hardware support for this (predicates), but it can also be done with other instruction sets.

brigade · on May 6, 2021

That would get exponential real quick. But Intel’s Tremont does have an instruction decoder that’s 6-wide only when decoding a taken branch, since it also starts decoding the branch target in the same cycle.

kissgyorgy · on May 6, 2021