In old days people used #ifdef to compile things out so that the "production code" doesn't have any unnecessary branches. I was shocked when I first saw these living if()s in the server-side C++ code but then realized it was vital for debugging in production.
People also used to prefer table based jump over long if-else-if chain for anecdotal performance reasons. That has gradually changed over the evolution of CPUs with various optimizations like the branch prediction. Now some JIT impls like tracing JIT replace virtual function calls with if-elses.
By the way, I'm glad that the article is so thoughtfully written that it doesn't include any ambiguous "best practices" at the end. It could have created yet another myth. The three top tips in the article is more about stating facts. I admire the author taking that stance.
Jump tables can still be vastly faster than if-else-if though it really depends on the specifics such as the frequency of each option being correct. Best practices are always subservient to benchmarks when optimizing.
The greatest advantage of jump tables is that they are flexible, within certain margins. You could populate them with entries from a config file or a database without having to go through a full release cycle.
> Now some JIT impls like tracing JIT replace virtual function calls with if-elses.
I think it was about 2010ish when I first noticed a C compiler (Microsoft's) doing this optimization for function pointer calls. Might have been a link time optimization, because caller and callee were in different compilation units.
Great article, just to add one thing: conditional branches can seriously mess with the compiler's ability to optimize, especially the ability to auto-vectorize. So even if a never-taken branch is more or less free when the CPU executes it, just having it there might have resulted in much less efficient codegen. Just another thing to keep in mind with this issue.
This is a pretty amazing analysis that answers questions we probably all had as newbie programmers before realizing that code readability mattered more than efficiency 99.9% of the time.
The goalposts are constantly changing, but there are refactorings that achieve both.
As a student, real world performance analysis was one of the ways I kept myself engaged in otherwise sometimes tedious curricula. After college I landed a job far away from home only to discover their software was so slow I was embarrassed to be associated with it, so I got a quick self-directed study in practical optimization.
On any given day, you're much more likely to be amused by your own cleverness than anyone else is, so you either do it for your own reasons, stop doing it entirely, or find palatable forms of expression.
My most concrete example is probably the bizarre love triangle between locality of reference, lazy evaluation, and common subexpression elimination. There are several islands of sanity where reading comprehension is improved by refactoring in ways that happen to reduce or avoid calculations.
You clean up the code and it gets 5% faster. You use the same pattern in three places and you're comfortably into double digits.
As more evidence piled up for me about the dangers of confusing code, I spent more time thinking about the legibility and very little time thinking about the performance, but the fact is that I still do it intuitively. 25yo me would still be pretty happy with the performance of my code, even though <mumble>yo me gets a little concerned about that level of enthusiasm.
I appreciate this comment. We can develop these little ingrained patterns that keep things 'optimal' and at the same time think about future us having to read the code again.
I know that even 1 week future me appreciates readable code at the expense of a _little_ bit of performance.
In most applications you will not be able to get to the level of performance when any of this is going to matter.
For example, if you program Python then there is so much more conditionals in Python runtime itself than you will not see any difference from couple of your conditionals being predicted better or worse. You will also be doing a lot of other stuff (like spending time in framework code, calling libraries, performing I/O) that will be making any gains from this pretty insignificant.
Where this comes into picture is if you are really bent on improving performance of your code on a grand scale (when you develop algorithmic trading or maybe operating system), when you have a very dense, busy inner loop (for example video encoding) or when you develop compilers.
Lots of code doesn't run so often that it needs to be optimal. If you use proper data structures and algorithms then you'll avoid the real problems. The constant factor slowdown you get from language choice won't matter for the vast majority of lines of code.
Sloppy code would be using the wrong data structures and algorithms. Sloppy code would be optimizing nothing at all.
If you write things with good big O performance, for almost all of your lines of code you're done. It will perform fine.
Bloated messes don't come from writing a bunch of business logic in python instead of C, not while processors are this many orders of magnitude faster than they used to be. They comes from layers and layers of abstractions, or doing things completely the wrong way.
But if you're processing 10's of thousands of records in an O(n) loop, it will still be 10's of thousands of times slower than processing a single record in that loop. i.e. having good big O still doesn't mean the code will be fast.
Are you saying this is a scenario where O(1) is possible? Then "good big O" is O(1). Code that is O(n) is a mistake.
If you're saying the fastest method is O(n), then sure, no method can guarantee fast processing of the entire list. But that fact sure isn't python's fault!
My takeaway was the opposite. You have a fairly generous budget of if's and code size if you use a low level language. But if you use a higher level language you could run into the limits.
Bit of a tangent but in the userland app space I'm increasingly interested in sidestepping the whole "conditionals" question with finite state machines, eg with a lib like XState.
I agree of course but I'd argue that an array lookup would be more readable and more maintainable than this `if` soup in the `getCountry` example code. One added benefit for instance is that the language will never allow you to declare two values for the same index, whereas you could very easily mess up a copy/paste and forget to update the condition.
More generally I don't think we should consider that readability and efficiency are at opposing ends of the same axis. With a bit of care and good tools it's often possible to get the best of both worlds, or close enough.
This kind of statement always makes me feel funny. It's true that algorithmic complexity is the most important part, you always have to measure, and micro optimisations are often not worthwhile. But this type of statement seems a lot stronger than that.
I'm a graphics programmer working in game development and most of my time is spent on our performance task force. I spend a lot of time thinking about the performance impact of things I write and revisiting things I've written to improve their performance.
My job would have been easier if more people spent a bit more time considering the performance impact of their solutions, as well.
Your job is very different than mine. Gaming/graphics are their own domain with different priorities. When your goal is to process data on the scale of several seconds, readability matters more. When your goal is performance per frame, you will definitely have different priorities.
If you care about performance and the number of if's you should look into profile guided optimization. This can help the compiler to arrange the code such that branches like the "if (debug)" in the example, are never taken and so don't occupy precious resources. This will also make instruction fetch faster and if the body of the if is placed on another cache line it can also improve instruction cache performance (and even TLB performance and IO performance if the unused code is placed into another page).
> This code is in a performance critical loop and it looks like a waste - we never run with the "debug" flag enabled[1]. 1. One historical solution to this specific 'if debug' problem is called "runtime nop'ing".
You know what another historical solution is called? Use a compile-time constant and let the damn optimizer optimize. (Virtually) no modern, widely used compiler would fail to optimize out `if (0) { ... }`.
Also, why does M1 performance matter to Cloudflare's software? I kinda doubt their servers are on Macbooks.
The “if (debug) log” example in the article is likely compiled to an always taken conditional jump past the logging call. And therefore not quite free. Nitpicking of course, I love such in-depth articles.
You can easily verify this in your specific code base but I believe the compiler prefers not-taken jumps for paths it predicts to be less frequently taken.
You can provide hints that the path is a cold one to ensure the compiler uses the most efficient conditional branch layout. On some architectures the compiler will also provide this hint to the CPU (with a prefix, the instruction it selects, or a flag bit in the instruction) which avoids the perf hit on the first iteration.
Both Clang and GCC recognize __builtin_expect which can be wrapped in a macro easily:
I feel like I'm missing something in the conclusion.
The initial question is:
if (debug) {
log("...");
}
> Is it ok to have if clauses that will basically never be run?
And the summary is:
> If it's never-taken, it's probably ok. I found no evidence that such branches incur any extra cost. But do avoid always-taken branches and function calls.
However, if debug is false in the initial example, the branch is always taken, it's jumping over the call to log. Such code incurs a penalty - correct?
Not that this at all answers your question, and I'm guessing you know this already, but for a language with a preprocessor, that should be something more like:
Given appropriate likeliness annotations or profile-guided optimization, it's possible for the compiler to _invert_ the conditional and move the log statement to a "cold" section of code outside of the main flow, making it a never-taken branch.
No need to hard code the length, and the cast is guaranteed to be within the bounds of an int on all platforms as long as you don't go over 2^16-1 countries.
I think that's the best solution (the only potential optimization I can think of would be to add a padding byte after every country to force 32bit alignment which would make the offset computation faster on some architectures, although probably not on x86).
Stylistically-wise I think the best solution would be to write the array like:
This way the codes are explicit when you read the code and it makes editing the array a little easier. Unfortunately gcc only warns if you set the same index twice with -Wextra, it remains silent with -Wall.
> This way the codes are explicit when you read the code
Good call. It might have caught the off-by-one mistake I made (the codes start from 1 and not 0 in the blog post). Maybe even switch to using an enum as our index instead of an int.
Well caught! I have two off by one errors :D seems they start from 1 instead of 0, and I didn't account for the null byte :D your solution is nicer and easier to read :)
This jives with the overall point of the article of keep it simple and readable. The semi fancy indexing caused 2 errors and the upside is negligible after the compiler gets done with it.
That instruction clears upper 4 bytes of the rdi register. Note the next instruction uses rdi in the address.
edi register is the lower 4 bytes of rdi. Instructions which write these smaller pieces zero out the unused higher bytes of the destination registers. This helps with performance because eliminates data dependencies on the old values in these higher bytes.
This is not a NOP; it explicitly clears the upper 32 bits of EDI since the compiler does not know that they are zero in this situation. If you change cc from an int to size_t (long on x86-64) the compiler will generate:
It’s just a two byte NOP. IIRC, it’s Intel’s recommended form for one of that length. Windows uses it for hot patching,[0], but I can’t imagine that’s the reason here.
Does a ternary actually eliminate a logic branch? I always assumed it was just high level shorthand, and would be the same in assembly as writing out the if block.
And to make it more complicated, both "regular" ifs and ternaries may not generate a branch at all, e.g. by using the CMOV (conditional move) instruction on x86. This avoids a branch but can obviously not eliminate the data dependency between the condition and a later consumer of the result.
Following that line, it's never exactly been syntactic sugar. When initializing a const local, you can use the ternary operator on the right hand side of the assignment, but you can't get the same effect using if (short of writing a new function). Similarly, C++ lets you use the ternary operator for the left hand side of an assignment, (a ? b : c) = 42; although you could use an if for this pretty easily.
> I'm wondering how you guys would optimize that code ?
Run it on infinite multiverse, construct a mechanism to destroy the Universe every time branch predictor did not predict every single conditional correctly.
The branch predictor can select a random branch, so it doesn't even need to actually predict anything. It should use quantum noise so that it has chance of selecting different branches in different copies of the universe. There is noting to compute, quantum or otherwise.
The bomb to destroy the universe will take care of all universes where it made wrong predictions, so that is where we should concentrate our development resources.
It has more utility than just branch prediction. Imagine the bomb going off automatically whenever there starts a war.
Any copy of the universe that starts a war is automatically eliminated and this guarantees that you, as an observer, will never observe any wars.
The language Elixir, during its compilation phase, actually automatically removes these in the production environment. That is to say, it is a macro which behaves as expected in every environment but "prod", in which case it removes itself.
> conditionals
A number of years ago I used Ruby to experiment with writing a version of fizzbuzz that avoided conditionals entirely, and was purely functional:
(I actually regret using currying here because it hurts the portability of the algorithm)
While it may be difficult (if possible) to convert all conditional logic to functional logic in a given algorithm, perhaps a compiler could do the work, if certain patterns emerged.
I'm not a C guy, but I have to wonder if such an algorithm would run faster on Intel or ARM chips by avoiding all branching and branch prediction. Can someone chime in on this?
> The language Elixir, during its compilation phase, actually automatically removes these in the production environment. That is to say, it is a macro which behaves as expected in every environment but "prod", in which case it removes itself.
I mean, you can do this in C too. Make "debug" an #ifdef, and when it is defined to "false" the compiler will obviously optimize that out.
There's a bunch of things you can do in C to "clue in" the compiler on what your code is going to do.
The easiest is using the preprocessor to hardcode values such that branches can be eliminated. This is nice and all but it can start to cause problems with things getting confusing.
---
The other common thing you can do is provide compiler attributes to hint at what the code is doing. For example, you can specify that a function is pure (doesn't affect the observable state of the program) or const (pure attribute with the addition that the function is not affected by changes to anything other than the inputs. i.e. same inputs always give you the same output). There are also cold/hot attributes for improving branch prediction.
Similarly there is the leaf attribute which restricts the control flow of a function largely to the current translation unit and allows the compiler to deduce significantly more information about what the code is intending to do.
---
Now on the more fancy/dangerous side is using strict aliasing, the restrict keyword, and array parameters in functions. Strict aliasing tells the compiler that types can only contain what they say they contain (i.e. any two pointers for the same location in memory must have the same type).
Likewise the restrict keyword states that any memory accessed by a restrict qualified pointer can only be accessed by that pointer or by pointers derived from said pointer (member accesses, array access, pointer sliding/offsets). This allows the compiler to know that memory isn't being touched by other accesses (which without restrict it could be). Realistically this allows you some small to large performance gains any time you are interacting with more than one pointer at a time as the alternative is that the compiler may have to recheck every cached value any time you write to another pointer. An example would be `f(char * restrict x, char * restrict y)`. Here you know that no value ever possibly accessed or written to in x will affect any value ever possibly accessed or written to in y and vice versa. Note that the restrict keyword is valid on variables, pointers, members, and arrays.
In a similar vein, you can leverage array arguments in functions to guarantee to the compiler that a pointer argument is non-null and contains some number of elements. For example the difference between the functions `f(char * x)`, `g(char y [5])`, and `h(char z [static 5]` is that the compiler can't necessarily know for sure that x is a valid address in memory or how many elements it contains. The compiler does however know that the variable y contains exactly 5 elements and that the variable z contains at minimum 5 elements. The compiler can now potentially elide null checks and bounds checks (say for null terminated strings). Since the function is aware of this, it may help a bit but it benefits more in that with inlined functions, macros, and LTO the compiler can optimise within the scope of said function trees to elide those checks/branches. Something to note because people don't always pick up on it, array arguments do in fact work with size 1 which allows you to specify that all your arguments are guaranteed valid pointers. Basically for any internal functions, you probably always want to use either an array argument over a pointer so you can elide null checks and the like.
---
Combining all of the above in a C project can sufficiently restrict the search space so that the compiler can elide most "unused" code sections, restructure a decent bit of branching code into branch free forms (since it can be deduced equivalent at compile time), and reasonably tag the branch weights for the remaining branches.
TLDR: Yes however some of the techniques come with the downside that if the code doesn't fit the requirements of the technique the optimisations can introduce bugs into the program. There are some levels of compiler warnings to help you when using these features but they only catch the obvious cases (since if they could catch all of them, they could implement the techniques automatically). It's C, it comes with its footguns but if you know how to use them, you'll probably only be burned a few times and hopefully not too badly.
If you have perf critical branches in your code, you should try and make them as predictable as possible. Sometimes that means using a different approach that is algorithmically worse, but has much more predictable branches.
Classic example is linear search is faster than binary search for “small” lists. The item == myItem branch is only taken once at the end. Meanwhile binary search will take the branches of it’s comparison (item < myItem, item > myItem) in equal proportion to each other, so the branch predictor is stuck at a 50% guess for that branch. There is a great talk on this but I can’t remember what it’s called...
As you point out, the exact properties that make binary search algorithmically fast can slow it down on real computer hardware.
Branch predictors like low entropy (unsurprising) branches, but binary search is algorithmically fast because it maximizes the entropy (information gained) from the few branch checks that it does make.
Another interesting question where one could sink a lot of time is how do to binary search on GPU's using multiple threads. There branching in multiple threads at the same time is also bad, but for slightly different reasons.
On a GPU you could do a parallel n-ary search. I.e. instead of making 1 test in the center, make n tests. E.g. 1 round of 32-ary search should be equivalent of 5 rounds of binary search. Not only do you avoid the branch predictor, but the memory access is also parallel, which isn't true in branchless binary search. If the width of the search is larger than a warp/wavefront (32/64), then you need to add synchronization and it might not be faster anymore.
GPUs aren't really built for latency though and this will waste a lot of accesses on speculation. You are probably better off with a good btree.
Rule of thumb #1: don't. The compiler is probably smarter than you, it can handle things itself most of the time.
Rule of thumb #2: always test low-level performance improvements on real data. This is very similar to rule #1 - the compiler might already be implementing the optimisation you're changing to, so you might be making the code less readable for no benefit.
Rule of thumb #3: sort your arrays. Sorting algorithms are one of the most aggressively-optimised functions in modern computer science, it's a very small overhead compared to regular branch prediction failures. Make sure to keep rules #1 and #2 in mind, but this is probably the most common significant performance improvement to be gleaned from branch prediction.
I might be off base here, but you would probably do best to not overthink this and let the compiler do its job. Attempting to pre-optimize if statements relative to a branch predictor (a very hardware-specific thing) is probably a massive fool's errand.
The compiler doesn't know how the CPU branch predictor works either. It's not especially good at optimizing this, or guessing if a branch is predictable or not, so you might as well do it yourself.
You can use the __builtin_expect keyword on GCC and clang.
It does make a difference (HPC code uses it quite a bit because of this), but as it's done manually, results obviously vary depending on processor, and there's a limit to how far you can take it. Trying not to branch (not always possible) is best...
Basically, it's like manually doing profile-guided optimisation - you look at perf or vtune, and add the above to branches which have a high penalty and which are rarely or often taken, and see if it makes a difference.
Note that this is not targeted to branch prediction, but compiler optimizations. It should be done carefully, because it tends to have a larger downside than its upside, and the changes the compiler makes based on it can be unpredictable, and unstable.
> Trying not to branch (not always possible) is best...
Branches are often faster than branch-free methods, since they can speculate certain data dependencies away.
> Branches are often faster than branch-free methods, since they can speculate certain data dependencies away.
And often they're slower, since predicting some branches is often impossible due to the data :) So the processor ends up miss-predicting most of the time, thinking it knows what happened last time.
Branch prediction works predictably well when the branch is heavily biased towards one outcome, typically follows a repeating pattern of a short length (eg. taken once every three iterations), or is strongly correlated with another preceding (not necessarily local) branch. Conversely, branch prediction works badly when they are data dependent on irregular data.
> Is it ok to have if clauses that will basically never be run?
Yep.
> But when I thought about it more: should it be improved?
Nope.
If you are an average programmer, you should write your code first to be legible and maintainable.
If you are a smart programmer, you should add performance tests to your test suite (since you need to do performance testing anyway for any production-critical code) and run your app on the appropriate-sized machine.
If you are a very smart programmer, you should rely on performance hacks when your application no longer meets performance requirements under heavy load.
If you are a freaking genius, you should think about optimizing your code.
No, they mean 192KB. Because that's how they refer to it literally every other time. Oh, and because it's NOT 196 thousand. 3072*64 = 196608. Which rounds to... 197 thousand. (But is exactly 192KB - or 192KiB if you prefer that sort of thing)
> No, they mean 192KB. Because that's how they refer to it literally every other time.
Are you suggesting that you believe that both the number and the unit were a typo?
> Oh, and because it's NOT 196 thousand. 3072*64 = 196608. Which rounds to... 197 thousand.
Truncation is a totally valid way of representing numbers when pinpoint accuracy isn't really necessary. "32K" is a very common way to refer to the number 32768, for example.
> or 192KiB if you prefer that sort of thing
I appreciate the caveat here as I very much do not and believe that the concept of kibibytes being distinct from kilobytes are a scam perpetuated by hard-drive and floppy disk manufacturers as a way to cut costs while still advertising the same storage space. It definitely makes sense to use the same definitions for the SI prefixes, but even two decades after that ISO was published no-one actually says "kibibytes", so clearly it's not much of a standard.
> I appreciate the caveat here as I very much do not and believe that the concept of kibibytes being distinct from kilobytes are a scam perpetuated by hard-drive and floppy disk manufacturers as a way to cut costs while still advertising the same storage space
Except that you just said that "32768->32K" involves truncation. What is 32*1024? 32 KiB is exactly, no truncation involved, 32768.
(BTW, kibibytes are distinct from kilobytes because 1024 is distinct from 1000. Similarly, 1048576 is distinct from 1000000.)
> Truncation is a totally valid way of representing numbers [with lower precision]
No. Rounding is.
--
Once again. THEY, the authors of the post, use unit multipliers of 1024 in every other figure on the page. Except in this one particular figure where you claim they intentionally use 1000. I frankly don't care whether they define KB as 1000B or 1024B. Either way works. But they should, at the very least, be consistent within the space of a single blog post.
how about gcc (and others as well) ‘__builtin_expect’ which is most commonly translated as ‘likely || unlikely’ macros to nudge the thing along in right direction?
Because doing so would require twice the cache and execution bandwidth. It is usually better to take the gamble and fetch just one branch. If you don't want to gamble you could stop the current thread and continue with the execution of another thread (if you have simultaneously multithreading/hyper threading).
In some cases the compiler may decide to evaluate both paths and then "undo" the path that was wrong. This is very common on VLIW architectures like Itanium which have hardware support for this (predicates), but it can also be done with other instruction sets.
That would get exponential real quick. But Intel’s Tremont does have an instruction decoder that’s 6-wide only when decoding a taken branch, since it also starts decoding the branch target in the same cycle.
People also used to prefer table based jump over long if-else-if chain for anecdotal performance reasons. That has gradually changed over the evolution of CPUs with various optimizations like the branch prediction. Now some JIT impls like tracing JIT replace virtual function calls with if-elses.
By the way, I'm glad that the article is so thoughtfully written that it doesn't include any ambiguous "best practices" at the end. It could have created yet another myth. The three top tips in the article is more about stating facts. I admire the author taking that stance.