
Mispredicted branches can multiply your running times - ingve
https://lemire.me/blog/2019/10/15/mispredicted-branches-can-multiply-your-running-times/
======
kstenerud
This is the main reason why I've codified a bunch of bit tricks so I don't
forget them [1]. They're often not worth using, but can sometimes work as a
last-mile optimization after you've done all your algorithmic changes, cache
locality, sizing, alignment, etc.

[1] [https://github.com/kstenerud/bit-
tricks](https://github.com/kstenerud/bit-tricks)

~~~
omgtehlion
> [https://github.com/kstenerud/bit-
> tricks/blob/master/bytes_re...](https://github.com/kstenerud/bit-
> tricks/blob/master/bytes_required_for_bits.h)

looks like over-engineering of (bits + 7) >> 3

~~~
skrimp
Your solution can overflow.

~~~
omgtehlion
this won't: (bits + 7LL) >> 3

------
mac01021
I am not a computer engineer and I always wondered:

What if we had a modern, pipelined CPU architecture with no branch predictor,
which instead unconditionally executed the X instructions immediately
following every branch before (possibly) proceeding with the branch's target
instruction?

Would compilers and programmers be able to find most of the efficiencies that
we currently rely on the branch predictor to find? What would be the common
cases where it would be hard to do that?

How much circuitry would we save by leaving out the predictor? Enough to allow
a measurable speedup in the CPU clock speed?

~~~
gpderetta
You just reinvented delay slots [1].

Two issues:

\- It is not easy for the compiler to always fill up a single delay slot.
Filling dozen of them (as required for a deeply pipelined modern processor)
would be significantly harder.

\- The number of delay slots would depend on the depth of the cpu pipeline. If
you do not want to expose microarchitectural details in your ISA, you need
either JIT or install time specialization. edit: or stick with a potentially
suboptimal number.

[1]
[https://en.wikipedia.org/wiki/Delay_slot](https://en.wikipedia.org/wiki/Delay_slot)

~~~
planteen
Oh this takes me back to a horrible time in my life when I was tracking down a
terrible intermittent crash on a LEON (SPARC) processor. I was down at the
assembly level, stepping over instructions. It was a bit mind bending to
always have an instruction execute AFTER the branch (you wouldn't know if the
branch was taken or not until AFTER the delay slot instruction). Often the
delay instruction was a NOP, showing the difficulty in filling the delay slots
as you mentioned. If you were really lucky, your branch would also result in a
register window overflow. By the time handling that was done, you lost your
mental context of what was going on. _shudder_

~~~
gpderetta
Right, SPARC has both register windows and branch delay slots, truly a great
collection of architectural dead ends :).

------
saagarjha
> If you are accessing the content of an array, many languages will add “bound
> checking”: before accessing the array value, there will be a hidden check to
> see whether the index is valid. If the index is not valid, then an error is
> generated, otherwise the code proceeds normally. Bound checks are
> predictable since all accesses should (normally) be valid. Consequently,
> most processors should be able to predict the outcome nearly perfectly.

Warning! This is just a stone's throw from the kind of thing that got us
Spectre, so keep in mind this sort of optimization can occasionally come back
to bite you…

~~~
rdc12
At the software level, you don't get a choice about this thou, the CPU is
going to speculate about a bounds check regardless, the optimization is it a
different abstraction layer.

Pretty much any thing on the market with the compute power of a cellphone, is
going to have a feature like this too

------
Const-me
> However, there are tricks that compilers cannot use safely.

Why not? VC++ emits cmov quite often, for operator ? and similar code.

------
dvhh
related stackoverflow example:

[https://stackoverflow.com/questions/11227809/why-is-
processi...](https://stackoverflow.com/questions/11227809/why-is-processing-a-
sorted-array-faster-than-processing-an-unsorted-array)

------
ncmncm
> most loops are actually implemented as branches.

No kidding!

But the article does call attention to a very important technique. It merits
serious thought over all the ways it can be applied, that may not look much
like this one, on the surface.

~~~
ithkuil
> no kidding

Well, that also means that some loops are fully unrolled :-)

~~~
mac01021
And some loops use only unconditional jumps.

~~~
ithkuil
Yes, those of the infinite variety

~~~
0xdeadbeefbabe
Haven't seen any, and I never expect to. (will it halt? == yes)

~~~
ithkuil
sure, pulling the plug is perfectly valid mechanism for terminating a loop,
albeit an external one :-)

------
XJ6
And just like that, a micro-optimisation has introduced a bug.

If the last number generated is even, it will still appear in the result set
in the new code, and not in the old code.

~~~
lifthrasiir
The code assumes that, once the loop finishes, out[0] through out[index-1]
contains the desired output. out[index] is _not_ a part of that output.

~~~
XJ6
Then you've got a bug which crashes if all your results were even and index is
left zero.

~~~
kstenerud
The original code can end with index = 0 as well. The only difference is that
in the optimized code, it will write to index 0, whereas in the original it
will not.

~~~
XJ6
In the original code, an index = 0 would not be a problem.

Here, trying to access out[index-1] would error.

Yes, you could then guard against that of course, but then that's even more
code to maintain.

If this code is a critical hot-path then sure, micro-optimizations can make
sense but doing so without over-commenting and a rigorous test suite to catch
introduced bugs is a recipe for disaster.

~~~
kstenerud
I think you may have misread the code:

    
    
        while (howmany != 0) {
            val = random();
            if( val is odd) {
              out[index] =  val;
              index += 1;
            }
            howmany--;
        }
    

vs

    
    
        while (howmany != 0) {
            val = random();
            out[index] = val;
            index += (val bitand 1);
            howmany--;
        }
    

Both of these store a list of odd numbers in out[], with "index" containing
the resulting count of how many numbers are in out[]. Both will have an
"index" (count) value of 0 if all inputs were even, and neither attempts to
access out[index-1].

~~~
XJ6
You were right I misinterpreted "out[0] to out[index-1]" but your next
statement:

> count of how many numbers are in out[]

is not true, in the latter case it's a count of how many numbers you want to
be in out[].

Consider what happens if howmany is 1 and it generates a single even number.
In the original you have an empty array and index 0, in the newer one you have
an array e.g. [2] and an index 0.

Yes, you can solve this by 'manually' only returning the first "index"
elements but it's just asking for trouble and a source of bugs as seen in your
attempt to describe it have a difference between the reality and your
description.

You might not consider that to matter, but the point I'm trying to make isn't
just nitpicking, it's that there are certain expected behaviours from code,
and an array which is "all odd numbers, except maybe all odd but one last even
number" is bound to lead to broken expectations.

~~~
kstenerud
> > count of how many numbers are in out[]

> is not true, in the latter case it's a count of how many numbers you want to
> be in out[].

I'm failing to see how "index" is the count of how many numbers you want to be
in out[] rather than the count of how many numbers actually ARE in out[], or
why this would be different between code examples.

> Consider what happens if howmany is 1 and it generates a single even number.
> In the original you have an empty array and index 0, in the newer one you
> have an array e.g. [2] and an index 0.

Yes, that's the point, as was explicitly stated in the article.

> Yes, you can solve this by 'manually' only returning the first "index"
> elements but it's just asking for trouble and a source of bugs as seen in
> your attempt to describe it have a difference between the reality and your
> description.

What is there to solve? The end result in both cases is that variable "index"
is 0, because 0 of the values in out[] are valid.

> You might not consider that to matter, but the point I'm trying to make
> isn't just nitpicking, it's that there are certain expected behaviours from
> code, and an array which is "all odd numbers, except maybe all odd but one
> last even number" is bound to lead to broken expectations.

Yes, and the expectation is that "index" tells you where the first garbage
element is in the array (in both examples). You're either going to have an
array with all garbage (index = 0), all garbage except the first element
(index = 1), all garbage except the first and second elements (index = 2), and
so on. What actual values are in a garbage index (even or odd or 0 or 1 or 2
or Martin Luther's birth year), are irrelevant because you're not going to
read from those indices.

------
naikrovek
This is one of those things that is completely lost on someone who has never
written in a low level language. I automatically assume JavaScript developers
to be completely oblivious to this entire class of software development
knowledge.

It is important to understand your platform all the way down to the CPU,
including things like branch prediction and caches if you want to have
performant software.

Software has been getting slower more rapidly than hardware has been getting
faster for nearly a decade, and the free performance gains that software
people have taken advantage of when better hardware is made available are just
about exhausted.

It's time to learn your platforms, software people.

[later addition] This kind of thing is also one of the reasons that teaching
OOP principles as they are taught today is so bad for software performance.
Modeling object relationships to match the real world will, in every non-toy
program, produce object structures that are actively unfriendly to cache
efficiency, and will therefore produce software which performs very poorly
when compared to software that was written with the hardware platform in mind.

~~~
vsareto
>It is important to understand your platform all the way down to the CPU,
including things like branch prediction and caches if you want to have
performant software.

Let's not drastically increase job requirements for no good reason.

>It's time to learn your platforms, software people.

Many of these platforms have undocumented CPU instructions, so until you get a
full accounting of that, what's the point? You can't learn the platform fully
if they keep that a secret.

Secondly, we've had CPU-level issues like spectre and meltdown introduced that
affected performance in some cases. We can't even trust the platform makers to
get it right!

~~~
naikrovek
> Let's not drastically increase job requirements for no good reason.

Well, I would say that it's a very good reason, and that learning about branch
prediction and caches is not a "drastic" step by any means.

Is there _any_ software that you write whose users would not be made happier
if the software performed better? Any at all?

> Many of these platforms have undocumented CPU instructions

You don't need to know the secrets of a platform to understand branch
prediction and caches, or to use that knowledge to produce software that
performs far better than software written without that knowledge. You don't
need to know the platform on a logic gate level, you need to understand how
the platform executes your code, so that you can take advantage of the
strengths of the platform. You don't need to know any hidden instructions or
secrets to take advantage of the platform.

> we've had CPU-level issues like spectre and meltdown introduced that
> affected performance in some cases

those things didn't affect performance, the fixes for those things did. The
fixes also required no code changes outside of the firmware and the operating
system, and knowing the platform is still the best way to write performant
software, no matter what is going on to avoid hardware vulnerabilities.

~~~
vsareto
>Is there any software that you write whose users would not be made happier if
the software performed better? Any at all?

I'd say security is a bigger issue than performance most of the time.

And most gains are going to happen within the code itself by, e.g., not
writing n^2 when there's a log(n) solution or something similar.

Plus we're talking about javascript, and that's likely to be software with
network concerns, so your optimizations might be a rounding error compared to
performance degradation from slow network connections.

>You don't need to know any hidden instructions or secrets to take advantage
of the platform.

You don't know what they do though. Some of those ops could be more
advantageous to performance to use in some cases. You can't fully know the
platform if there are secret ops.

You can still make optimizations with partial knowledge, but don't pretend to
know the platform when the platform manufacturer doesn't tell you everything
about it.

>The fixes also required no code changes outside of the firmware and the
operating system, and knowing the platform is still the best way to write
performant software, no matter what is going on to avoid hardware
vulnerabilities.

Good algorithm knowledge and practice is the most cost-effective way of
writing performant code and is more than likely going to be the lion's share
of issues.

~~~
kasey_junk
> Good algorithm knowledge and practice is the most cost-effective way of
> writing performant code

I’d love to see that common thought validated because in practice I’ve seen it
to not be true at all.

There are lots of cases where the complexity effects of the algorithm are
swamped by cache effects. In fact basic foundational assumptions about
complexity analysis are dangerously untrue on modern systems.

In my experience in either high throughput or low latency systems algorithmic
complexity is never the issue. It’s always cache coherence, CPU
prefectching/prediction, lock contention or over copying of data.

~~~
vsareto
>I’d love to see that common thought validated because in practice I’ve seen
it to not be true at all.

If you have Javascript devs that came out of some boot camp with no knowledge
of either, do you teach algorithms 101 or low-level CPU programming 101 first?

I would argue your codebase would benefit from teaching them algorithms first,
then the other one.

>In my experience in either high throughput or low latency systems algorithmic
complexity is never the issue.

Do you encounter those workloads written in Javascript often?

~~~
emsy
>If you have Javascript devs that came out of some boot camp

Do you also go to a doctor that came out of a bootcamp? Is your house built by
a constructor that came out of a bootcamp? Would you fly with an aviator that
came out of a bootcamp? Would you run banking software made by a developer
that came out of a bootcamp?

~~~
NotATroll
So how, precisely, is anyone supposed to get experience if it's unacceptable
to hire them fresh out of "boot camp".

Even if you go "INTERNSHIP". Well what, is everyone supposed to stick the
unpaid intern on toy apps that don't give them any actual real world
experience writing actual production software?

Cause then the next argument will just be "Do you also go to a doctor that
came out of an internship? Is you house built by a constructor that came out
of an internship?".

Elitism at its finest.

~~~
emsy
>Elitism at its finest.

Do you think doctors in training get to do open heart surgeries by themselves
fresh out of university? Do you think we train aviators that can only fly
using the autopilot? Because that's the way we treat software developers. This
has nothing to do with elitism and everything with professionalism. Our
industry has built training wheels in form of various VMs and high level
languages because it missed the opportunity to properly train its workforce.

~~~
NotATroll
Again, just more insane elitism.

You keep going to the "OH MY GOD, PEOPLE WOULD DIE" examples to try and make a
fairly weak point.

No one is going to die, because some noob made a crappy little site out of the
millions of crappy little sites, and it's not performing like a demi-god.

VMs and high level languages aren't "training wheels". Especially not VMs,
that's just complete and utter non-sense. Unless you think literally every
website on the web should have a 100% dedicated server box.

VMs are good for a great many of things, both noob-friendly and not.

As for high level languages, they were meant for one particular thing. To get
a task done quickly. Which is largely the real reason why so much software out
in the wild performs like crap.

Anyone can sit down and spend years making a highly performant piece of
software. But when things have to move fast, corners get cut & there's not
enough time dedicated to researching to get said product to be as highly
performant.

~~~
emsy
Just to be clear: VM means a language VM like the JVM. I don’t have a problem
with high level languages in general, but the gains in developer efficiency
tend to be paid for by the CPU.

No I don’t think people die (although I wouldn’t be surprised if that was the
case). I just don’t want to have to buy a 3000$ PC so I can run a fucking chat
app, an editor and a browser somewhat decently. The opportunity cost of bad
software is paid by billions of users every day.

~~~
pjmlp
Yet all that performance optimization gets thrown away when running on
virtualized containers alongside sanitizers.

By the way, my language VM is your language runtime.

~~~
emsy
Yes runtime is the better word and the JVM was a bad example anyway. There are
much worse offenders when it comes to throwing away CPU power.

