Mispredicted branches can multiply your running times

kstenerud · on Oct 16, 2019

This is the main reason why I've codified a bunch of bit tricks so I don't forget them [1]. They're often not worth using, but can sometimes work as a last-mile optimization after you've done all your algorithmic changes, cache locality, sizing, alignment, etc.

[1] https://github.com/kstenerud/bit-tricks

omgtehlion · on Oct 16, 2019

> https://github.com/kstenerud/bit-tricks/blob/master/bytes_re...

looks like over-engineering of (bits + 7) >> 3

skrimp · on Oct 16, 2019

Your solution can overflow.

omgtehlion · on Oct 16, 2019

this won't: (bits + 7LL) >> 3

jacobolus · on Oct 16, 2019

In https://github.com/kstenerud/bit-tricks/blob/master/sign.h

Why not use: (x > 0) - (x < 0)

mac01021 · on Oct 16, 2019

I am not a computer engineer and I always wondered:

What if we had a modern, pipelined CPU architecture with no branch predictor, which instead unconditionally executed the X instructions immediately following every branch before (possibly) proceeding with the branch's target instruction?

Would compilers and programmers be able to find most of the efficiencies that we currently rely on the branch predictor to find? What would be the common cases where it would be hard to do that?

How much circuitry would we save by leaving out the predictor? Enough to allow a measurable speedup in the CPU clock speed?

gpderetta · on Oct 16, 2019

You just reinvented delay slots [1].

Two issues:

- It is not easy for the compiler to always fill up a single delay slot. Filling dozen of them (as required for a deeply pipelined modern processor) would be significantly harder.

- The number of delay slots would depend on the depth of the cpu pipeline. If you do not want to expose microarchitectural details in your ISA, you need either JIT or install time specialization. edit: or stick with a potentially suboptimal number.

[1] https://en.wikipedia.org/wiki/Delay_slot

speleo_engr · on Oct 16, 2019

Oh this takes me back to a horrible time in my life when I was tracking down a terrible intermittent crash on a LEON (SPARC) processor. I was down at the assembly level, stepping over instructions. It was a bit mind bending to always have an instruction execute AFTER the branch (you wouldn't know if the branch was taken or not until AFTER the delay slot instruction). Often the delay instruction was a NOP, showing the difficulty in filling the delay slots as you mentioned. If you were really lucky, your branch would also result in a register window overflow. By the time handling that was done, you lost your mental context of what was going on. shudder

gpderetta · on Oct 16, 2019

Right, SPARC has both register windows and branch delay slots, truly a great collection of architectural dead ends :).

sv_h1b · on Oct 16, 2019

Plus, the delay slots are typically _fixed_ 1-5 cycles depending on CPU architectures. The problem is sometimes the real delay could easily be 200+ cycles (TLB/ cache reload, bus contention etc.) and _variable_.

There's no way you could cover all variable latency scenarios with fixed delay slots, unless you have a highly specific scenario like a GPU where you control all the internals.

ThePadawan · on Oct 16, 2019

Not a computer engineer either, but I did do some GPGPU in college (~6 years ago).

Notably, branching on a flag f on the GPU was so slow that most of the time it was faster to (manually) compute both branches b1 and b2 and then calculate the result as r = f * b1 + (1-f) * b2 (where f is either 0 or 1).

vardump · on Oct 16, 2019

Just like you can do on x86/ARM SIMD. It often pays to compute both sides and mask out the undesired results.

jb1991 · on Oct 16, 2019

This is pretty awesome.

sv_h1b · on Oct 16, 2019

GPUs are designed to do massive parallel computations which all branch one way and don't have much (any?) branch prediction logic.

You trade it off with increased code size which might spill the cache but for small tight loops not exceeding the cache line/ size it would still be a good win.

ScottFree · on Oct 16, 2019

Would there be any benefit to adding branch prediction logic to a GPU?

rdc12 · on Oct 16, 2019

You would probably get some benefit on a per CUDA core basis, but the extra space required would mean less CUDA cores, overall it would probably be a net loss on problems well suited to GPU acceleration

ericpauley · on Oct 16, 2019

Aside from other comments about weaknesses of delay slots, modern processors already do this automatically in the form of out of order execution. A branch can be moved up and assigned priority to an execution unit if it doesn't depend on some of the previous instructions. These instructions can be queued or assigned to other execution units.

dooglius · on Oct 16, 2019

It looks as though you're being downvoted but I'm not sure why. This matches my mental model of OoO execution, for instance if you have a loop like:

  sub eax, 1
  (several non-flags-touching instructions)
  jmpne loop_start

I would expect that a branch predictor wouldn't be needed as the jump only needs the flags register to figure out where the branch is going to go.

gpderetta · on Oct 16, 2019

that's not how ooo cpu work in practice. Jumps are resolved at the fetch stage, always by consulting the predictor; at this stage there is no opportunity to actually execute the jmpne instruction and check if the sign flag is set or not. When the jumpne is executed later in the pipeline, its only effect it to either commit or rollback the speculation.

This is not specific of OoO cpus. In order cpus work the same way, assuming they have a predictor of course.

edit: to be more precise, jumps are not not resolved at the fetch stage, but in one of the early stages, in fact there is often a bubble for taken branches as the fetcher will fetch the next instruction in the stream by default.

marcosdumay · on Oct 16, 2019

> How much circuitry would we save by leaving out the predictor? Enough to allow a measurable speedup in the CPU clock speed?

Considering the complex predictors on current PCs, we would save quite a lot of circuitry. But that circuitry is there because it is the most effective place to increase the CPU speed, if you used it for something else, speed would go down, not up (but power consumption would improve).

Also, actual clock speed isn't really relevant and has a complex relation to CPU speed. That circuitry isn't affecting clock speed, so it wouldn't change.

morelisp · on Oct 16, 2019

> speed would go down, not up (but power consumption would improve).

Power consumption per second might go down, but overall may go up, depending on how many more seconds the computation takes.

lippel82 · on Oct 16, 2019

To be slightly pedantic here: Power is energy per second, so power consumption itself would go down. Total energy consumption, which is power*time could go up.

morelisp · on Oct 16, 2019

The correction is welcome - I am just a little sad we trained a cohort of developers to care about power, but not well enough to understand that race-to-sleep is usually the best way to achieve that.

marcosdumay · on Oct 18, 2019

Hum... Race to sleep is usually not the best way to save power. Undervoltage is. Race to sleep just gets the spotlight once you have an extremely optimized CPU that can not be undervolted anymore in a practical way.

The rule of thumb you learned is useful for optimizing PC software, but has no place in CPU architecture discussions.

Also, "usually" leaves a lot of important cases out.

gpderetta · on Oct 16, 2019

In fact, to increase clock speed you need more pipeline stages, which means higher mispredict penality which means you need a better branch predictor...

saagarjha · on Oct 16, 2019

> If you are accessing the content of an array, many languages will add “bound checking”: before accessing the array value, there will be a hidden check to see whether the index is valid. If the index is not valid, then an error is generated, otherwise the code proceeds normally. Bound checks are predictable since all accesses should (normally) be valid. Consequently, most processors should be able to predict the outcome nearly perfectly.

Warning! This is just a stone's throw from the kind of thing that got us Spectre, so keep in mind this sort of optimization can occasionally come back to bite you…

rdc12 · on Oct 16, 2019

At the software level, you don't get a choice about this thou, the CPU is going to speculate about a bounds check regardless, the optimization is it a different abstraction layer.

Pretty much any thing on the market with the compute power of a cellphone, is going to have a feature like this too

Const-me · on Oct 16, 2019

> However, there are tricks that compilers cannot use safely.

Why not? VC++ emits cmov quite often, for operator ? and similar code.

dvhh · on Oct 16, 2019

related stackoverflow example:

https://stackoverflow.com/questions/11227809/why-is-processi...

ncmncm · on Oct 16, 2019

> most loops are actually implemented as branches.

No kidding!

But the article does call attention to a very important technique. It merits serious thought over all the ways it can be applied, that may not look much like this one, on the surface.

ithkuil · on Oct 16, 2019

> no kidding

Well, that also means that some loops are fully unrolled :-)

mac01021 · on Oct 16, 2019

And some loops use only unconditional jumps.

ithkuil · on Oct 16, 2019

Yes, those of the infinite variety

0xdeadbeefbabe · on Oct 16, 2019

Haven't seen any, and I never expect to. (will it halt? == yes)

ithkuil · on Oct 17, 2019

sure, pulling the plug is perfectly valid mechanism for terminating a loop, albeit an external one :-)

sifar · on Oct 16, 2019

Most DSPs have zero overhead loops (still branches) which don't incur a branch penalty. So something like below may take 1-2 cycles atmost, per iteration.

    for(i=0;i<howmany;i++) {
       
       out[index] = random();

    }

Ofcourse variable length loops would still have the branches.

ncmncm · on Oct 17, 2019

Haswell and later can run loops with 4 instructions, including 2 ALU ops and two conditional branches -- provided only one is taken -- in one cycle.

Compilers prefer to order the instructions so they take two or more cycles.

XJ6 · on Oct 16, 2019

And just like that, a micro-optimisation has introduced a bug.

If the last number generated is even, it will still appear in the result set in the new code, and not in the old code.

SloopJon · on Oct 16, 2019

It strikes me that the example is a bit contrived: why is howmany decremented unconditionally? The first loop populates out with howmany numbers. The second loop may set none (other than the ignored value that you point out). I suppose you can use the same trick:

  howmany -= (val bitand 1);

But that might complicate the benchmark.

Coincidentally, this was the subject of the first Stack Overflow question Bjarne addressed in an interview the other day:

https://stackoverflow.com/questions/11227809/why-is-processi...

lifthrasiir · on Oct 16, 2019

The code assumes that, once the loop finishes, out[0] through out[index-1] contains the desired output. out[index] is not a part of that output.

XJ6 · on Oct 16, 2019

Then you've got a bug which crashes if all your results were even and index is left zero.

kstenerud · on Oct 16, 2019

The original code can end with index = 0 as well. The only difference is that in the optimized code, it will write to index 0, whereas in the original it will not.

XJ6 · on Oct 16, 2019

In the original code, an index = 0 would not be a problem.

Here, trying to access out[index-1] would error.

Yes, you could then guard against that of course, but then that's even more code to maintain.

If this code is a critical hot-path then sure, micro-optimizations can make sense but doing so without over-commenting and a rigorous test suite to catch introduced bugs is a recipe for disaster.

kstenerud · on Oct 16, 2019

I think you may have misread the code:

    while (howmany != 0) {
        val = random();
        if( val is odd) {
          out[index] =  val;
          index += 1;
        }
        howmany--;
    }

vs

    while (howmany != 0) {
        val = random();
        out[index] = val;
        index += (val bitand 1);
        howmany--;
    }

Both of these store a list of odd numbers in out[], with "index" containing the resulting count of how many numbers are in out[]. Both will have an "index" (count) value of 0 if all inputs were even, and neither attempts to access out[index-1].

XJ6 · on Oct 16, 2019

You were right I misinterpreted "out[0] to out[index-1]" but your next statement:

> count of how many numbers are in out[]

is not true, in the latter case it's a count of how many numbers you want to be in out[].

Consider what happens if howmany is 1 and it generates a single even number. In the original you have an empty array and index 0, in the newer one you have an array e.g. [2] and an index 0.

Yes, you can solve this by 'manually' only returning the first "index" elements but it's just asking for trouble and a source of bugs as seen in your attempt to describe it have a difference between the reality and your description.

You might not consider that to matter, but the point I'm trying to make isn't just nitpicking, it's that there are certain expected behaviours from code, and an array which is "all odd numbers, except maybe all odd but one last even number" is bound to lead to broken expectations.

kstenerud · on Oct 17, 2019

> > count of how many numbers are in out[]

> is not true, in the latter case it's a count of how many numbers you want to be in out[].

I'm failing to see how "index" is the count of how many numbers you want to be in out[] rather than the count of how many numbers actually ARE in out[], or why this would be different between code examples.

> Consider what happens if howmany is 1 and it generates a single even number. In the original you have an empty array and index 0, in the newer one you have an array e.g. [2] and an index 0.

Yes, that's the point, as was explicitly stated in the article.

> Yes, you can solve this by 'manually' only returning the first "index" elements but it's just asking for trouble and a source of bugs as seen in your attempt to describe it have a difference between the reality and your description.

What is there to solve? The end result in both cases is that variable "index" is 0, because 0 of the values in out[] are valid.

> You might not consider that to matter, but the point I'm trying to make isn't just nitpicking, it's that there are certain expected behaviours from code, and an array which is "all odd numbers, except maybe all odd but one last even number" is bound to lead to broken expectations.

Yes, and the expectation is that "index" tells you where the first garbage element is in the array (in both examples). You're either going to have an array with all garbage (index = 0), all garbage except the first element (index = 1), all garbage except the first and second elements (index = 2), and so on. What actual values are in a garbage index (even or odd or 0 or 1 or 2 or Martin Luther's birth year), are irrelevant because you're not going to read from those indices.

couchand · on Oct 16, 2019

Granted, the examples lack enough detail to know precisely how `out` is allocated. However, it seems safe to assume from the context that `out` is allocated by the client and `index` is returned to indicate the total number. There is no case where `out` is an "empty array", it always must have enough space for `howmany` integers.

gpderetta · on Oct 16, 2019

You are reading too much in a minimalistic example to show some specific detail.

inetknght · on Oct 16, 2019

> And just like that, a micro-optimisation has introduced a bug.

If only the function had a unit test to catch such bugs

naikrovek · on Oct 16, 2019

This is one of those things that is completely lost on someone who has never written in a low level language. I automatically assume JavaScript developers to be completely oblivious to this entire class of software development knowledge.

It is important to understand your platform all the way down to the CPU, including things like branch prediction and caches if you want to have performant software.

Software has been getting slower more rapidly than hardware has been getting faster for nearly a decade, and the free performance gains that software people have taken advantage of when better hardware is made available are just about exhausted.

It's time to learn your platforms, software people.

[later addition] This kind of thing is also one of the reasons that teaching OOP principles as they are taught today is so bad for software performance. Modeling object relationships to match the real world will, in every non-toy program, produce object structures that are actively unfriendly to cache efficiency, and will therefore produce software which performs very poorly when compared to software that was written with the hardware platform in mind.

meheleventyone · on Oct 16, 2019

> This is one of those things that is completely lost on someone who has never written in a low level language. I automatically assume JavaScript developers to be completely oblivious to this entire class of software development knowledge.

This is a shame because not only are there lots of developers who write JS that have low-level backgrounds there are also a lot who haven't and are still interested. It seems rather snobby to jump in and write off a bunch of people because of the language they use on the assumption that they use the language without being aware of the implications.

On OOP versus more cache friendly approaches and branch prediction this talk from CPPCon 2019 is great: https://www.youtube.com/watch?v=HG6c4Kwbv4I

The interesting result is that although the DoD approach is eventually faster a branch misprediction causes havoc until uncovered.

naikrovek · on Oct 16, 2019

I realize that it is unfair to categorize ALL JS devs in this way, and it certainly is a tight fit for the JS developers that I have worked with in the past.

ScottFree · on Oct 16, 2019

OT: Why did you feel the need to add this disclaimer? Isn't it assumed that there are always exceptions anytime somebody makes a statement on the macro level? I don't think anybody would mistake "JS devs" for "Every single last individual JS dev".

meheleventyone · on Oct 16, 2019

I don’t think that is a general assumption. Particularly when the generality is made not as a statement of fact but of opinion and in a disparaging manner.

ScottFree · on Oct 16, 2019

Are you telling me when somebody says "Programmers make the web", you can't infer from context that they're not talking about all programmers?

meheleventyone · on Oct 16, 2019

Of course but that is a substantially different statement than “I automatically assume JavaScript developers to be completely oblivious to this entire class of software development knowledge.“

Programmers make the web is a true statement of fact it’s just that it’s a subset of programmers that do it.

The OPs statement is a lazy opinion about JavaScript developers in general unambiguously applied. So I welcome their later amendement.

gildas · on Oct 16, 2019

> I automatically assume JavaScript developers to be completely oblivious to this entire class of software development knowledge.

Surprisingly, In JavaScript the technique described in the article is quite efficient on Firefox whereas the gain is almost negligible on Chrome.

https://jsperf.com/mispredicted-branches

Edit: Jsperf seeems to be down. Here are the 2 snippets of code I tested:

    // Unoptimized
    let howmany = 5000000;
    let index = 0;
    const out = [];
    while (howmany) {
        const val = Math.floor(Number.MAX_SAFE_INTEGER * Math.random());
        if (val % 2) {
          out[index] = val;
          index += 1;
        }
        howmany--;
    }


    // Optimized for branch prediction
    let howmany = 5000000;
    let index = 0;
    const out = [];
    while (howmany) {
        const val = Math.floor(Number.MAX_SAFE_INTEGER * Math.random());
        out[index] = val;
        index += (val & 1);
        howmany--;
    }

naikrovek · on Oct 16, 2019

Yeah, of course performance gains can be had in JavaScript, and I will add to that fact with the experience that I've had when working with JavaScript developers: only one that I've ever known has ever considered performance and worked to produce performant JS before multiple users complained and an issue was raised. All others simply do not care about Javascript performance until there is an issue created to address it.

This single person is also the only JS developer I know who has any clue how to actually improve performance in existing code.

littlestymaar · on Oct 16, 2019

This matches my experience as well (as a former full-time JavaScript dev), but in most cases, where JavaScript devs don't care about performance, you don't even need to understand how the machine works to speed things up. 90% of the time, the algorithmic complexity isn't thought at all and there are plenty of quadratic (or worse) behaviors everywhere.

djokkataja · on Oct 16, 2019

I tried both of those out in the console in firefox and chrome and found that in chrome the optimized version was about 10% faster, but in firefox the unoptimized version was about 5% faster.

Also it ran 30x faster in chrome... (~200ms vs ~6000ms)

vsareto · on Oct 16, 2019

>It is important to understand your platform all the way down to the CPU, including things like branch prediction and caches if you want to have performant software.

Let's not drastically increase job requirements for no good reason.

>It's time to learn your platforms, software people.

Many of these platforms have undocumented CPU instructions, so until you get a full accounting of that, what's the point? You can't learn the platform fully if they keep that a secret.

Secondly, we've had CPU-level issues like spectre and meltdown introduced that affected performance in some cases. We can't even trust the platform makers to get it right!

naikrovek · on Oct 16, 2019

> Let's not drastically increase job requirements for no good reason.

Well, I would say that it's a very good reason, and that learning about branch prediction and caches is not a "drastic" step by any means.

Is there any software that you write whose users would not be made happier if the software performed better? Any at all?

> Many of these platforms have undocumented CPU instructions

You don't need to know the secrets of a platform to understand branch prediction and caches, or to use that knowledge to produce software that performs far better than software written without that knowledge. You don't need to know the platform on a logic gate level, you need to understand how the platform executes your code, so that you can take advantage of the strengths of the platform. You don't need to know any hidden instructions or secrets to take advantage of the platform.

> we've had CPU-level issues like spectre and meltdown introduced that affected performance in some cases

those things didn't affect performance, the fixes for those things did. The fixes also required no code changes outside of the firmware and the operating system, and knowing the platform is still the best way to write performant software, no matter what is going on to avoid hardware vulnerabilities.

vsareto · on Oct 16, 2019

>Is there any software that you write whose users would not be made happier if the software performed better? Any at all?

I'd say security is a bigger issue than performance most of the time.

And most gains are going to happen within the code itself by, e.g., not writing n^2 when there's a log(n) solution or something similar.

Plus we're talking about javascript, and that's likely to be software with network concerns, so your optimizations might be a rounding error compared to performance degradation from slow network connections.

>You don't need to know any hidden instructions or secrets to take advantage of the platform.

You don't know what they do though. Some of those ops could be more advantageous to performance to use in some cases. You can't fully know the platform if there are secret ops.

You can still make optimizations with partial knowledge, but don't pretend to know the platform when the platform manufacturer doesn't tell you everything about it.

>The fixes also required no code changes outside of the firmware and the operating system, and knowing the platform is still the best way to write performant software, no matter what is going on to avoid hardware vulnerabilities.

Good algorithm knowledge and practice is the most cost-effective way of writing performant code and is more than likely going to be the lion's share of issues.

kasey_junk · on Oct 16, 2019

> Good algorithm knowledge and practice is the most cost-effective way of writing performant code

I’d love to see that common thought validated because in practice I’ve seen it to not be true at all.

There are lots of cases where the complexity effects of the algorithm are swamped by cache effects. In fact basic foundational assumptions about complexity analysis are dangerously untrue on modern systems.

In my experience in either high throughput or low latency systems algorithmic complexity is never the issue. It’s always cache coherence, CPU prefectching/prediction, lock contention or over copying of data.

vsareto · on Oct 16, 2019

>I’d love to see that common thought validated because in practice I’ve seen it to not be true at all.

If you have Javascript devs that came out of some boot camp with no knowledge of either, do you teach algorithms 101 or low-level CPU programming 101 first?

I would argue your codebase would benefit from teaching them algorithms first, then the other one.

>In my experience in either high throughput or low latency systems algorithmic complexity is never the issue.

Do you encounter those workloads written in Javascript often?

emsy · on Oct 16, 2019

>If you have Javascript devs that came out of some boot camp

Do you also go to a doctor that came out of a bootcamp? Is your house built by a constructor that came out of a bootcamp? Would you fly with an aviator that came out of a bootcamp? Would you run banking software made by a developer that came out of a bootcamp?

NotATroll · on Oct 16, 2019

So how, precisely, is anyone supposed to get experience if it's unacceptable to hire them fresh out of "boot camp".

Even if you go "INTERNSHIP". Well what, is everyone supposed to stick the unpaid intern on toy apps that don't give them any actual real world experience writing actual production software?

Cause then the next argument will just be "Do you also go to a doctor that came out of an internship? Is you house built by a constructor that came out of an internship?".

Elitism at its finest.

emsy · on Oct 16, 2019

>Elitism at its finest.

Do you think doctors in training get to do open heart surgeries by themselves fresh out of university? Do you think we train aviators that can only fly using the autopilot? Because that's the way we treat software developers. This has nothing to do with elitism and everything with professionalism. Our industry has built training wheels in form of various VMs and high level languages because it missed the opportunity to properly train its workforce.

NotATroll · on Oct 16, 2019

Again, just more insane elitism.

You keep going to the "OH MY GOD, PEOPLE WOULD DIE" examples to try and make a fairly weak point.

No one is going to die, because some noob made a crappy little site out of the millions of crappy little sites, and it's not performing like a demi-god.

VMs and high level languages aren't "training wheels". Especially not VMs, that's just complete and utter non-sense. Unless you think literally every website on the web should have a 100% dedicated server box.

VMs are good for a great many of things, both noob-friendly and not.

As for high level languages, they were meant for one particular thing. To get a task done quickly. Which is largely the real reason why so much software out in the wild performs like crap.

Anyone can sit down and spend years making a highly performant piece of software. But when things have to move fast, corners get cut & there's not enough time dedicated to researching to get said product to be as highly performant.

emsy · on Oct 16, 2019

Just to be clear: VM means a language VM like the JVM. I don’t have a problem with high level languages in general, but the gains in developer efficiency tend to be paid for by the CPU.

No I don’t think people die (although I wouldn’t be surprised if that was the case). I just don’t want to have to buy a 3000$ PC so I can run a fucking chat app, an editor and a browser somewhat decently. The opportunity cost of bad software is paid by billions of users every day.

pjmlp · on Oct 17, 2019

Yet all that performance optimization gets thrown away when running on virtualized containers alongside sanitizers.

By the way, my language VM is your language runtime.

emsy · on Oct 17, 2019

Yes runtime is the better word and the JVM was a bad example anyway. There are much worse offenders when it comes to throwing away CPU power.

ScottFree · on Oct 16, 2019

> If you have Javascript devs that came out of some boot camp with no knowledge of either, do you teach algorithms 101 or low-level CPU programming 101 first?

The latter. It's also known as "computer architecture" and is something typically taught in the first two years of a bachelor's degree. How can you hope to learn to properly program a computer if you don't know what a computer is?

emsy · on Oct 16, 2019

That’s an unpopular opinion on HN, and not because you can’t learn those thing outside of university (because you absolutely can). But because there’s a push by the industry to make programming a low skill/low wage job. Many are still enjoying a low skill/high pay career and want to remain blissfully ignorant of their future.

ScottFree · on Oct 16, 2019

> That’s an unpopular opinion on HN

Too bad for HN, then, because commodifying web development has done it no favors.

> Many are still enjoying a low skill/high pay career and want to remain blissfully ignorant of their future.

That's where I was a few years ago. I'm a web developer who recently went back and re-learned all the low level stuff I forgot and didn't think was necessary 20 years ago. I was wrong.

SilasX · on Oct 16, 2019

>I'd say security is a bigger issue than performance most of the time.

Yes! I mean, isn't the tradeoff exactly what got us Spectre? "Hey, let's aggressively pre-compute and cache for performance. Oh, darn, turns out that leaks information..."

https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...

heisenbit · on Oct 16, 2019

Exactly, usually the big mistakes are made on the macro level. Worrying too much about the micro level is taking away attention where it belongs.

There is certainly worth in awareness of performance issues - the key is to recognize the - for most of us - rare instances where it matters.

naikrovek · on Oct 16, 2019

[flagged]

vsareto · on Oct 16, 2019

>you are fucking SKILLED at dodging questions and changing the topic to suit your own needs. you would fail out of a debate class in the first week, though.

Look, you're trying to argue that javascript dev jobs should have this kind of knowledge on their descriptions and I'm arguing there's way more shit to worry about and this is near the bottom level of things javascript devs should know. That's why I opened with "don't increase these job requirements unnecessarily"

>Is there any software that you write whose users would not be made happier if the software performed better? Any at all?

Yeah, a lot of websites. It just doesn't matter after a certain point.

From 10 seconds to 1 seconds? Definitely happier.

From 1 second to 0.1 seconds? Happier, but not as much as the first gain.

0.1 to 0.01 or further? They might not even notice it. And this is where the network makes further gains cost ineffective.

Take my arguments to C or assembly land with systems programming and I immediately lose on most fronts. Especially if you can translate fractions of a second into more money, there should be no restrictions on how much performance to go after IF speed == money.

But if you're doing systems programming with javascript and want the utmost performance, this just looks silly for both of us.

>This is one of those things that everyone "knows" but is actually not true.

You and kasey_junk have different/better experience than I do, because those questions come up way more often in javascript jobs than what you're promoting.

gpderetta · on Oct 16, 2019

> 0.1 to 0.01 or further? They might not even notice it. And this is where the network makes further gains cost ineffective.

true. Of course the vast majority of web sites are nowhere near this point, not even to the 10 to 1 seconds.

naikrovek · on Oct 16, 2019

Right, and moving from 0.1 seconds to 0.01 seconds means you can use one TENTH of the CPU time or Lambda time or whatever, so you can do more with fewer resources or scale down your infrastructure.

Performance matters, a LOT, and not just to end users; if you perform better, you save money on infrastructure.

emsy · on Oct 16, 2019

>Let's not drastically increase job requirements for no good reason.

If anything, job requirements for programming are way too low. I wouldn't let a mechanic anywhere near my car if I knew s/he didn't understand the basics of an Otto engine. I'm the first to admit I don't know as much about modern CPUs as I should, but I don't live in a fantasy world where I convince myself that I don't need to know it.

>Many of these platforms have undocumented CPU instructions

So we should get the vendors to publish a proper documentation. This doesn't change anything about software developers responsibilities.

>CPU-level issues

Perfectly answered in a sibling comment so I'll just leave it at that.

vsareto · on Oct 16, 2019

>I wouldn't let a mechanic anywhere near my car if I knew s/he didn't understand the basics of an Otto engine.

That's hilarious because most of the electronics of cars are utter shit and it has nothing to do with how well those developers knew the CPU.

>So we should get the vendors to publish a proper documentation. This doesn't change anything about software developers responsibilities.

Most responsibilities are not concerned with low-level optimizations when javascript in particular typically deals with network connections

emsy · on Oct 16, 2019

>That's hilarious because most of the electronics

That was an analogy. I wasn't talking about car electronics. I was talking about the lack of professional education in programming.

>Most responsibilities are not concerned with low-level optimizations when javascript in particular typically deals with network connections

I have yet to encounter a javascript program that doesn't run on a CPU. Also, javascript is one of the most versatile and widely used languages. It runs on clients, servers, high end machines and embedded systems. So of course performance matters. If it didn't we wouldn't need WASM, asm.js and countless ridiculous JS engine optimizations. The time "saved" having to learn proper programming by using JS is dwarfed by the time wasted in CPU and end user time.

vsareto · on Oct 16, 2019

>That was an analogy. I wasn't talking about car electronics. I was talking about the lack of professional education in programming.

Yeah and I'm pointing out that for Javascript dev, there's a lot more knowledge with a higher priority to learn than low-level CPU code.

That's why we shouldn't add it to job descriptions unncessarily.

>If it didn't we wouldn't need WASM, asm.js and countless ridiculous JS engine optimizations.

This is probably why Javascript from the OP was probably the worst language to choose to make an argument. If you want to know the platform, you'll have to learn these things as well.

OTOH you can just write your super high performance thing in C or assembly and spend less time learning the extra layers between JS and the CPU. Then the low-level tuning becomes more important and obvious, and its merit is way more understandable.

This kind of knowledge is totally okay for jobs in C or assembly. It's not a great idea as a requirement for javascript jobs.

jolson88 · on Oct 16, 2019

I've been thinking about this topic a lot lately and I find myself in agreement with you vsareto. I'm largely a JS dev by day (working on backend services) but I'm very passionate about low-level programming. I tend to hack around in Rust or C, even starting to hack with assembly language. Heck, I'm even reading the AMD x86_64 architecture books for fun :P.

There are so many challenges trying to understand what your Javascript is doing under the hood. Take Node.js development for example. Your code is running on V8, likely running within a Docker container, perhaps even on a Kubernetes node or other distributed cluster. To truly understand what your code is doing at the CPU level is a _very_ difficult thing to do. More often than not, the performance issues I've run into have to do with topics like: slow performing network communication, poorly thought out algorithm design (excessive code), "noisy neighbors" where other containers are causing performance problems unrelated to your own code (for example, one issue we had with excessive times for DNS resolution was caused by the networking stack/tech in our k8s cluster being misconfigured and doing a GC every 60 seconds for 4-5 seconds at a time), or poor database-access patterns that cause excessive times for queries.

The JavaScript code itself and inefficiencies of what the resulting machine code is seems to rarely be the case for me. But that may also be due to the domain that I work in (it's actually one of the things that has me super bored with the software I work on these days).

To echo your own points, I find myself most often coaching junior developers on data-access patterns and design, minimizing network activity (if you think missing the cache and going to memory is slow for software, just imagine hitting the network and going to different machines), etc.

Like I said though, it may just be due to the domains I work in and being on the backend.

And all of this from a person who is _loving_ learning more about the internals and playing around with lower-level programming. I feel like the more I have learned at that level has made me a much better programmer as well. I wish I would have learned this stuff 10-15 years ago. So I'm torn.

imtringued · on Oct 16, 2019

>what's the point? You can't learn the platform fully if they keep that a secret.

By that logic we should never learn anything.

vsareto · on Oct 16, 2019

You can still learn it and use it and make performance adjustments at low level, just don't claim to know it fully. Subtle differences, kind of like low level code.

Those ops could be better in some cases, and until you know what they do, you can't answer that.

toast0 · on Oct 16, 2019

> Let's not drastically increase job requirements for no good reason.

Here's the short version of branch predictor awareness. If you can write your inner loop with fewer conditionals, it will probably perform better.

Is it relevant and worth doing? At some level, only if your loop is frequent and fairly tight -- otherwise the difference is likely to be in the noise. On the other hand, writing your code to avoid branches in general can be a reasonable style and get you in the right place by default.

Everybody is claiming to be a 'Full Stack' developer these days, but apparently that doesn't include much of the stack.

DrScientist · on Oct 16, 2019

And if your platform is a virtual machine or interpreter running across a number of chip archs?

Perhaps it would be better to simply have appropriate tests and benchmarks to see what's slow on what platform?

Otherwise it's guess work based on incomplete understanding of the many layers below.

ip26 · on Oct 16, 2019

Generically cache and branch predict friendly code is worth most of the benefit. Micro-tuning is small potatoes compared to accidentally wiping the cache on each instruction (e.g. with badly designed objects) or mispredicting a thousand times in a row (e.g. with a poorly designed loop)

naikrovek · on Oct 16, 2019

Even then, you're not going to be running on a wildly different set of architectures, I'm betting; it will be a finite set, of course, so you can attend to all the common optimizations and give the virtual machine what it needs to optimize the best it can.

No one is going to write a single piece of code that runs on both a z80 and a 56-core Xeon, for example.

ThrowawayR2 · on Oct 16, 2019

> "And if your platform is a virtual machine or interpreter running across a number of chip archs?"

Outside of microcontrollers, I can't think of a processor in common use that doesn't have branch prediction or caches.

DrScientist · on Oct 18, 2019

True though I suspect that that the length of the pipeline and the cost of a cache miss will be quite different between ARM versus Intel.

While, I'm a big fan not not writing poorly performing code, as you get death by a 1000 cuts. Is it not better to test rather than educated guess?

ie The reason there is alot of poor performing code, is not a lack of understanding at a low level about caches and branch prediction, but a lack of care about performance?

pjmlp · on Oct 16, 2019

And when you outsmart the compiler, the CPU gets a microcode update and those clever micro-optimizations are thrown out of the window.

minipci1321 · on Oct 16, 2019

Your logic would also apply to the compiler back-end to the same measure. Should it emit optimized code then?

pjmlp · on Oct 16, 2019

Yep, it is easier to update the compiler than doing manual clever tricks, specially if the compiler happens to be an AOT/JIT with PGO feedback loop.

Most people aren't able to outsmart their compiler optimizers.

speleo_engr · on Oct 16, 2019

The article is C++, so not sure why you are bringing up AOT/JIT since it doesn't apply here. And clearly there is a large gain possible by doing a calculation rather than a conditional with a high entropy result, otherwise, the article's mentioned speedup wouldn't have happened. A PGO feedback loop isn't going to do anything on a high entropy conditional. I saw a great talk at CppCon 2019 by Andrei Alexandrescu about speeding up std::sort. He does this same trick among others to speed up the C++ std library sort routine.

https://www.youtube.com/watch?v=FJJTYQYB1JQ

pjmlp · on Oct 16, 2019

Because implementation and programming language are not the same thing?

Speaking of which, better catch up on the C++ interpreters and JIT related talks from CppCon 2019.

speleo_engr · on Oct 16, 2019

I can't tell if you are trying to troll here. So you reject the notion of branchless coding as a premature optimization because you think it is going to be made irrelevant by a potential future microcode update? Or are you saying that a C++ interpreter/JIT would have made this branchless at run-time, but that optimization isn't made at compile time?

pjmlp · on Oct 16, 2019

The first one naturally.

The tone of my previous remark is that optimization algorithms are orthogonal to programming languages, an language agnostic backend optimizer working on AST and SSA passes doesn't care what the input language was all about.

minipci1321 · on Oct 16, 2019

(This missed my question. Once the compiler is updated to the uArch change, we don't rebuild all binaries out there that were built by the old versions, and their speed is possibly compromised by the uArch change. Should we then even care about emitting binary code optimized for a given uArch, if we can guarantee it is stable in time?)

pjmlp · on Oct 16, 2019

Depends on the use case.

One reason mainframes still use bytecode as executable format is that it allows for AOT compiling the world after and OS or hardware upgrade.

And there are solutions for doing that to plain native opcodes as well.

Finally there is the whole issue that many micro-opmizations, while fun to implement, seldom contribute for visible optimizations of the business case at hand.

dspillett · on Oct 16, 2019

> It is important to understand your platform all the way down to the CPU, including things like branch prediction and caches if you want to have performant software.

I would agree it can be useful to understand the lower levels, for the majority of developers macro optimisations are far far more important. Things like not forcing unnecessary redraws client-side, or excess DOM manipulations in general, caching values and references instead of rederiving them on each iteration of a loop, avoiding excess database hits, not using a naive sorting method, being aware of network latency issues, not properly indexing the database, and so on, will dwarf the effect of micro-optimisations for branch prediction and CPU caching if you get them wrong.

> understand your platform all the way down to the CPU

Which platform though? Not everyone is a back-end or embedded dev with constrained hardware to support. Are you targetting amd64 or ARM or something else? Any particular generation of CPU? In many cases an optimisation for one will have no effect elsewhere or worse will make others slower. Unless you are doing much tight-loop number crunching, is faffing around at this level really worth your time? Some understanding of cache concepts will help with overall algorithm design but most of what that knowledge will help you with generally (rather than for specific CPUs) is good practise for other reasons too.

> JavaScript developers

JS and other JIT compiled languages make tweaks for specific platforms even less important. For the most part let the optimising compiler worry about that for the platform it is compiling for at the time or consider a lower level language.

> never written in a low level language

I've written in assembly in the distant past (6502 & related, Z80, early x86 and a little of later x86). I did once used knowledge of cache sizes and population behaviour to optimise a little image processing (having the routine work on blocks of the pixel data that were small enough to remain in cache for longer than they would if not dividing into smaller chunks or if jumping around more randomly) done in inline assembly because Delphi's compiler produced a massively less optimal result if it wasn't.

I have enough low-level knowledge to feel safe saying I know that most developers (especially junior devs away from embedded (or other low-resource) systems) don't really need it, at least not in as much detail, in the current multi-platform world. They aren't working on code where the benefit of optimisations enabled by that knowledge aren't dwarfed by many other factors.

rhinoceraptor · on Oct 16, 2019

How is it not a good thing that you don't need to worry about CPU branch prediction when you're writing Javascript to POST a form to your server?

In most cases you don't need to go that deep to fix performance issues. Your SPA is probably slow because your app bundle is 10mb, not because you have an inner loop that is tripping the branch predictor.

amelius · on Oct 16, 2019

Instead of peephole optimizing your code, I think more can be gained by ensuring that your code doesn't perform useless computations. For example, a game that recomputes the entire scene every frame, or a UI that recomputes an entire virtual-dom tree after every user action.

inetknght · on Oct 16, 2019

Or a machine learning algorithm which recomputes everything on every iteration

hamilyon2 · on Oct 16, 2019

What good software modeling techniques could one learn instead of OOP?

naikrovek · on Oct 16, 2019

Data Driven Development is the single most impactful way to write performant code, if I had to single out some single thing. CPUs work on data, not a developer's abstractions about the data.

Everything from your structs and objects all the way up should be written with the data in mind.

Think about what pieces of data are needed at the same time, or in sequence, and not what classes you need to perform a certain action. Structuring your data types to keep pieces of data that will be operated on together, or read together alone can dramatically improve performance.

It's difficult for me to explain, and I'm confident once one sees examples that demonstrates the concept well, it will become clear pretty quickly.

hamilyon2 · on Oct 17, 2019

I am quite sure that writing performant code for current architectures is great idea. I am sure one can learn that skill.

My question is quite orthogonal to that. OOP lets you model your application in a way that somewhat natural. It's principles are well-researched. Resulting code is widely understood. It can be applied across many different domains.

I know no competing way to model software that comes even close to that qualities. Maybe I am wrong and data driven approach provides all that.

pjmlp · on Oct 16, 2019

Like Wirth puts it, algorithms + data structures.

Programming paradigms are orthogonal to that.