
SSE: mind the gap - Audiophilip
https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/
======
johnt15
It's much better to use any of the numerous SIMD wrappers such as libsimdpp or
Vc and get various benefits for free. It's possible to target everything from
SSE and NEON to AVX512 with what is essentially a single code path.

~~~
andrewf
Would you be able to point me towards a shipping product/library that does
this? It's easy to find examples of people hardcoding x64 assembly (x264,
zlib, libyuv) but I haven't stumbled across anybody making good use of a high
level wrapper.

~~~
johnm1019
Although it's way more than an SSE wrapper, the Eigen library is excellent in
my experience and targets multiple platforms.

[http://eigen.tuxfamily.org/index.php?title=Main_Page](http://eigen.tuxfamily.org/index.php?title=Main_Page)

~~~
Ono-Sendai
I had a look at the matrix*vector multiplication code for Eigen once and it
was rubbish.

------
cm3
I'm trying to understand speculative execution.

Given this

    
    
        int result = foo != bar ? do_side_effect_and_return() : safe_return();
    

C code, am I right to assume both functions will be executed speculatively?

What other potential bugs/gotchas are lurking with speculative execution?

~~~
daemin
In the context of a CPU, especially one with a deeper pipeline, the comparison
value will only be known at some stage deep within the pipeline. Therefore to
not stall the pipeline until the result is known, the CPU will start to
execute either one of the branches. Then once the value is known, if it
guessed the branch correctly it will continue executing as normal, having
already partially executed it. But if it guesses incorrectly it has to flush
the work that it has done and start executing the branch not taken.

Edit: One particular thing to note is that side effects that occur within the
pipeline do not actually occur until the latter stages of the pipeline, where
the writes to memory and registers are realised. By that stage the condition
result would already be known and the correct branch would be executing within
the processor.

~~~
cm3
Yeah, but what if do_side_effect_and_return() deletes file. This surely cannot
be prevented.

What I'm mostly wondering is how it is that half of our code doesn't break all
the time due to speculative execution.

I'm looking for an explanation how it's prevented or if it's just careful luck
because compiler and runtime writers took precautions.

~~~
daemin
Ok, from what I see you are thinking at too high a level for this.

The CPU has a pipeline where it executes instructions. This pipeline has
stages in it for: fetching the instruction from L1 cache (a), decoding the
instruction (b), fetching data from registers or memory (c), computing the
instruction (d), storing the data back into registers or memory (e). This is a
rough grouping of these stages, and each of these (a-e) stages is comprised of
smaller stages, one of which executes in each clock cycle.

So between the first stage where a CPU starts executing a comparison
instruction (a), and when it knows what the value is (d) that can be many
clock cycles. So instead of waiting and stalling it instead guesses which
branch will be taken and starts feeding it in.

It is only in stage (e) where it stores values to registers or memory where
actions can actually take place, and by the time it gets there it is executing
the correct branch. Either because it guessed the correct branch from the
beginning or it has mispredicted, flushed the pipeline, and is now executing
the correct branch.

Edit: Note that there is no such instruction as "delete file on SSD". The CPU
has various ways of working with external devices (such as sounds/video chips,
ssd's, etc), with memory mapping being one of the more popular one, but
there's also IO pins and a variety of hardware protocols that it can write
instructions to use. If you want to learn up on this get a small device like a
Raspberry Pi and play around.

~~~
cm3
I see, are CLWB and PCOMMIT for NVME safe?

~~~
vardump
Yes. They're no exception. They actually make DMA transfer safe, without doing
unnecessary work.

~~~
cm3
So all this speculative business is more about using idle bits of the chip to
warm it up for either branch to be taken, thereby reducing some of the time
for that code to execute, but any and all op that would actually modify memory
or access stuff on the bus is exempt from speculation.

About right?

~~~
vardump
> About right?

Close.

There are two kinds of speculative execution.

1) CPU guesses (branch predicts) _one_ path by default. If it guesses wrong,
it'll need to throw away speculative results and to execute the alternative
path. No speculative state is leaked to other CPU cores or memory (writes or
I/O). Typically CPUs guess right way over 99% of time -- there are of course
cases when prediction fails, sometimes pathologically. Those times it guesses
wrong, 15-20 cycles are lost. To put it in perspective, that's enough cycles
for 500 floating point computations.

2) Programmer / compiler produced speculative execution. Typical for SIMD, for
both CPUs and GPUs. For example with AVX2 you could compute results for 16x 16
bit integer lanes (256 bits wide) per instruction (so maybe about 32x 16-bit
operations per clock cycle). Computations for both branches are done in
parallel and right results are masked before writing data somewhere. The
benefit is ability to avoid branching, gaining performance.

Sometimes CPU branch predictor does really badly. For example, if you have
somewhat random data dependant branching, CPU is going to guess wrong 50%
time. So computing both sides in parallel and throwing 50% of the results away
might mean an order of magnitude speedup!

~~~
cm3
How does it determine it was wrong without checking the condition? Or does it?

~~~
vardump
It catches up with the branch condition with some delay, because CPUs have a
lot of pipelined stages. CPU is not executing one or two instructions at a
time, but a window of 10-50 (guesstimate, maybe more) instructions over
roughly 15 clock cycles -- pipeline depth.

So after it knows the condition result that determines the taken branch say 15
cycles later, it compares the guess to the actual path that needs to be taken.
If they agree, speculative results are marked valid. If not, they're thrown
away, CPU pipeline is flushed and execution starts again from the other path.

CPUs also have hundreds (current crop is about 200 uops (read: instructions)
reorder buffer (ROB), Intel Haswell has 192), where CPU tries to sort the
cross instruction dependencies in an order that's faster to execute. Deep
pipeline means if it can't reorder the instructions, it'll have to stall
waiting for the earlier results -- it just doesn't know the value of certain
register (or cached memory location), until the earlier computation dependency
chain is finished.

They're complicated and weird machines. They don't really execute the code
sequentially at all, just make it look like as if they did.

Everything I said above is oversimplified. I left out register renaming, cross
core / CPU socket cache coherency -- and _so_ much more. I can't really say I
completely understand the beast myself.

