
An introduction to last branch records - signa11
http://lwn.net/Articles/680985/
======
jheriko
when did division become expensive again?

i'm also not sure how this adds anything to being able to stick timers in
code. it affords me the same kind of analysis without relying on architectural
details... maybe i am just missing some key insight here that the article
failed to get across to me.

~~~
aktau
Last time I checked (which admittedly was ~3 years ago), division was about
10x as expensive as addition/subtraction (rule of thumb, doesn't take the
complexity of modern processors into account but if you're not mixing too many
things it should pan out).

This became noticeable in a codepath I was optimizing.

~~~
nkurz
Once you factor the "complexity", it actually becomes much worse. A better
rule of thumb for 64-bit math might be that division is 100x the cost of
addition. 64-bit division on current Intel takes 40-80 cycles, and they don't
overlap well. Addition takes only 1 cycle, and the processor has three
execution ports that can execute this instruction in parallel.

Moreover, with SIMD, you can add 4 64-bit floats with each instruction. There
are no vector instructions for division.

The last case where I found this to be the limiting factor was ranged random
number generation. If you have a fast generator, reducing it to an unbiased
range by traditional means can often be most expensive step. It can be much
faster to substitute a lot of other arithmetic just to avoid the
division/modulo.

~~~
aktau
I need to adjust my mental factor to 100x then :).

However, you mention that there's not SIMD instruction for division. Then what
are divps/divpd? Those do SIMD division (on x86).

~~~
nkurz
Wow, you're absolutely right. I should have said that is no SIMD instruction
for integer division (which is what I needed for the ranged random numbers),
and somehow I got the idea that they didn't exist for floating point either.
The 64-bit division timing I quoted was after looking at IDIV on Haswell:

    
    
      IDIV | r64 | 59 | 59 | p0p1p5 p6 | 39-103 | 24-81
    

[http://www.agner.org/optimize/instruction_tables.pdf](http://www.agner.org/optimize/instruction_tables.pdf)

This is saying that it takes 59 µops spread across Ports 0, 1, 5, and 6 to a
signed integer division. That latency for a single operation will be between
39 and 103, but if you are doing many signed integer divisions in a row you'll
finish a division once every 24 to 81 cycles.

For DIVPD on Haswell we have:

    
    
      VDIVPD | y,y,y | 3 | 3 | 2p0 p15 | 19-35 | 16-28
    

Here it takes only 3 µops (2 on Port 0, one on Port 1 or 5), has a single
instruction latency of 19 to 35 cycles, and if doing many in a row we can
finish one vector's worth of divisions every 16 to 28 cycles. Since a 256-bit
YMM register holds 4 doubles, this is about 6 cycles per double precision
division.

Performance on Skylake is dramatically improved:

    
    
      VDIVPD | y,y,y | 1 | 1 | p0 | 13-14 | 8
    

Only one µop (on Port 0), a latency of 13-14 cycles, and an inverse throughput
of just 8 cycles. Thus the best case double precision requires only 2 cycles
per division!

By contrast, double precision addition on Skylake is 1 µop on Port 0 or Port
1, 4 cycles of latency, and optimally we can complete 2 of these operations
per cycle (inverse throughput of .5). At 4 doubles per YMM vector, this means
we can do 8 double precision additions per cycle, or 16 every two cycles.

So on the most recent Intel processors, in the best optimized case, double
precision division should be 16 times slower than double precision addition
--- much closer to your 10x estimate than my 100x. I think my numbers are
still about right for 64-bit integer division, but I appreciate the pushback.

