
Some small suggestions for the Intel instruction set - zdw
http://blog.cr.yp.to/20140517-insns.html
======
pascal_cuoq
> What's crazy about this is that we're limited to multiplying integers

> of about 25 bits [...]

> Why? Because the multiplier insists on rounding

> its low output bits rather than giving them back to the programmer.

> Why? Because the multiplier is buried inside the floating-point unit.

The reference to the 25-bit success makes it seem like the author is thinking
of bits 53-105 of the 106-bit result of multiplying two 53-bit inputs.

The multiplier is not going to “give them back” to the programmer because it
never computed them in the first place. Floating-point 53x53->53
multiplication is implemented with a couple of guard bits, the last of which
is “sticky”. The multiplier the author is asking for is not “buried”, it does
not exist in the processor.

I am not sure why integer multiplication is not preferred. It goes up to
64x64->128, if the author wants all the bits of the product.

> A 53-bit multiplier producing a full 106-bit product won't be much larger
> than a 53-bit multiplier producing a correctly rounded 53-bit floating-point
> result.

Yes, it will. It will have to be 106-bit wide instead of 56-bit wide.

~~~
stromgo
How do you correctly round, say, 6172293634027511 * 7059478094414279 / 2^53 =
4837593847918366.50000000000000011102 without essentially computing the full
106-bit product?

~~~
nwhitehead
The idea is that you can use guard digits instead of full products and get
correct rounding (as if to infinite precision). For IEEE-754 math you need 3
guard digits, typically named Guard, Round, and Sticky. The trick at the
hardware level is that Sticky represents the OR of a bunch of bits. Instead of
keeping the full 106 bit partial product at each step you only need to keep
the 53+3 bit product. As you shift+add the partial products you use special
rules to keep track of the Sticky bit and you will still get the right answer.

~~~
fdej
I'm aware of the 53+3 bit algorithm for addition and subtraction, but I don't
see how this could possibly work for multiplication. Could you please point me
to a paper that explains how to compute a correctly rounded 53-bit product
without computing the full 106-bit product?

~~~
pascal_cuoq
I can't find a handy reference, but here is a 5x5 multiplication:

    
    
         1XXXX
       * 1XXXX
     _________
         XXXXX
        XXXXX
       XXXXX
      XXXXX
     1XXXX
    

Using a sticky bit means computing the logical-or of all the bits that are
sufficiently far to the right in all these partial results, instead of
computing them individually and summing them.

Like for addition, the idea is that the result you obtain is different from
the exact result, but rounds the same. The sticky bit only serves to round
(midpoint + a small quantity) correctly up.

~~~
stromgo
Let's say that the full product in your example is

    
    
      1XXXXYZZZZ
    

To round it we need:

(1) the X bits for our answer

(2) the Y bit for rounding

(3) the logical-or of the Z bits for rounding

Ok so there might be a trick to compute (3) without computing ZZZZ, but we
still want to know how to compute 1XXXXY without computing ZZZZ.

~~~
pascal_cuoq
Perhaps a better way to describe it is that during the additions, the
accumulator only ever needs to be shifted right in order to be aligned with
the next number to add, never left. So the accumulator can be 53+3-bit wide
and carry at each step the 53+3-bit representation of the partial sum already
computed, and this is enough to produce the correctly rounded result after all
numbers have been added.

This explanation assumes that the N partial products are added sequentially. I
don't know what techniques are used to achieve the 4- or 5-cycle latencies of
modern desktop processors. Implementations with more parallelism and/or that
are part of an FMA unit have it harder, but I still expect that they don't
naively compute a full width partial result before rounding it (in the case of
the FMA, it would be impractical as the partial result could be thousands of
bits wide)

~~~
stromgo
Looks like you're actually computing the full product. The only optimization
you're doing is forgetting the low order bits as soon as you're done with
them. The author is asking to keep them, which sounds reasonable. You
initially claimed they were "never computed in the first place", and we're
asking you to back up that claim if you still think it's true.

------
userbinator
_The worst case is that today you promise secrecy for, e.g., the MUL inputs,
and then realize in several years that you really want to make a faster
variable-time multiplier; the commitment would then force you to add a new
MULABORT instruction rather than violating the secrecy of the MUL instruction.
Of course, if you don 't make any commitments, then programmers have no choice
but to rely on observations of CPU behavior; many of those programmers will
assume constant-time MUL, and if you switch to variable-time MUL then you will
be breaking cryptographic security._

I disagree with this and think he has it backwards; the majority of
applications will benefit from early-out multiplication, and it's only crypto
that wants constant-time. Also, MUL has never been constant-time on x86, ever
since the 8086. My suggestion would be to add a control register bit that puts
the CPU into a mode where the guarantees needed for crypto are true. All that
software would need to do is set the bit before performing sensitive
operations, and clear it afterwards.

~~~
haldean
I believe that's his point; I read that section not as "please make these
instructions constant-time" and more as "please document whether these
instructions are constant-time, so that we can rely on them being constant-
time". If an instruction is constant time but then a new chipset comes out
with early-exit semantics for multiplication, everything previously compiled
for that architecture is suddenly vulnerable to timing attacks when it wasn't
previously. He's just looking for a commitment for the leak and timing
semantics of various instructions, even if the commitment is "you cannot rely
upon this being constant-time".

------
jph
DJB is spot-on for the pipeline documentation. This can help many apps and
workflows, ranging from the one he describes ( tiny tightly optimized
cryptography code) all the way up to the entire JDK, JavaScript V8, Erlang,
Beam and similar large VMs.

~~~
dragontamer
Except modern architectures don't have "pipelines"... at least not like ARM's
pipelines are done. Modern Architectures have decoders to translate the
assembly into micro-ops which are then stored into a cache (beyond the L1
cache) and then computed.

Heck, the entire concept of Intel's "Hyperthreading" kills any pipeline
determinism you would hope to achieve.

Consider your typical ARM read-modify stall: add r1, r2 / add r1, r3. A stall
needs to be inserted into the pipeline because r1's value will change.

In contrast, the typical Out-of-Order (OoO) x86 computer (Steamroller, Ivy
Bridge, and Bay Trail) will attempt to look at future instructions to execute
during the pipeline stall. In the case of Hyperthreaded CPUs (Intel i7, i3,
older Atoms), Intel will grab instructions from _another thread_ to execute
while the stall is undergoing.

And worse yet, as Intel and AMD upgrade their systems, sometimes there are
massive architectural differences which cause regressions in some code. The
infamous AMD Bulldozer (2011 chip) regressed in FPU and Vectorized code for
instance, while AMD tried to up the core count and improve integer
performance.

If you really want to step into this arena, just use Agner Fog's optimization
manual.

[http://www.agner.org/optimize/instruction_tables.pdf](http://www.agner.org/optimize/instruction_tables.pdf)

But... it isn't as simple as ARM because x86 CPUs are far more complicated.
Out-of-order processors just kill any determinism you have.... let alone
hyperthreading or the other tricks these bigger CPUs do.

~~~
lgg
Out of order processors do not "kill" determinism. If you can guarantee your
initial state is the same you will get the same result in the same number of
cycles repeatably. The fact that time it takes is non-obvious does not make it
non-deterministic.

Now hyperthreading does introduce real non-determinism from the standpoint of
a single threaded calculation because it doesn't control the other thread
using processor resources, but hyper threading introduces that sort of non-
determinism even on in order processors.

~~~
hrjet
> Out of order processors do not "kill" determinism.

It may be deterministic at the micro-architecture level, but it _is_ non-
deterministic at the architecture level.

And then there are many other things in the mix: speculative execution,
instruction tracing, etc.

~~~
lgg
Well sure, different implementations behave differently. It still has nothing
to do with being OoOE...

A 486 and an original Atom are both in order implementations the i386
architecture (though both also implement other extensions), but knowing how
code runs on one tells me nothing about the other, and if I just have an ISA
reference the i386 without additional documentation or hardware to test on I
won't be able to predict how code I write performs on either of them.

------
pygy_
Earlier on his blog:

 _How to design an elliptic-curve signature system — There are many choices of
elliptic-curve signature systems. The standard choice, ECDSA, is reasonable if
you don 't care about simplicity, speed, and security._

He goes on to demonstrate his point in minute details.

~~~
X-Istence
It does not surprise me, considering he also co-authored
[http://safecurves.cr.yp.to](http://safecurves.cr.yp.to) with Tanja Lange.

------
akuma73
At what point does it make sense to have a separate crypto-core / offload-
engine?

You could provide all of the isolation and funky instructions that you want
without affecting the general purpose CPU. It may also be more efficient as it
is a special purpose processor.

~~~
SoftwareMaven
Those certainly exist and are used frequently in systems that require very
high performance encryption, but what DJB is trying to do is ubiquitize
encryption capability. A massive first step was taken when AES instructions
were added to Intel CPUs, but more could be done. The advantage of having it
be ubiquitous in the CPU is overall cost is reduced and the ability to utilize
encryption in more and more places is increased since users aren't stuck
waiting.

------
f0xf1re
Feedly on ios displays the linked page in black-on-almost-black for some
reason.

