
X86 is a high-level language - alexcasalboni
http://blog.erratasec.com/2015/03/x86-is-high-level-language.html#.VRLn95PF_XE
======
jstarks
The author's conclusion that side channel attacks are unpreventable because
x86 instructions execute in a variable amount of time does not follow.

Side channel attacks that rely on variable instruction timing rely on
_content-dependent_ timing. For example, the time spent for a mov from a
memory location does not depend on the contents of the memory, but it does
depend on the address of that memory. If that address is a function of the key
or plain text, then the mov will leak sensitive timing information. If not,
then the mov will not.

None of the author's various examples have anything to do with side channel
attacks. The fact that xor eax,eax is just a register operation doesn't mean
it can leak sensitive information.

~~~
jlebar
> The fact that xor eax,eax is just a register operation doesn't mean it can
> leak sensitive information.

AIUI tfa's point is that you can't assume that

xor eax,eax

and

xor eax,ebx

take the same amount of time, because "x86 is a high level language".
Similarly for his other examples. This, he claims, makes it difficult to write
code that resists timing attacks, if you ever want to have a branch that does
nothing. Thus the discussion of cmov.

~~~
mreiland
Can someone explain to me why you wouldn't just count clock ticks at a higher
level than the CPU? If the operation finishes early just... wait. I understand
resolution and yes you'll probably only be w/i a delta of some sort, but isn't
that the point?

Why wouldn't that approach work?

~~~
jstarks
Even if you could figure out the worst-case timings and how to simulate them
in all cases, the performance would be unacceptable. The usual operations that
are data-dependent in crypto algorithms are mov with a data-dependent address,
and conditional jump with a data-dependent condition. To slow these down,
you'd have to simulate the absolute worst case, which means no caches of any
kind, which means possibly thousands or even millions of cycles (to simulate a
page fault from disk) for every operation of unknown duration. Crypto would
become infeasible.

There are much better ways to achieve this, by always computing both sides of
a (logical) branch or using computation instead of branching or table lookups.
You likely have to write this in assembly to be sure that the compiler doesn't
"optimize" any of your tricks back into branches. Some of this is hard in
older crypto algorithms (AES was designed to use table lookups in software),
but newer crypto algorithms are much more amenable to safe implementation.

------
kazinator
"high level" doesn't refer to how it is implemented.

Even a RISC instruction set can be implemented by, say, either some very slick
silicon gates, or by an emulator written in Javascript.

The existence of the Javascript emulator doesn't make the RISC instruction a
high level language, on the grounds that an instruction like "ADD R1, R2, R3"
triggers a complicated traversal within the Javascript code.

The "level" of a language should refer to the "abstraction level". Years ago,
I came to favor the definition of "abstraction level" given in the glossary of
the Tunes project:

[http://tunes.org/wiki/abstraction_20level.html](http://tunes.org/wiki/abstraction_20level.html)

~~~
na85
I think that's the author's point: x86 is now a significant abstraction from
the actual details of execution.

~~~
Animats
That's true of all superscalar machines. That's what killed RISC. The original
idea of RISC was one instruction per clock and a very simple CPU control
section. Early MIPS CPUs realized that.

Then superscalar came to microprocessors with the Pentium Pro. It took 3000
engineers at Intel to design that CPU, but it did much better than one
instruction per clock while still handling all the weird cases in x86
instructions. As the fabs improved, the Pentium Pro technology moved to the
mainstream. The Pentium II and III were Pentium Pro architecture.

This killed the basic advantage of RISC. The "all instructions the same
length" concept really killed it - it meant 2x code bloat. That meant bigger
caches or worse cache performance. It meant more RAM and more RAM bandwidth or
worse memory performance. The x86 instruction set, for all its faults, is
compact.

For the crypto problem, the trouble is that modern crypto algorithms are not
branch free. DES was. Vernam was. Rotor machines were. RSA and elliptic curve
stuff, no. This is independent of the CPU architecture.

~~~
sdevlin
RSA is not "modern".

Modern elliptic curves are chosen with side channels in mind. You won't find
data-dependent branches or look-ups in straightforward implementations of
Curve25519 scalar multiplication, for example.

~~~
rdtsc
> You won't find data-dependent branches or look-ups in straightforward
> implementations of Curve25519

As it was pointed out already starting since Pentium 4, for example, the same
assembly instruction "add dest, src" might take a different time depending on
the value of dest and src.

Unless those modern elliptic curve compensate for specific models of CPUs they
run on, their straight C and assembly code might behave as if it has data
dependent branches.

~~~
sdevlin
Are you talking about this paper?
[https://gmplib.org/~tege/x86-timing.pdf](https://gmplib.org/~tege/x86-timing.pdf)

I only saw evidence of data-dependent timing with respect to the div
operation, but maybe I missed something.

Of course, I am not suggesting other operations are inherently immune to data-
dependent timing leaks.

EDIT: I see there are also some notes on adc and sbb in some situations, i.e.
chains of instructions that all light up the carry flag.

~~~
raverbashing
Don't worry about the instruction, worry about dest, src. If they:

\- Are in cache

\- Were touched by a previous instruction (regardless of cache)

\- Their cache line was touched by a previous instruction

\- Will be read/write (immediately) after that instruction

\- You're accessing data shared by multiple cores (lock prefix)

timing will vary

~~~
sdevlin
That's the whole point, though - the algorithm is very straightforward to
implement _without_ data-dependent look-ups. So while any of those could
affect execution time, none should do so in a way that leaks data.

------
tptacek
Every major server ISA is, in this sense, "high level": the ISA is documented,
but the microarchitecture isn't, and there are timing-relevant details known
only to the manufacturers. Aciicmez famously demonstrated this with a timing
attack on the branch prediction cache.

~~~
schoen
Any idea why the manufacturers haven't shown more interest in helping out
crypto implementers? Surely timing attacks are a pretty prominent problem in
microprocessor design circles by now.

~~~
brohee
Intel added AES, SHA-1 and SHA-256 instruction. They also added the PCLMULQDQ
instruction to efficiently implement ECC. Via Padlock also have AES and SHA,
and instructions for Montgomery multiplication (RSA speedup). Plenty of ARM
SoC vendors offer crypto cores.

So the toolbox to implement e.g. TLS securely is pretty much there.

What isn't here is a way to implement new crypto primitives that would be
timing attack and power analysis attack proof. But short of shipping a FPGA, I
don't see how they could do it...

~~~
sdevlin
> They also added the PCLMULQDQ instruction to efficiently implement ECC.

Isn't this more typically used in GHASH implementations? Maybe it's applicable
to both.

~~~
pbsd
PCLMULQDQ is a godsend to both (GCM and binary elliptic curves), since both
rely heavily on multiplication performance over F_{2^n}. The current fastest
elliptic curve implementations are over binary fields using this instruction:
[http://eprint.iacr.org/2013/131](http://eprint.iacr.org/2013/131).

~~~
sdevlin
Ah, interesting - so this applies in particular to elliptic curves over binary
fields.

I may have missed this, but did they note how performance fared in the absence
of hardware support?

Also, have binary curves (this or the NIST ones or any others) seen widespread
deployment anywhere? I was under the impression that prime field curves were
more widely used.

~~~
pbsd
As far as I know they didn't try to make a good implementation without CLMUL.
However, the older endomorphism-free curve2251 implementation [2, 3] is eye-
opening:

\- the SSSE3 implementation is ~2.7 times slower than with CLMUL \- the
generic (using mpfq, which should actually be pretty good) implementation is
5-6 times slower than with CLMUL

Binary curves used to be a lot more popular than they are now, before we all
had fat multipliers in CPUs. The patent situation is worse for binary fields
too, I think. That said, I'm pretty sure there are deployments somewhere using
them; Dan Boneh's TLS survey [1] shows an overwhelming 96% of TLS clients
using NIST's P-256, but the second most popular curve is NIST's B-233, at
3.6%. I would guess that this is due to hardware accelerators.

[1]
[http://www.w2spconf.com/2014/papers/TLS.pdf](http://www.w2spconf.com/2014/papers/TLS.pdf)

[2] [http://bench.cr.yp.to/web-
impl/amd64-titan0-crypto_dh.html](http://bench.cr.yp.to/web-
impl/amd64-titan0-crypto_dh.html)

[3] [https://eprint.iacr.org/2011/170](https://eprint.iacr.org/2011/170)

~~~
sdevlin
Great info - thanks!

------
mackwic
I think it's time to dig this:
[http://yarchive.net/comp/linux/x86.html](http://yarchive.net/comp/linux/x86.html)

When Linus worked at Transmeta, doing a weird kind of processors and x86 was
considered "a charming odity that works well".

------
ChuckMcM
And this is perhaps a response to the 'heat leaks' story [1] about how
tracking processor temperature over time you can get information out, or
perhaps guess at what the processor is doing. (aka a 'side channel').

The basic observation is that in addition to the fact that CISC instructions
do more than just one thing (they are essentially subroutine calls into
microcode at one level of abstraction) out of order execution has added even
more variability to _when_ they are executing. Making a goal of "constant
time" programming (popular in crypto code and video timing loops) very
difficult to achieve as the time may vary based on the data in play,
instruction ordering, etc.

We're a long way from the time when you could right the number of cycles
something would take on the right hand column of your assembly.

[1]
[https://news.ycombinator.com/item?id=9250611](https://news.ycombinator.com/item?id=9250611)

------
yxhuvud
I wonder if introducing a sleep that is longer than the computation could
conceivably take would work to solve (or at least make it a _lot_ harder to
defeat) the actual problem. The failure case where the computation actually is
slow enough to matter could be detected and thrown away.

This of course would require a fully async crypto lib..

~~~
matthewmacleod
That's not an effective technique, unfortunately. You still end up leaking
timing information.

Lets say you add a small, random sleep after each operation – this still leaks
information, as the delay can be averaged out over multiple runs. A fixed
sleep after each operation is no use either, for obvious reasons.

One approach I've seen is to break time into discrete quanta – for example,
you could guarantee that every operation will take an integer number of
seconds to complete (i.e. an operation takes exactly 1 second, or exactly 2
seconds, or… scaled as required). There are still statistical techniques to
extract timing information regardless, however!

The takeaway is the cryptography is really, really hard; system integrity is
even harder.

~~~
JonnieCache
The solution i've seen is:

    
    
        sleep(float(hash(request_content)) % n)
    

this assumes that the attacker cannot control any non-relevant part of
request_content.

~~~
matthewmacleod
As far as I can tell, that would not be secure – wouldn't it end up being
essentially cryptographically secure random timing noise? We already know that
random jitter doesn't work, because it can be averaged out given enough
samples.

I'm not a cryptography guy though, so I could well be wrong!

~~~
TheLoneWolfling
It cannot be averaged out, as there is no way to take multiple samples.

Look at it: for any specific input it always sleeps for a deterministic amount
of time. Unlike random timing noise, which can be averaged away.

You take averages and all you know is the value of (actual time + some unknown
value) very precisely. That doesn't help you.

(He is, however, missing that it should also have a random salt, generated
once and stored.)

Of course, this is still breakable most of the time.

------
scott_karana
I'm not sure what the takeaway from this is.

Don't implement crypto in software, only hardware? Does that cut out
algorithm-writers who won't have FPGAs and fabrication available to them?...

~~~
sliverstorm
You don't have to have FPGA or ASIC capabilities to develop a great new
algorithm, do you? Do your research, prove it in software, get it into a
product later... right?

~~~
scott_karana
True! My takeaway was needlessly bleak. Thanks :)

------
mjhoy
I had this very same thought today, reading _Inside the Machine_ by Jon
Stokes. Recommended. Amazing how abstract the programmer model can be from the
real hardware.

------
rdc12
One of the links on the post, links to a stack overflow question, that has
this code. What is the purpose of the do-while loop with the condition 0, how
is that different to not having the loop at all?

#define BN_CONSTTIME_SWAP(ind) \

    
    
      do { \
    
        t = (a->d[ind] ^ b->d[ind]) & condition; \
    
        a->d[ind] ^= t; \
    
        b->d[ind] ^= t; \
    
      } while (0)

~~~
stephencanon
It's a common C macro idiom, that makes a multi-statment macro behave like a
single statement. See [http://stackoverflow.com/questions/154136/do-while-and-
if-el...](http://stackoverflow.com/questions/154136/do-while-and-if-else-
statements-in-c-c-macros)

------
DrJokepu
Forget the controversial opinion and drama, let's talk about something
technical, how does the mov eax, ebx thing work? He said it will update eax to
point to the same underlying register ebx points to. But then what happens
when something like mov ebx, 9000h is executed? How does the CPU know that
this will only apply to ebx, not eax?

~~~
stephencanon
The subsequent mov ebx, 9000h will cause ebx to "point to" a new underlying
register, leaving the one that corresponds to eax intact.

~~~
DrJokepu
So every time a new value is loaded into one of the "high level" registers,
it's actually loaded into a new underlying register? So the underlying
registers are basically immutable?

~~~
caf
Yes, it's a allocation technique called "single static assignment".

------
JohnBooty
If "constant time" programming isn't possible, could crypto code thwart side
channel analysis by moving in the "other" direction - adding random amounts of
spurious computation?

If I can't hide the work my code is doing by making it take the same amount of
time (and use the same amount of power) no matter what I'm doing, seems like I
could obscure it by adding a random amount of work - so that even the same
"real" crypto workload would use varying amounts of power on subsequent runs.

Although I guess that would only be a partial solution at best. If an attacker
could monitor many runs of the same "real" workload, some simple statistical
analysis would still tell you some things.

~~~
jlebar
> Although I guess that would only be a partial solution at best. If an
> attacker could monitor many runs of the same "real" workload, some simple
> statistical analysis would still tell you some things.

Yes, exactly. If you're doing a timing attack under non-ideal conditions,
there are already plenty of sources of random delays -- e.g. network delays,
delays introduced by the kernel scheduler, and so on. Adding additional random
delay doesn't solve this fundamental problem.

------
malka
"Inside the CPU, the results always appear as if the processor executed
everything in-order, but outside the CPU, things happen in strange order."

I think the words 'outside' and 'inside' have been swaped

~~~
TheLoneWolfling
Nope.

Look at load/store ordering, for instance. To your thread (i.e. inside the CPU
you're running on), things look normal. But outside your CPU, the order may be
different than expected.

------
soup10
Yet another reason high-performance code will continue to be written in c/c++.
Every abstraction layer from actual machine architecture severely convolutes
the optimization process.

~~~
Qwertious
On x86 machines (which includes any standard desktop/laptop PC) C/C++ is
compiled into x86 before it can be executed. Writing in C or C++ is actually
_more_ abstracted than x86.

As a rule, hand-written x86 assembly will outperform C or C++, when both are
very thoroughly optimised.

