
Some AMD CPU's RDRAND might not return random data after a suspend/resume - Aissen
https://github.com/systemd/systemd/issues/11810#issuecomment-489727505
======
lelf
RDRAND is not guaranteed to always succeed (and never was). You’re supposed to
retry on failure.

(Although linked in that thread systemd code has fallback anyway, so I’m not
sure how it fails at all).

 _Edit_ : not to mention that it’s better just to not use it, ever. Quite sane
and sensible thing.

~~~
zaroth
Sounds like a bit of code which was hard to trigger the negative test and
therefore the fallback failed to work properly.

Not sure how the kernel devs generally go about testing the “this virtually
never happens” code paths without adding debug switches to every unhappy path.

Certainly I doubt they are using DI/IoC to wrap an interface to RDRAND which
allows unit testing the failure modes.

At least the result is a failure to generate a key, not a compromised key.

~~~
loeg
> Sounds like a bit of code which was hard to trigger the negative test and
> therefore the fallback failed to work properly.

No; it turns out that's giving systemd too much credit (sadly). See [1].

The problem appears to be that RDRAND _was_ signalling success, but producing
a nonrandom value. This is bad and a violation of the specification.

Can't speak to Linux kernel development, and in this particular case, that
isn't the problem.

The linked bug involves systemd using the world's worst random number
generator. A security engineer goes into more detail on this twitter
thread[1]:
[https://twitter.com/FiloSottile/status/1125840275346198529](https://twitter.com/FiloSottile/status/1125840275346198529)
(or unrolled:
[https://threadreaderapp.com/thread/1125840275346198529.html?...](https://threadreaderapp.com/thread/1125840275346198529.html?refreshed=yes)
).

> At least the result is a failure to generate a key, not a compromised key.

In fact, the result _is_ a compromised key -- the bug report is due to
colliding "globally unique" identifies generated through a flawed random
gathering process.

------
lvh
Is there a good reason this doesn't just use urandom/getrandom? This doesn't
look like it's in the "we're so early in boot I haven't restored the random
seed yet" case, for example.

~~~
jsnell
No. There's a bad reason though. Want to guess what?
[https://github.com/systemd/systemd/blob/master/src/basic/ran...](https://github.com/systemd/systemd/blob/master/src/basic/random-
util.c#L83)

Yes, the old "draining entropy" fallacy.

~~~
dzdt
Why do you say draining entropy is a fallacy? It is certainly true that
entropy recorded from I/O sources accumulates at a very limited rate.

~~~
wolf550e
Because after you gather ~256 bits of entropy from I/O sources (or
rdrand^H^H^H rdseed [1]), using it to generate infinite output of /dev/urandom
does not drain it. Adding more entropy only helps if the entropy pool inner
state leaked, it is not needed to add more entropy because it drained through
using it.

Proof:

1\. Take 256 bits of entropy.

2\. Use HKDF to generate 128 bit key and 64 bit nonce.

3\. Use key and nonce for AES-CTR with all possible 2^64 counter values to
produce 2^68 bytes of output.

4\. goto 2.

If you can compute the inner state of the RNG (and thus predict future output)
by just observing the output (not through side channels), using any amount of
output, then all modern symmetric crypto is broken. If you can't, then using
the entropy in an RNG does not drain the entropy pool.

And here is a CCC talk about it:
[https://www.youtube.com/watch?v=OSfmtRc4VsE](https://www.youtube.com/watch?v=OSfmtRc4VsE)

1 - [https://software.intel.com/en-us/blogs/2012/11/17/the-
differ...](https://software.intel.com/en-us/blogs/2012/11/17/the-difference-
between-rdrand-and-rdseed)

EDIT: Thanks for 'dragontamer for pointing out difference between rdrand and
rdseed.

~~~
dzdt
So after reading this thread my understanding of the "draining is a fallacy"
view is

(1) yes genuine non-programmatic entropy bits accumulate at a slow rate and
can be drained

(2) in practise no one should care about (1) because once you have ~256 bits
to initialise a CSPRNG you can use the output of that CSPRNG until the Earth
is swallowed up by the sun; thats what the S of Cryptographically Secure
Programmatic Random Number Generator promises.

(3) the linked systemd code is silly to even provide a function
genuine_random_bytes that tries to get additional genuine non-programmatic
entropy as every possible use case is covered by (2)

(4) The real fallacy is when people think that it might be possible to
discover the seed of a CSRNG or predict its next outputs by examining a long
enough run of its previous output. In practise such an attack should not be
possible.

If I've got the summary right, the only point I would disagree on principle is
(3). I can imagine that someone might rationally want "genuine" entropy
independent of a kernel CSPRNG, for example for seeding their own CSPRNG.

~~~
wolf550e
re your (1), the bits cannot be drained. The only thing that can happen to
them is that there is an attacker with root on your computer, and they see the
CSPRNG inner state, and the attacker loses their access but they can still
predict output of CSPRNG because it's deterministic, so you want to inject new
entropy into it so the attacker will lose ability to predict CSPRNG output.
djb says this is nonse.
[https://blog.cr.yp.to/20140205-entropy.html](https://blog.cr.yp.to/20140205-entropy.html)

------
oakwhiz
[https://news.ycombinator.com/item?id=6336505](https://news.ycombinator.com/item?id=6336505)

~~~
fpgaminer
For the lazy, this is a link to a comment by Theodore Ts'o, kernel dev, who
says:

> I am so glad I resisted pressure from engineers working at Intel to let
> /dev/random in Linux rely blindly on the output of the RDRAND instructure.
> Relying solely on an implementation sealed inside a chip and which is
> impossible to audit is a BAD idea. Quoting from the article...

Theodore Ts'o is maintainer of the ext filesystems (particularly ext4), as
well as, IIRC, /dev/random and other CSRNG related components of the kernel.

Thank you, Theodore Ts'o.

~~~
agwa
Theodore Ts'o is responsible for perpetuating the myth that entropy/randomness
can run out, leading systemd (and other software) to do crazy things, such as
trying to use RDRAND, to avoid "drain[ing] randomness from the kernel pool".
The bugs and security vulnerabilities resulting from this myth probably
neutralize the benefit that came from from his resistance to RDRAND in the
kernel.

~~~
BeefySwain
> the myth that entropy/randomness can run out

Can you expand on this, or link to some sources that expand on this idea that
the assumption above is wrong? As a person who has not dealt with crypto
really at all I had heard this explained several times before and assumed it
was generally accepted.

~~~
colmmacc
128-bits of random data is sufficient to securely generate a stream of 100s of
terabytes of random data. It's not /that/ hard to find 128-bits of true
entropy, even during boot phase. Here's one example:

    
    
        1. Seed with any fixed hardware IDs
    
        2. Mix-in the wall clock time
    
        3. Spin up a kernel thread and flip a bit on/off in a tight loop. Interrupt it every 100 nanoseconds and take the value of the bit at that time. Do this 256 times. Mix that in too.
    
        4. Mix-in 256-bits from RDRAND 
    
        5. Mix-in timings from other interrupts as and when they happen. 
     
        6. Repeat steps 4. and 5. ad infinitum. 
    

By step 4 we have taken 26 microseconds and we have the kind of entropy I
would be comfortable generating an RSA private key with.

Note that step 3 is effectively a measure of how precise the system clock and
CPU are. Attacks have been demonstrated against step 3, but they require co-
resident processes and don't apply during the boot-phase, if you've got a
dedicated core at least. In theory if system clocks and CPU got super precise
it could become too deterministic, but the point is the likelihood of /both/
that happening /and/ RDRAND being broken.

~~~
ris
> 128-bits of random data is sufficient to securely generate a stream of 100s
> of terabytes of random data.

What you are describing is /dev/urandom. Your argument is basically "urandom
is good enough for anybody". If you want to use that, use it.

~~~
Dylan16807
/dev/urandom is not always sufficiently seeded.

/dev/random makes sure that it's seeded, then pretends it can run out somehow.

getrandom() with default settings is the right behavior almost always, and it
took ages to get implemented.

------
zokier
Note that the corresponding kernel bug was reported already in 2014.

[https://bugzilla.kernel.org/show_bug.cgi?id=85911](https://bugzilla.kernel.org/show_bug.cgi?id=85911)

------
ricardobeat
Later in the thread:

> Given that RDRAND is allowed to fail, it seems to me that you should either
> try it only once, or only a few times, before falling back to whatever code
> is used when RDRAND is not implemented.

~~~
JdeBP
... which is what the systemd code actually does. The problem is that there
appears to be a possible AMD processor state, caused by suspend+resume, where
the instruction _succeeds_ but the data returned are not in fact random.

~~~
ricardobeat
There is no mention anywhere, in this thread nor the one from 2014, of it
returning non-random data from the issue reporters, just assumptions from
onlookers.

~~~
JdeBP
Wrong.

* [https://news.ycombinator.com/item?id=19851693](https://news.ycombinator.com/item?id=19851693)

------
nneonneo
Oh, this could be really bad. I wonder if any private keys are compromised
this way - would certainly be nice to know what rdrand is returning if it
isn’t random data.

~~~
lelf
Catchy title. Nowhere it says it returns non-random data. It just fails.

~~~
butterisgood
Sure it does... read all the posts. Also read all the history.

Also read this from 2013 " I am so glad I resisted pressure from Intel
engineers to let /dev/random rely only on the RDRAND instruction... Relying
solely on the hardware random number generator which is using an
implementation sealed inside a chip which is impossible to audit is a BAD
idea. "

And this
[https://www.theregister.co.uk/2013/09/10/torvalds_on_rrrand_...](https://www.theregister.co.uk/2013/09/10/torvalds_on_rrrand_nsa_gchq/)

So, uh, this isn't news, and isn't limited to AMD.

Sure it'd be nice to fix.

~~~
jawnv6
I fail to see how a dev talking about a wholly unrelated implementation from
an entirely different vendor has anything to bear on this conversation unless
you're just maliciously spreading FUD around the entire concept of random
numbers.

Do you actually know if the instruction is faulting or delivering back non-
random data ("sure it does...")? Is the non-random data 0's or something with
a pattern like 0x9090? Does that match the Intel implementation's behavior
exactly?

------
fpgaminer
And this is why cryptography has increasingly reduced its reliance on
randomness. Strong CSPRNG that only need a single seed to be secure, signing
constructions that use deterministic hashes, deterministic derived keys, DAE-
secure ciphers that fail-safe when IVs are re-used, etc.

Randomness is definitely something we took for granted for too long.

------
iamnothere
Suspend/resume seems to be the cause of a whole host of bugs; I run into
obnoxious suspend problems frequently on all platforms except maybe Windows.
It's so common that I've pretty much stopped using suspend on most of my
machines. It's a bit inconvenient to power down / power up things each day,
but I'd rather deal with that than have intermittent wifi issues, display
problems, etc.

~~~
jdironman
I consulted once for a hospital that was experience seemingly random network
outages. After a few questions with the staff there it seemed to happen
whenever a person stepped away from their windows (Lenovo) workstations for
too long. Say, after lunch or breaks. After a wireshark of the network I
determined it was a network card driver that was causing a broadcast flood on
the network from multiple points for stations with the same driver version.
While not the fault of Windows (I dont think anyways, as an update of the
driver fixed the issue) your comment did remind me of this experience.

~~~
iamnothere
Bet the vendor made sure the drivers "worked" for suspend/resume, but only
from the user point of view.

I may have just been lucky with Windows on my particular hardware. A quick
search turns up a bunch of different problems. I think suspend/resume may just
be a particularly difficult thing to get right.

------
throwayEngineer
Wow, so when would be the worst time to suspend your computer?

The only time I imagine this, is when generating a private key for a
production environment.

~~~
hedora
My reading is that systemd uses the cryptographically secure rng to generate a
unique id for a filename, and doesn't handle collisions properly.

 _sigh_

~~~
rwmj
There shouldn't be collisions. I mean that really: if you see a collision it's
so much more likely that your computer / program / source of randomness is
faulty [as in this case] than that the two random numbers collided that it's
not worth considering the collision case.

~~~
lokedhs
But there can be collisions, at least in theory. And as we can see, in
practice too.

Not checking for collisions because you trust that an RNG returns unique
values is a fallacy and should be avoided. Especially since it's so easy to
deal with the problem.

~~~
codedokode
It would be better to report the collision because most likely it means some
error.

~~~
lokedhs
Yes, of course. As long as you do something. Just ignoring it because it's
assumed it'll never happen is just bad design.

