Hacker News new | past | comments | ask | show | jobs | submit login
Some AMD CPU's RDRAND might not return random data after a suspend/resume (github.com/systemd)
121 points by Aissen on May 7, 2019 | hide | past | favorite | 84 comments

RDRAND is not guaranteed to always succeed (and never was). You’re supposed to retry on failure.

(Although linked in that thread systemd code has fallback anyway, so I’m not sure how it fails at all).

Edit: not to mention that it’s better just to not use it, ever. Quite sane and sensible thing.

The instruction is signalling success, but returning FFFFFFFFFFFFFFFF as the random value.

* https://github.com/systemd/systemd/issues/11810#issuecomment...

H. Peter Anvin's educated guess was that some MSR flag, that has the effect of controlling this instruction, is not being saved and restored in the processor across a suspend, and the result is a processor state where it signals that the instruction is succeeding but it is not actually returning random values.

* https://bugzilla.kernel.org/show_bug.cgi?id=85911#c4

> The instruction is signalling success

Let’s wait for clarifications how that person has done the tests.

However: the bug is about systemd failing to get entropy, not getting nonsense entropy.

How xe was doing this was in fact explained in the comment hyperlinked right at the start of this page, and H. Peter Anvin was thinking about this over 4 years ago. But for the doubters Vladislav has already reiterated the point.

* https://github.com/systemd/systemd/issues/11810#issuecomment...

There is no "however". This bug is about code that is, according to the AMD doco, using the instruction correctly; but that is, because the AMD processor has this possible state after a suspend+resume, getting all-ones as its random data, thereby causing ID collisions in a fairly wide range of possible things from freshly re-generated machine IDs to journal file header block IDs, and including unit invocation IDs.

* https://github.com/systemd/systemd/blob/717e8eda77b93ac396dc...

* https://github.com/systemd/systemd/blob/717e8eda77b93ac396dc...

* https://github.com/systemd/systemd/blob/717e8eda77b93ac396dc...

This indicates that a "should be fine" in another comment is not in fact true. (-:

* https://github.com/systemd/systemd/blob/717e8eda77b93ac396dc...

The Bugzilla report seemed to indicate that AMD was returning an error code, because OpenSSL was failing to generate a key, not generating a bad key?

Why would OpenSSL fail visibly if the API was returning success but with non-random data?


If you look closely at one of the error messages, it says "too many iterations", in what looks like a random generator. I wonder if at some point it has to iterate to get a number different from another one, and it just gets an endless stream of identical values?

I can’t imagine the code is checking the returned value for uniqueness after seeing a SUCCESS code, and iterating in the hopes that becomes less true.

That would be quite something to read the comment on;

// Here we check to be sure the Earth is not flat


// Seeing that the Earth is indeed flat, we will iterate in this absurdity some more in the hopes it rounds out eventually

Suppose you need to generate several unique 32-bit IDs for something?

Sounds like a bit of code which was hard to trigger the negative test and therefore the fallback failed to work properly.

Not sure how the kernel devs generally go about testing the “this virtually never happens” code paths without adding debug switches to every unhappy path.

Certainly I doubt they are using DI/IoC to wrap an interface to RDRAND which allows unit testing the failure modes.

At least the result is a failure to generate a key, not a compromised key.

> Sounds like a bit of code which was hard to trigger the negative test and therefore the fallback failed to work properly.

No; it turns out that's giving systemd too much credit (sadly). See [1].

The problem appears to be that RDRAND was signalling success, but producing a nonrandom value. This is bad and a violation of the specification.

Can't speak to Linux kernel development, and in this particular case, that isn't the problem.

The linked bug involves systemd using the world's worst random number generator. A security engineer goes into more detail on this twitter thread[1]: https://twitter.com/FiloSottile/status/1125840275346198529 (or unrolled: https://threadreaderapp.com/thread/1125840275346198529.html?... ).

> At least the result is a failure to generate a key, not a compromised key.

In fact, the result is a compromised key -- the bug report is due to colliding "globally unique" identifies generated through a flawed random gathering process.


Filippo Valsorda said nothing about keys. Xe did not mention them at all.

The linked bug report showed errors using OpenSSL to generate keys after resume on an AMD platform. It seemed to indicate RDRAND returning an error code instead of entropy.

Sorry for any confusion.

UUIDs and hash table keys are probably what they meant by "keys".

The 'edit-compile-run-debug' cycle is so expensive, chip design is all about simulation.

Chip vendors, and the tools they use to design them, actually do a significant amount of work for error cases. This is important not just for correctness, but for production yield, reliability, temperature and radiation hardness, etc. As chips get larger and more dense this becomes more and more important.

Single Event Upset (where one bit flips) is an example of the type of error. https://en.wikipedia.org/wiki/Single_event_upset

Virtual machines perhaps.

To be fair from that thread I can't infer whether they check if the call succeeded. It might very well be it returns success but still yields -1.

i don't see any reason to not xor rdrand output with a prng if rdrand is available. please enlighten us why you think otherwise as i'm really interested in assumptions that lead to this conclusion.

Do you mean if a PRNG is available? Because the answer is obvious if RDRAND is available: you don't do that because sometimes it fails, failure is almost always catastrophic but sublte, and CSRPNGs are not a bottleneck in almost all cases.

RDRAND has major benefits over any CSPRNG: there is no software-visible state that could possibly leak. With attacks like Spectre, there’s always a concern that your CSPRNG secrets could be leaked.

The scheme we’re discussing XORs the two at the end, so that flaw doesn’t apply. Also: empirically, “kernel leaks CSPRNG” state does not seem to be as much of a problem as “userspace is convinced it knows better than urandom/getrandom”.

So - are they not checking for error in this case, and using a value anyway?

They are checking for an error and there is none, but the value returned is bogus (always the same, causing collisions).

Is there a good reason this doesn't just use urandom/getrandom? This doesn't look like it's in the "we're so early in boot I haven't restored the random seed yet" case, for example.

No. There's a bad reason though. Want to guess what? https://github.com/systemd/systemd/blob/master/src/basic/ran...

Yes, the old "draining entropy" fallacy.

Why do you say draining entropy is a fallacy? It is certainly true that entropy recorded from I/O sources accumulates at a very limited rate.

Because after you gather ~256 bits of entropy from I/O sources (or rdrand^H^H^H rdseed [1]), using it to generate infinite output of /dev/urandom does not drain it. Adding more entropy only helps if the entropy pool inner state leaked, it is not needed to add more entropy because it drained through using it.


1. Take 256 bits of entropy.

2. Use HKDF to generate 128 bit key and 64 bit nonce.

3. Use key and nonce for AES-CTR with all possible 2^64 counter values to produce 2^68 bytes of output.

4. goto 2.

If you can compute the inner state of the RNG (and thus predict future output) by just observing the output (not through side channels), using any amount of output, then all modern symmetric crypto is broken. If you can't, then using the entropy in an RNG does not drain the entropy pool.

And here is a CCC talk about it: https://www.youtube.com/watch?v=OSfmtRc4VsE

1 - https://software.intel.com/en-us/blogs/2012/11/17/the-differ...

EDIT: Thanks for 'dragontamer for pointing out difference between rdrand and rdseed.

So after reading this thread my understanding of the "draining is a fallacy" view is

(1) yes genuine non-programmatic entropy bits accumulate at a slow rate and can be drained

(2) in practise no one should care about (1) because once you have ~256 bits to initialise a CSPRNG you can use the output of that CSPRNG until the Earth is swallowed up by the sun; thats what the S of Cryptographically Secure Programmatic Random Number Generator promises.

(3) the linked systemd code is silly to even provide a function genuine_random_bytes that tries to get additional genuine non-programmatic entropy as every possible use case is covered by (2)

(4) The real fallacy is when people think that it might be possible to discover the seed of a CSRNG or predict its next outputs by examining a long enough run of its previous output. In practise such an attack should not be possible.

If I've got the summary right, the only point I would disagree on principle is (3). I can imagine that someone might rationally want "genuine" entropy independent of a kernel CSPRNG, for example for seeding their own CSPRNG.

re your (1), the bits cannot be drained. The only thing that can happen to them is that there is an attacker with root on your computer, and they see the CSPRNG inner state, and the attacker loses their access but they can still predict output of CSPRNG because it's deterministic, so you want to inject new entropy into it so the attacker will lose ability to predict CSPRNG output. djb says this is nonse. https://blog.cr.yp.to/20140205-entropy.html

> If you can compute the inner state of the RNG (and thus predict future output) by just observing the output (not through side channels), using any amount of output, then all modern symmetric crypto is broken. If you can't, then using the entropy in an RNG does not drain the entropy pool.

This seems wrong to me.

If you can predict #1 (the 256-bits of entropy in step 1), then you can break the system. This is highly likely, as Intel only assures 65-bits of entropy across two rdrand calls. (64-bits of entropy from the first call, +1 bit from the 2nd call, since RDRAND is generated from an internal random number generator. So there's going to be a correlation between the two values).

In effect, if #1 is created with four calls to RDRAND, you only have around 68-bits of entropy, which can be brute-forced faster than a true source of 256-bits of entropy. Therefore, your RNG is broken. https://software.intel.com/en-us/blogs/2012/11/17/the-differ...

RDSEED is the instruction that guarantees 64-bits of independent / multiplicative entropy, but executes slower as a result.


In effect, you're "simplifying" the problem. #1, getting 256 bits of entropy, is the hard part of the problem. You cannot ignore this part of the problem.

In any case, it seems to me that a large number of people don't know how to properly use the RDRAND function, in this thread as well as in the Github thread. RDRAND has a chance of failure, AND it doesn't even hold multiplicative entropy guarantees.

So this seems like a case of EVERYONE hasn't read the docs yet. Please people, do NOT use RDRAND as a source of entropy. Use RDSEED as a source of entropy. RDRAND is only a random number generator with only 64-bits of entropy guaranteed at any given state, and probably additive entropy at that.

I can ignore "how to do #1" when discussing whether entropy is drained by using it or not.

You need to get 256 bits of high quality entropy. We assume there are ways to do that.

On a device that has no way to do that, you can't do crypto (unless you accept the entropy from outside, which is key escrow, but might be ok for an IoT device that communicates only to its mothership).

> I can ignore "how to do #1" when discussing whether entropy is drained by using it or not.

Ehhh... fair point. Still, the overall discussion is about RDRAND, so I feel like its very important to point out how RDSEED must be used to properly generate the 256 bits of entropy you require in step #1.

Each of these steps are tricky, and require thorough analysis to understand.


EDIT: Its not so much that entropy is "used up". Its that RNG-sources of randomness gives "additive" entropy, while true entropy is "multiplicative".

Generating random numbers from a 256-bit RNG will give 256-bits of entropy from the first step, but be 100% predictable (and therefore "only" 256-bits on the 2nd step).

If the "next programmer" wants 512-bits of entropy, they will NEVER get 512-bits of entropy from your methodology, because you only started with 256-bits of entropy. Only by gathering "more" entropy will you be able to reach 512-bits of entropy.

The argument would go "who needs more than 256-bits of entropy", which is a fine point. But... that's how the math checks out.

It all comes back to the RDRAND vs RDSEED question. If the user just wants "unpredictable random numbers", then RDRAND and /dev/urandom is sufficient. But if you're creating independent random number generators (ex: If you're creating a service like random.org, and are guaranteeing certain amount of entropy per call to everyone), then you need to be using RDSEED.

Does 10 calls of your RNG produce 256-bits of randomness, or 2560-bits of randomness? What are your requirements? What are the requirements of the end user? For most people, having a RNG that "only" provides 256-bits of randomness across 10,000 calls is perfectly fine and sufficient. But there are plenty of cryptographic cases where that's not enough (and "true" 256-bits of entropy are needed in every call).

True about no way to deterministically get more than 256 bits of entropy out of 256 bits of initial entropy, but the requirements are to run wireguard, ssh, tls, generate web app session cookies. All those, I believe, would be satisfied by using a urandom which is implemented the way I described. What real use case has a requirement of more than 256 bits of "real" entropy?

Note that the real linux urandom is not as simple as I described, and djb's proposed key erasure RNG [1] is not as simple as I described. My design was just to talk about "draining" of the entropy.

1 - https://blog.cr.yp.to/20170723-random.html

> All those, I believe, would be satisfied by using a urandom which is implemented the way I described.

Do you want all of your AES Keys to be generated from one 256-bit random number generator? Or would you feel more comfortable for each AES Key to be independent sources of entropy? Assume all AES Keys are 256-bits for the rest of this post, to keep all encryption targets at 256-bits of security.

If an attacker breaks urandom, they effectively gain the key-generation for all keys that were based on that entropy pool (at least, until its 'entropy fills back up').

So if you're a server handling say, 100-connections per second... and for round-numbers sake, lets say that /dev/urandom generates 256-bits of entropy every second. If you break /dev/urandom for a 1-second interval (ie: you figure out /dev/urandom's 256-bit key for that 1-second interval), you now know all keys to all 100-connections that happened within that second.

If all 100-connections grabbed their own independent sources of entropy (/dev/random), then it wouldn't be possible. Probably because 100-connections per second would block and your hardware wouldn't support it. :-)

But yeah, its a question of "what do you want your server to do when it runs out of entropy". Do you want to just use pseudo-randomness to "fill in the blanks", which increases your attack surface. (All AES-keys you generate within that time period will have weakened security from an entropy perspective). Or would you rather BLOCK, and simply not support that case? (waiting on /dev/random will cause the server to slow down)

Since 256-bits of entropy is such a huge amount of security (if done properly), I think I'm inclined to agree with you. But still, its an engineering choice that has to be made.

I would be comfortable using AES keys generated by my algorithm, even though I would prefer djb's proposed key erasure RNG I linked above.

If the attacker managed to read the inner state of your RNG, you need to rotate all keys. There is no defense. The key erasure RNG only limits the damage.

There is no "entropy fills back up", because it is not drained. There is "adding/replacing non-deterministic entropy", to mitigate how many keys are known to attacker if attacker had access to RNG inner state for limited amount of time.

The entropy in /dev/random it not any better than the entropy in /dev/urandom, after the entropy pool has been initialized (step 1 in my algo) they are the same. /dev/random blocks until step 1 is complete. Step #1 is not complete in two cases: very early boot before the entropy pool has been read from storage and boot with no entropy pool stored, so need to gather entropy from I/O or hardware RNG like rdseed.

For ssh, tls, wireguard etc, use urandom.

For embedded devices: inject entropy pool value at manufacturing time.

For VMs: inject entropy pool from host at first start / VM clone.

The server never "runs out of entropy".

Please watch the CCC presentation I linked above.

> If the attacker managed to read the inner state of your RNG, you need to rotate all keys.

That's not the attack I'm talking about. The inner-state of your RNG can be brute-forced with an attack of size 2^256 (assuming 256-bits of internal state to your RNG).

If your 256-bit RNG creates 100x 256-bit AES-keys, the "common thread" is to attack the RNG. That's the most efficient way to get all 100x AES keys.

Case in point:

1. Try state X

2. Did it generate the RNG Sequence you're attacking? If so, you're done.

3. If not, X = X+1. Go back to step #1 and loop.

Simple brute-force attack against the state. That is to say, a 256-bit RNG only has "256-bits" of protection. Or to put it another way: the 100x AES Keys you've generated all have 256-bits of protection, max.


Stick a TRUE Hardware random number generator with entropy guarantees (such as RDSEED) as your generator, and you're immune to this brute force attack. In theory, its a non-trivial difference. In practice, 256-bits of entropy is enough for most people, and no one is going to accomplish this brute force attack.

Its not about "draining" entropy. Its about asking yourself how much entropy your application needs. I can IMAGINE people needing more than 256-bits of entropy in higher-security contexts.

What? No, brute forcing 2^256 is not an option. We assume brute forcing 2^128 is not an option. If brute forcing 2^256 were an option, why would anyone use AES-256?

No. Its not an option in practice. But its the starting point for a research problem.

* 100x different AES-256 keys generated by /dev/random would have 25,600 bits of entropy. (All 100x keys are independent, and would require an attack effort per key).

* 100x different AES-256 keys generated by /dev/urandom would "only" have 256-bits of entropy. (/dev/urandom would be attacked, assuming /dev/urandom started with 256-bits of entropy in its pool, assuming /dev/urandom never got extra entropy in the 100x calls)

You can't pretend that the two solutions to the problem are the same. At best, you can suggest that 256-bits of entropy is enough for practical purposes.

No, brute forcing 2^256 is not the starting point of anything.

Again, after seeding with initial entropy, /dev/random and /dev/urandom are the same, /dev/random is not better entropy.

Attacking a single 256 bit key is exactly as equally impossible as attacking a 256 bit RNG pool used to generate 100 keys.

There is no actual scenario in which using hardware RNG to generate 100 keys would keep me safe while using a hardware RNG to initialize a deterministic RNG to generate 100 keys would lead to compromise.

If your attacker model includes brute forcing 2^256, I can't help you. If you're trying to prove that Aleph-one is greater than Aleph-0, perhaps, but the number of atoms in the universe is less than Aleph-0, so it's irrelevant to cryptography or to information security.

People who say "maybe an advance in math will break 2048 bit DHE" are at least theoretically correct. But no advance in math is going to break that RNG. If you can't trust that RNG, you can't send secrets using CTR, GCM, CHACHA, etc. because they will all break using that method.

> Attacking a single 256 bit key is exactly as equally impossible as attacking a 256 bit RNG pool used to generate 100 keys.

But both are strictly easier than attacking 100x independently created 256-bit keys. The /dev/random case.

> If you're trying to prove that Aleph-one is greater than Aleph-0

I mean, yeah, that's basically the argument I have. 256-bits of entropy are probably enough for your application, but you cannot under any circumstances tell me that its the same as 25,600 bits of entropy. That's what /dev/random vs /dev/urandom comes down to. For most people in most situations, its almost inconceivable to imagine an application that needs more than 256-bits of true high quality entropy.

EDIT: A POTENTIAL case, which I admit I haven't though too much about... is a 2048-bit RSA key. I assume the RSA key needs to be generated with 2048-bits of true entropy (please correct me if I'm wrong), but the bits-of-security are far less than that, because RSA doesn't scale like other encryption schemes. So your 2048-bits of entropy RSA-key only scales to 128-bits of security.

If your 2048-bits of entropy RSA-key ended up to only have 256-bits of entropy generating it, you're possibly in bad luck (only 16-bits of entropy assuming a similar scaling factor). I'm not a specialist in the math, but... not all cryptographic primitives scale 1-to-1 with the input entropy source.

> > Attacking a single 256 bit key is exactly as equally impossible as attacking a 256 bit RNG pool used to generate 100 keys.

> But both are strictly easier than attacking 100x independently created 256-bit keys. The /dev/random case.

Both are impossible. Like, 'Solar System turned into thermodynamically optimal computronium fails' impossible. (seriously - '((mass of solar system * (speed of light)^2) / 0.0172 electronvolts ) / (2^256)' for calculations performed at 20C [0], and that evaluates to 5*10^-10. Drop the assumed temperature to 2.75K - the cosmic background temperature - and the energy per computation drops by a factor of several hundred, but that's still not nearly enough)

Or, put another way - if you can bruteforce 2^256, you can do that 100 times. (Even if you encrypt an object 100 times, you'll still only need to do ~2^264 work - not 2^25600)

0: https://en.wikipedia.org/wiki/Landauer%27s_principle

RSA-2048 is only about 112 bits of security because index calculus methods of breaking RSA-2048 (and DHE-2048) have an algorithm with ~2^112 steps to break it, not because of the amount of entropy used to generate the key [4].

An RSA-2048 keypair is generated by taking a CSPRNG like the one I proposed, generating 1024 bit numbers, checking them for being a prime number using Miller Rabin [1] until the chance of the number not being prime is below 2^-128 (generating new candidate primes if the current one turns out to be not prime), and then doing it all over again for the other prime. The two 1024 bit probably-primes "p" and "q" are the private key, their product "n" (with the public exponent e=65537) is the public key. Most systems precalculate some derived values used to speed up operations and store them with p and q, but you don't have to [2][3].

The RNG I proposed is absolutely good enough to generate RSA key pairs. There is no benefit to using "true hardware randomness" for generating candidate primes, after the initial pool of 256 bits of entropy has been generated.

1 - https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality...

2 - https://en.wikipedia.org/wiki/RSA_(cryptosystem)#Key_generat...

3 - https://tools.ietf.org/html/rfc8017#section-3.2

4 - https://crypto.stackexchange.com/a/8692/24949

What are you talking about? This is not an "in practice" detail.

Let's put the time involved in perspective.

``` int main() { for (unsigned long long i = 0; i < (((unsigned long long)1) << 35ull); i++) { __asm__ __volatile__ (""); } } ```

this is large enough to remove the relevance of any start up time, and make noise somewhat meaningless.

On my laptop this takes 9 seconds. That's just counting to 2^35. Now let's imagine we're just counting to 2^256. Not even doing the work to break the rng. How long does it take?

2^221 * 9seconds.

Which is approximately 1.397 * 10 ^ 50 times the age of the universe (I think a billion billion billion billion billion times the age of the universe).

That's just counting from 0 to 2^256, not doing anything else at all.

Understand that if you can break a CSPRNG with 2^256 bits of state, you can break:

* The entire WebPKI system, because all you're doing is brute forcing an very high speed 256 bit hash function.

* TLS connections: you're just brute forcing 256bit keys here as well, in the worst case. Fortunately most are still just aes128. That's 6471949087173059816 times faster, so that's only a billion billion billion times the age of the universe.

Note that the numbers involved mean that "making computers faster" and "adding more computers" doesn't making a meaningful difference, because you run into an energy wall - literally more power than a star emits ends up being necessary.

So, please don't just regurgitate nonsense: brute forcing a current 256 bit CSPRNG is not possible, at all. because physics.

So to break it, you have to break the csprng algorithm. But if you can do that, then you can break any of the places the an RNG is used in practice. Eg. you could have a "true" rng seeding crypto, but it doesn't matter because the attacker can just use your algorithm to break the result of your crypto.

Note also the "true" random you get on a computer is fairly restricted - it has both little variance, and also significant bias, so the output has to be merged (hash function that you've broken) and whitened (again, hash function that you've broken). So even if you were trying to rely on "true" random, you're not getting as much "entropy" as you seem to think you are, and you've invented an algorithm that breaks all the massaging used to make it seem more random.

There's no way to "use up" your entropy; this is a persistent myth that's been hanging around the Linux world for a very long time.

* https://www.2uo.de/myths-about-urandom/#what-about-entropy-r...

* https://www.mail-archive.com/cryptography@randombit.net/msg0...

Which one of these statements is false:

1. When secure your TLS 1.2 connection with a 128 bit AES key, you can transmit a small amount of data (on the order of ~128 bits) before the entropy is drained from your CSPRNG state.

2. When you generate at least ~128 bits of entropy in your kernel's CSPRNG state, you can generate a very large amount of random bits before the entropy is drained from your CSPRNG state.

Quoting from https://github.com/systemd/systemd/issues/11810#issuecomment...

BTW, the reason we use RDRAND in some cases instead of getrandom() [which we use in many others] is that we need to generate uuids early on (since every service we starts gets one passed, the "invocation ID", and for other stuff too), but getrandom complains in dmesg or blocks if we call it before the pool is initialized. Since systemd is one of the earliest programs that runs and thus very likely comes into contact with an uninitialized pool we attempt to avoid that by using RDRAND when generating uuids, since it should be good enough for that, as the usecase needs a "mid-quality" rng source: not crypt quality and not totally guessable either.

Why not use getrandom() with GRND_NONBLOCK?

Worst case, they could run RDRAND in a loop and write() it into /dev/random until getrandom() is unblocked. Then they're still using the ordinary kernel random device, more or less. They wouldn't have these catastrophic collisions due to near-zero entropy.

Because systemd is not x86-64-only. The actual worst case is where there is no RDRAND, because then the systemd people have the chicken-and-egg situation of not being able to seed urandom without running a system service unit, and not being able to run a systemd service unit without being able to assign a random GUID to it as its instance ID.

The design problem is that every service unit has an instance ID, whether meaningful and useful to it or not, the journal and other things have a machine ID, and journal files have header IDs. This requires that re-seeding urandom be pushed right to the start of the entire user-mode boot process (which begins with systemd in the initramfs on some operating systems), that something else (such as RDRAND) be used until urandom is re-seeded, or that the boot process simply stop and block running the very first service unit until urandom is re-seeded by the kernel.

What can systemd do better than GRND_NONBLOCK in the non-x86_64 case?

Or better yet, the getrandom syscall, which has better semantics and avoids the filesystem.

I mentioned getrandom.

I'd love to read a "cryptographic doom principle"-esque latacora blog post on all the failures over the years of not using your OS provided CSRNG.

For the lazy, this is a link to a comment by Theodore Ts'o, kernel dev, who says:

> I am so glad I resisted pressure from engineers working at Intel to let /dev/random in Linux rely blindly on the output of the RDRAND instructure. Relying solely on an implementation sealed inside a chip and which is impossible to audit is a BAD idea. Quoting from the article...

Theodore Ts'o is maintainer of the ext filesystems (particularly ext4), as well as, IIRC, /dev/random and other CSRNG related components of the kernel.

Thank you, Theodore Ts'o.

Theodore Ts'o is responsible for perpetuating the myth that entropy/randomness can run out, leading systemd (and other software) to do crazy things, such as trying to use RDRAND, to avoid "drain[ing] randomness from the kernel pool". The bugs and security vulnerabilities resulting from this myth probably neutralize the benefit that came from from his resistance to RDRAND in the kernel.

> the myth that entropy/randomness can run out

Can you expand on this, or link to some sources that expand on this idea that the assumption above is wrong? As a person who has not dealt with crypto really at all I had heard this explained several times before and assumed it was generally accepted.

128-bits of random data is sufficient to securely generate a stream of 100s of terabytes of random data. It's not /that/ hard to find 128-bits of true entropy, even during boot phase. Here's one example:

    1. Seed with any fixed hardware IDs

    2. Mix-in the wall clock time

    3. Spin up a kernel thread and flip a bit on/off in a tight loop. Interrupt it every 100 nanoseconds and take the value of the bit at that time. Do this 256 times. Mix that in too.

    4. Mix-in 256-bits from RDRAND 

    5. Mix-in timings from other interrupts as and when they happen. 
    6. Repeat steps 4. and 5. ad infinitum. 
By step 4 we have taken 26 microseconds and we have the kind of entropy I would be comfortable generating an RSA private key with.

Note that step 3 is effectively a measure of how precise the system clock and CPU are. Attacks have been demonstrated against step 3, but they require co-resident processes and don't apply during the boot-phase, if you've got a dedicated core at least. In theory if system clocks and CPU got super precise it could become too deterministic, but the point is the likelihood of /both/ that happening /and/ RDRAND being broken.

>By step 4 we have taken 26 microseconds and we have the kind of entropy I would be comfortable generating an RSA private key with.

And yet Truecrypt made me wiggle the mouse around for like 30 seconds?

> 128-bits of random data is sufficient to securely generate a stream of 100s of terabytes of random data.

What you are describing is /dev/urandom. Your argument is basically "urandom is good enough for anybody". If you want to use that, use it.

/dev/urandom is not always sufficiently seeded.

/dev/random makes sure that it's seeded, then pretends it can run out somehow.

getrandom() with default settings is the right behavior almost always, and it took ages to get implemented.

The most pragmatic explanation is that if 128 or 256 bits of random could run out, the cryptography underlying almost everything we do online would be unsound. If you trust those crypto systems to take a sufficient seed and stretch it, why don't you trust the OS RNG to do so?

You could always read the discussion that was already here on this very page. (-:

* https://news.ycombinator.com/item?id=19850938

Note that the corresponding kernel bug was reported already in 2014.


Later in the thread:

> Given that RDRAND is allowed to fail, it seems to me that you should either try it only once, or only a few times, before falling back to whatever code is used when RDRAND is not implemented.

... which is what the systemd code actually does. The problem is that there appears to be a possible AMD processor state, caused by suspend+resume, where the instruction succeeds but the data returned are not in fact random.

There is no mention anywhere, in this thread nor the one from 2014, of it returning non-random data from the issue reporters, just assumptions from onlookers.

The other problem is that systemd's fallback is a non-random PRNG.

Only if a RANDOM_EXTEND_WITH_PSEUDO flag is set. But in fact, that flag is nowhere set in systemd outwith unit testing code. So it is not in fact another problem.

Even if it were used, the calls to rand() would only happen if RDRAND first succeeded and then failed. If RDRAND always failed, systemd's true fallback is actually to getrandom() and thence to /dev/urandom. However, AMD claims in its doco that there is a FIFO of generated data provided for satisfying bursts of multiple RDRAND instructions, so it is supposedly unlikely to succeed and then fail, especially for the amounts of random data in the case at hand.

AMD does not say, though, whether the FIFO is big enough to complete a 128-bit GUID, and this does after all involve a processor that can enter a mode where RDRAND succeeds and gives a constant result. It's not wholly beyond the bounds of possibility that AMD made the FIFO too small as well.

Oh, this could be really bad. I wonder if any private keys are compromised this way - would certainly be nice to know what rdrand is returning if it isn’t random data.

Catchy title. Nowhere it says it returns non-random data. It just fails.

Sure it does... read all the posts. Also read all the history.

Also read this from 2013 " I am so glad I resisted pressure from Intel engineers to let /dev/random rely only on the RDRAND instruction... Relying solely on the hardware random number generator which is using an implementation sealed inside a chip which is impossible to audit is a BAD idea. "

And this https://www.theregister.co.uk/2013/09/10/torvalds_on_rrrand_...

So, uh, this isn't news, and isn't limited to AMD.

Sure it'd be nice to fix.

I fail to see how a dev talking about a wholly unrelated implementation from an entirely different vendor has anything to bear on this conversation unless you're just maliciously spreading FUD around the entire concept of random numbers.

Do you actually know if the instruction is faulting or delivering back non-random data ("sure it does...")? Is the non-random data 0's or something with a pattern like 0x9090? Does that match the Intel implementation's behavior exactly?

And this is why cryptography has increasingly reduced its reliance on randomness. Strong CSPRNG that only need a single seed to be secure, signing constructions that use deterministic hashes, deterministic derived keys, DAE-secure ciphers that fail-safe when IVs are re-used, etc.

Randomness is definitely something we took for granted for too long.

Suspend/resume seems to be the cause of a whole host of bugs; I run into obnoxious suspend problems frequently on all platforms except maybe Windows. It's so common that I've pretty much stopped using suspend on most of my machines. It's a bit inconvenient to power down / power up things each day, but I'd rather deal with that than have intermittent wifi issues, display problems, etc.

I consulted once for a hospital that was experience seemingly random network outages. After a few questions with the staff there it seemed to happen whenever a person stepped away from their windows (Lenovo) workstations for too long. Say, after lunch or breaks. After a wireshark of the network I determined it was a network card driver that was causing a broadcast flood on the network from multiple points for stations with the same driver version. While not the fault of Windows (I dont think anyways, as an update of the driver fixed the issue) your comment did remind me of this experience.

Bet the vendor made sure the drivers "worked" for suspend/resume, but only from the user point of view.

I may have just been lucky with Windows on my particular hardware. A quick search turns up a bunch of different problems. I think suspend/resume may just be a particularly difficult thing to get right.

Wow, so when would be the worst time to suspend your computer?

The only time I imagine this, is when generating a private key for a production environment.

My reading is that systemd uses the cryptographically secure rng to generate a unique id for a filename, and doesn't handle collisions properly.


Well, not handling collisions helps a lot with exposing bad CSPRNGs. If they (and e.g. OpenSSH as mentioned in the original 2014 bug report) did handle collisions, it could've remained unnoticed.

There shouldn't be collisions. I mean that really: if you see a collision it's so much more likely that your computer / program / source of randomness is faulty [as in this case] than that the two random numbers collided that it's not worth considering the collision case.

But there can be collisions, at least in theory. And as we can see, in practice too.

Not checking for collisions because you trust that an RNG returns unique values is a fallacy and should be avoided. Especially since it's so easy to deal with the problem.

It would be better to report the collision because most likely it means some error.

Yes, of course. As long as you do something. Just ignoring it because it's assumed it'll never happen is just bad design.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact