
A months-old AMD microcode bug destroyed my weekend - furcyd
https://arstechnica.com/gadgets/2019/10/how-a-months-old-amd-microcode-bug-destroyed-my-weekend/
======
viraptor
> When there's a bug in the CPU microcode, you're at the mercy of your
> motherboard vendor to release a new system BIOS that will update it for you

That's just... not true. Is the article wrong, or was the rdrand issue
actually bios-specific rather than in CPU microcode?

Edit: Whoever downvoted this - ucode update is supported by both Intel and AMD
for quite a while. You certainly can just download it:
[https://github.com/platomav/CPUMicrocodes/tree/master/AMD](https://github.com/platomav/CPUMicrocodes/tree/master/AMD)

~~~
drewg123
I'm not sure why you're being downvoted, I had the same reaction. CPU
microcode is updateable by the kernel at boot.

On Ubuntu, he'd want to use the amd64-microcode package:
[https://launchpad.net/ubuntu/+source/amd64-microcode](https://launchpad.net/ubuntu/+source/amd64-microcode)

~~~
amelius
This assumes that you _can_ boot.

~~~
cremp
well, first thing on _my_ list if I can't boot; good 'ol USB live boot to read
dmesg and syslog.

Which a live boot, you can install packages such as microcode updates.

~~~
LeonM
This is a microcode bug which causes a kernel to hang. So simply booting from
a different medium is not going to help you.

You'll need to compile a kernel which does not depend on the RDRAND
instruction as a source of randomness.

~~~
ehvatum
It causes systemd to blow up, but you could try doing your microcode update
from initrd. Or just get rid of systemd, run your microcode update, and make
some Halloween jello shots to take to the no-systemd party (it's going to be
even crazier, this year).

------
BuildTheRobots
> Readers also suggested that if the amd64-microcode package were installed,
> it would fix the issue. This, too, is not the case; the amd64-microcode and
> intel-microcode packages are both installed by default on all Ubuntu 19.10
> systems, including the one I'm experiencing the RDRAND failure on. I
> contacted AMD and asked representatives to check the status of that package
> and see if anything needs to (or can) be done about updating it.

Can anyone comment why the AMD microcode update doesn't seem to actually fix
this bug? I've seen plenty of people commenting that you don't need a bios
update and that it _should_ , however the author seems to have tried that.

~~~
ehmish
It's CPU initialisation microcode, it has to be put in a bios update for your
motherboard. Specifically it's called AEGSA
[https://en.m.wikipedia.org/wiki/AGESA](https://en.m.wikipedia.org/wiki/AGESA)

~~~
kregasaurusrex
Previously, AMD shipped out free processors to fix an issue with Ryzen boards
having firmware too old to boot the CPU:
[https://www.extremetech.com/computing/264097-amd-shipping-
fr...](https://www.extremetech.com/computing/264097-amd-shipping-free-
processors-customers-address-apu-firmware-update-issue)

------
gexla
Yup, this killed a weekend of mine as well. I'm surprised this isn't a
footnote of every conversation about AMD here. Luckily for me the fix was
available from my motherboard manufacturer about a week after I bought the
chip.

Also, the Ryzen 3 seems to be picky in dealing with RAM chips. I couldn't boot
my computer with all slots in my motherboard filled. I ended up having to dial
down the speed of my RAM in the BIOS as well as manually set some other
settings.

Also, certain motherboards require an updated BIOS before it can boot off a
Ryzen 3. In my case, I had the store I bought the chip from install the chip
into the mobo so that they could demonstrate that it would boot up and I
wouldn't have to return anything. It took them hours to figure this out and I
had to tell them how to fix the problem after arriving because I had
previously run across mention of the problem. The fix was to install a Ryzen
2, enter the BIOS, install the upgrade and then go ahead with installing the
Ryzen 3 chip.

What a PITA that was. It put me off building my own system. Next time, I think
I'll get a Dell or something.

~~~
eptcyka
I was under the impression that it was widely known that X470 motherboards
would only support the newer Ryzen CPUs with a BIOS update. This is very often
the case for motherboard chipsets that are forward-compatible with newer CPUs.
As far as memory goes, you do have to pay attention to what memory the CPU
supports, it's not like any DDR4 modules will _just work_ with any
motherboard/CPU combo. I just built a ryzen desktop, and everything was pretty
much turn-key, as I selected the parts which with my limited amount of
knowledge had the highest probability of working together nicely.

~~~
gexla
This was in the Philippines. I wasn't surprised they ran into issues and I
knew right away what the issue was.

I understand you can't just get any memory. The memory I purchased was
reasonably within the required specs and a solid brand. The memory wasn't on
the "tested list" but on a relatively new chip that's a small list and in the
Philippines, I had a limited choice. This is also a wide problem as I found
loads of people who had the same issues. I was able to fix it though, and I
have my original RAM installed. If there's a secret sauce to picking RAM other
than being on this tested list, it's not widely known. Just finding a thread
where someone could explain the problem and the fix was difficult.

I guess this is no different from any newly released thing. Better to wait
until the quirks get worked out.

~~~
eptcyka
In your case, I think it might very well just be the fact that this is a
completely new platform and all the kinks haven't been straightened out just
yet. And yeah, being in a smaller market must suck due to the limited choice
:/

Having said that, getting ahold of the 3900x in the UK made me feel like I was
too participating in a very limited market, situated somewhere in the middle
of the Pacific ocean.

~~~
gexla
I bet you have to pay a nice premium as well. Anything electronic here is
anywhere from 30% to 100% more than I would pay Amazon in the US.

------
peter_d_sherman
Excerpts:

"My CPU still thought 0xFFFFFFFF was the randomest number ever, always, no
matter what."

"What if two years later, I was still vulnerable to stack-smashing that I
shouldn't have been, due to ASLR that wasn't actually randomizing?"

An excellent point... IF ASLR is dependent on RDRAND, and IF RDRAND always
returns 0xFFFFFFFF, THEN ASLR will not work, and addresses will be laid out in
memory deterministically rather than randomly... which is a big no-no for
security...

A future OS would check the RNG's it uses for any security feature, on
startup, and upon failure would stop, log the problem, advise the user, and
ask them how they want to proceed...

~~~
jzwinck
99% of users would click "Ignore and don't ask me again." If that option is
not available they would tell their friends how stupid their OS was and that
it ruined their weekend.

The OS needs to paper over known defects, if consumers are to be kept sane.

~~~
peter_d_sherman
DOS had abort, retry, fail
([https://en.wikipedia.org/wiki/Abort,_Retry,_Fail%3F](https://en.wikipedia.org/wiki/Abort,_Retry,_Fail%3F))
when it encountered an issue; most BIOSes will stop if they can't detect the
keyboard or if a memory test failed (it typically can be overriden), Windows
has an Advanced Boot Options screen that automatically comes up (such that the
user can select "Safe Mode") if/when the computer crashed previously...

I'm not saying you're wrong; I'm just saying, hey, there's "prior art" for
this... <g>

~~~
jzwinck
The link you provide says that is a famous example of a bad user interface. I
can't tell if you're agreeing with me.

~~~
dfox
Abort, retry, ignore(, fail) is perfect example of UI, that was good when it
was designed, but very fast became nothing else than source of confusion.

------
BruiseLee

      long getRandomNumber() {
          return 0xFFFFFFFF; // chosen by fair 4294967295 sided dice roll.
      }                      // guaranteed to be random.

~~~
rmu09
it should say 4294967296 sided dice.

~~~
nullc
Dice aren't usually 0 indexed.

~~~
saghm
They also usually don't have four billion sides

~~~
Narishma
Sure they do, we just call them marbles instead of dice.

------
marios
I'm surprised WireGuard asks RDRAND directly. Isn't there a facility inside
the Linux to get random numbers ?

OpenBSD conveniently provides arc4random() in its libc for applications to
use, and the same function is available for kernel components (obviously one
needs to include different headers).

~~~
zx2c4
WireGuard uses the correct kernel function, get_random_u32(). WireGuard does
not use RdRand directly. WireGuard instead uses the proper kernel function for
asking for a unsigned 32-bit number. The kernel implements get_random_u32()
under the hood using a variety of backends for it. One of them is RdRand.

Other facilities in the kernel, such as ASLR, also use get_random_u32().

Many things in the kernel use get_random_u32(). That's the proper function to
use.

When presented with this bug, the upstream kernel maintainers chose not to fix
get_random_u32(), due to the availability (?) of microcode updates for AMD
chips. That's not my decision. WireGuard is just a mere consumer of
get_random_u32(), like all other modules. This is an upstream kernel bug.

~~~
zaroth
> _That 's not my decision._

So says a maintainer of WireGuard. HN is beautiful sometimes.

If RANDOM_TRUST_CPU is disabled, that will stop the kernel function from using
RDRAND and avoid this issue for anyone using the ‘get_random_u32()’ function?

~~~
dfox
No. get_random_u32() simply returns result of RDRAND if RDRAND is available
regardless of any runtime configuration. For me that is pretty significant
issue, because it is documented as being based on separate kernel-only CSPRNG
with somewhat specific security assumptions (complete with pretty large
discussion in comments of random.c as to why would anybody want that weird
thing)

------
KingMachiavelli
Is there an upstream source for this article? There's no mention of the actual
versions of the Linux kernel, linux-firmware, microcode (either bios or
applied by the kernel), etc. Which makes it seem less thoroughly investigated
since I could easily see Ubuntu having something outdated.

I just got my 3900x a couple week ago and I don't have this bug according to
the test code provided in the article.

I'm running:

> Archlinux linux-5.3.7

> Linux-firmware: 20191022.2b016af-1

> Microcode: v2.2 (patch_level=0x08701013)

> MB: ASUS PRIME x570-P. bios=1201

~~~
fivefive55
It was patched for almost all systems almost immediately like he said. I
suspect the reason he still had the bug was due to the Asrock Rack X470D4U
motherboard being an older generation board, and also a micro atx server board
of all things. They might have only sold a couple hundred of them total in a
niche like that so it's not too surprising it would be a low priority on the
bios update list.

I also find it kind of funny that he calls Asrock Asus every single time he
mentions them in the article. If he was trying to install an Asus bios on his
Asrock mobo then he's got bigger problems than this bug.

~~~
ActorNightly
So if I was to build a Ryzen System today, would that bug not exist?

~~~
fivefive55
You would definitely be fine as long as you're using an updated motherboard
with at least AGESA 1003ABB, which would be any x570 board and likely the vast
majority of x470s by now.

~~~
notzuck
Last time I checked, this was not fixed for my B450-TOMAHAWK-MAX

~~~
fivefive55
Hmm, I see 1.0.0.3abba was released for that board on 9-18-2019. That should
include the fix, although I'm not an expert, maybe it isn't there for some
reason?

~~~
notzuck
Last time I checked it was only BETA - I will check again to see if anything
has changed.

EDIT: looks like it's out of beta now - awesome! Thanks.

------
pingyong
>If an existing session has the same ID as the new number, WireGuard asks
RDRAND for another "random" number, checks it for uniqueness, and so on. Since
RDRAND on my system—and any non-microcode-updated Ryzen 3000 system—always
returned 0xFFFFFFFF no matter what, that means infinite loop.

I'm pretty surprised that so much software seems to rely on external (be it
from the CPU, or the OS) random number generation directly instead of simply
using a software PRNG and seeding it with external sources (and maybe
additional creative sources if desired). That seems like a way more
predictable system to me, in the sense that at least you get guaranteed
uniformly distributed numbers, even if the external seeds are not uniformly
distributed at all.

~~~
owenmarshall
Because your intuition about the ease of building random number generators is
wrong, especially if they need to be cryptographically secure. I trust my OS’
RNG functions[1] far more than I trust third party code to be right, and I
trust it a million times more than I trust myself to get it right. Trusting a
small core of validated code is far more likely to be correct in the long
term.

Wireguard’s mistake was not using what the OS gave them; I’m not sure why they
didn’t.

[1]:
[https://man.openbsd.org/arc4random.3](https://man.openbsd.org/arc4random.3)

~~~
Tharre
WireGuard is a kernel module, it already uses what the kernel gave it, which
is get_random_u32(). Note that this is NOT used for anything crypto related,
it's just used in the hashtable code.

~~~
owenmarshall
Amazing. Apologies to the Wireguard folks: first for casting aspersions,
second that they have to deal with such a trash heap of an API.

------
willis936
Wow that’s quite the bug. I’m surprised systemd uses RDRAND since I remember a
big scare related to unauditable RNGs and government backdoors around the time
that RDRAND and RDSEED came into existence. The thing is, they’re technically
auditable. You just need an SEM and an astounding amount of time to reverse
engineer a core to the point that you can identify and understand the RNG
circuitry, beginning to end.

Edit: There’s some interesting discussion in the Wikipedia article’s talk
page’s criticism section.

[https://en.wikipedia.org/wiki/Talk:RdRand#Criticism](https://en.wikipedia.org/wiki/Talk:RdRand#Criticism)

~~~
kryptiskt
I always found that worry quite weird, these are processors that already have
things like System Management Mode that takes control away from the kernel and
runs some unknown code at an arbitrary point in the execution. Worrying about
a subverted RNG and trusting the rest of the CPU will just create a load of
busywork that does nothing to keep you more secure.

~~~
zozbot234
System Management Mode is the feature that does things like getting your fans
to spin up when the processor runs hot. Pretty critical when you put it like
that. Yes, that requires "tak[ing] control away from the kernel and running
some unknown code at an arbitrary point in the execution". We could do away
with such things entirely, but then our kernels would look like coreboot code,
and a comparatively trivial kernel bug could permabrick your system. Not a
very sensible tradeoff!

~~~
thenewnewguy
I think they're more worried about the untrusted/unauditable part of it. I
suspect most people that dislike the ME would be perfectly fine with it if
Intel released the code for it and allowed users to build and upload their own
copy of the ME code.

------
xoa
> _When there 's a bug in the CPU microcode, you're at the mercy of your
> motherboard vendor to release a new system BIOS that will update it for
> you—you can't just go to some download link at AMD and apply a fix
> yourself._

I think this is the real kicker, and might represent one of the next major
fronts in the security struggle. It's a little different from the debates
happening right now about support periods, that at least has clear economic
implications. It's one thing to argue about whether a product should still be
supported at all. But it's quite another when something is being supported,
and does in fact have a patch available, yet many owners still can not apply
it anyway. That seems like an avoidable failure, and something worth
considering legislation around. The industry could and should have more
standardized methods and requirements to make sure that any patches that are
created do make it out to product owners quickly and universally, there just
hasn't been consistent motivation.

~~~
calcifer
> _When there 's a bug in the CPU microcode, you're at the mercy of your
> motherboard vendor to release a new system BIOS that will update it for
> you—you can't just go to some download link at AMD and apply a fix
> yourself._

At least on Linux, that's not true. Intel publishes microcode updates on
Github [1] (and distros package it) and AMD has it upstreamed in linux-
firmware [2], so you don't have to rely on motherboard vendors at all.

[1] [https://github.com/intel/Intel-Linux-Processor-Microcode-
Dat...](https://github.com/intel/Intel-Linux-Processor-Microcode-Data-Files)

[2]
[https://git.kernel.org/pub/scm/linux/kernel/git/firmware/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-
firmware.git/tree/amd-ucode)

~~~
jplayer01
Microsoft also distributes microcode updates through Windows Update. Although,
on a second look, it seems like all the microcode updates they've made
available were only security-related (Spectre and co) Intel patches.

------
onlydnaq
Why would anyone use rdrand directly? Seems like user space applications
should use getrandom() or /dev/urandom and the kernel should use rdrand as a
complementary random number source in its random number generator.

No user space program should need to use rdrand directly at all.

~~~
pmjordan
I believe Wireguard is implemented in the kernel. I'd argue there should be a
kernel-wide wrapper function for this sort of thing, or if that already
exists, Wireguard should probably use it.

That doesn't explain why systemd uses it, of course.

~~~
cyphar
WireGuard does use the right wrapper -- get_random_u32(). The issue is that
the implementation will just use whatever the architecture-provided randomness
source provides if it's available[1]. That's the real bug.

[1]:
[https://elixir.bootlin.com/linux/v5.3.6/source/drivers/char/...](https://elixir.bootlin.com/linux/v5.3.6/source/drivers/char/random.c#L2343)

------
johnklos
Too bad that all the benefits of running an open source OS are lost because
it's too arduous for regular people to recompile their OS and software. At
least the BSDs make it trivial. When you give up freedom for "ease of use",
you get what people give you, and you have no real options.

I'm sorry, but I really don't feel bad for this person.

Also, it's disingenuous to say that only "some" (the author uses that word)
motherboards have updated microcode. "Most" would be more accurate, and
"pretty much all, with few exceptions" is actually closer to the truth. Even
my el cheapo A320 chipset motherboard already has AGESA 1.0.0.3 ABBA.

~~~
CDSlice
> When you give up freedom for "ease of use", you get what people give you,
> and you have no real options.

> I'm sorry, but I really don't feel bad for this person.

And this is why Linux will never take off as a consumer OS.

------
PhantomGremlin
For those interested in the origins of RDRAND, here is some background Intel
stuff about it.

[https://spectrum.ieee.org/computing/hardware/behind-
intels-n...](https://spectrum.ieee.org/computing/hardware/behind-intels-new-
randomnumber-generator)

[https://www.hotchips.org/wp-
content/uploads/hc_archives/hc23...](https://www.hotchips.org/wp-
content/uploads/hc_archives/hc23/HC23.18.2-security/HC23.18.210-Random-
Numbers-Cox-Intel-e.pdf)

NB: RDSEED was added at a later date, some people didn't like that Intel used
AES for conditioning, they wanted the raw bits.

------
stevekemp
The screenshot in the article include:

    
    
         sudo hexdump -C -n 64 /dev/urandom
    

sudo won't be required to read from /dev/urandom, although it probably is for
/dev/hwrng.

------
mattferderer
It sounds like this article was all about Asus not updating their
motherboards. Than at the bottom we find out it's not an Asus motherboard but
an older AsRock Micro ATX Server board.

------
zachruss92
I don't have any insights to add here but I did just build a new PC this
weekend with an Asus X570 MoBo a 3900x, and 32GB of 3600Mhz memory w/ Ubuntu
19.04 and everything worked as expected. I haven't even updated the BIOS yet.
I did spend all day Sunday getting the Nvidia/CUDA drivers configured for
Tensorflow, but that's a different story.

I just hope people don't think that every system build will deal with these
problems.

------
kevin_thibedeau
Tinfoil hat time: Maybe this was a canary if the system is designed to have an
exploitable RNG that can be weakened on demand by back channels.

------
unethical_ban
My Lenovo E485 with Ryzen 2700U processor had the same issue - the BIOS update
for that was provided in August, I believe, though I only installed it this
past weekend.

It was a very frustrating thing to happen, since I was super excited to be
back on AMD for my laptop.

------
swills
Could WireGuard use libjitterentropy
([https://github.com/smuellerDD/jitterentropy-
library](https://github.com/smuellerDD/jitterentropy-library)) to avoid this?

~~~
zx2c4
WireGuard uses the correct kernel function, get_random_u32(). WireGuard does
not use RdRand directly. WireGuard instead uses the proper kernel function for
asking for a unsigned 32-bit number. The kernel implements get_random_u32()
under the hood using a variety of backends for it. One of them is RdRand.

Other facilities in the kernel, such as ASLR, also use get_random_u32().

Many things in the kernel use get_random_u32(). That's the proper function to
use.

When presented with this bug, the upstream kernel maintainers chose not to fix
get_random_u32(), due to the availability (?) of microcode updates for AMD
chips. That's not my decision. WireGuard is just a mere consumer of
get_random_u32(), like all other modules. This is an upstream kernel bug.

~~~
altfredd
I imagine, that ASLR uses get_random_u32(), because it is needed during
earliest phases in boot, before random pool is initialized. It might not
necessarily be the proper function for making kernel random numbers — rather
such function might not exist in Linux yet.

------
fnordsensei
It's interesting that it relies on thermal noise to generate random numbers.
Would this mean, theoretically, that the more stable the thermals of the
system are, the less entropy, and the less random the numbers generated?

~~~
willis936
Stable thermals are fine. Very low temperatures would be an issue. I wonder if
LN2 affects the quality of the RNG values.

[https://en.wikipedia.org/wiki/Johnson–Nyquist_noise](https://en.wikipedia.org/wiki/Johnson–Nyquist_noise)

~~~
dragontamer
Computers are made from transistors... so there's a good chance that its
probably shot-noise instead.

[https://en.wikipedia.org/wiki/Shot_noise](https://en.wikipedia.org/wiki/Shot_noise)

Johnson Nyquist noise is measured from resistors, and is therefore present in
every circuit. Shot noise however, is more evident in transistors.

\------------

I don't know if the AMD circuit is Shot-noise or Johnson Nyquist noise. But
its important to remember that there are many sources of noise / randomness in
normal circuits.

The funny thing about true-random number generators: noise sources are all
over the place! A beginner who plays with transistors will almost immediately
"discover" a form of noise, and be forced to minimize it.

With regard to "true random" circuits (of which there are many, many different
kinds), they all isolate a particular noise, and then amplify it. You have to
isolate a particular form of noise if you want to get a good entropy
estimation.

EDIT: It seems both shot noise and Johnson-Nyquist noise are white-noise, so
maybe it doesn't matter. White noise + white noise should result in white
noise, but its been a long time since I've taken this circuit class...

------
fortran77
> When theRDRAND bug in Ryzen 3000 first surfaced back in June, Linux users
> widely reported that their entire Ryzen 3000-powered systems wouldn't boot.
> The failure to boot was due to systemd's use of RDRAND—and it wasn't
> systemd's first clash with AMD and a buggy random-number generator,
> unfortunately.

Then why are half the posts on Hacker News about how wonderful Ryzen is and
how Intel is all but dead, etc.?

~~~
toast0
HN has a pretty vocal contingent that won't allow systemd on their systems, so
they wouldn't have noticed?

Also, this issue was discussed here when it was noticed shortly after release.
AMD acknowledged the issue and released a fix pretty quickly, subject to
microcode updates don't necessarily make it out to all motherboards very
rapidly.

------
parliament32
>When there's a bug in the CPU microcode, you're at the mercy of your
motherboard vendor to release a new system BIOS that will update it for
you—you can't just go to some download link at AMD and apply a fix yourself.

Since when? On Linux anyway microcode can be installed just fine as a package,
often bundled in kernel updates.

------
person_of_color
I don't get it. Why is RdRand failing?

------
carapace
Everything— _everything_ —about that was awful.

------
sempron64
Turns out the IEEE standard floating point number is ffffffff...

( see the alt text on [https://xkcd.com/221/](https://xkcd.com/221/) )

~~~
paulmd
That's the problem with randomness... you can never be sure!

[https://dilbert.com/strip/2001-10-25](https://dilbert.com/strip/2001-10-25)

------
ncmncm
Any machine that has a CCD image sensor has an excellent source of high-
quality random noise, in the least-significant bit of each pixel.

Each pixel will have a bias, but x-oring lots of them together gets nice,
unbiased random bits to stir into your pool.

If it doesn't have a CCD, maybe it has a microphone or ambient-light sensor.
That yields fewer total bits, but usually enough.

Maybe it has two, or all three. The advantage of mixing randomness from
multiple sources is that you don't need to trust them all. If one starts
feeding you FFFF..., the output has just a little less entropy, not none.

My guess for why AMD produces FFFFF... is that a Spook Mode was activated by
accident. If so, it was supposed to switch to that only on command from the
"management engine", the wired-in exploit every big CPU has. Maybe somebody
deep within AMD wanted to ensure that we would not trust our RDRAND
instruction overmuch.

