
However improbable: The story of a processor bug - dsr12
https://blog.cloudflare.com/however-improbable-the-story-of-a-processor-bug/
======
Normal_gaussian
Maybe I expect too much, but when I started reading I was under the impression
they were going to have discovered a previously undiscovered bug.

Instead I found they hadn't applied microcode updates and were suffering from
a bug that causes. Would it not be devops 101 to monitor what updates are
available to your fleet (and selectively apply).

I would expect this of a small company, who don't have the resources to look
after their fleet, but not from cloudflare. I also don't get their dig about
'errata', it is standard practice to include post-release hardware faults (and
their solutions) in errata.

~~~
ASalazarMX
Is devops really meant to keep microcode patches up to date? Isn't hardware
and OS security sysadmin work?

~~~
Normal_gaussian
Yes and no. For somewhere like CloudFlare I would expect devops to manage this
kind of sysadmin. The roles aren't sharply defined so it really is syntactic

------
jo909
That is a lot of wasted time for an already known and fixed bug. At the scale
of cloudflare I would have expected them to have a few people intimately
familiar with their server hardware and in steady contact with their vendors,
and to know and evaluate every firmware update for every component they use.

At my scale, we install whatever is publicly available at a time of our
convenience in hopeful trust the vendor knew what they were doing.

------
rdtsc
> SIGSEGV is not the only signal that indicates an error in a process and
> causes termination. We also saw process terminations due to SIGABRT and
> SIGILL

Also SIGBUS. To me it happens mostly when playing with mmap and shared memory.
But casting random numbers to pointers and trying to access those should do
that as well. Don't be too surprised if you see it once in a while.

> “BDF76 An Intel® Hyper-Threading Technology Enabled Processor May Exhibit
> Internal Parity Errors or Unpredictable System Behavior”

Lovely.

In this sense I've always liked when various layers implement their own
checksumming and sanity checking. Don't just rely on hardware to do it (ECC,
raid controllers etc). Wrote a database which saves a chunk to disk? Add a
checksum with it. Have something that sends data over the wire, send a
checksum with it. With disks it's even more fun - periodically read and
reverify it as bit rot will slowly eat away at it. Same with backups, verify
that your backups can be read and data there is consistent.

It's not cheap and you'd pay a penalty for it in performance so it's a
tradeoff definitely. Just make sure to consider it and don't forget about it.

~~~
jacquesm
SIGBUS tend to be alignment errors and cross page references gone wrong.

~~~
rdtsc
Indeed it was cross pages references from my mmap and shared memory setup.

------
PuffinBlue
> There was no obvious pattern to the servers which produced these mystery
> core dumps. We were getting about one a day on average across our fleet of
> servers...The probability that an individual server would get a mystery core
> dump seemed to be very low (about one per ten years of server uptime,
> assuming they were indeed equally likely for all our servers).

Does that tell us that Cloudflare has about 365 * 10 servers in their fleet? I
can never quite work out probabilities but I figured they'd have more.

EDIT:

> all of the mystery core dumps came from servers containing The Intel Xeon
> E5-2650 v4. This model belongs to the generation of Intel processors that
> had the codename “Broadwell”, and it’s the only model of that generation
> that we use in our edge servers, so we simply call these servers Broadwells.
> The Broadwells made up about a third of our fleet at that time...

So at least 3 times the number of servers the above probabilities would
suggest, and for edge nodes only. Not sure why I find working out this info
fun but there you go.

~~~
ikeboy
The probability is assuming all are equally likely, which turned out to be
wrong. So the 365*10 number should be around right.

~~~
bluedino
I thought I heard '4,000 servers' at one point

------
homero
I never realized my bios or Windows was live patching my processor. Now I'm
wondering if I'm missing updates since neither tell me.

~~~
fulafel
\+ Intel is badly undercommunicating the cpu bug impacts in their errata.

------
woliveirajr
> The most convenient way for us to apply the microcode update to our
> Broadwell servers at that time was via a BIOS update from the server vendor.

You can read this article with another point of view: there was a bug
somewhere that was leaking costumer information. After searching for a while,
they discovered that the CPU had a bug that had already being solved/patched,
all it took was a BIOS update, that wasn't done before because of... well,
whatever.

~~~
jgrahamc
The CPU bug wasn't causing a security problem. We made a conscious decision to
get crashes in production down to zero so that we would be alerted early if a
subsequent security issue of a similar type occurred.

During that investigation we came across mystery crashes which were already
fixed by a microcode update.

~~~
woliveirajr
:) thanks for clarifying

------
paradroid
@cloudflare I took the photo in your lede :)

~~~
dpw
Thank you! We often use on Creative Commons-licensed images in our blog posts.
We always include credit, but we owe a big debt of gratitude to the people who
take these photos and make them available.

