
DRAM chip failures reveal surprising hardware vulnerabilities - mud_dauber
http://spectrum.ieee.org/computing/hardware/drams-damning-defects-and-how-they-cripple-computers
======
fivesigma
Thank you Intel for disabling ECC capabilities on consumer grade CPUs.

With smaller and more dense DRAM chips every year, it takes an even smaller
amount of disturbance to flip a bit. ECC needs to go mainsteam yesterday.

~~~
AnthonyMouse
At least AMD is doing the right thing.

Obviously the problem is that Intel has twice the single thread performance,
but all modern CPUs are so fast that it's almost irrelevant. And on top of
that most of the "actually needs performance" applications are being updated
to use the GPU instead of the CPU anyway.

~~~
Osiris
I have an AMD Bulldozer CPU that was release in 2011 and it supports ECC RAM.
I admit that I haven't used it.

I assume Intel disables ECC in consumer CPUs to help prop up the prices of the
Xeon models which are essentially the same CPUs but at a higher price point.

~~~
gruez
What about xeon e3? They have similar prices to i7s.

~~~
wmf
Except a Xeon motherboard often costs significantly more.

~~~
anoother
Indeed.

It's a little-known fact that Celeron, Pentium and i3 processors also support
ECC, when used in a motherboard with a server/workstation chipset.

It's just i5s and i7s that are crippled. Cunning.

------
PhantomGremlin
They make an interesting claim:

    
    
       Between 12 percent and 45 percent of machines
       at Google experience at least one DRAM error
       per year. This is orders of magnitude more
       frequent than earlier estimates had suggested.
    

We just had a big discussion on this topic a few days ago.
[https://news.ycombinator.com/item?id=10598629](https://news.ycombinator.com/item?id=10598629)

~~~
mud_dauber
I'm not sure if this is the cited source, but the following Google paper is
one of my bookmarks:

[http://static.googleusercontent.com/media/research.google.co...](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf)

------
crististm
Judging from Jeff Atwood's post only three days ago, a lot of people converted
or seem to believe now that bit flips in RAMs are a thing of the past and ECC
is just an "enterprisey" thing.

[https://news.ycombinator.com/item?id=10598629](https://news.ycombinator.com/item?id=10598629)

He didn't dare himself to not use ECC for the database... :)

~~~
mud_dauber
Seeing that DRAM pricing is so dirt cheap, it's difficult to persuade any chip
manufacturer to devote another 10-15% (my estimate based on previous memory
design projects) of die area to ECC bits & decode circuitry when they won't
get another dime for the capability.

~~~
feld
Why can't they charge 15% more for ECC to recoup the costs?

~~~
mud_dauber
It depends on the end customer.

PC manufacturers have zero incentive to add ECC when 99.99% of consumers don't
care or even know.

Networking: a mixed big. Some designs manage ECC inside the processor, others
use memories with ECC for specialized functions like packet QoS. Commmodity
DRAM errors are a known irritant and planned for.

Servers: I'm not quite sure of my facts here, but a DRAM manufacturer would
have to persuade a commmodity manager at Dell, etc, that ECC is worth it. A
commodity manager won't give a d __* unless his systems guys say that it 's
needed.

~~~
feld
I don't think you get the point. Just stop making non-ECC RAM for all future
DDR releases. Let's just say, for example, "DDR6 has ECC baked into the
standard" and have everyone on board with this. The memory manufacturers bump
prices a bit to cover the costs and the entire world moves on unabashed.

~~~
PhantomGremlin
_Just stop making non-ECC RAM_

That's what's currently done with flash, but for DRAM that's not smart from a
_system_ point of view.

A long time ago ECC was done with Hamming Codes[1]; now there are similar but
more sophisticated versions. These codes share a property such that the wider
the word, the fewer the additional bits required. So, e.g.

    
    
        8 bits + 5 bits
       16 bits + 6 bits
       32 bits + 7 bits
       64 bits + 8 bits
    

That is enough for a minimum "hamming distance" of 4 between valid words. That
gets you SEC (single error correction) plus DED (double error detection). You
can save 1 bit if you're willing to forego the latter.

Many decades ago (but I couldn't find a reference with a quick google) Micron
made a chip that had on-board ECC. IIRC it was 8+4, which meant that it would
internally correct a 1 bit error, but there was a possibility it would mis-
correct some 2+ bit errors (which were rare).

But that meant that Micron was putting 50% more bits onto a die over a non-ECC
device. They probably did this not for system reliability but to cover up the
failings of their chips at the time. Yes you could make each bit smaller, but
no way would it pay off if you needed to put 50% more bits onto a die.

ECC is best done with wide words. For example, a SIMM that presents a 72-bit
interface can easily be built with 9 chips, each 8-bits wide.

The system (CPU, memory controller, whatever) takes in 72 bits (or some
multiple thereof) from DRAM and does ECC internally. Note that as words get
wider the time to compute ECC goes up. So a system could even speculatively
execute using the uncorrected data, while in parallel checking to make sure
that it was OK. This requires recovery in case of error, but since errors are
infrequent it could be a big win overall.

It doesn't make sense for a DDRx spec to require ECC. It's cheaper and smarter
to do ECC at a system level.

[1]
[https://en.wikipedia.org/wiki/Hamming_code](https://en.wikipedia.org/wiki/Hamming_code)

------
nano_o
The article seems to be based on the following paper, published in 2012:
"Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and
the Implications for System Design". Newer research on the topic has since
been published, for example this year: "Revisiting Memory Errors in Large-
Scale Production Data Centers: Analysis and Modeling of New Trends from the
Field" (from Facebook) and "Memory Errors in Modern Systems: The Good, The
Bad, and The Ugly".

------
sevensor
The idea that cosmic rays are responsible for DRAM failures seems laughable.
I'm surprised anybody ever thought that.

I worked in a DRAM fab; I saw how it was made and tested. It's a month-long
process involving dozens of machines. Particulate contamination, watermarks,
process variation -- there are tons of things that can go wrong. And then
testing. There's just no way to test thoroughly enough to catch all of the
possible issues with timing and crosstalk. Not to mention electromigration and
dielectric breakdown. You can do burn-in, but it's not going to catch
everything. We even know which die are most failure-prone (usually edge die,
due to process uniformity issues.) If they pass test, they get shipped!

 _Of course_ these issues are manufacturing related.

~~~
sufiyan
If so, why dont they increase refresh rates given that most of the memory
(>95%) can do refreshes at numbers of over 3x existing refresh rates. This
could potentially increase the energy savings and also potentially make the
system faster because of lesser waiting time during Refresh and Row
Activation.

~~~
sevensor
Faster refresh will mitigate leakage, so you can run your DRAM hotter, but I
don't see how that helps energy savings. Refresh means current flows out of
the cell, through the sense amplifier, and then current flows back in, through
the word driver. More refresh means more current. More current flow, more
power consumption. This does not seem to argue for energy savings.

~~~
sufiyan
I meant reduce the refresh rate. My bad. Should have proof read what I typed.
Essentially, what I meant was that many dram chips needn't be refreshed as
often as they are now.

------
moconnor
"DRAM chips are a little like people: Their faults are not so much in their
stars as in themselves. And like so many people, they can function perfectly
well once they compensate for a few small flaws."

A beautifully whimsical way to end a fascinating technical article!

------
mschuster91
Does anyone know of a consumer-grade laptop with ECC RAM?

~~~
jkot
Skylake in Xeon version. So far there is Dell XPS 15 and something from
Lenovo.

~~~
mschuster91
Thanks. Though, the downside: both companies got caught red-handed with
filling their computers pre-loaded with spyware, malware and other bullshit
:'(

~~~
jacquesm
If you care enough to put ECC in your machine you probably should start with a
clean install as well.

~~~
mschuster91
Doesn't help a single bit (pun intended) against the bullshit in the BIOS data
tables which installs the malware no matter what :(

------
gima
Source paper appears to be:
[http://www.cs.toronto.edu/~bianca/papers/ASPLOS2012.pdf](http://www.cs.toronto.edu/~bianca/papers/ASPLOS2012.pdf)

Based on the list of authors at the beginning of the article and the credited
source given in the figures ("Ioan Stefanovici, Andy Hwang & Bianca Schroeder"
and "Source: Hwang, Stefanovici, and Schroeder, Proceedings of ASPLOS XVII,
2012")

------
mozumder
So how do we enable page retirement in popular operating systems on machines
with ECC?

------
tempodox
Can I pay the publisher to omit those blinking distractions? If I want to
watch a movie while reading an article, I bring one myself.

~~~
acqq
On the devices with iOS I can click on the book icon by the URL and read just
the text there.

~~~
Drdrdrq
Or in firefox, anywhere.

