Hacker News new | comments | show | ask | jobs | submit login
DRAM chip failures reveal surprising hardware vulnerabilities (ieee.org)
91 points by mud_dauber on Nov 24, 2015 | hide | past | web | favorite | 67 comments

Thank you Intel for disabling ECC capabilities on consumer grade CPUs.

With smaller and more dense DRAM chips every year, it takes an even smaller amount of disturbance to flip a bit. ECC needs to go mainsteam yesterday.

At least AMD is doing the right thing.

Obviously the problem is that Intel has twice the single thread performance, but all modern CPUs are so fast that it's almost irrelevant. And on top of that most of the "actually needs performance" applications are being updated to use the GPU instead of the CPU anyway.

I have an AMD Bulldozer CPU that was release in 2011 and it supports ECC RAM. I admit that I haven't used it.

I assume Intel disables ECC in consumer CPUs to help prop up the prices of the Xeon models which are essentially the same CPUs but at a higher price point.

I assume Intel disables ECC in consumer CPUs to help prop up the prices of the Xeon models which are essentially the same CPUs but at a higher price point.


What about xeon e3? They have similar prices to i7s.

Except a Xeon motherboard often costs significantly more.


It's a little-known fact that Celeron, Pentium and i3 processors also support ECC, when used in a motherboard with a server/workstation chipset.

It's just i5s and i7s that are crippled. Cunning.

If the application utilizes all the cores, then AMD is generally superior (performance/$)

are you including the price of electricity in your "$"?

it usually costs more to power a CPU over its lifetime than it does to buy it, and Intel's been aggressively optimizing the Core architecture's performance per watt to prevent ARM from taking over on products where battery life is important.

I was curious so I actually measured it with a watt meter. AMD FX-8350 consumes ~39 watts at idle and ~180 at full load. Intel Core i5 (Ivy Bridge) consumes ~27 watts at idle and ~90 at full load. They both consume ~1 watt in standby.

Also FWIW if you're holding onto an older chip because it's still "good enough" then you might want to think about upgrading to either of the above. Dual socket Intel Xeon 5160 uses 300 watts at the plug at idle.

Thanks. Is this at the power rails on the motherboard or at the AC plug in the wall?

ADDED. the "300 Watts" in your (added) second paragraph makes me strongly suspect that the measurements are from the wall socket / power cord. OK, so did you disconnect the power to any disk drives or graphics cards? Anything special about the power supply? (There are special, "green," extra-efficient power supplies.)

This is at the plug using an AC watt meter. Incidentally FX-8350 has an internal watt meter that measures ~18 watts at idle and 125 at full load. I expect the power supply conversion losses you're after explain the difference between the two sets of numbers.

Both hard drives were in standby mode (it reduces the power consumption by ~5 watts), though things like that should add the same amount to both systems and make no difference to the relative numbers. If anything I'm biasing against the AMD system because the i5 is using its internal GPU whereas the FX has a low end discrete GPU which I assume is adding five or ten watts to its numbers.

Depends on usage I guess, my CPU cost ($50) is ~4 months of the monthly electrical bill for the entire household. And I am doubtful the CPU even equals 5-10% of the electricity usage.

EDIT: back of the envelope calculation

Let's say my CPU runs 12 hours a day and uses 20 watts on average. So, over a month, it is used for 375 hours at 20 watts = 7.5 kwH which roughly costs about 1$.

I had two RAM chip failures last year (notebook, PC). It took a while to figure out what the root cause of weird issues was (only one of several chips was faulty). Windows inbuilt RAM check tools failed to recognice the issue. Mem64+ CD found the defective RAMs - interestingly the memory chip inverted the data which caused weird side effects. The RAM is definitely more expensive than 3 years ago. I would prefer ECC RAM, sadly Intel prefers to deliver that vital function only to server CPUs (Xeon).

Intel sells a bunch of workstation oriented Xeons with built in graphics and of course ECC. This posting of mine is being written on a rock solid Ivy Bridge Xeon E3-1225 V2.

I know I have a workstation at work with a Xeon CPU and a cheaper i7 at home. Though, the Xeon is basically a i7 CPU chip with enabled ECC and a few other enterprise BIOS level features.

They make an interesting claim:

   Between 12 percent and 45 percent of machines
   at Google experience at least one DRAM error
   per year. This is orders of magnitude more
   frequent than earlier estimates had suggested.
We just had a big discussion on this topic a few days ago. https://news.ycombinator.com/item?id=10598629

I'm not sure if this is the cited source, but the following Google paper is one of my bookmarks:


Judging from Jeff Atwood's post only three days ago, a lot of people converted or seem to believe now that bit flips in RAMs are a thing of the past and ECC is just an "enterprisey" thing.


He didn't dare himself to not use ECC for the database... :)

His logic is beyond broken. "Why do I need ECC when I can just have more servers in my cluster in case something bad happens"

Umm, I don't know -- perhaps data integrity? Having multiple servers doesn't solve the "I just wrote bad data to disk" problem.

Wasn't his logic more like, "I need lots of single thread performance, but the only way to get that is consumer hardware and anyways my database servers will have ECC"?

If the client calls the cluster nodes individually then yes it does, provided there's enough overlap. Say the client considers a write committed when it's written to 5/7 of the cluster, and a read is only valid when you've heard from 5/7 of the cluster. Then whenever you read you're guaranteed to read at least three saved values, so if they disagree you can take the majority value (and write a correction). And that's the kind of system you already need to be safe against hard drive failures (just with the numbers at 4/7 and 4/7), so it's cheap to enable. In a modern distributed system you're already effectively doing ECC at a cross-machine level, so there's little value in doing it again within each individual machine.

I upload a file to one server. Memory flips a bit before it is flushed to disk, and then the file is replicated to other servers.

Now every server has a bad copy of the data.

Or even more fun: I do an OS update and something is slightly corrupted -- enough to harm data being processed, but not enough to crash applications. Fun!

> I upload a file to one server. Memory flips a bit before it is flushed to disk, and then the file is replicated to other servers.

I explicitly excluded this case when I said "if the client calls the cluster nodes individually".

> Or even more fun: I do an OS update and something is slightly corrupted -- enough to harm data being processed, but not enough to crash applications. Fun!

Any such OS corruption (already an unlikely scenario) would almost certainly manifest as the equivalent of frequent memory errors and be handled by the same mechanism.

"almost certainly" means "I've never experienced the joy of this kind of problem"

People see this stuff in the real world. It sucks.

I've seen it. The most fun case I remember was where a copy of ed got somehow corrupted and worked fine as far as I could tell but somehow broke the TeX build process on that machine.

No doubt it sucks, but what's going to manifest that's not going to be handled by that protocol? Either you get consistent corruption, in which case you notice quickly, or you get sporadic corruption, in which case it's handled by the redundancy with the rest of the cluster and eventually you notice that one node's failure rates are higher than the others. What other failure modes even are there?

Depends on how much processing you’re doing. It's not hard to get to the point where 7/7 machines all give different answers.

Do you end clients really write to all nodes?

I know pretty much no application for which this is the case.

Sometimes. Cassandra and similar datastores follow this model. I've been out of the web game a while but it seems like a sensible way to do a "single page application" - indeed it seems like the only way to have true high availability, if the client only makes a single call then whatever initially receives that call (e.g. the load balancer) becomes a single point of failure.

Seeing that DRAM pricing is so dirt cheap, it's difficult to persuade any chip manufacturer to devote another 10-15% (my estimate based on previous memory design projects) of die area to ECC bits & decode circuitry when they won't get another dime for the capability.

You don't have to add anything to the DRAM chips for ECC. It's just another chip on the DIMM to store the extra ECC bits. It's the memory controller that puts the pieces back together.

Why can't they charge 15% more for ECC to recoup the costs?

It depends on the end customer.

PC manufacturers have zero incentive to add ECC when 99.99% of consumers don't care or even know.

Networking: a mixed big. Some designs manage ECC inside the processor, others use memories with ECC for specialized functions like packet QoS. Commmodity DRAM errors are a known irritant and planned for.

Servers: I'm not quite sure of my facts here, but a DRAM manufacturer would have to persuade a commmodity manager at Dell, etc, that ECC is worth it. A commodity manager won't give a d* unless his systems guys say that it's needed.

I don't think you get the point. Just stop making non-ECC RAM for all future DDR releases. Let's just say, for example, "DDR6 has ECC baked into the standard" and have everyone on board with this. The memory manufacturers bump prices a bit to cover the costs and the entire world moves on unabashed.

Just stop making non-ECC RAM

That's what's currently done with flash, but for DRAM that's not smart from a system point of view.

A long time ago ECC was done with Hamming Codes[1]; now there are similar but more sophisticated versions. These codes share a property such that the wider the word, the fewer the additional bits required. So, e.g.

    8 bits + 5 bits
   16 bits + 6 bits
   32 bits + 7 bits
   64 bits + 8 bits
That is enough for a minimum "hamming distance" of 4 between valid words. That gets you SEC (single error correction) plus DED (double error detection). You can save 1 bit if you're willing to forego the latter.

Many decades ago (but I couldn't find a reference with a quick google) Micron made a chip that had on-board ECC. IIRC it was 8+4, which meant that it would internally correct a 1 bit error, but there was a possibility it would mis-correct some 2+ bit errors (which were rare).

But that meant that Micron was putting 50% more bits onto a die over a non-ECC device. They probably did this not for system reliability but to cover up the failings of their chips at the time. Yes you could make each bit smaller, but no way would it pay off if you needed to put 50% more bits onto a die.

ECC is best done with wide words. For example, a SIMM that presents a 72-bit interface can easily be built with 9 chips, each 8-bits wide.

The system (CPU, memory controller, whatever) takes in 72 bits (or some multiple thereof) from DRAM and does ECC internally. Note that as words get wider the time to compute ECC goes up. So a system could even speculatively execute using the uncorrected data, while in parallel checking to make sure that it was OK. This requires recovery in case of error, but since errors are infrequent it could be a big win overall.

It doesn't make sense for a DDRx spec to require ECC. It's cheaper and smarter to do ECC at a system level.

[1] https://en.wikipedia.org/wiki/Hamming_code

Ah, now I get it. I'll step back and let economics take over. :)

PC manufacturers are incredibly cost sensitive right now. It is kind of like the airline industry where they strip away a $4 meal and $1 blanket on a $250 ticket to save money.

True for most of the world of consumer electronics. 10 cent saving on a TV - the person who proposes the technique stands to make a quarters salary as a bonus.

ECC is not just a DRAM thing, you can't put ECC DRAM into a machine that does not support it. So even if there was a demand for it the CPU/Chipset/Motherboard combo would have to allow for the option to use ECC DRAM first before you could make that choice.

The article seems to be based on the following paper, published in 2012: "Cosmic Rays Don't Strike Twice: Understanding the Nature of DRAM Errors and the Implications for System Design". Newer research on the topic has since been published, for example this year: "Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field" (from Facebook) and "Memory Errors in Modern Systems: The Good, The Bad, and The Ugly".

The idea that cosmic rays are responsible for DRAM failures seems laughable. I'm surprised anybody ever thought that.

I worked in a DRAM fab; I saw how it was made and tested. It's a month-long process involving dozens of machines. Particulate contamination, watermarks, process variation -- there are tons of things that can go wrong. And then testing. There's just no way to test thoroughly enough to catch all of the possible issues with timing and crosstalk. Not to mention electromigration and dielectric breakdown. You can do burn-in, but it's not going to catch everything. We even know which die are most failure-prone (usually edge die, due to process uniformity issues.) If they pass test, they get shipped!

Of course these issues are manufacturing related.

This is true, re. manufacturing defects. I used to work in a DRAM fab, too, and observed similar patterns.

Re. cosmic rays, it's surprising, but I think they can cause errors: http://stackoverflow.com/questions/2580933/cosmic-rays-what-...

The risk is significant: JEDEC has a standard testing methodology for them: http://www.jedec.org/sites/default/files/docs/jesd89a.pdf

Interesting. Shows how my point of view colors the facts -- I never once heard one of our test engineers complain about cosmic rays, but then it was their job to blame process when defects came up. We used to try to blame dirty test heads, but that's another story. I'm sure in the real world cosmic rays actually do flip bits sometimes, but I'm equally sure that a significant proportion of the DRAM in production has latent defects. Given our observed defect rates and the limits of test coverage, it's unavoidable.

For manufacturing defects, there are no doubt processes you can do to make them manifest more frequently so you can detect them. I imagine a lot of defects are temperature-dependent, so you're probably testing hot chips to force that. Errors while reading and writing can be made to happen more often by repeatedly reading and writing in a tight cycle. Errors from radio interference could be tested with artificial radio noise. The idea, I imagine, is to try to test the equivalent of X years of real-world use in only Y days of lab time.

But cosmic rays are just a constant background thing. If an errant cosmic ray flips a bit, that doesn't happen any more often if you hammer on the chips or heat them up or expose them to radio waves or whatever. The only way to simulate X years of real-world use is to use them for X years, short of setting up a particle accelerator or something. (Technically, you could simulate X years of real-world cosmic ray exposure in X/N years by testing N sets of RAM in parallel, but then that requires you to have more ram under test than you give to your customers!)

All the same, I'm sure more mundane explanations are behind the vast majority of errors, as you say.

Yeah, we did hot tests as well as cold. The problem isn't whether you could detect defects given all the time and resources in the world, but whether you can maximize the probability of detection given finite resources. If you're in test, you've got to deal with an avalanche of material coming out of the fab, as well as constant pressure to get it out of the door and off the books. Upper management doesn't see you as adding value, but they'll come down on you like a ton of bricks if there's an incident and you don't catch it. It's a tough job, and I'm glad it wasn't mine.

If so, why dont they increase refresh rates given that most of the memory (>95%) can do refreshes at numbers of over 3x existing refresh rates. This could potentially increase the energy savings and also potentially make the system faster because of lesser waiting time during Refresh and Row Activation.

Faster refresh will mitigate leakage, so you can run your DRAM hotter, but I don't see how that helps energy savings. Refresh means current flows out of the cell, through the sense amplifier, and then current flows back in, through the word driver. More refresh means more current. More current flow, more power consumption. This does not seem to argue for energy savings.

I meant reduce the refresh rate. My bad. Should have proof read what I typed. Essentially, what I meant was that many dram chips needn't be refreshed as often as they are now.

why dont they increase refresh rates

Apple did exactly this to the EFI of some machines, to mitigate Rowhammer[1]:

   Description: A disturbance error, also known as
   Rowhammer, exists with some DDR3 RAM that could
   have led to memory corruption. This issue was
   mitigated by increasing memory refresh rates.
[1] https://support.apple.com/en-us/HT204934

I read an article that explained that DRAM cannot be read or written to during a refresh cycle, so if you triple the frequency of the refresh, you'll substantially harm the performance and throughput of the RAM.

I think they already did. The 2x refresh option has existed since DDR2, and was originally designed for very high temperatures BTW.

"DRAM chips are a little like people: Their faults are not so much in their stars as in themselves. And like so many people, they can function perfectly well once they compensate for a few small flaws."

A beautifully whimsical way to end a fascinating technical article!

Does anyone know of a consumer-grade laptop with ECC RAM?

I don't think there are any.

For such laptops to exist, there should exist a memory controller that supports ECC RAM. Intel's chips have memory controller built-in, since Nehalem (2008) [1] and Intel has no mobile CPUs with ECC support. Apart from the brand-new mobile Xeon's, the first of which are released in September and have 45W TDP; [2][3] I think the laptops with them are not out yet.

[1] https://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%2...

[2] http://ark.intel.com/products/family/88210/Intel-Xeon-Proces...

[3] https://en.wikipedia.org/wiki/List_of_Intel_Xeon_microproces...

Skylake in Xeon version. So far there is Dell XPS 15 and something from Lenovo.

It's Dell Precision 15 with the Xeon option, but I doubt it can use ECC RAM until someone provides the reliable link.



    8GB (1x8G) 2133MHz DDR4 Memory Non ECC
    16GB (1x16G) 2133MHz DDR4 Memory Non ECC
    16GB (2x8GB) 2133MHz DDR4 ECC Memory ECC
    16GB (2x8G) 2133MHz DDR4 Memory Non ECC
    32GB (4x8GB) 2133MHz DDR4 Memory ECC
    32GB (4x8GB) 2133MHz DDR4 Memory Non ECC
    32GB (2x16GB) 2133MHz DDR4 Memory Non ECC
    64GB (4x16G) 2133MHz DDR4 Memory Non ECC"

Thanks. Though, the downside: both companies got caught red-handed with filling their computers pre-loaded with spyware, malware and other bullshit :'(

If you care enough to put ECC in your machine you probably should start with a clean install as well.

Doesn't help a single bit (pun intended) against the bullshit in the BIOS data tables which installs the malware no matter what :(

The Lenovo discussed here purports to have ECC capability: https://news.ycombinator.com/item?id=10039306

I don't know if you'd call that "consumer grade" or not, though.

Check http://www.intelligentmemory.com/ECC-DRAM/; maybe you can fit these on cots laptops. Heard it mentioned in relation to Purism Librem laptops (https://puri.sm/librem-15/); better ask them if it's still an option.

Source paper appears to be: http://www.cs.toronto.edu/~bianca/papers/ASPLOS2012.pdf

Based on the list of authors at the beginning of the article and the credited source given in the figures ("Ioan Stefanovici, Andy Hwang & Bianca Schroeder" and "Source: Hwang, Stefanovici, and Schroeder, Proceedings of ASPLOS XVII, 2012")

So how do we enable page retirement in popular operating systems on machines with ECC?

Can I pay the publisher to omit those blinking distractions? If I want to watch a movie while reading an article, I bring one myself.

On the devices with iOS I can click on the book icon by the URL and read just the text there.

Or in firefox, anywhere.

Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact