Flipping Bits in Memory Without Accessing Them [pdf]

userbinator · on Dec 8, 2014

The tl;dr of this is: DDR3 DRAM modules are not as reliable as once thought, and all the ones from 2012 and 2013 that they tested showed errors when exercised with the right access patterns.

What isn't mentioned is whether these would be the types of errors that the usual memory testers like MemTest86 could detect --- but based on the lack of any significant news stories in the past 2 years, I'd guess not. Perhaps this could explain why a lot of people who encountered weird hardware-ish problems could run MemTest with no errors but still crash with the right workload.

Their dismissal of one of the "potential solutions" is a bit of a WTF:

Manufacturers could fix the problem at the chip-level by improving circuit design. However, the problem could resurface when the process technology is upgraded. In addition, this may get worse in the future as cells become smaller and more vulnerable.

As their tests show, modules from 2008 and '09 are basically perfect, and most of the ones from '10 too. Why could this change in process even be called an "upgrade" if it results in memory that doesn't behave anymore like memory should? To me, it's clearly a serious flaw. Their "workaround" proposal of adding more complexity to the memory controller, and which doesn't actually guarantee a solution, just feels... wrong.

I think this is all really quite scary - programmers are used to, and all software depends on, memory as something whose contents should not ever change without being written to! While the majority of access patterns won't trigger this flaw, the one that does could have significant cascading effects. This paper really should get more exposure to the public.

nkurz · on Dec 8, 2014

Nice review. This is a good paper, and I found it worth reading just for the clear explanation of how DDR3 memory actually works. Maybe one of these times I'll finally learn it well enough to remember which is a rank and which is a bank.

Why could this change in process even be called an "upgrade" if it results in memory that doesn't behave anymore like memory should?

I wasn't sure from the paper, but I think they were testing memory of different scale (nm). At least, the majority of the chips with no errors were 1 GB, and the majority of the chips with errors were 2 GB. It's an improvement because the smaller scale allows higher densities and more storage per chip.

More explanation about the "row hammer" issue is here:

http://forums.xilinx.com/t5/Xcell-Daily-Blog/Unexplained-mem...

the usual memory testers like MemTest86 could detect

Historically no, but the MemTest86 6.0 from a couple months ago added the "Hammer Test" a couple months ago, citing this paper.

I think this is all really quite scary - programmers are used to, and all software depends on, memory as something whose contents should not ever change without being written to!

I agree. From the paper, the strong implication is that a user running unprivileged code on any modern computer can corrupt memory outside of their process. Perhaps even with asm.js? Comments in the release announcement thread suggest, thouth, that although the problem is real, the paper is a bit alarmist about the prevalence: http://www.passmark.com/forum/showthread.php?4836-MemTest86-...

userbinator · on Dec 8, 2014

It's an improvement because the smaller scale allows higher densities and more storage per chip.

Yes, that's the usual explanation but I don't think it makes much sense here since the ostensibly "better" memory can produce visible errors that the older generation didn't. I see the word "tradeoff" being used often in situations like this but I don't agree that this is, since at some point on the reliability scale it just stops being memory completely and devolves into some weird approximation of it.

From the paper, the strong implication is that a user running unprivileged code on any modern computer can corrupt memory outside of their process

Indeed, that's the big message I get: a tiny and innocuous-looking piece of code can easily corrupt memory. I'm not someone who believes in conspiracy theories much, but this looks like an amazingly good backdoor or constituent of one to me. If memory controllers implement workarounds such as the one described in the paper to reduce these types of errors, they also naturally will have options to turn them off for testing/debugging purposes, etc. For the great majority of the time if they are turned off nothing unusual will be noticeable, but then the system becomes vulnerable to the specific access patterns that trigger the fault. Since the documentation on the latest memory controllers is largely kept secret, a firmware update that silently changes this setting wouldn't raise much concern - memory initialisation code usually uses lots of undocumented registers and values anyway. Then all it takes is a tiny piece of user-level code (possibly obfuscated/concealed in some other mundane application), maybe with some cooperation/knowledge of how the OS's VM mapping works, to enable relatively precise corruption of certain addresses in memory. Although largely (publicly) undocumented, it wouldn't be so difficult to reverse-engineer the row<>address mappings either. The results could range from DoS to bypassing access controls, depending on what gets targeted.

The subtle nature of this approach is what makes it all the more scarier; the access patterns that trigger it aren't so unusual, and it's just reading from memory. I doubt it can be easily triggered (never say never...) from compiled languages like JS but virtualised environments appear vulnerable (unless the hypervisor constantly moves the pages around, incurring a significant performance penalty).

although the problem is real, the paper is a bit alarmist about the prevalence

The paper assumes exactly knowledge of the row<>address mappings and hammered the DRAM with that, whereas MemTest's implementation might not know the exact mapping used by a particular controller+configuration. Their estimate is 5-20% (a huge range), under "less optimal" hammering, which is still cause for concern.