
Once thought safe, DDR4 memory shown to be vulnerable to “Rowhammer” - theandrewbailey
http://arstechnica.com/security/2016/03/once-thought-safe-ddr4-memory-shown-to-be-vulnerable-to-rowhammer/
======
robert_tweed
Do any tools like say Memtest86 use test patterns that will detect row hammer
vulnerability and for that matter, is it something that can be fixed on any
given system by changing the RAS/CAS frequency or the voltage?

Many years ago, I ran a couple of PC shops. When EDO RAM first came out we had
a lot of problems with random crashes being reported by customers, usually
when they were running Word. We would typically run memory tests for a few
hours in QAFE, which was then the best at detecting errors, but it would
consistently fail to find anything wrong with these systems.

It turned out that the most reliable way to reproduce the problem was to run
the shareware version of Duke Nukem. So we had a copy of it on all our
diagnostic disks. If a system could get through the first level without any
on-screen corruption, you knew it wasn't going to come back to the shop, so
that became a standard test on all new PCs.

~~~
speeder
You made me remember a extremely interesting article about Guild Wars team...
they were annoyed with the amount of weird bug reports, and one of their
programmers found out how to make the game self-analyze its data, and find if
it was RAM hardware issue, he found out that about 1% of all their players had
computers with faulty RAM, and a good amount of bug reports came from these
computers, and that this program saved them a lot of money (since now they
didn't needed to make devs chase "ghost" bugs, only actually real bugs).

~~~
CydeWeys
I had bad RAM on my primary Linux desktop some ~six years ago and it caused an
endless series of problems, data corruption, and hard reboots every few
months. It happened so infrequently that I never quite tracked it down, but
I'd copied hundreds of gigabytes of data between computers over my LAN and
lots of random files still have some corrupted bits in them to this day from
that.

I finally got serious about tracking down the issue, ran memtest, and sure
enough, it discovered a faulty DIMM pretty quickly. I yanked it out and the
problems immediately went away.

------
yuhong
I submitted the original paper this is based on (notice most of it comes from
Micron DDR4 chips):
[https://news.ycombinator.com/item?id=11308525](https://news.ycombinator.com/item?id=11308525)

------
zymhan
As a hardware engineer, I find the idea of security vulnerabilities based on
how closely components on a chip are to one another. Most of the hardware
implementations we still use aren't really designed to prevent these exploits.
Aside from rowhammer, there's also "sidechannel" attacks on shared CPUs, and
measuring the electromagnetic fields generated by a PC to reverse-engineer
sensitive data.

I'm certain that in the future these concerns will play a greater role in how
we design computer components and systems. Previously these attacks were
infeasible due to the lack of computing power available to sort through what
is seemingly just noise.

------
moreira
Can this be exploited by VPSs, docker containers, and the like? As
throwaway2048 mentioned, this can be executed from regular JS, so it's not a
difficult thing to exploit, and it seems that ECC RAM doesn't help, so any
server with vulnerable RAM could be exploited. Is there anything, from the
software side of things, that can/will be done to mitigate this?

------
smaili
Link to actual paper -
[http://www.thirdio.com/rowhammer.pdf](http://www.thirdio.com/rowhammer.pdf)

------
ori_b
So. Time to move to ECC by default?

~~~
SixSigma
from TFA

The researchers were also able to flip the bits inside DDR3 DIMMs installed on
an enterprise-grade server. The tests succeeded even though all of the DDR3
modules included a protection known as ECC, the servers completely locked up
or spontaneously rebooted, usually within three minutes of the tests
commencing.

~~~
ori_b
Yes; Locking up or rebooting seems like a reasonable thing to do if the memory
has multiple unrecoverable errors. It certainly beats silent corruption.

~~~
roywiggins
But the failure mode for ECC is silent corruption, followed (maybe) by a
crash, right? ECC can detect and correct one or two (or a few more?) bit-flips
at a time, but if there's too many it won't notice that something's gone
wrong.

The system then goes merrily along and either crashes (because you've flipped
some bits at random) or recovers (because those bits weren't important) or
hands the attacker root (because you've managed to flip some particular target
bits).

So if the computer crashes or locks up, you can probably conclude that you've
defeated the ECC and flipped some bits, and the system is vulnerable to
Rowhammer. It's (maybe) just a matter of time before someone engineers a way
to flip the specific bits needed to hand the attacker root.

~~~
GauntletWizard
The memory can recover one, detect two, and will throw up loud warning bells
to any competent admin as long as it is doing so. If you see a sudden spike of
ECC errors from a server, the memory is probably bad, and you replace it. If
you see a sudden spike of ECC errors from a large number of servers, something
is seriously wrong, and you investigate - I'd be more likely to blame it on
bad power myself, or a EM Field/radiation source.

Even when trying to trigger rowhammer, the graph of bit flips is going to look
like a logarithmic curve - Single-bit flips will be common, two-bits will be
rare, and undetected flips will be real but ultimately not much to talk about
compared to the discovered flips. You know they're there by the unrecoverable
rate, and you replace the ram, discard any replicated data on that machine,
and move on.

~~~
nkurz
_The memory can recover one, detect two, and will throw up loud warning bells
to any competent admin as long as it is doing so._

I don't have personal experience here, but one of the important claims in the
paper is that this warning is not given on all servers:

    
    
      Unfortunately, server vendors routinely use a technique   
      called ECC threshold or the 'leaky bucket' algorithm where 
      they count ECC errors for a period of time and report them 
      only if they reach certain levels of failure. From what we 
      understand, this threshold is commonly above 100 per hour, 
      but this remains a trade secret and varies based on the 
      server vendor. So, to see ECC errors (MCE in Linux or
      WHEA in Windows), there generally needs to be 100 bit flips 
      per hour or greater. This makes “seeing” Rowhammer on 
      server error logs more difficult.
    
      In addition, we have observed some server vendors will 
      NEVER report ECC events back to the OS,although they might 
      get logged into IPMI. Typically, users expect to see 
      correctable ECC errors logged directly to the OS or that 
      halt the system when they cannot be corrected. During our 
      investigation into this phenomenon, we even encountered one 
      server that neither reported ECC events to the OS nor
      halted when bit flips were not correctable. The end result 
      was data corruption at the application level.
      This is something, in our opinion, that should never happen 
      on an ECC protected server system.
    

[http://www.thirdio.com/rowhammer.pdf](http://www.thirdio.com/rowhammer.pdf)

~~~
BuildTheRobots
I really wish they'd have expanded on that somewhere and actually stated what
manufacturers/models of server they tested and which ones don't report ECC
errors.

~~~
brainfire
Really, without details like that it's just unverifiable FUD.

------
Animats
Can you execute Rowhammer attacks from WebAssembly?

~~~
throwaway2048
You can execute it from regular JS
[https://github.com/IAIK/rowhammerjs](https://github.com/IAIK/rowhammerjs)

------
mchahn
Memory manufacturers all use heavy testing to both test designs and to weed
out bad chips. This seems to be a weakness in these tests and should have been
caught. Row-hammering is no different than applying test vectors. I'm sure
those tests now include row-hammering but it will take a while for the new
designs to come out.

------
cloudsloth
Is a pragmatic solution more virtualized memory, other software solutions,
hardware shielding, lower data density, or some combination of these?

~~~
rincebrain
A pragmatic solution is probably "go for the DIMMs that are identified as not
apparently vulnerable", if you're in a position where you have to care about
this faster than t(vendors release software updates to mitigate this class of
attack on your platforms).

