Hacker News new | past | comments | ask | show | jobs | submit login
Project Zero: Exploiting the DRAM rowhammer bug to gain kernel privileges (googleprojectzero.blogspot.com)
339 points by j_baker on Mar 9, 2015 | hide | past | web | favorite | 101 comments

Once again, I pine for ECC memory on my Laptop. I know you can get ECC SODIMMS, I got 16GB worth for a Supermicro ITX motherboard. And while the paper talks about multi-bit errors getting through ECC (which is certainly possible with enough flips) single flips causing alerts and double flips causing halts would really get your attention that something bad was happening. As opposed to silently sitting there while my memory is shredded.

Intel cripples their "consumer grade" processors by locking out the ECC DRAM interface. This forces server vendors to buy "server grade" processors. There, you get all the good error-correction stuff.[1] The fraction of die space devoted to these features is small; there's no reason they couldn't be provided on all x86 family CPUs. It's purely a market positioning thing.

AMD leaves the ECC hardware enabled on most of their parts.

[1] http://www.intel.com/content/dam/www/public/us/en/documents/...

Not just laptops! SQL Azure doesn't use ECC memory[1], which might suggest the rest of the Azure platform doesn't, either. I haven't found citations for AWS using ECC, so perhaps they don't. Maybe this could be used to break out of VMs on those platforms.

1: https://social.msdn.microsoft.com/Forums/azure/en-us/84000f7... (I remember I asked about Windows Azure, but my posting clearly says SQL Azure, so perhaps it's a different hardware platform.)

This may be just speculation, as the memory density in cloud hosts is rarely possible with non-ECC memory. I've found that when purchasing RAM for systems, it's fairly common for server and multi-rank to imply ECC, although I've had to look at product sheets to verify that.

Now, I could be wrong, but it would be quite a surprise to find out that any of the cloud services are not using ECC. I suspect they all are, but they don't advertise it.

Well I mean I asked in that thread, and MS replied stating they simply do not need ECC. I quoted a line from Google's study on memory errors, and Azure replied: "In our scenario, we have not seen bit error rates that align with the quote you mention".

I suppose I could just spin up a 56GB instance and let it run a memtest for a week and see, right?

SQL Azure probably need less RAM than Windows Azure VMs on the other hand.

If I understand correctly, Intel doesn't even ship a consumer CPU (i.e., a non-Xeon) that supports ECC. (Don't know about AMD.)

> Intel doesn't even ship a consumer CPU (i.e., a non-Xeon) that supports ECC.

Not true - there are some Atoms that do, but they're targeted at NAS type uses. It is the case you can't get Core-series processors with ECC.

AMD used to offer very broad support for ECC, but data integrity clearly didn't win market share.

This is no longer true either. Intel has consumer desktop Celeron, Pentium, and i3 series chips that support ECC.

The i3's in particular are fairly popular with FreeNAS users.


data integrity clearly didn't win market share

It requires more expensive, compatible DRAM right? Knowing that, it shouldn't really be super surprising. Enabling it on die is just one piece of the equation.

The problem is that if you say "fast, cheap, reliable, pick two" people will pick "fast" and "cheap". Even people who might ordinarily worry about whether their workstation is silently corrupting their data.

i3 are dual cores that support ECC, aimed at NAS and the like (hence mini ITX boards with ECC SODIMMs). But you still need an explicit workstation chipset/mobo to support it.

i5 and i7 have no ECC because there are equivalent Xeons. At this level "Xeon" is just branding implying features like ECC, and it doesn't automatically mean expensive - a single socket non-E (LGA1150) Xeon build costs only a little more (~20%) than consumer cpu/mobo/ram (although there's better sales on the consumer stuff).

Heck, if you are buying a non-K series i7, the equivalent Xeon E3 will probably be cheaper by a few dollars.

it's also very possible it's not as fast. consult benchmarks first, as usual.

It used to be the case that all AMD CPUs supported ECC back in the AM2/AM3 era, apparently this may no longer be true though. Not all motherboards bothered to route the extra traces required for it and include BIOS support though.

All AMD FX and Opteron CPUs support ECC. The APUs do not. For the CPUs that support ECC, it is still up to the motherboard manufacturer to support it on their end as well.

I don't think ECC will help. Where you can flip one memory bit, you can flip two consequently.

It will help, two bit errors are not corrected, though. The system will be rebooted and and an error should be logged.

I don't think laptops have SODIMM memory these days.

I bought one two weeks ago that has two SODIMM slots.

Sadly true, the 'thin is in' crowd is more often than not soldering in the memory.

Depends on the laptop. I have bought two laptops in the past 14 months, a $2000 ThinkPad and a $600 Acer. Both came with 4GB soldered on and a single free SODIMM.

(On a different note: the ThinkPad maxes out at 8GB and the Acer at 12GB, whereas previous generations went up to 16GB at least. Intel intentionally nerfed Haswell and newer core i's, presumably to push their Xeons on more people)

Intel intentionally nerfed Haswell and newer core i's, presumably to push their Xeons on more people

Really? I think the latest Haswell can go to 16GB just fine if there are two SO-DIMM slots.

There are 16GB SODIMMs, so why isn't Intel supporting those? The company that makes them claims this is purely on Intel/OEMs.

1: http://www.intelligentmemory.com/dram-modules/ddr3-so-dimm/

Update: Oh wow, the new Broadwell chips do support them. So maybe the new ThinkPad X250 isn't so useless after all! This is great news if true.

Intelligent Memory's are probably too expensive for the normal laptop market. Micron claims to be sampling them now: http://www.micron.com/products/dram-modules/sodimm/DDR3%20SD...

They're supposed to be priced around $350 or so, at least that's what I see from last year. How is that too expensive? An X series ThinkPad is like $2300+ with a good config. Adding another few hundred so I can have a decent amount of RAM sounds like a no-brainer.

(Or, Lenovo could put IBM engineering in charge and figure out how to get 2 slots back on the X series.)

I am thinking of the two 16GB DIMM setup, sorry.

Be careful... the 5th gen Intel CPUs can NOT run with standard 16GB modules. There is a technical issue which causes instabilities with 16GB modules, except if the modules are specially made to cover this issue. The Intelligent Memory modules do that.

I wonder if this issue is actually in hardware or just in the MRC.

There was an older paper discussing using various methods of fault injection (heat, voltage changes, etc) to attack Java smart cards, essentially destroying the type system guarantees and thus opening up an attack surface: "The Sorcerer’s Apprentice Guide to Fault Attacks", https://eprint.iacr.org/2004/100.pdf

Fault injection is also how older Dish Network and DirecTV smart cards were hacked - there used to be a cottage industry selling "voltage glitchers" to reprogram Dish Network smart cards with the keys for additional programming tiers.

I believe some pay TV smartcard hacks also made use of clock glitching, basically sending a shorter-than-usual clock pulse that means some of the internal signals don't make it to their destinations on time. The pay TV hacking industry had some pretty clever tricks a decade or two ago.

They were quite cool.

From memory, I think one card had some internal startup check that checked to see if its EPROM got marked by the "Black Sunday" countermeasure and then hung itself.

The hackers, having a ROM dump and having knowledge of how many clock cycles each instruction took the CPU, knew that it was at ~clock cycle 525 or so that this internal check happened.

Knowing that the instruction was a "Branch if equals to" (I think), and that instruction took 12 cycles, they figured out which of those 12 caused that branch to happen, figured out the precise time to glitch (whether via voltage or a single rapid clock cycle), and caused the CPU to skip changing the instruction pointer and then continue through its ROM code as if the check had passed.

Within a month or two, hundreds of thousands of receivers had a man-in-the-middle device just to glitch reprogrammed cards every time they were started up.

Apparently the north american provider had tested the same countermeasure in their south american division, so the north americans had advance notice of what they had to do to get back in action.

I recall, for another system, a small memory chip was required for a pre-existing man-in-the-middle card, and overnight every electronics supplier went out-of-stock overnight. Digikey sold out of 50k units overnight.

Other interesting lessons discovered: 1. You could run an >100' >100kbps rs232 link for over a year without issue. Proper wiring and rs232 length limitations be damned. 2. You could wire up an rs232 link (-12V and +12V) directly to a TTL input for over a year without issue.

People exceeding the defined limitations of things seemed to know better when it came to exceeding defined limitations.

Coincitentally hardware to play with those types of attack just got commodified



Same with the JTAGulator units. 10+ years ago, countermeasures would reprogram the very-difficult-to-desolder TSOP EEPROM on the receiver.

The manufacturers seemed to use an externally accessible JTAG access point to program the receivers in the factory, which was a convenient boon to hackers that didn't even need a screwdriver to reprogram the units through their parallel ports.

The starting research that enabled this security work appeared last year at ISCA, but didn't fully discuss the security implications:


I noticed the security implications of "memory that doesn't always behave like memory" when that paper came out a few months ago and was discussed briefly on HN:


You know, this makes me wonder. If a car manufacturer or a toy company made a product that was found to be unsafe, there would be a recall. If hardware manufacturers make a product that is insecure, will there be a recall? Unfortunately, I suspect that this is a case where the law hasn't caught up with technology.

A few years ago I built a home PC for myself and bought an i5 sandy bridge processor with an appropriate motherboard. A few months later it was found out that a huge batch of the SATA controllers shipped on those types of motherboards were faulty[0]. Back then, Intel made a statement recalling all faulty motherboards and shipping out new ones, I just contacted my retailer where I purchased my board, sent it for RMA and got a new one (different model, but that's another story). All of this for free.

[0] http://www.pcadvisor.co.uk/news/pc-components/3259061/intel-...

Intel has a good history of recalls and replacements of their motherboards and processors. The Pentium FDIV bug comes to mind immediately, as does the recall of motherboards with the faulty 820-series memory translation hub.

Actually, Intels behavior with the FDIV bug was originally anything but good. They downplayed the bug and refused to recall them. Then they started offering replacements if you could prove that the bug affected you.

It wasn't until the whole thing turned into a giant PR disaster that they started a generous exchange program. That whole affair is basically the reason that Intel is much more forthcoming with errata these days.

Of the five vendors that they mentioned, the only one that did not have vulnerable memory was "DRAM vendor D", which also only had one entry on the table. Given the nature of the problem here, odds strike me as near-1 that "DRAM vendor D" has shipped RAM with this problem.

For that matter, the "no"s on that table really only prove that the exact stick they tested with the exact memory locations they tested did not exhibit detectable bit flips. It doesn't prove that those sticks are "safe", let alone that the product line they come from is safe.

So, basically, what's vulnerable? To a first approximation, everything. What would happen if we tried to recall every bit of DRAM produced in the past X years (where X is also unknown)? Well... you'd bankrupt the industry is what you'd do. That's not a very useful outcome.

In fact this sort of thing happens all the time. New safety tech is developed for cars all the time, but you can't go back and sue the auto companies for not including it before it was invented or the need for it was discovered [1]. This seems more like that problem than an actual problem of negligence or "defects" being produced.

[1]: Well... more or less. I know of cases where this was successfully done, though they tend to get overturned on appeal. Run with me here.

It's not just insecure, this is memory that doesn't work 100% like memory should.

I use MemTest86+ on every stick of DRAM I buy - if there's even a single error, it goes back as defective. The fact that this memory seems to work for most access patterns doesn't excuse the fact that it is completely broken for others, because good memory should be able to store any data and maintain its integrity for any access pattern.

Unfortunately even MemTest86+ is not exhaustive, as I found out while troubleshooting a very strange issue: a specific file in a specific archive would unpack with corrupted bits (and an "archive damaged" message) on a coworker's computer, but on half a dozen other machines would be fine. A hash of the file matched, so HDD-based corruption was ruled out. His machine passed an overnight run of MemTest86+ perfectly and AFAIK unpacking no other archives would yield corruption. He reported never getting any crashes - but yet, that one file in that archive would fail to unpack correctly.

It would always corrupt in the same strange way. On a whim, I decided to swap the RAM out and the problem went away. Even the "bad" stick seemed to work fine in other machines with the same model of CPU and mobo running the same OS and unpacking the same archive, but with his extremely specific combination of hardware and software, would always fail. That experience taught me that bad RAM can be extremely difficult to troubleshoot.

This isn't like other storage technologies e.g. SSDs where their finite lifespan and sensitivity to access patterns is well-documented. It's a case of claiming to sell memory while giving consumers a close approximation of one that completely breaks in some situations. I think it needs to be treated like the FDIV bug.

In the EU, products have to be fit for purpose. You could then argue that if you bought (for example) a server for hosting virtual machines, then the RAM was not fit for purpose because the flaw made it incapable of isolating separate VMs.

Good luck trying that though!

On the other hand, servers tend to use ECC memory.

Yes. This happened 20 years ago: http://en.wikipedia.org/wiki/Pentium_FDIV_bug

Can you get killed as a result of privilege escalation? The law hasn't caught up in part because the potential consequences aren't nearly as dire.

Modern medical technology relies heavily on computers and software. Take an infusion pump for example. Controlled by a microcontroller and using software. Or insulin pumps; and some vendors are actually considering to add Bluetooth to insulin pumps, so that patients using such a pump can check its status on their smartphone (or on the upcomming smart watches). Also you can adjust the infusion rate of an insulin pump to accommodate for ingested sugar. Overdosing on insulin can send a person into shock and kill.

If somebody is running their ramhammer exploit on your insulin pump, it's probably a bit late.

> Modern medical technology relies heavily on computers and software.

Which is why medical devices should all have ECC memory. And for that matter physical separation between any processor that might run attacker-controlled code and the processor responsible for That Which Must Not Fail.

Product defects like this are foreseeable. If bad memory can cause a medical device to kill someone, the party at fault is the one who made a medical device without sufficient redundancy and error correction that bad memory could cause it to kill someone.

It's an interesting attack vector, recently covered by Person of Interest episode, in which an abusive husband got killed by having his insulin pump wirelessly hacked and making him overdose the drug. While fiction, I'm pretty sure this kind of thing will happen (after all, no one writes bug-free software, and even if, you can always steal the keys...) - and initially will be very hard to detect because of its uncommon nature.

You can get killed as the result of a race condition[0] so privilege escalation is certainly possible.

[0] https://en.wikipedia.org/wiki/Therac-25

Laptops are particularly at risk for stuff like this: components are more densely packed and may use smaller process sizes and have less powerful supplies which may be a factor in keeping bits in adjacent rows stable.

That may be the reason why the desktops mentioned are less sensitive, they'll use full size memory modules and will have beefy power supplies.

It'd be interesting to repeat the experiments with the laptops running off their internal battery.

Also, lower refresh rates on DRAMs means less power consumption (so it's an easy fix in BIOS, independent of OS, clearly attractive to laptop makers), but also more exposition to this issue.

Very little information on time scales. In one case they speak about 5 minutes vs 40 minutes (both might be acceptable for an exploit). Also no information about how long it took to bitflip in their per-hardware table.

And why name no hardware vendor ? I'm guessing they expect people to use the tool they provided and draw their own conclusions, but I don't understand why they'd treat them differently from software vendors.

At a guess to avoid labeling laptop manufacturers and getting sued if it turns out that something else was at fault? The DRAM itself might be the culprit (probably is), laptops of a certain brand might come with RAM from different manufacturers.

I understood the litigation risk. In an integrated system it's always someone else's fault (DRAM, BIOS, CPU, laptop vendor). IMHO the last integrator (the one selling you the goods) is always the culprit.

Why would they fear hardware manufacturers' litigation more than software vendors' ? Especially at such a big company like Google ?

They also don't want to say "DellappLenoHP" laptops could not be attacked and turn out to be wrong. Or maybe they're right but only with factory 2GB modules used between May '11 and July '13.

Way too many variables to make any claims that is ethically defensible.

They could specify the detailed system configuration with the CPU, chipset, and DRAM part numbers (including date codes) so others can compare. It's much better than leaving things in the dark completely.

Remember that there is a github repo with code you can use to test your specific Hardware. Why not run it & post results?

A vulnerability in the Windows kernel is going to exist in all Windows kernels of the same version. One laptop with bad RAM doesn't mean all similar models have bad RAM.

The rowhammer test program consistently finds one of my systems (i7 3770k, z77) vulnerable in <10s.

This is a system that passed several days of memtest86+.

memtest86 etc should add tests for this if they didn't already, as this is the best place for such tests.

If they did so... the fallout would be interesting. Does anyone know what proportion of modern memory has this flaw? Would it result in tens of thousands of customers returning stick after stick of DRAM until they were able to get a reliable one?

memtest86 has this feature in beta, and it's already generating some heat.

I would be personally more interested in this test on memtest86+ though.

What's the difference between memtest86 and memtest86+?

OK, from WP [1]: "Memtest86 was developed by Chris Brady. After Memtest86 remained at v3.0 (2002 release) for two years, the Memtest86+ fork was created by Samuel Demeulemeester to add support for newer CPUs and chipsets. As of November 2013 the latest version of Memtest86+ is 5.01."

And the original has become a commercial program by PassMark. So I think at this point if anyone is talking about memtest86, they're likely referring to the still open-source '+' version.

[1] http://en.wikipedia.org/wiki/Memtest86

There is a github repo with a rowhammer test based on memtest86+: https://github.com/CMU-SAFARI/rowhammer

On my desktop (DH87RL / i7-4770 / 2x8GB Crucial DDR3L-1600), rowhammer_test reported errors after ~20 iterations (less than a minute).

I went into the BIOS and tried lowering the tREFI value from 6300 to 3150 (not sure what the units are). So far, it's gone 1000 iterations with no problems detected.

Edit: Actually, the units are probably multiples of the cycle time, just like CAS latency. So, for DDR3-1600, that would mean 6300x1.25ns=7.8μs, and 3150x1.25ns=3.9μs


I tried and it reported one error under a second. I had to reboot because gcc started to make bash crash, it seems. Then I saw the README (duh!):

  Be careful not to run this test on machines that contain important
  data.  On machines that are susceptible to the rowhammer problem, this
  test could cause bit flips that crash the machine, or worse, cause bit
  flips in data that gets written back to disc.

  **Warning #2:** If you find that a computer is susceptible to the
  rowhammer problem, you may want to avoid using it as a multi-user
  system.  Bit flips caused by row hammering breach the CPU's memory
  protection.  On a machine that is susceptible to the rowhammer
  problem, one process can corrupt pages used by other processes or by
  the kernel.
(Mine is Kingston Hyper X 2x8GB DDR3 1600MHz)

Single-sided or double-sided hammering?

I used rowhammer_test.cc, which I think is single-sided.

Surprised that the mitigations section did not mention ECC RAM. Wouldn't it be effective mitigation?

Not necessary, see the original paper.

For example, SECDED (single error-correction, double error- detection) can correct only a single-bit error within a 64-bit word. If a word contains two victims, however, SECDED cannot correct the resulting double-bit error. And for three or more victims, SECDED cannot even detect the multi-bit er- ror, leading to silent data corruption.

Edit: link http://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf

Technically, SECDED cannot reliably detect errors involving more then 3 bits since they might generate a valid code, they might not however and in that case they might be detected as single or double bit error or possible something else.

Also the typical reaction to an uncorrectable ECC error is to halt the system with a NMI.

Yeah, ECC is going to make exploiting this reliably a lot harder - you'd need to flip three or more bits in the right combination, without first hitting a combination of bits that'd be detected as an uncorrectable error. Google's report suggests they haven't even been able to cause uncorrectable two-bit errors yet, let alone undetectable three-bit ones.

"We also tested some desktop machines, but did not see any bit flips on those. That could be because they were all relatively high-end machines with ECC memory. The ECC could be hiding bit flips."

Now someone has to come up with a JavaScript version of this exploit and the disaster is complete.

More difficult since you can't execute CLFLUSH from there.

Maybe, though as they say it'd potentially be possible to cause a cache spill and attack it that way. I was looking at the associativity of various CPU caches with a vague eye to trying this in JavaScript a few days back and in theory it shouldn't take many reads to evict a cache line, so long as they're from the right addresses.

Maybe some JS commands trigger a CLFLUSH internally. I don't know, but it'd be "funny" if that exploit worked in JS.

Would reducing the speed memory is clocked at prevent this?

Yes, it would, as would overvoltage, and reducing the refresh interval. The latter reduces memory subsystem performance, however.

row access counters in memory controller would solve this problem - too many accesses between refresh cycles -> force refresh cycle for that particular row/potentially affected rows

Anything that reduces the number of times memory can be accessed between refreshes can mitigate this, reducing RAM clock (probably) included

My first gen Toshiba Chromebook ran the test 130 minutes without an error.

Does anyone know if Macbooks are known to be affected?

Seems like my Macbook Air 2014 is not affected (with a high probability)

here's the test: https://github.com/google/rowhammer-test

Thanks for the link.

I haven't seen anything after 375 iterations (600s). So I may still be exploitable, but that means you'd have to keep something running at 100% CPU for > 600s and somehow have me not notice the laptop fans going crazy.

An exploit tool could always run slower and hide from that.

Also consider that it might work better when your laptop is in lower power mode because of reduced voltages.

You may wish to try both single- and double-sided hammering. If you hit the right row size it is significantly more effective:


How long did you test it? The tests they did ran fairly long, possibly you'd have to run this for days to really be able to state that a particular machine/ram combination is not vulnerable.

> The test should work on Linux or Mac OS X, on x86 only.

No x86_64 support?

They mean no powerpc or other non-intel chips. I've run it on several x86_64/amd64/x64 processors.

Download the tool and try it yourself? It supports Mac OS and there is a mailing list to report affected machines (nothing seems to be posted there yet).

Is a memory error actually an exploit? If so then are the unwanted changes that occur with no deliberate action an example of the computer cracking itself?


Everything is a memory error on some level.

Back to grounding in reality, a way to reliably[1] break security measures is an exploit. Cosmic ray bit flips are anything but reliable.

[1]The threshold of reliability being somewhere below "instant and always" and somewhere above "one in a million if you give it a day to try".

I think there is a useful distinction between a fault/error and an exploit. A fault is a break from the "desired" or "expected" semantics of a system, while an exploit is an algorithm to predictably utilize a fault (or faults) to access unexpected behaviours in that system. I.e., a buffer overflow is a fault in a program (breaking the expectation that a buffer's contents will remain within a certain bound), while an exploit targeting that overflow will likely allow running arbitrary code in a program not designed to do so.

So, I'd put it, the memory error can be leveraged in an exploit.

Errors can be used as part or all of an exploit. Exploiting a system requires that ethereal value of "intent", and I don't think anyone would (currently) argue that computers can have intent. Without that intent, it's just an error.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact