AMD leaves the ECC hardware enabled on most of their parts.
1: https://social.msdn.microsoft.com/Forums/azure/en-us/84000f7... (I remember I asked about Windows Azure, but my posting clearly says SQL Azure, so perhaps it's a different hardware platform.)
Now, I could be wrong, but it would be quite a surprise to find out that any of the cloud services are not using ECC. I suspect they all are, but they don't advertise it.
I suppose I could just spin up a 56GB instance and let it run a memtest for a week and see, right?
Not true - there are some Atoms that do, but they're targeted at NAS type uses. It is the case you can't get Core-series processors with ECC.
AMD used to offer very broad support for ECC, but data integrity clearly didn't win market share.
The i3's in particular are fairly popular with FreeNAS users.
It requires more expensive, compatible DRAM right? Knowing that, it shouldn't really be super surprising. Enabling it on die is just one piece of the equation.
i5 and i7 have no ECC because there are equivalent Xeons. At this level "Xeon" is just branding implying features like ECC, and it doesn't automatically mean expensive - a single socket non-E (LGA1150) Xeon build costs only a little more (~20%) than consumer cpu/mobo/ram (although there's better sales on the consumer stuff).
(On a different note: the ThinkPad maxes out at 8GB and the Acer at 12GB, whereas previous generations went up to 16GB at least. Intel intentionally nerfed Haswell and newer core i's, presumably to push their Xeons on more people)
Really? I think the latest Haswell can go to 16GB just fine if there are two SO-DIMM slots.
Update: Oh wow, the new Broadwell chips do support them. So maybe the new ThinkPad X250 isn't so useless after all! This is great news if true.
(Or, Lenovo could put IBM engineering in charge and figure out how to get 2 slots back on the X series.)
From memory, I think one card had some internal startup check that checked to see if its EPROM got marked by the "Black Sunday" countermeasure and then hung itself.
The hackers, having a ROM dump and having knowledge of how many clock cycles each instruction took the CPU, knew that it was at ~clock cycle 525 or so that this internal check happened.
Knowing that the instruction was a "Branch if equals to" (I think), and that instruction took 12 cycles, they figured out which of those 12 caused that branch to happen, figured out the precise time to glitch (whether via voltage or a single rapid clock cycle), and caused the CPU to skip changing the instruction pointer and then continue through its ROM code as if the check had passed.
Within a month or two, hundreds of thousands of receivers had a man-in-the-middle device just to glitch reprogrammed cards every time they were started up.
Apparently the north american provider had tested the same countermeasure in their south american division, so the north americans had advance notice of what they had to do to get back in action.
I recall, for another system, a small memory chip was required for a pre-existing man-in-the-middle card, and overnight every electronics supplier went out-of-stock overnight. Digikey sold out of 50k units overnight.
People exceeding the defined limitations of things seemed to know better when it came to exceeding defined limitations.
The manufacturers seemed to use an externally accessible JTAG access point to program the receivers in the factory, which was a convenient boon to hackers that didn't even need a screwdriver to reprogram the units through their parallel ports.
It wasn't until the whole thing turned into a giant PR disaster that they started a generous exchange program. That whole affair is basically the reason that Intel is much more forthcoming with errata these days.
For that matter, the "no"s on that table really only prove that the exact stick they tested with the exact memory locations they tested did not exhibit detectable bit flips. It doesn't prove that those sticks are "safe", let alone that the product line they come from is safe.
So, basically, what's vulnerable? To a first approximation, everything. What would happen if we tried to recall every bit of DRAM produced in the past X years (where X is also unknown)? Well... you'd bankrupt the industry is what you'd do. That's not a very useful outcome.
In fact this sort of thing happens all the time. New safety tech is developed for cars all the time, but you can't go back and sue the auto companies for not including it before it was invented or the need for it was discovered . This seems more like that problem than an actual problem of negligence or "defects" being produced.
: Well... more or less. I know of cases where this was successfully done, though they tend to get overturned on appeal. Run with me here.
I use MemTest86+ on every stick of DRAM I buy - if there's even a single error, it goes back as defective. The fact that this memory seems to work for most access patterns doesn't excuse the fact that it is completely broken for others, because good memory should be able to store any data and maintain its integrity for any access pattern.
Unfortunately even MemTest86+ is not exhaustive, as I found out while troubleshooting a very strange issue: a specific file in a specific archive would unpack with corrupted bits (and an "archive damaged" message) on a coworker's computer, but on half a dozen other machines would be fine. A hash of the file matched, so HDD-based corruption was ruled out. His machine passed an overnight run of MemTest86+ perfectly and AFAIK unpacking no other archives would yield corruption. He reported never getting any crashes - but yet, that one file in that archive would fail to unpack correctly.
It would always corrupt in the same strange way. On a whim, I decided to swap the RAM out and the problem went away. Even the "bad" stick seemed to work fine in other machines with the same model of CPU and mobo running the same OS and unpacking the same archive, but with his extremely specific combination of hardware and software, would always fail. That experience taught me that bad RAM can be extremely difficult to troubleshoot.
This isn't like other storage technologies e.g. SSDs where their finite lifespan and sensitivity to access patterns is well-documented. It's a case of claiming to sell memory while giving consumers a close approximation of one that completely breaks in some situations. I think it needs to be treated like the FDIV bug.
Good luck trying that though!
Which is why medical devices should all have ECC memory. And for that matter physical separation between any processor that might run attacker-controlled code and the processor responsible for That Which Must Not Fail.
Product defects like this are foreseeable. If bad memory can cause a medical device to kill someone, the party at fault is the one who made a medical device without sufficient redundancy and error correction that bad memory could cause it to kill someone.
That may be the reason why the desktops mentioned are less sensitive, they'll use full size memory modules and will have beefy power supplies.
It'd be interesting to repeat the experiments with the laptops running off their internal battery.
And why name no hardware vendor ? I'm guessing they expect people to use the tool they provided and draw their own conclusions, but I don't understand why they'd treat them differently from software vendors.
Why would they fear hardware manufacturers' litigation more than software vendors' ? Especially at such a big company like Google ?
Way too many variables to make any claims that is ethically defensible.
This is a system that passed several days of memtest86+.
I would be personally more interested in this test on memtest86+ though.
OK, from WP : "Memtest86 was developed by Chris Brady. After Memtest86 remained at v3.0 (2002 release) for two years, the Memtest86+ fork was created by Samuel Demeulemeester to add support for newer CPUs and chipsets. As of November 2013 the latest version of Memtest86+ is 5.01."
And the original has become a commercial program by PassMark. So I think at this point if anyone is talking about memtest86, they're likely referring to the still open-source '+' version.
I went into the BIOS and tried lowering the tREFI value from 6300 to 3150 (not sure what the units are). So far, it's gone 1000 iterations with no problems detected.
Edit: Actually, the units are probably multiples of the cycle time, just like CAS latency. So, for DDR3-1600, that would mean 6300x1.25ns=7.8μs, and 3150x1.25ns=3.9μs
Be careful not to run this test on machines that contain important
data. On machines that are susceptible to the rowhammer problem, this
test could cause bit flips that crash the machine, or worse, cause bit
flips in data that gets written back to disc.
**Warning #2:** If you find that a computer is susceptible to the
rowhammer problem, you may want to avoid using it as a multi-user
system. Bit flips caused by row hammering breach the CPU's memory
protection. On a machine that is susceptible to the rowhammer
problem, one process can corrupt pages used by other processes or by
For example, SECDED (single error-correction, double error- detection) can correct only a single-bit error within a 64-bit word. If a word contains two victims, however, SECDED cannot correct the resulting double-bit error. And for three or more victims, SECDED cannot even detect the multi-bit er- ror, leading to silent data corruption.
Edit: link http://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf
here's the test: https://github.com/google/rowhammer-test
I haven't seen anything after 375 iterations (600s). So I may still be exploitable, but that means you'd have to keep something running at 100% CPU for > 600s and somehow have me not notice the laptop fans going crazy.
Also consider that it might work better when your laptop is in lower power mode because of reduced voltages.
No x86_64 support?
Back to grounding in reality, a way to reliably break security measures is an exploit. Cosmic ray bit flips are anything but reliable.
The threshold of reliability being somewhere below "instant and always" and somewhere above "one in a million if you give it a day to try".
So, I'd put it, the memory error can be leveraged in an exploit.