With smaller and more dense DRAM chips every year, it takes an even smaller amount of disturbance to flip a bit. ECC needs to go mainsteam yesterday.
Obviously the problem is that Intel has twice the single thread performance, but all modern CPUs are so fast that it's almost irrelevant. And on top of that most of the "actually needs performance" applications are being updated to use the GPU instead of the CPU anyway.
I assume Intel disables ECC in consumer CPUs to help prop up the prices of the Xeon models which are essentially the same CPUs but at a higher price point.
It's a little-known fact that Celeron, Pentium and i3 processors also support ECC, when used in a motherboard with a server/workstation chipset.
It's just i5s and i7s that are crippled. Cunning.
it usually costs more to power a CPU over its lifetime than it does to buy it, and Intel's been aggressively optimizing the Core architecture's performance per watt to prevent ARM from taking over on products where battery life is important.
Also FWIW if you're holding onto an older chip because it's still "good enough" then you might want to think about upgrading to either of the above. Dual socket Intel Xeon 5160 uses 300 watts at the plug at idle.
ADDED. the "300 Watts" in your (added) second paragraph makes me strongly suspect that the measurements are from the wall socket / power cord. OK, so did you disconnect the power to any disk drives or graphics cards? Anything special about the power supply? (There are special, "green," extra-efficient power supplies.)
Both hard drives were in standby mode (it reduces the power consumption by ~5 watts), though things like that should add the same amount to both systems and make no difference to the relative numbers. If anything I'm biasing against the AMD system because the i5 is using its internal GPU whereas the FX has a low end discrete GPU which I assume is adding five or ten watts to its numbers.
EDIT: back of the envelope calculation
Let's say my CPU runs 12 hours a day and uses 20 watts on average. So, over a month, it is used for 375 hours at 20 watts = 7.5 kwH which roughly costs about 1$.
Between 12 percent and 45 percent of machines
at Google experience at least one DRAM error
per year. This is orders of magnitude more
frequent than earlier estimates had suggested.
He didn't dare himself to not use ECC for the database... :)
Umm, I don't know -- perhaps data integrity? Having multiple servers doesn't solve the "I just wrote bad data to disk" problem.
Now every server has a bad copy of the data.
Or even more fun: I do an OS update and something is slightly corrupted -- enough to harm data being processed, but not enough to crash applications. Fun!
I explicitly excluded this case when I said "if the client calls the cluster nodes individually".
> Or even more fun: I do an OS update and something is slightly corrupted -- enough to harm data being processed, but not enough to crash applications. Fun!
Any such OS corruption (already an unlikely scenario) would almost certainly manifest as the equivalent of frequent memory errors and be handled by the same mechanism.
People see this stuff in the real world. It sucks.
No doubt it sucks, but what's going to manifest that's not going to be handled by that protocol? Either you get consistent corruption, in which case you notice quickly, or you get sporadic corruption, in which case it's handled by the redundancy with the rest of the cluster and eventually you notice that one node's failure rates are higher than the others. What other failure modes even are there?
I know pretty much no application for which this is the case.
PC manufacturers have zero incentive to add ECC when 99.99% of consumers don't care or even know.
Networking: a mixed big. Some designs manage ECC inside the processor, others use memories with ECC for specialized functions like packet QoS. Commmodity DRAM errors are a known irritant and planned for.
Servers: I'm not quite sure of my facts here, but a DRAM manufacturer would have to persuade a commmodity manager at Dell, etc, that ECC is worth it. A commodity manager won't give a d* unless his systems guys say that it's needed.
That's what's currently done with flash, but for DRAM that's not smart from a system point of view.
A long time ago ECC was done with Hamming Codes; now there are similar but more sophisticated versions. These codes share a property such that the wider the word, the fewer the additional bits required. So, e.g.
8 bits + 5 bits
16 bits + 6 bits
32 bits + 7 bits
64 bits + 8 bits
Many decades ago (but I couldn't find a reference with a quick google) Micron made a chip that had on-board ECC. IIRC it was 8+4, which meant that it would internally correct a 1 bit error, but there was a possibility it would mis-correct some 2+ bit errors (which were rare).
But that meant that Micron was putting 50% more bits onto a die over a non-ECC device. They probably did this not for system reliability but to cover up the failings of their chips at the time. Yes you could make each bit smaller, but no way would it pay off if you needed to put 50% more bits onto a die.
ECC is best done with wide words. For example, a SIMM that presents a 72-bit interface can easily be built with 9 chips, each 8-bits wide.
The system (CPU, memory controller, whatever) takes in 72 bits (or some multiple thereof) from DRAM and does ECC internally. Note that as words get wider the time to compute ECC goes up. So a system could even speculatively execute using the uncorrected data, while in parallel checking to make sure that it was OK. This requires recovery in case of error, but since errors are infrequent it could be a big win overall.
It doesn't make sense for a DDRx spec to require ECC. It's cheaper and smarter to do ECC at a system level.
I worked in a DRAM fab; I saw how it was made and tested. It's a month-long process involving dozens of machines. Particulate contamination, watermarks, process variation -- there are tons of things that can go wrong. And then testing. There's just no way to test thoroughly enough to catch all of the possible issues with timing and crosstalk. Not to mention electromigration and dielectric breakdown. You can do burn-in, but it's not going to catch everything. We even know which die are most failure-prone (usually edge die, due to process uniformity issues.) If they pass test, they get shipped!
Of course these issues are manufacturing related.
Re. cosmic rays, it's surprising, but I think they can cause errors: http://stackoverflow.com/questions/2580933/cosmic-rays-what-...
But cosmic rays are just a constant background thing. If an errant cosmic ray flips a bit, that doesn't happen any more often if you hammer on the chips or heat them up or expose them to radio waves or whatever. The only way to simulate X years of real-world use is to use them for X years, short of setting up a particle accelerator or something. (Technically, you could simulate X years of real-world cosmic ray exposure in X/N years by testing N sets of RAM in parallel, but then that requires you to have more ram under test than you give to your customers!)
All the same, I'm sure more mundane explanations are behind the vast majority of errors, as you say.
Apple did exactly this to the EFI of some machines, to mitigate Rowhammer:
Description: A disturbance error, also known as
Rowhammer, exists with some DDR3 RAM that could
have led to memory corruption. This issue was
mitigated by increasing memory refresh rates.
A beautifully whimsical way to end a fascinating technical article!
For such laptops to exist, there should exist a memory controller that supports ECC RAM. Intel's chips have memory controller built-in, since Nehalem (2008)  and Intel has no mobile CPUs with ECC support. Apart from the brand-new mobile Xeon's, the first of which are released in September and have 45W TDP;  I think the laptops with them are not out yet.
8GB (1x8G) 2133MHz DDR4 Memory Non ECC
16GB (1x16G) 2133MHz DDR4 Memory Non ECC
16GB (2x8GB) 2133MHz DDR4 ECC Memory ECC
16GB (2x8G) 2133MHz DDR4 Memory Non ECC
32GB (4x8GB) 2133MHz DDR4 Memory ECC
32GB (4x8GB) 2133MHz DDR4 Memory Non ECC
32GB (2x16GB) 2133MHz DDR4 Memory Non ECC
64GB (4x16G) 2133MHz DDR4 Memory Non ECC"
I don't know if you'd call that "consumer grade" or not, though.
Based on the list of authors at the beginning of the article and the credited source given in the figures ("Ioan Stefanovici, Andy Hwang & Bianca Schroeder" and "Source: Hwang, Stefanovici, and Schroeder, Proceedings of ASPLOS XVII, 2012")