
To ECC or Not to ECC - shritesh
http://blog.codinghorror.com/to-ecc-or-not-to-ecc/
======
mehrdada
Please note that the main purpose of ECC is not to reduce RAM error rate and
make it look more reliable, but to help the system stop the process when an
unrecoverable memory error occurs as opposed to propagating it and resulting
in unpredictable outcomes. The change in the effect of failure is what matters
most, not the probability of it. Without ECC, there's often no clear way to
realize that the result of a computation is valid or garbage and should be
discarded.

(Of course, in extreme scenarios, like at Google scale, even ECC can fail to
fail due to multibit errors, but in almost all non-pathological scenarios,
SECDED[1] is enough to catch all erroneous cases.)

[1]: [http://cr.yp.to/hardware/ecc.html](http://cr.yp.to/hardware/ecc.html)

~~~
sireat
Exactly, you want to know when the error is due to memory.

Intel deciding that consumers (including those buying Haswell-E CPUS) do not
need ECC really irks me. Textbook market segmentation from a near monopoly.

Currently you can not have your cake and eat it:

You cannot have the best single-thread performance (offered by overclocking
Haswell-E series or Skylake 6700k) and have ECC.

So if one is building the ultimate workstation, you have a hard choice, do you
go with X99 chipset(no ECC but can overclock) or do you go to the server
motherboards with C610 chipsets which are quite limited as far as consumer
interests are.

Interesting are the Intel mobile Xeons which now provide a venue for ECC on a
laptop.

~~~
sliken
Textbook market segmentation? Xeons have ECC, larger thermal envelope, and
some additional testing. Sure they are identical silicon.

Generally if you are willing to give up a single clock bin in exchange for ECC
you end up with a cheaper (and cooler) system that's more reliable. Generally
if you want the cheapest 4c/8t CPU it's a xeon, NOT an i7.

I don't feel particularly artifically segmented. Additionally the high end
desktop motherboards tend to be more expensive than the server boards. Often I
find a nice server board at $180 and the nice desktop boards are often another
$100. Sure they are marketed to gamers, but I really just want a nice reliable
power and cooling and it's not clear which of the cheaper desktop boards are
really going to last 24/7 for 5 years.

Today I'd buy the E3-1270 for $339 over the $350 i7-6700k. Keep in mind the k
chips are a premium _AND_ they don't come with a fan like the non-k chips do.
Sure it's 3.6 - 4 GHz instead of 4.0 to 4.2 GHz, not a particularly noticeable
difference, especially since that both thermally throttle as needed.

I think ECC is well justified because it doesn't just detect dimm errorrs, but
also motherboard errors, cpu errors, and socket (dimm or cpu) errors. If a
node randomly crashes/hangs it's very hard to track down why... unless you
have ECC and often will help you pin it down. I'd much rather see something
strange show up in mcelog than wait for a hang, or worse a corruption.

Most of my "ecc" errors have actually been motherboard, socket, or (in AMDs
case) CPUs. When I look at larger samples some dimms are WAY less reliable
than others. Strongly implying it's not high energy particles, but something
out of spec.

~~~
mjevans
If it weren't a market segmentation strategy, just like limiting the RAM
capacity, then there'd be equivalent 'server' chips for most current 'desktop'
feature sets and vice versa. However that is clearly not the case, both in my
own shopping experience and in the experience of Jeff Atwood (this is in fact
something he complains about in this very article).

ECC would require running and connecting a few more traces, but that would
/surely/ be offset by not having to create as many layouts or source/stock as
many parts. In the past AMD used to have a competing/selling point of /all/ of
their CPUs supporting ECC ram. Today that is not the case, as they too have
mirrored (colluded?) Intel's market segmentation strategy.

------
PhantomGremlin
I know that Jeff is a demigod to some people, but I interpret this article as:
"As a software guy, I don't really understand why I need this fancy hardware,
so this can't be important". IMO he's wrong.

The margins between working and non-working DRAM these days are extremely
small. E.g. Rowhammer demonstrated that even user-space programs could readily
obliterate main memory, without even trying very hard to do so.[1]

But, maybe in this case he's right. It's not like "open source Internet forum
software" is anything that's mission critical. If there's an occasional garble
in a character or two, will the latte-swilling hipsters even notice? :-)

Just like the original Google servers he points to. Who cares if they
occasionally screwed up in reporting search results, because they didn't have
ECC memory. Overall the experience was still 100x better than using something
like Altavista.

[1]
[https://en.wikipedia.org/wiki/Row_hammer](https://en.wikipedia.org/wiki/Row_hammer)

~~~
theandrewbailey
What Jeff is trying to say is: if ECC is so desperately needed to prevent
memory errors that are supposedly happening all the time, why isn't ECC in
every computer everywhere?

~~~
PhantomGremlin
That question is very easily answered.

The average consumer knows that more "jigabits" are better and more
"jigahertz" is better (see Intel NetBurst for how badly that can go wrong).

See a link elsewhere in this tread, someone posted a memory error presentation
that talked about FIT, failures in time. But the average consumer doesn't know
what that is.

Hence we get a race to the bottom. PC assemblers are willing to sell their
mothers into slavery if it can save them $0.05 in build cost. ECC doesn't fit
into that narrative.

BTW ECC is "in every computer" nowadays. As yet another poster mentioned,
Intel CPUs use ECC internally to protect their caches.

~~~
bro-stick
There's at least two broad classes of error correction and detection: at-rest
and in-flight.

Each storage hierarchy component (RAM, SSD, CPU caches, etc.) and
interconnection (chip-to-chip, add-on card, cable to another box) needs to be
looked at for risk of nondetection/data loss based on risk consequences of the
intended use.

For example, billing database servers for a successful company probably should
use RAID array/SAN/NAS (say RAID6 or ZFS with RAIDZ3) and Chipkill ECC memory
on an enterprise-class box with decent vendor support.

CDN boxes for serving free, static content can be almost anything.

For larger shops, they have the economies of scale to ask from OEMs and ODMs
to build custom boxes that are more optimized than COTS gear at Dell, HP or
CDW.

When Jeff's venture takes off, they might explore gear customized for running
Ruby and/or partnering with 37signals and the like to have OEMs/ODMs folks
develop better performing gear and open source it like Facebook has.

------
teddyh
We once had a new server with all new hardware which had weird problems and
kept crashing mysteriously. Memory tests showed no errors, so we were all
tearing our hair out. We took the server offline and set it to test
continously – still no errors. After running Memtest86 on _nothing but test
#4, for about a day or so_ – _then_ a few memory errors showed up. Replaced
memory, problem gone, server started working.

Memory errors are _especially_ insidious compared to how common they are. ECC
is worth it.

~~~
beachstartup
i wouldn't even call a machine without ecc a server or workstation. more like
a consumer device that's been given a job it can't do.

~~~
teddyh
This _was_ many, maybe more than 10, years ago.

------
tzs
I tried to catch soft errors for about a year on a couple of Linux boxes I
had. They were both desktop form factor machines, one being used as a home
server and one as a desktop at work.

I had a background process [1] on each that simply allocated a 128 MB buffer,
filled it with a known data pattern, and then went into an infinite loop that
slept a while, woke up and checked the integrity of the buffer, and if any of
the data had changed logged the change and restored the data pattern.

Based on the error rates I'd seen published, I expected to catch a few errors.
For example, using the rate that Tomte's comment [2] cites I think I'd expect
about 6 errors a year.

I never caught an error.

I also have two desktops with ECC (a 2008 Mac Pro and a 2009 Mac Pro). I've
used the 2008 Mac Pro every working day since I bought it in 2008, and the
2009 Mac Pro every day since I bought it in 2009. Neither of them has ever
reported correcting an error.

I have no idea why I have not been able to see an error.

[1] [http://pastebin.com/Bv56kVwC](http://pastebin.com/Bv56kVwC)

[2]
[https://news.ycombinator.com/item?id=10600308](https://news.ycombinator.com/item?id=10600308)

~~~
Ono-Sendai
Did you check the resulting (dis)assembly? If you compile with optimisations
the reading (and maybe writing) to the RAM buffer may be optimised away.

------
tshtf
Soft errors are fairly common; in fact it allows for problems in DNS
resolution such as Bitsquatting:
[https://www.defcon.org/images/defcon-19/dc-19-presentations/...](https://www.defcon.org/images/defcon-19/dc-19-presentations/Dinaburg/DEFCON-19-Dinaburg-
Bit-Squatting.pdf)

Anyone who has bought a popular bitsquatted domain name can attest to this.

~~~
baby
Also errors in packets signatures from TLS handshakes
([http://cryptologie.net/article/294/factoring-rsa-keys-
with-t...](http://cryptologie.net/article/294/factoring-rsa-keys-with-tls-
perfect-forward-secrecy/))

And I'm sure there are many other vectors of attacks using this flaw.

------
Tomte
IEC 61508 documents an estimate of 700 to 1200 fit/MBit (fit = "failure in
time"; per 10e-9 hours of operation) and gives the following sources:

a) Altitude SEE Test European Platform (ASTEP) and First Results in CMOS 130
nm SRAM. J-L. Autran, P. Roche, C. Sudre et al. Nuclear Science, IEEE
Transactions on Volume 54, Issue 4, Aug. 2007 Page(s):1002 - 1009

b) Radiation-Induced Soft Errors in Advanced Semiconductor Technologies,
Robert C. Baumann, Fellow, IEEE, IEEE TRANSACTIONS ON DEVICE AND MATERIALS
RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005

c) Soft errors' impact on system reliability, Ritesh Mastipuram and Edwin C
Wee, Cypress Semiconductor, 2004

d) Trends And Challenges In VLSI Circuit Reliability, C. Costantinescu, Intel,
2003, IEEE Computer Society

e) Basic mechanisms and modeling of single-event upset in digital
microelectronics, P. E. Dodd and L. W. Massengill, IEEE Trans. Nucl. Sci.,
vol. 50, no. 3, pp. 583–602, Jun. 2003.

f) Destructive single-event effects in semiconductor devices and ICs, F. W.
Sexton, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 603–621, Jun. 2003.

g) Coming Challenges in Microarchitecture and Architecture, Ronen, Mendelson,
Proceedings of the IEEE, Volume 89, Issue 3, Mar 2001 Page(s):325 – 340

h) Scaling and Technology Issues for Soft Error Rates, A Johnston, 4th Annual
Research Conference on Reliability Stanford University, October 2000

i) International Technology Roadmap for Semiconductors (ITRS), several papers.

If that's correct, the math is simple: you have bit flips in your PC about
once a day.

It's just that (a) you often won't notice those transient errors (one pixel in
your multi-megapixel photo is one bit off) and (b) a lot of your RAM is
probably unused.

~~~
mehrdada
> _It 's just that (a) you often won't notice those transient errors (one
> pixel in your multi-megapixel photo is one bit off) and (b) a lot of your
> RAM is probably unused._

Also, most modern processors use ECC for their caches (even when the main
memory is non-ECC) and they serve the vast majority of memory requests, so it
is unlikely that intermediate values in a tight computation are affected by
non-ECC RAM. That adds to the "silentness" aspect of the bit flip in consumer
systems.

------
cushychicken
These things do happen with a reasonable amount of frequency. I used to work
at a division of a major memory manufacturer that dealt with writing tests to
find these DIMMs that exhibited these sorts of failures - the semiconductor
industry calls them "variable retention transfers". (Aside: numerous PhDs in
the field of semiconductor physics have built prosperous careers trying to
understand why these soft failures happen. Short answer: we have some
theories, but we don't really know.) It was provably worth millions of dollars
to be able to screen for this sort of phenomenon, because a Google or an Apple
or an IBM would return a whole manufacturing lot of your bleeding edge, high-
margin DIMMs if they found one bit error in one chip of one lot. Each lot was
shipping for millions and millions of dollars.

------
CrLf
Anyone who've managed even a modest amount of servers with ECC RAM for a
reasonable amount of time has surely seen ECC events in their hardware logs.
Most of these are one-time errors that never happen again on the same server,
ever.

Without ECC these errors would have unknown consequences. They could happen in
some unused region of memory, or they could happen in a dirty page in the
filesystem cache. It's not fun to discover that your filesystem has been
silently corrupted a unknown time after the fact.

Maybe Google doesn't need ECC. Their data is duplicated across several
machines and it's extremely unlikely that a few corrupt servers would lead to
any data loss.

However, on a smaller scale (and just like RAID) it's cheaper to have ECC than
add more servers for extra redundancy.

------
wmf
Or he could have waited a few months and gotten ECC anyway:
[http://ark.intel.com/products/88171/Intel-Xeon-
Processor-E3-...](http://ark.intel.com/products/88171/Intel-Xeon-
Processor-E3-1280-v5-8M-Cache-3_70-GHz)

~~~
yuhong
Interestingly, the only vendor which sells 16GB unbuffered ECC DDR4 DIMMs
seems to be Crucial:
[http://www.crucial.com/usa/en/ct16g4wfd8213](http://www.crucial.com/usa/en/ct16g4wfd8213)

~~~
wmf
E3 DIMMs have always been rare and usually expensive; I wish Intel would
enable regular registered DIMMs.

~~~
yuhong
The funny thing is that they still so expensive when the x8 chips are so
cheap.

------
sebcat
What he's saying is essentially "The code I write/the platform I choose scales
poorly over multiple cores. Therefor I decide to blame the hardware, and skip
features that are good for me"

People need to adapt to a world where we have more cores instead of faster
execution per core. You can't compare late 90's growth in execution speed per
core with the situation we have today.

Write software for an environment where the number of cores scale, instead of
an environment where the execution speed of a single core is more important.

~~~
ketralnis
> What he's saying is essentially "The code I write/the platform I choose
> scales poorly over multiple cores. Therefor I decide to blame the hardware,
> and skip features that are good for me"

Is that so bad? He's writing and hosting the code, and he's paying the bill to
do it. Seems to me he should be able to pick how to do it.

------
vox_mollis
This cannot possibly be right. There was a DC21 talk regarding DNS request
misfires due to bit flips in non-ECC DRAM, and the researcher was able to
collect a surprisingly large number of requests on the basis of this.

Edit: found it:
[https://www.youtube.com/watch?v=ZPbyDSvGasw](https://www.youtube.com/watch?v=ZPbyDSvGasw)

~~~
ketralnis
Importantly, those DNS packets go through a number of systems that are not
clients or servers. Wifi, microwave antennae, undersea cables, consumer
routers, unpowered hubs, you name it. It's hard to know whether these bit
flips are actually coming from cosmic rays or EM interference or rare
decompression bugs.

------
Animats
If soft errors are rare, parity checking, without correction, might be more
useful. It's better to have a server fail hard than make errors. In a "cloud"
service, the systems are already in place to handle a hard failure and move
the load to another machine. Unambiguous hardware failure detection is exactly
what you want.

~~~
mehrdada
In practice, you basically get one-bit error correction 'for free' when you
have enough redundancy to detect two-bit soft errors. Simple parity can only
detect one bit flip, so if you want to catch two-bit errors, you might as well
correct one-bit errors you find on your way at no extra cost.

------
scurvy
I don't think that data corruption was a huge issue for Google back then
(really early on). Corrupt data? Big whoop. Re-index the internet in another X
hours, and it's gone. I doubt they had much persistent storage as most of
their data was transient and well, the Internet.

Also, I still see "fire hazard" when I look at the early Google racks. No idea
how Equinix let them get away with it. Too much ivory tower going on there.
Not enough "you know we're liable if we burn down the colo with that crap,
right?"

~~~
upofadown
There is no extra chance of a short circuit before the power supply. After the
power supply the power is limited, either by explicit current limiting or just
because they are switching power supplies where transformer saturation limits
the power.

So you could have a PCB fire, but PCBs are made to be flame retardant. You
could have a wire insulation fire, but the amount of material would be so low
that it wouldn't be able to start a fire anywhere else.

So I am basically saying there isn't really anything there that could sustain
a fire and that there isn't a lot of energy to start ignition in the first
place.

~~~
scurvy
Cardboard breaks down over time. It turns into particulate matter that goes
airborne into really hot server intakes and comes out tiny little burning
embers.

If it didn't burn down Google's stuff, it could have burned down other
people's gear. I have decades of experience here; I'm not an ivory tower nerd.
Any datacenter/colo provider worth a salt will jump on you immediately for
having cardboard in your environment. DRT makes you unbox everything outside
the various colos and won't even let cardboard enter.

~~~
upofadown
>comes out tiny little burning embers.

The auto-ignition temp of paper is over 200C. The maximum junction temperature
of most electronics is somewhere around 100C. This this literally could not of
ever happened unless the equipment was already on fire.

I'll leave the idea that cardboard breaks down fast enough to be noticed over
the life of a server to someone more knowledgeable. I note that there was no
mention of cardboard in the article.

~~~
scurvy
The motherboards were placed directly on cardboard trays. It says that in the
article.

------
devit
The article is wrong.

The Xeon E3-1270 v5 goes from 3.6 to 4.0 GHz and only costs 10% more than the
i7-6700 (3.4-4.0 GHz)

Also, the Xeon E3-1230 v5 goes from 3.4 to 3.8 GHz (same base clock) and costs
less than the Core i7-6700.

In general, you should never buy non-Xeon CPUs if you have the choice, both
for desktop and for servers, since ECC memory is essential if you don't want
to have a significant chance of having to replace your RAM after discovering
mysterious problems with your system.

------
venomsnake
Isn't it simple enough calculation:

Will someone die if the data gets corrupted? No - then no ECC should be
enough. And you should have checksums everywhere anyway.

~~~
jo909
where do you create that checksum? If its on a computer without ECC, you will
just checksum the data including the error, then write that data including the
error to disk.

What happens to the data after you have read it to memory and successfully
verified the checksum? You probably process it in memory, and have no idea
afterwards if the changes are due to your code, or because of errors.

Of course you can now propose to also checksum and check the data while it is
in memory. Which is basically what ECC does, in hardware, for cheap, requiring
no CPU cycles.

