
Why use ECC? - benkuhn
http://danluu.com/why-ecc/
======
morelikeborelax
I've had ECC on my workstations since 2006 when Intel forced it on DP systems
due to FB-DIMMs. Being that I have had a good number of correctable memory
errors over the past decade I don't feel I can go back to not having it,
wouldn't make any sense when the cost is so low. It was also a case that I
couldn't get non-ECC 8GB and 16GB DIMMs when I built some of my systems, so it
just was a matter of fact that I had to use it.

Do people need it? Nah probably now for systems that can handle crashing, but
you'd be nuts to not use it in servers or systems running long jobs - it's
just a single insurance payment on your system that gives a remote chance of
protection.

Sadly there are very few studies in to it that show how modern DIMMs still get
errors that are correctable. Manufacturing processes are much better, but they
haven't eliminated the need for it.

I do enjoy the "why would I need it, I've never had memory errors" attitude
though when people likely have no idea why their application or OS crashed.
And the accounts of people who eventually diagnose memory errors after weeks
of random crashes which would have been reported immediately if they had ECC.

~~~
nextos
This is a big dilemma I have. I'm trying to build a workstation similar to
Nvidia's reference design for deep learning:

[https://developer.nvidia.com/devbox](https://developer.nvidia.com/devbox)

I will be doing deep learning and other ML GPU-powered tasks. Plus some long
running high-memory I/O intensive tasks.

Note that Nvidia's build does not employ ECC RAM. And it's a quite expensive
machine. Mine will be only a fraction of the cost ($4.5k), with just one Titan
X. It's possible to afford a Xeon, but this comes at the compromise of buying
slower hardware. What shall I do?

Intel's segmentation of the market, limiting the amount of RAM you can use in
regular CPUs and removing support for ECC sucks.

~~~
dman
AMD has much cheaper points of entry to ECC.

~~~
codinghorror
yeah I was noticing this, AMD is much more ECC friendly whereas Intel segments
their markets hard by arbitrarily disallowing ECC platform use on common
consumer (read: not-Xeon) CPUs. Unfortunately AMD is still quite a bit slower
than Intel these days, but if you are doing work that is GPU rather than CPU
intensive it might not matter.

~~~
zf00002
> disallowing ECC platform use on common consumer (read: not-Xeon) CPUs

this hasn't really been true since Haswell. Now even low-end Celerons and
Pentiums support it.

~~~
wmf
One may suspect that some Celeron/Pentium/i3 SKUs support ECC just so that
Intel doesn't have to manage low-volume very-low-end E3 SKUs. And then you run
into the fact that the cheapest server motherboard is double the price of a
cheap desktop one.

~~~
makomk
From what I can tell, some Celeron/Pentium/i3 SKUs support it because AMD was
killing Intel in certain NAS applications thanks to the fact that their low-
end chips had ECC.

------
codinghorror
I just want to be clear that in the original referenced article, I am not
anti-ECC per se, I just found myself caught in the massive cognitive
dissonance between "you must have ECC in all your computers otherwise they
will constantly and silently corrupt your data + crash" and "statistically
speaking, most computers in the world do not use ECC". How can both of these
things be true?

The argument for ECC is credible (I personally think rowhammer is the best
example of this actually mattering, but ironically a) you can rowhammer ECC
memory just fine and b) DDR4 has hardware features to mitigate rowhammer --
which shows how quickly things are changing), but it also seems to hinge on
whether you have hundreds to thousands of computers all working together, e.g.
the positive effects of ECC only seem to matter enough statistically at a
_very_ large scale.

~~~
slavik81
> "you must have ECC in all your computers otherwise they will constantly and
> silently corrupt your data + crash" and "statistically speaking, most
> computers in the world do not use ECC". How can both of these things be
> true?

What makes you say that? Both those things being true simply means that most
computers silently corrupt your data and crash. That matches my experience. My
programs occasionally crash. My pictures and videos are occasionally
corrupted.

Do I know that those events are caused by memory failures? No. Most of them
are probably other sorts of software or hardware failures, but some could be
memory errors.

~~~
codinghorror
I've never had a photo or video occasionally corrupted on any computer I've
ever owned going back to 1985. Crashes? Sure, who hasn't.

"Some could be.." is computing by coincidence, and I'm not a fan of that
logic. You can refer to the 2007, 2009, and 2012 studies for measured data on
server farms.

~~~
darkmighty
A few bit flips in an image are unlikely to be noticeable (it depends on the
format though; uncompressed is obviously the most resilient, JPEG should be
reasonably resistant too.

Also, I don't think images stay for too long in RAM. Your HDD definitively has
ECC.

~~~
Moru
Actually jpg is not very resilient, there is artforms doing single-bit changes
to jpg-files to end up with very strange images.

~~~
darkmighty
This is an adversarial error, I think it's pretty good against random errors.
I'll do a test sometime.

------
orf
> Alternately, it might be a plan to create literal cloud computing.

Thanks, I just snorted tea onto my keyboard after reading that. Seems like if
you have the money and are maintaining a pretty critical system it would be
silly not to get ECC RAM (and if you're building your own iron the price
difference isn't that much as far as I can tell).

On the EC2 site they say "In our experience, ECC memory is necessary for
server infrastructure, and all the hardware underlying Amazon EC2 uses ECC
memory"[1]. Amazon maintain a _lot_ of servers and if they think it's
necessary I'm inclined to believe them.

1\. [https://aws.amazon.com/ec2/faqs/](https://aws.amazon.com/ec2/faqs/)

~~~
codinghorror
A big part of the value proposition of ECC does seem to hinge on having
thousands, and perhaps _many_ thousands of machines working together. At a
large enough scale even small statistics start to matter.

~~~
Alupis
It's more a value proposition of how long the machine will be running. Given
time, it will get corrupt bits. You don't need thousands of machines to
experience memory corruption...

------
melted
I think ECC is inevitable. With 128GB DIMMs being produced now and NV-DIMMs
(DDR4 flash-on-dimm) just around the corner, some kind of hardware error
detection is necessary.

This is similar to high capacity spinning drives. With smaller ones you could
just go with RAID5 and not worry about anything, but when drives are 3-4TB and
up, you have to use RAID6, because the spec error rate becomes too high to
rely on a single parity drive.

Here, too, when you machine has 1-2TB of hybrid RAM/NVM you HAVE TO have some
way to detect failures, even if it's not particularly good. Performance
characteristics of RAM preclude the more robust algorithms such as wide
(32bit) CRC from being used (narrower CRC could still be doable in hardware,
though), but parity is a complete no brainer as the first step.

~~~
snaky
RAID 6 is not an option with drives big enough because of rebuild time.

[https://storagemojo.com/2010/02/27/does-raid-6-stops-
working...](https://storagemojo.com/2010/02/27/does-raid-6-stops-working-
in-2019/)

~~~
mrb
Nitpick: neither RAID 6, nor RAID 5 are an option. Leventhal's point is that
triple-parity (eg. ZFS raidz3) becomes necessary.

~~~
snaky
I think the main point is that there is no "RAID is the solution for all our
problems" anymore.

------
mmagin
"If ECC were actually important, it would be used everywhere and not just
servers."

Ha. I wish I could get laptops with ECC RAM.

~~~
snaky
Thinkpad P50 (Xeon E3-1535M v5, ECC DDR4, up to 3 drives including PCIe NVMe)

Thinkpad P70 (Xeon E3-1505M v5, ECC DDR4, up to 4 drives including PCIe NVMe)

[http://www.lenovo.com/psref/pdf/ThinkPad.pdf](http://www.lenovo.com/psref/pdf/ThinkPad.pdf)

~~~
cat-dev-null
"... DDR4, ECC or non-ECC ( _ECC function supported only on Xeon processor_ ),
dual-channel capable, four DDR4 SO-DIMM sockets ..."

------
Spooky23
Atwood's original article was puzzling to me, and the conclusions just didn't
compute.

I can recall at least a half dozen times when I was a DBA in olden times that
ECC either corrected was essential in the isolation of faults on my Informix
and later Oracle boxes, running mostly on Sun and RS/6000 at the time.

Sun had a nice habit of shipping defective CPUs and memory in the late 90s.
The details are foggy, but I remember correlating ECC faults to long
transactions that would fail, and getting a bunch of stuff out of Sun.

Than again, that was 15+ years ago, so maybe the newfangled memory we have
these days is more reliable.

~~~
specialist
Here's James Gosling's account of radioactive RAM chips used in the UltraSparc
II...

[http://nighthacks.com/roller/jag/entry/at_the_mercy_of_suppl...](http://nighthacks.com/roller/jag/entry/at_the_mercy_of_suppliers)

 _When Sun folks get together and bullshit about their theories of why Sun
died, the one that comes up most often is another one of these supplier
disasters. Towards the end of the DotCom bubble, we introduced the UltraSPARC-
II. Total killer product for large datacenters. We sold lots. But then reports
started coming in of odd failures. Systems would crash strangely. We 'd get
crashes in applications. All applications. Crashes in the kernel. Not very
often, but often enough to be problems for customers. Sun customers were used
to uptimes of years. The US-II was giving uptimes of weeks. We couldn't even
figure out if it was a hardware problem or a software problem - Solaris had to
be updated for the new machine, so it could have been a kernel problem. But
nothing was reproducible. We'd get core dumps and spend hours pouring over
them. Some were just crazy, showing values in registers that were simply
impossible given the preceeding instructions. We tried everything. Replacing
processor boards. Replacing backplanes. It was deeply random. It's very
randomness suggested that maybe it was a physics problem: maybe it was alpha
particles or cosmic rays. Maybe it was machines close to nuclear power plants.
One site experiencing problems was near Fermilab. We actually mapped out
failures geographically to see if they correlated to such particle sources.
Nope. In desperation, a bright hardware engineer decided to measure the
radioactivity of the systems themselves. Bingo! Particles! But from where?
Much detailed scanning and it turned out that the packaging of the cache ram
chips we were using was noticeably radioactive. We switched suppliers and the
problem totally went away. After two years of tearing out hair out, we had a
solution.

But it was too late. We had spent billions of dollars keeping our customers
running. Swapping out all of that hardware was cripplingly expensive. But even
worse, it severely damaged our customers trust in our products. Our biggest
customers had been burned and were reluctant to buy again. It took quite a few
years to rebuild that trust. At about the time that it felt like we had
rebuilt trust and put the debacle behind us, the Financial Crisis hit..._

~~~
gh02t
We use high-spec systems for storing data coming off radiation detectors in
experiments (which can be in the multiple GB/s of data with high end
digitizers). You can bet we use ECC for that; we made sure to after one
experiment got ruined by memory corruption...

------
rando289
I looked at the atwood article, which really didn't give useful numbers,
except the one citation. I opened that, found the fit number, converted to ~
.5 errors per year and thought, eh, skipping ecc is fine.

I didn't notice this: "From the graph above, we can see that a fault can
easily cause hundreds or thousands of per month." Now I want ecc again.

~~~
codinghorror
Drill into the 3 referenced studies, they are all reasonably recent (2007,
2009, 2012) and contain specifics/data.

------
hrez
Speaking of Google's shipping containers. Sun had that too-
[https://en.wikipedia.org/wiki/Sun_Modular_Datacenter](https://en.wikipedia.org/wiki/Sun_Modular_Datacenter)

"A data center of up to 280 servers can be rapidly deployed by shipping the
container"

------
aidenn0
The last time I built a system with ECC it was impossible to tell which
cpu/motherboard combos supported ECC; the time before that was when AMD had
super-affordable systems with ECC support, but I gave up trying to figure out
if AMD even supported ECC on their workstation parts.

------
caycep
The problem is, there is also additional hidden costs to using ECC, mostly due
to what is available in the ecosystem. Namely - I want ECC. Great, then I need
to get to an X99 board with xeon chips running at higher TDPs. Depending on
the case I use, then I would need to upgrade PSU and fans/coolers.

Or, (especially in the mini-ITX world), I can choose one of the server boards
or "workstation" C236 boards. Then I would lose official desktop windows
support, or lose a m.2 SSD slot, or onboard sound, or other "desktop
workstation" features.

It is still not easy to do ECC in this day and age...

------
brandon272
From my point of view it boils down to how critical stability and uptime
within any given environment is. I'm not going to wring my hands over whether
or not the hardware powering my todo list startup includes ECC RAM. I would
wring my hands over it if it I were building a system for military,
healthcare, or critical infrastructure applications, to use a few examples.

~~~
teddyh
Saying that only “ _military, healthcare, or critical infrastructure
applications_ ” should bother using ECC RAM is akin to claiming that only
banks should use HTTPS.

~~~
brandon272
Well, I didn't say that. Additionally, I don't consider error correction to be
analogous to SSL.

------
santaclaus
Is it possible to get some of the benefits of ECC without ECC by serializing
out the entire state of a program at some set rate? For example, one could
read in the n'th serialized checkpoint, run the program to the n+1'th
checkpoint, and compare the original n+1'th checkpoint to the new n+1'th
checkpoint. If these differ, cosmic rays flipped a bit in the interim. This
would, of course, break if the code itself doesn't guarantee bit compatible
results over multiple runs (due to the use of certain parallel algorithms,
etc). I suppose this would double the runtime, however...

~~~
wmf
2x time overhead vs. 10% cost overhead?

~~~
awqrre
the cost is not everything... sometimes it is impossible to get the features
that you need in the form factor that you are seeking

------
shin_lao
ECC is a requirement for servers when we validate installations of our
database product.

Basically the take away is that without ECC you can expect a memory error
every two days on machines with a lot of RAM.

------
Confiks
Is there someone else who consistently reads 'Elliptic Curve Cryptography' and
is disappointed by the subject of the article?

~~~
Perseids
I indeed do and was. But then again, not being able to have ECC RAM on my
laptop has been a pet peeve of mine for quite a long time, so after my initial
disappointment I could also enjoy the surprise article :)

