
Should I buy ECC memory? (2015) - colinprince
https://danluu.com/why-ecc/
======
nostrademons
While I was at Google, someone asked one of the very early Googlers (I think
it was Craig Silverstein, but it may've been Jeff Dean) what was the biggest
mistake in their Google career, and they said "Not using ECC memory on early
servers." If you look through the source code & postmortems from that era of
Google, there are all sorts of nasty hacks and system design constraints that
arose from the fact that you couldn't trust the bits that your RAM gave back
to you.

It saved a few bucks in a time period where Google's hardware costs were
rising rapidly, but the ripple-on effects on system design cost much more than
that in lost engineer time. Data integrity is one engineering constraint that
should be pushed as low down in the stack as is reasonably possible, because
as you get higher up the stack, the potential causes of corrupted data
multiple exponentially.

~~~
sytelus
Google had done extensive studies[1]. There is roughly 3% chance of error in
RAM per DIMM per year. That doesn't justify buying ECC if you have just one
personal computer to worry about. However if you are in data center with 100K
machines each with 8 DIMM, you are looking at about 6K machines experiencing
RAM errors _each day_. Now if data is being replicated then these errors can
propogate corrupted data in unpredictable unexplainable way even when there
are no bugs in your code! For example, you might encounter your logs
containing bad line items which gets aggregated in to report showing bizarre
numbers because 0x1 turned in to 0x10000001. You can imagine that debugging
this happening every day would be huge nightmare and developers would end up
eventually inserting lot of asserts for data consistency all over the places.
So ECC becomes important if you have distributed large scale system.

1:
[http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf](http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf)

~~~
loeg
> There is roughly 3% chance of error in RAM per DIMM per year. That doesn't
> justify buying ECC if you have just one personal computer to worry about.

How do you make that leap?

~~~
user5994461
It's an inappropriate leap. Consumers should have ECC memory too.

However the consumer market has long decided to settle for ECC nowhere and
cheap everywhere.

ECC hardware comes at premium option that can easily be +100%. You need
support in the memory, the motherboard and the CPU.

Given the price difference, personal computers will have to live with the
memory errors. People will not pay double for their computers. Manufacturers
will not sacrifice their margin while they can segment the market and make a
ton of money off ECC.

~~~
michaelmrose
Amd has modestly priced hardware that supports ecc

~~~
rfrank
Was that the case before Ryzen? I know their new CPUs support ECC, but I'm not
sure for earlier generations.

~~~
yuhong
I think it was common for AM3 for example too.

~~~
qb45
ECC is officially supported by all AM2/3(+) CPUs and AFAIK all corresponding
motherboards from ASUS. As in, you have it guaranteed on the spec sheet.

There are also reports of BIOS support in some boards which don't have ECC
advertised. And you can try to enable it in the OS even without BIOS support,
though some level of _hardware_ support is still necessary. As Linux
documentation puts it: "may cause unknown side effects" :)

------
olavgg
Can people here please stop posting that ZFS needs ECC memory. Every
filesystem, with any name like FAT, NTFS, EXT4 runs more safe with ECC memory.
ZFS is actually one of the few that can still be safer if you don't run with
ECC memory. Source: Matthew Ahrens himself:
[https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=...](https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=26303271#p26303271)

~~~
lomnakkus
Indeed. It's _true_ that the data _may_ be corrupted before hitting any
disk[1], but once it _has_ hit the disks (>1), it's extremely unlikely that
you'll ever hit a similar bit error where it'll mistakenly choose the wrong
disk block to recover from.

The main point of e.g. ZFS or Btrfs checksumming is that a) _at least it isn
't getting worse_, and b) I can _tell_ if it's getting worse.

[1] ... but if the bits are not generated by the machine that actually saving
to disk, how do you know they weren't corrupted along the way? The number of
people who religiously check PGP signatures/SHA256sums or whatever is
miniscule.

~~~
derefr
> The number of people who religiously check PGP signatures/SHA256sums or
> whatever is miniscule.

• If you transfer things around using BitTorrent, it'll ensure you always end
up with a file that hashes correctly to the sum it originally had when the
.torrent file was constructed.

• Many archive formats (zip, rar, and 7z, at least) contain checksums, and
archival utilities validate those checksums during extraction, refusing to
extract broken files. "Self-extracting archive" executables that use these
formats inherit this property.

• Some common disk-image formats (dmg, wim) embed a checksum that checks the
whole disk-image during mount, and will refuse to mount a bad one. (I believe
you can then try to "repair" the disk image with your OS's disk-repair
utility, if you have no other copies.)

• Web pages increasingly use Sub-Resource Integrity attributes on things like
.css and .js files, protecting _them_ (though not _the page itself_ ) from
errors.

• ISO files don't embed checks, but all the common package formats (Windows
.cab and .msi; Linux .deb and .rpm; macOS .pkg) on _installer_ ISOs embed
their own checksums and often signatures.

• git repos are 'protected' insofar as you won't be able to sync mis-hashed
objects from a remote, so they won't spread.

Really, looking over all that, it's only 1. plain binary executables, and 2.
"media files" (images, audio, video)—and only when retrieved over a "dumb"
protocol, rather than a pre-baked-manifest protocol like BitTorrent or
zsync—that are "risky" and in need of explicit checksum comparison.

~~~
comex
Both macOS/iOS and Windows use code signing for executables, which should
guard against most types of corruption.

Web pages (and anything else) transmitted over HTTPS are protected from
corruption in transit by TLS's hashing (which is vastly stronger than the
checksums at lower levels of the network stack), though that doesn't help if
the server has faulty memory or storage.

PNG has built-in checksums, though other image formats don't (JPEG). Not sure
about video.

~~~
derefr
> Web pages (and anything else) transmitted over HTTPS are protected from
> corruption in transit by TLS's hashing (which is vastly stronger than the
> checksums at lower levels of the network stack), though that doesn't help if
> the server has faulty memory or storage.

I didn't bring this one (or any other transport-level checksums) up, because
we were talking about whether you can trust something "across the whole
process"—from its origin developer's disk (where it might get an initial
explicit checksum generated), to origin memory, across the network to a
server's memory, to that server's disk, over the network again to a CDN
reverse-proxy's memory, maybe its disk, then the network again to you, then
_your_ memory, _your_ disk, and finally your memory again as you verify it.
Oh, and a bunch of routers and switches in between, of course.

Static checksums that are baked into file formats or manifest files protect
the file across that _whole_ chain. Transport-level checksums only ensure that
the one part they're involved in happened correctly.

------
blackflame7000
Altitude also plays a factor in random memory corruption.

From the wikipedia article on ECC Ram, "Hence, the error rates increase
rapidly with rising altitude; for example, compared to the sea level, the rate
of neutron flux is 3.5 times higher at 1.5 km and 300 times higher at 10–12 km
(the cruising altitude of commercial airplanes).[3] As a result, systems
operating at high altitudes require special provision for reliability."

~~~
gizmodo59
I wonder if this has anything to do with Microsoft's plans to build an
underwater data center.

~~~
rorosaurus
If I remember correctly, that research venture was mostly due to the potential
of easy heat exchange and "free" energy via geothermal/tidal. Now that you
mention this, though, it's clear that such a datacenter would also be
naturally shielded from many things!

------
spullara
I reproduced this by bit-squatting cloudfront.net after reading about it. So
many memory errors!

[http://dinaburg.org/bitsquatting.html](http://dinaburg.org/bitsquatting.html)

Loved the variety as well. Sometimes though requests came to me the Host
header was correct!

~~~
codinghorror
Wait so when someone typoes cnn.com as con.com, that is ipso facto a memory
error? I guess I could see that if the characters are radically far apart on
the keyboard? But doesn't a simpler explanation like "one person out of
billions with Internet access typed the wrong thing" seem a lot more likely?

~~~
duskwuff
Domains that are only used as CDNs, like cloudfront.com, are almost never
typed into an address bar. Errors in the domain name are more frequently the
result of a bit-flip error.

~~~
spullara
Also, these typos are typically not easy ones since flipping a bit changes the
letter in ways that are unlikely typos. With cloudfront.net an negligible
number of people would be typing them at all. Close to 100% of the errors that
I saw were loading either images, css or javascript files that some other page
depended on.

------
veidr
Yes. Everybody reading this should use ECC RAM, and non-ECC RAM should be
called "error-propagating RAM".

Random bit flips aren't cool, and they happen regularly. Most computers that
have ECC RAM can report whether errors happen. I see them at least once a year
or so. For instance, here are 2 ECC-correctable memory errors that occurred
just last month.

Cosmic rays? Fukushima phantom? Who knows. You'll never know why they happen
(unless it's like a bad RAM module and they happen a lot), but if you don't
rock ECC you will never know they happened at all. You'll be left guessing
when, years later, some encrypted file can no longer decrypt, and all the
backups show the same corruption...

[1]:
[https://www.dropbox.com/s/zndvy3nkv1jipri/2017-03-20%20FUCK%...](https://www.dropbox.com/s/zndvy3nkv1jipri/2017-03-20%20FUCK%20memory%20errors.png?dl=0)

[2]:
[https://www.dropbox.com/s/6yeoedc7ajzq4u9/2017-03-20%20FUCK%...](https://www.dropbox.com/s/6yeoedc7ajzq4u9/2017-03-20%20FUCK%20memory%20errors%20detail.png?dl=0)

~~~
jandrese
I remember the one time I bought ECC memory, for a PII-400. It was only 512MB
or so I think, but in the 12 years that server ran I saw a grand total of 1
corrected error in the logs. Given how much of a premium that ECC memory was
it felt like a waste.

------
ReligiousFlames
An old article from DJB worth perusal:
[http://cr.yp.to/hardware/ecc.html](http://cr.yp.to/hardware/ecc.html)

It's also worth noting that not all ECC (SECDED) is created equal: ChipKill™
and similar might not survive physical damage because of likely shorts of the
data bus but a single malfunctioning chip producing/experiencing higher hard
error rate is possible from which to recover.

Also, it'd be really cool if some shop a-la BackBlaze blogged about large-
scale monitoring for soft and hard RAM errors across chip/module modules (+
motherboards & CPUs). Without collecting and revealing years data from real
use, conversation devolves into opinion and conjecture.

Finally, not all use-cases can benefit from ECC (ie Angry Birds) however there
are some obvious/nonobvious ones that can (ie router non-ECC DNS bitsquatting
or processing bank transactions).

~~~
ReligiousFlames
PS: Random-crazy thought.. it's curious with reduction of costs via Moore's
law improvements that there aren't yet formally-verified, zero-knowlege
systems which can end-to-end prove they performed computation/real-world side-
effects and/or continue to safely store data. Why blindly trust anyone or any
company with data that can be seized, lost or misused when distributed
computation, communication and storage can be A2E with only limited
participants knowing operations / plaintext? Perhaps: homomorphic encryption,
blockchain-similar ledger or proof-of-work and periodic, authenticated hash
challenge queries. Mix in relaying and other idle phony traffic to make
triangulation more difficult. I think in order to assure sufficient
distributed system resources are made available, μpayments a-la AWS but just
covering costs would make it possible to have a persistent, anonymous
computation and storage collective that would survive outages, FBI raids,
single nodes going offline, etc.

~~~
indolering
Storage, [yes]([https://storj.io/](https://storj.io/)). Computation ... sure
if you don't mind the server viewing the contents of your computation and can
verify the results. Sadly, fully homomorphic systems incur waaaay too much
overhead so you are constrained in what you can do (i.e. specialized DBs,
zkSNARKs, etc).

Then, of course, there is the problem of network latency and bandwidth costs
vs just keeping it all on one datacenter.

------
mjevans
A better question is why /shouldn't/ you use ECC memory?

Generally the answer to this is any context where you legitimately do NOT care
about your data at all, but you still care about costs. This predominately
devolves in to consumption only gaming systems.

In all other cases everyone would be better served (in the long run) by buying
ECC RAM.

~~~
blackflame7000
a common network topology is to have a load balancer distribute load to a
number of cheap Http servers which internally connect to a centralized and
powerful database server. In this case only the database server really needs
ECC ram. The system is designed to be fault tolerant for any individual HTTP
server node so the increased cost vs the problem it solves doesn't make sense.

I guess you could argue that a random bit flip could somehow make the HTTP
server vulnerable and able to compromise the network however that risk is
impossibly small. If we take IBMs estimation that a bit flip occurs at an
approximate rate of (3.7 × 10-9) bytes/month and then divide it by the number
of bytes in the system you can see that the odds of randomly corrupting a byte
in memory that triggers a vulnerability is too small.

~~~
fulafel
What about memory-error corrupted application data (or application logic)
where the corruption occurred on load balancers or web application servers?
There's more to data integrity than security holes.

~~~
blackflame7000
If you write code that detects stack smashing and illegal dereferences then
you can terminate the webservice and either have a watchdog restart it or if
it crashes multiple times have it taken out of service by the load balancer.
There are plenty of ways to handle hardware errors without throwing out the
hardware and getting "better" hardware. Technically, You could have a faulty
component somewhere between the Ram and CPU and then what is your expensive
ram going to do? What if the CPU cache has errors? For many small businesses
often the difference between success and failure is their ability to make
things work without throwing cash at the problem.

~~~
fulafel
Even with your proposed checks there remains a high probability to just get
silent application data corruption, not crashes.

Regarding faulty components, that is one part of ECC's job, but the other part
is correcting the regular bit flips that happen with nominally operating DRAM.

Flagging faulty components is more useful than you propose. There are not that
many places where this corruption can occur, so being able to rule out RAM is
very useful. The example you used, CPU caches, is actually already covered by
ECC in most CPUs, including reasonably recent x86/amd64.

The tradeoff would be more worthy of thought if ECC was much more expensive

~~~
blackflame7000
ECC ram is not a raid. If corrosion on a trace causes a bit flip from an
adjacent line then the ram will recieve rhe corrupt data as valid. There is no
parity ram stick to recover from. I never said ECC ram doesnt have a purpose.
Im saying you are wasting your money if you think it's essential to running a
web sever. Lets be real here, like 80% of computers on the internet stream
porn. They dont need ecc ram

------
lucb1e
This article is gold in so many ways. It contains interesting bits of
information on ECC, company history that I didn't know (Sun's and Google's
namely), filesystem reliability (I never knew!), the physics of RAM (50
electrons per capacitor)...

It's a must read, even if only to get you thinking about some of these things.

------
VA3FXP
Depends on what you are doing. ZFS storage servers: Hell yes High-value data
in my DB? Hell yes email server: Nope super cool gaming rig: Nope * Cluster:
Hell yes

General office workstation: maybe.

I don't have the budget for 20 redundant copies. I do have the budget for
slightly more expensive RAM. Especially on my ZFS storage arrays.

ECC memory is like Insurance. You hope you never need it. One real downside
that I have found, is finding out _when_ that memory correction has saved your
ass. RAID arrays can alert you when a disk is dead. SMART mostly tells you
when disks are failing. I haven't found a reliable tool to notify me when I am
getting ECC errors/corrections.

~~~
rocqua
There is a hidden cost of ECC with regards to the chipset. None of the cheap
chipsets support it, so on any home build, it's going to be expensive.

~~~
binarycrusader
Not true with Ryzen, as long as you find unregistered ECC acceptable.

Somewhat not true with Intel, as some of the lower end Xeons now support it.

~~~
paulmd
Ryzen ECC support is a mess, no AM4 motherboard currently on the market has
implemented ECC support fully and properly (not even Asrock). It's better than
nothing but you would be a fool to rely on it.

[http://www.hardwarecanucks.com/forum/hardware-canucks-
review...](http://www.hardwarecanucks.com/forum/hardware-canucks-
reviews/75030-ecc-memory-amds-ryzen-deep-dive-5.html)

"Kinda sorta works but the manufacturer won't stand behind it" is bunch of
bullshit. If your data is worth using ECC in the first place - it's worth
using a platform that has fully-implemented support, that has passed
validation, that you know is going to work properly when you need it.

Until that happens - this is an application where Ryzen is simply not
appropriate.

All of the modern i3s and Pentiums support ECC, but you do need the server
chipset instead of the cheap consumer stuff. Good news though - those
"expensive server boards" are roughly the same price as say, an AM4
motherboard with an X370 chipset.

Heck, you can buy a basic off-lease ThinkServer TS140 for only about $300.
You'll only have about 4 GB of RAM but it's a shell to start building out
(which is cheaper than having an OEM assemble it for you anyway).

~~~
binarycrusader
_Ryzen ECC support is a mess, no AM4 motherboard currently on the market has
implemented ECC support fully and properly (not even Asrock). It 's better
than nothing but you would be a fool to rely on it._

Ryzen motherboard support is what is agreeably a "mess", not the processor
itself, but at least it's functional on ASRock and select Gigabyte boards. As
for "a fool to rely on it", not sure what you mean by that. The error
correction itself is done by the hardware. Other than calling the
initialization routines and providing logging/halt, the BIOS/UEFI isn't
responsible for anything afaik.

I'm well aware that this isn't the full grade of ECC support offered by
higher-end Xeons and chipset combos, but it's better than nothing and it's
affordable.

Also, no offense, but I'm not going to rely on hardwarecanucks as an authority
on this subject.

 _All of the modern i3s and Pentiums support ECC, but you do need the server
chipset instead of the cheap consumer stuff. Good news though - those
"expensive server boards" are roughly the same price as say, an AM4
motherboard with an X370 chipset._

The goal isn't ECC alone, at least not for me, the goal is an 8-core system
with good single-threaded performance and ECC at a reasonable price. As far as
I know, only Ryzen offers that.

So for me, I'm looking at the possibility of getting a single system that can
give me decent gaming performance, good development performance, ECC support,
and more, all at a price that leaves me with money for other components.

~~~
paulmd
> Also, no offense, but I'm not going to rely on hardwarecanucks as an
> authority on this subject.

Fine then. AMD says it's unvalidated and unsupported, is that good enough for
you?

> I'm well aware that this isn't the full grade of ECC support offered by
> higher-end Xeons and chipset combos, but it's better than nothing and it's
> affordable.

So would you be OK with running Xeon engineering samples then? After all -
they certainly pass the same "best effort" test. Personally since these are
_server_ ES hardware - I'd tend to trust it more than consumer hardware like
Ryzen, especially given their comparative age/maturity.

I just picked up a 10-core Haswell Xeon engineering sample for $140 last week.
40% more multi-threaded performance than a Ryzen 1700. The X99 mobo I picked
up from Microcenter for $60 doesn't have ECC support but a bunch of them do.

Or if you want something that's official and you know works, there are surplus
Sandy Bridge Xeons very cheap nowadays. A decent bit more multithreaded
performance than a Ryzen 1700 - but you'll be giving up single-threaded
performance. [http://natex.us/intel-s2600cp2j-motherboard-
dual-e5-2670-sr0...](http://natex.us/intel-s2600cp2j-motherboard-
dual-e5-2670-sr0kx/)

Or really - a full retail E5-2630 v3 is under $500 now on eBay. That's not
really that bad if you just have to have everything in one box.

> So for me, I'm looking at the possibility of getting a single system that
> can give me decent gaming performance, good development performance, ECC
> support, and more, all at a price that leaves me with money for other
> components.

What it comes down to: if you want everything in one box then be prepared to
shell out. Everyone has this market segmented out, _including AMD_ (after all
they won't stand behind Ryzen's ECC either). If you feel you need ECC, that's
really not a valid solution.

If a Xeon doesn't cut it for you - sounds like you might be in the market for
two boxes here. A server/workstation with ECC and good multi-thread
performance, and a gaming machine that you can overclock and get the best
single-thread performance out of.

(Also - in general, overclocking also seems kind of counterproductive to the
aims to running ECC RAM - although I guess I haven't looked into that.)

~~~
binarycrusader
_Fine then. AMD says it 's unvalidated and unsupported, is that good enough
for you?_

No, that's not what AMD said, they said it isn't validated by motherboard
partners. The functionality is there, it's up to their partners to use it.

 _So would you be OK with running Xeon engineering samples then? After all -
they certainly pass the same "best effort" test. Personally since these are
server ES hardware - I'd tend to trust it more than consumer hardware like
Ryzen, especially given their comparative age/maturity._

That's not even a remotely accurate comparison.

 _What it comes down to: if you want everything in one box then be prepared to
shell out. Everyone has this market segmented out, including AMD (after all
they won 't stand behind Ryzen's ECC either). If you feel you need ECC, that's
really not a valid solution._

Sorry, but so far all of your proposed "solutions" are summed up as: "If you
give up significant performance, functionality, buy second-hand, or completely
ignore official support statements, X competitor is the better deal!"

 _If a Xeon doesn 't cut it for you - sounds like you might be in the market
for two boxes here. A server/workstation with ECC and good multi-thread
performance, and a gaming machine that you can overclock and get the best
single-thread performance out of._

No, the goal is to have one system, and at this point, Ryzen looks like the
best option. If a competitor decides to release something equivalent, I'll
consider them too.

------
ddingus
Yes.

Bit errors are uncommon and range from benign to crash.

Your storage has them, memory has them, network has them.

Non error correcting memory very significantly increases risk.

And this is the kind of risk you don't notice, until you do and when you do,
it's often subtle, insidious, impossible to track down.

Servers absolutely. It's debatable on desktop, but we have huge RAM now. Might
as well error correct. The bit error risk is small. Bigger RAM only adds to
that possibility.

~~~
hashhar
I won't bother on a desktop. I've been using 4 machines for the last 17 years
with storage varying from 10GB to 2TB and RAM varying from 128MB to 16GB and
haven't personally seen any kind of data corruption in motion (or at rest for
that matter). Only had 2 mechanical drives fail (though predictably).

ECC is costly. The memory modules itself and the board required to support it
properly.

~~~
xenadu02
The only reason ECC is costly is because Intel has a monopoly on the
desktop/server chip market and they refuse to deploy ECC to consumer chips.
The hardware is there, it's just fused disabled.

If ECC were only the cost premium and we assumed a linear relationship then it
should cost about 1/8 more than non-ECC DRAM. Unfortunately Intel's decisions
have knock-on effects that ripple through the rest of the market.

IIRC I saw somewhere that JDEC expects a future standard will require ECC to
get acceptable error rates for all memory. At that point Intel won't have any
choice.

------
anonymous_iam
The article makes no mention of single event upsets (SEUs). These occur
randomly when cosmic rays can cause a bit flip anywhere in the chip. ECC is a
good way to mitigate SEU effects.

~~~
jbmorgado
Sorry for nitpicking, but it's not the cosmic rays, it's the cosmic rays
secondaries cascade shower (produced high up in the atmosphere when a cosmic
ray interacts with a particle there).

------
intrasight
I am typing this (finally!) on my new desktop build. I did mull over the
decision for a while but finally went with Xeon and ECC. So the memory cost
more - perhaps even twice as much - so what? I use my computer pretty heavily
for my work - with several VMs running at a time. If ECC saves me a headache
once a year, it will have paid for itself. If it never provides ANY benefit I
will still not regret the peace of mind.

~~~
usrusr
The parameters of the desktop ECC decision have changed massively with today's
glacial replacement cycles. Today you make a one time payment for many years
of avoided headaches and peace of mind, whereas back then any sign of
unreliability would have been a welcome excuse for a cheap upgrade.

------
tsukikage
No-one's mentioned it yet, but we're in a post-Rowhammer world and ISTM this
is relevant to the discussion: while not all non-ECC DIMMs are susceptible,
the cheaper ranges generally are, and if your purchasing decisions are driven
by hardware cost, that's probably what you'll end up with. Corruption due to
malice is a rather different beast to corruption due to random cosmic rays...

------
Tomte
Rehashing an old comment:

IEC 61508 documents an estimate of 700 to 1200 fit/MBit (fit = "failure in
time"; per 10e-9 hours of operation) and gives the following sources:

a) Altitude SEE Test European Platform (ASTEP) and First Results in CMOS 130
nm SRAM. J-L. Autran, P. Roche, C. Sudre et al. Nuclear Science, IEEE
Transactions on Volume 54, Issue 4, Aug. 2007 Page(s):1002 - 1009

b) Radiation-Induced Soft Errors in Advanced Semiconductor Technologies,
Robert C. Baumann, Fellow, IEEE, IEEE TRANSACTIONS ON DEVICE AND MATERIALS
RELIABILITY, VOL. 5, NO. 3, SEPTEMBER 2005

c) Soft errors' impact on system reliability, Ritesh Mastipuram and Edwin C
Wee, Cypress Semiconductor, 2004

d) Trends And Challenges In VLSI Circuit Reliability, C. Costantinescu, Intel,
2003, IEEE Computer Society

e) Basic mechanisms and modeling of single-event upset in digital
microelectronics, P. E. Dodd and L. W. Massengill, IEEE Trans. Nucl. Sci.,
vol. 50, no. 3, pp. 583–602, Jun. 2003.

f) Destructive single-event effects in semiconductor devices and ICs, F. W.
Sexton, IEEE Trans. Nucl. Sci., vol. 50, no. 3, pp. 603–621, Jun. 2003.

g) Coming Challenges in Microarchitecture and Architecture, Ronen, Mendelson,
Proceedings of the IEEE, Volume 89, Issue 3, Mar 2001 Page(s):325 – 340

h) Scaling and Technology Issues for Soft Error Rates, A Johnston, 4th Annual
Research Conference on Reliability Stanford University, October 2000

i) International Technology Roadmap for Semiconductors (ITRS), several papers.

If that's correct, the math is simple: you have bit flips in your PC about
once a day.

It's just that (a) you often won't notice those transient errors (one pixel in
your multi-megapixel photo is one bit off) and (b) a lot of your RAM is
probably unused.

------
notacoward
Same topic, same conclusion, even more hard facts.

[http://perspectives.mvdirona.com/2009/10/you-really-do-
need-...](http://perspectives.mvdirona.com/2009/10/you-really-do-need-ecc-
memory/)

------
justin66
In the late nineties, the Intel desktop chipsets such as 440LX and 440BX
offered ECC functionality, all you had to do was spend ten or fifteen bucks
extra on the memory. Great hardware.

I'm unhappy that Intel made things more expensive and complicated with their
market differentiation, but from their POV it was logical. PC users were
screwing up the reliability of their systems in so many ways via overclocking,
and were habituated to accept crappy reliability via pre-NT Windows. PC users
could have demanded ECC and they didn't. I'm sure that even when the chipsets
made it easy, only a tiny fraction bothered to use ECC.

------
alkonaut
For servers this is more or less a no brainer: it's not a huge extra cost and
a failure will cost you more than the extra cost.

For a regular desktop system for personal use it's not so easy. The data
volumes are much smaller, the temperature environments are usually better,
they aren't running (other than maybe idling) 24/7, most of the stuff that is
in ram isn't going to be mission critical (i.e. you don't have 32Gb of RAM
filled with customer database records, you have it filled with read only FPS
textures, compiler caches etc).

Unlike a business that has tons of data that is mutated, my data is mostly
immutable such as photos etc. It's not a continuously changing dataset where a
bit flip in memory is likely to find its way into my data and then into my
backups which would be the case e.g. for databases or big creative work (movie
editing etc).

------
danielfaust
I've spent the last two weeks looking at Memtest86+ trying to figure out if
either one of my memory modules is damaged, or if it is the motherboard. These
tests take a long time, and yield different results from day to day.

I've decided to never ever again buy non-ECC memory, at least not on 24/7
servers as well as on workstations.

In a gaming machine / visual typewriter? Sure, non-ECC memory is ok.

------
epx
I think that, given the personal importance of computing devices and storage,
no filesystem should exist w/o checksum of metadata+data, and no RAM should be
without ECC. The slight increase in cost does not justify the risk.

------
mixmastamyk
I searched for a good Linux laptop recently with ecc but didn't find much so
settled on a kaby lake i5. Does anyone make them?

~~~
jpalomaki
For example Lenovo P51 seems to support ECC (if equipped with Xeon processor).
About the Linux support I don't know, but I've understood at least some other
Lenovo models work ok with Linux.

[http://psref.lenovo.com/Product/ThinkPad_P51](http://psref.lenovo.com/Product/ThinkPad_P51)

------
ori_b
If you can afford it, sure. That's one reason why I'm so happy Ryzen supports
it on consumer processors: It makes ECC cheap.

------
omash
What are the odds of memory errors causing hard disk corruption / boot
failure?

------
exabrial
Yes. Everyone does.

------
JakiesKonto
some1 already mentioned row hammer so ecc yes :)

------
myrandomcomment
Yes. Are we done :)

------
saganus
"<pubDate>Fri, 27 Nov 2015 00:00:00 +0000</pubDate>"

Needs (2015) added to the Title I think.

~~~
sctb
Thanks! Updated.

------
sitkack
I want to thank Jeff for assisting Dan in writing this article.

------
Splendor
Relevant post by Jeff Atwood: [https://blog.codinghorror.com/to-ecc-or-not-to-
ecc/](https://blog.codinghorror.com/to-ecc-or-not-to-ecc/)

~~~
sbierwagen
That post is linked in the _first sentence_ of the submission.

------
godzillabrennus
Do you use ZFS? If yes then you should use ECC memory.

Do you have a use case where you would want your computer to alert you when
the ram is failing? If yes then you should use ECC memory.

Otherwise it's a nicitey and probably not worth the money.

~~~
static_noise
Do you use ZFS? If no then you should use ECC memory.

Now the half truth becomes full-truth.

~~~
mikeash
I don't get this association between ZFS and ECC. The recommendation to use
ECC with ZFS basically comes down to "all that fancy data integrity checking
that ZFS does won't protect you from memory errors, so you'll effectively lose
that feature."

Are you OK with silent data corruption? If so, don't bother with ECC. If not,
use it.

~~~
__jal
History. The ZFS folks, back when, were the only folks making much noise about
the association between non-ECC RAM and corrupt data landing on disk.

The truth is, if you care about the notion that your disk should return the
same data that software thought it was writing, you should use ECC with any
file system. But The ZFS folks made noise about the issue, I think lots of
people assumed the reason was that there was something special about ZFS that
needed it, and now you have something sort of like an urban legend.

~~~
dom0
> should return the same data that software thought it was writing

Hint: In an OS using a page cache (= _every OS_ ) I/O errors are not reliably
propagated to applications unless they explicitly sync their dirty pages.

~~~
__jal
I'm aware of that, but I'm not sure what I'm supposed to take away from it in
this context.

~~~
dom0
That it's difficult to accurately define "what the application thought it
wrote" when considering corruption at various abstraction layers; somewhat
similar to calculating checksums over already corrupted data.

