
ECC Memory and AMD's Ryzen – A Deep Dive - nacc
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/75030-ecc-memory-amds-ryzen-deep-dive.html
======
xorblurb
The conclusion of the article is mostly false at least for Linux: UE errors
have no reasons to panic the machine in all cases, and by default under Linux
the affected processes are simply killed. Of course, if this is kernel memory,
then you will panic, but the probability of it being kernel memory is low
(amount of kernel memory / total memory...). This has been signaled in the
comments (not by me) but unfortunately the article has not been updated to
reflect that fact. Also, there is no reason that the policy is not managed by
the software, so as long as it is detected, the kernel is free to do what it
wants with UE, and everything is fine.

Also I have no proof that any crazy thing can not happen, but there is no
reason for single bit errors not to be corrected regardless of the OS. The
worse that should happen for them is to not be reported.

IMO if you have the opportunity (the category of HW you want supports it) you
would be crazy not to use ECC RAM. Non-ECC RAM is basically the only component
in a PC that is not protected. Obvious weak point. I've been beaten at least
twice (two defective components, way more than 2 errors before I figured out
what was happening) only on computers I was _directly_ owning or using at work
(among a total of a dozen of computers). Now I don't want to loose my time
anymore, I always use ECC memory when possible (I'm not going to pay a
computer twice the price just for that, so it is a "little" difficult with
laptops which also have a plethora of other choice criteria, but it is very
easy to get affordable workstation desktop computers with ECC)

No modern digital communication bus will be designed without any form of
protection, so this make not much sense to have computers without ECC RAM. I
would even like to have it on smartphones, but unfortunately I doubt this will
happen soon.

~~~
yuhong
The fun thing is that things like rowhammer can cause errors in unrelated
memory.

~~~
poizan42
At least ECC turns a RCE into a DoS, even if it's not the best situation.

~~~
baq
Nah you just flip three bits at a time instead of two.

~~~
poizan42
But can you do that controlled and fast enough that the hardware doesn't
notice? As far as I know, no-one has managed to successfully perform a
rowhammer-based attack against a system with ECC ram.

------
mjevans
I would actually prefer if the uncorrected memory exception were handled by
the operating system.

I would far more prefer that the affected program(s) have a chance to react,
or be killed as a subset of the system. If the error occurred in a filesystem
context there may be other ways of correcting the issue (particularly if it's
merely in read cache instead of write cache).

Obviously unhandeled exceptions should cascade until they are either contained
or until the entire system halts.

~~~
throwaway2048
a problem arises when the memory error occurs in the subsystem designed to
handle memory errors....

~~~
wmf
I think that just hangs the whole machine, which is no worse than what happens
when the OS does not handle uncorrectable errors.

~~~
sqeaky
I think we would actually be in undefined behavior's territory. If that demon
is kind the machine hangs, if not it could start sending and endless stream of
gibberish down the SATA bus.

~~~
Dylan16807
Other kinds of faults are handled just fine. There's a fault handler, a double
fault handler for when that has a problem, and triple fault is a reset.

There's no reason a fault while in the ECC error handler shouldn't have the
same progression.

------
zkms
> Since we don't have our own particle accelerator to bombard the memory
> modules with in order to cause radiation-based errors

I really want to see someone get some radioisotopes and place them next to
both ECC and non-ECC RAM (while forcing reads and writes to the affected
memory) to see what sort of soft errors / SEUs happen.

~~~
_ihaque
The folks at Los Alamos have done that one better and actually measured GPUs
placed in neutron beamlines:

[https://www.cs.utexas.edu/users/skeckler/pubs/SELSE_2014_Rel...](https://www.cs.utexas.edu/users/skeckler/pubs/SELSE_2014_Reliability.pdf)

[http://users.nccs.gov/~vazhkuda/hpca.pdf](http://users.nccs.gov/~vazhkuda/hpca.pdf)
(section 6)

~~~
zkms
Lovely!

------
angry_octet
It's a bit terrible that the author implies ZFS is more susceptible to bit
errors _because_ it scrubs data, and any errors will make it go haywire. As
opposed to other systems like NTFS/ext4 which presumably cope fine with
undetected but errors...

