Hacker News new | comments | ask | show | jobs | submit login

What's the difference between a "hard" and "soft" error that the article mentions?

I've heard the term used in two different ways, and I'm not sure which way is the 'correct' one:

In one usage, "soft" errors are ones that are 'caught' and transparently fixed by ECC, and thus have no effect (on a system that has ECC memory). "Hard" errors, by contrast, are ones that affect multiple bits and aren't corrected by ECC.

In the other usage, which I think is the more technically correct one, a "soft" error is a transient condition (bit flipped by cosmic ray, etc.) and the memory cell continues to operate normally on the next cycle. A "hard" error is where the cell is basically stuck in one state or another, and indicates that it's probably time to replace the module. I think you detect a "hard" error by looking for a series of "soft" errors, although maybe some architectures/chipsets detect the difference and report them in different ways...?

If anyone can substantiate either set of definitions, I'd be interested as well.

Soft errors are temporary (e.g. a bit flip caused by cosmic rays) so rebooting will eliminate them.

Hard errors are permanent (e.g. a bit is always bad) and that's when you throw the DIMM away.

by definition you can recover from a soft error, eg by automatically retrying a calculation

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact