Hacker News new | past | comments | ask | show | jobs | submit login
Persistent-memory error handling (lwn.net)
16 points by bootload on May 1, 2016 | hide | past | favorite | 4 comments



Like “clugstj” wrote in a comment, how is this a different problem than the one which mmap()ed files have always had?

If this really is a problem which must be solved, why can’t it be solved in the same way in which RAM solves this problem, i.e. with ECC? Or with something like Forward Error Correction (support in dm-verity added in Linux 4.5).


The real issue is that a memory error reboots your entire machine.

That probably isn't the right granularity anymore.

A memory error should probably kick off a signal to the application. If the error isn't caught, then it should probably kill the process.

Memory errors should probably not cause a reboot unless they actually hit a kernel page.


This is already possible for ECC RAM:

http://www.intel.com/content/dam/www/public/us/en/documents/...

So even an application can recover from unrecoverable ECC errors.

But it's not feasible to solve this in applications, since there are many scenarios where such storage should simply be transparent.


ECC corrects some types of errors (single-bit errors), but detects others as uncorrectable; how should those errors be handled? Probably not via an unhandled machine check.

Hardware support for Forward Error Correction would certainly be nice.

Software FEC support from the kernel wouldn't help applications that directly access persistent memory from userspace. Software FEC support in userspace would still need the kernel to pass through errors rather than treating them as fatal. (For instance, the kernel could send a signal when it detects an error, but provide a way for the application to explicitly read the data including any errors, so that the application can apply any error-correcting codes it has.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: