Verifying your output on another system requires another system, the cost of whi...

simias · on May 30, 2017

Yeah I see where you're coming from, I guess I have a different perspective because I never really considered using consumer-grade hardware for "pro" server use. But I suppose it makes sense if you don't want to pay the premium for a Xeon type build. I wouldn't be comfortable hosting a critical production database on non-ECC RAM for instance.

Going on a tangent this discussion made me wonder if ECC memory was common on GPUs (after all, with GPGPU becoming more and more mainstream what good is it having ECC system RAM if your VRAM isn't?)

Unsurprisingly it turns out that consumer-grade GPUs don't have ECC. However I stumbled upon this 2014 paper: "An investigation of the effects of hard and soft errors on graphics processing unit-accelerated molecular dynamics simulations"[0].

Now obviously it's a rather specific use case but I thought their conclusions were interesting:

>The size of the system that may be simulated by GPU-accelerated AMBER is limited by the amount of available GPU memory. As such, enabling ECC reduces the size of systems that may be simulated by approximately 10%. Enabling ECC also reduces simulation speed, resulting in greater opportunity for other sources of error such as disk failures in large filesystems, power glitches, and unexplained node failures to occur during the timeframe of a calculation.

>Finally, ECC events in RAM are exceedingly rare, requiring over 1000 testing hours to observe [7, 8]. The GPU-corrected error rate has not been successfully quantified by any study—previous attempts conducted over 10,000 h of testing without seeing a single ECC error event. Testing of GPUs for any form of soft error found that the error rate was primarily determined by the memory controller in the GPU and that the newer cards based on the GT200 chipset had a mean error rate of zero. However, the baseline value for the rate of ECC events in GPUs is unknown.

[0]http://www.rosswalker.co.uk/papers/2014_03_ECC_AMBER_Paper_1...

AlphaSite · on May 30, 2017

HBM GPUs all have ECC by default.