A broken memory module hid in plain sight

sllabres · on Feb 28, 2020

It can be the other hand round: We once had multiple enterprise class server in a data center that suffered from a strange problem. After a couple of days, sometimes weeks, they stopped working with memory nearly always reported as the culprit. It took some time to find the true reason, which we discovered with the vendor after many hours of analyzing and swapping hardware. The error only occurred when the server was lightly loaded and could be mitigated by disabling the CPU C states entirely in the server configuration. As memory, CPU test or test jobs we had written produced high system load, they never showed any sign of error during test runs because they prevented the CPU from entering higher C states.

rsecora · on Feb 28, 2020

That shall be the one of the issues with the Xeon E5-26xx[1]. Some state changes produced either reset or unpredictable system behavior. The workaround was disable c-state or patch the microcode when the patch was available.

[1] https://www.intel.co.uk/content/www/uk/en/processors/xeon/xe...

nickcw · on Feb 28, 2020

A long time ago I wrote a program for stress testing hard disks. You can download the modern version of it here: https://github.com/ncw/stressdisk

It got used a lot to soak tests disk for servers that were going into production or being shipped out to far away places.

However it quite soon became apparent that a lot of the errors weren't caused by bad disks at all, but rather by bad RAM. It even discovered a set of bad RAM for 200 computers which passed all the manufacturers tests but failed when in use.

Moral of the story: if you care about data integrity check your RAM as well as your disks! I run memtest86 then stressdisk on any new systems I'm building.

Also, testing RAM is hard! Memtest86 does an excellent job, much better than the manufacturers tests.

mjevans · on Feb 28, 2020

This is why if I care about data on a system, I want that system to have ECC RAM.

Proper burn-in testing helps too; but ECC will help me later after things have aged. Maybe I should add a yearly re-validation to systems.

anonsivalley652 · on Feb 28, 2020

Mostly yeap. Every new memory stick I buy at home gets 30 minutes or so on memtest86. Professionally, it would be a couple of hours. Memory sticks might go "bad" over time, but I would bet it's likely a BGA ball failure. Possibly some directed heat or oven could fix it without having a BGA rework station. The failures-over-time graphs I've seen for memory are similar to those for hard drives: bathtub graphs.

My current build is a borosilicate water-cooled dual Rome EPYC with ½ a TiB of ECC RAM. I already had it planned before the media went gaga over AMD. I just hope prices don't spike too much before I drop what's already going to be $13k to get this done.

PS: 72-bit (64,8) SECDED ECC RDRAM typically uses 4-bit-wide chips in multiples of 18 chips (# = ranks, 36 for 2R), whereas consumer 64-bit DDR4 uses 8-bit chips, in multiples of 8 chips. Also, does anyone remember IBM's Chipkill?

londons_explore · on Feb 28, 2020

RAM does go bad over time. It all is factory tested good before you receive it, so any failures you see are during-lifetime failures.

drudru · on Feb 28, 2020

How quiet is the EPYC build?

tokamak-teapot · on Feb 28, 2020

I followed a similar route where I eventually got to blaming bad (ECC) RAM, even sending it back to Crucial. In the end though, it wasn’t bad RAM. It was a bad RAM socket. This is one reason I’m now allergic to hardware and prefer to have someone else look after it.

tbyehl · on Feb 28, 2020

Hardware is evil, this is why we migrate everything to The Cloud!

And get angry every time The Cloud serves us a painful reminded that it's just Someone Else's Hardware.

My favorite Faulty RAM story involves working for a hosting company that at some point homebrewed some Linux-based TCP/IP load balancers to offer customers instead of a proper F5 or whatever. This particular load balancer is restarting itself every time one of the Windows servers behind it fails a health check. Obviously not a desirable situation.

A tenacious young tech not yet battle-hardened to the fallibility of hardware had been working on the load balancer for hours. Eventually throws his hands up and calls for a Windows guy 'cause obviously Windows must be doing something weird to make the Linux crash. Linux never crashes, amiright?

Grizzled veteran that I am, I'm telling him to replace the RAM before he can finish explaining.

If you can't fathom what the problem could be, if it really isn't DNS[0], it's probably going to be RAM. If it's not RAM... update your resume and burn everything to the ground.

[0] https://www.cyberciti.biz/humour/a-haiku-about-dns/

temac · on Feb 28, 2020

You had a bad ram socket that silently corrupted data while you used ecc?

tokamak-teapot · on Feb 29, 2020

No, it crashed the machine when I hit certain addresses. Or rather, when /something/ did. Hitting this part of RAM wasn’t predictable, so it was only when I considered that it seemed my crashes happened when the OS might be allocating more RAM than usual that I looked at RAM. I saw errors in memtest86 and blamed the RAM. But then the replacement had a similar problem. And then I switched modules around slots. And then I realised.