
A broken memory module hid in plain sight - zdw
https://chollinger.com/blog/2020/02/how-a-broken-memory-module-hid-in-plain-sight-and-how-i-blamed-the-linux-kernel-and-two-innocent-hard-drives/
======
sllabres
It can be the other hand round: We once had multiple enterprise class server
in a data center that suffered from a strange problem. After a couple of days,
sometimes weeks, they stopped working with memory nearly always reported as
the culprit. It took some time to find the true reason, which we discovered
with the vendor after many hours of analyzing and swapping hardware. The error
only occurred when the server was lightly loaded and could be mitigated by
disabling the CPU C states entirely in the server configuration. As memory,
CPU test or test jobs we had written produced high system load, they never
showed any sign of error during test runs because they prevented the CPU from
entering higher C states.

~~~
rsecora
That shall be the one of the issues with the Xeon E5-26xx[1]. Some state
changes produced either reset or unpredictable system behavior. The workaround
was disable c-state or patch the microcode when the patch was available.

[1]
[https://www.intel.co.uk/content/www/uk/en/processors/xeon/xe...](https://www.intel.co.uk/content/www/uk/en/processors/xeon/xeon-e5-v4-spec-
update.html)

------
nickcw
A long time ago I wrote a program for stress testing hard disks. You can
download the modern version of it here:
[https://github.com/ncw/stressdisk](https://github.com/ncw/stressdisk)

It got used a lot to soak tests disk for servers that were going into
production or being shipped out to far away places.

However it quite soon became apparent that a lot of the errors weren't caused
by bad disks at all, but rather by bad RAM. It even discovered a set of bad
RAM for 200 computers which passed all the manufacturers tests but failed when
in use.

Moral of the story: if you care about data integrity check your RAM as well as
your disks! I run memtest86 then stressdisk on any new systems I'm building.

Also, testing RAM is hard! Memtest86 does an excellent job, much better than
the manufacturers tests.

------
mjevans
This is why if I care about data on a system, I want that system to have ECC
RAM.

Proper burn-in testing helps too; but ECC will help me later after things have
aged. Maybe I should add a yearly re-validation to systems.

~~~
anonsivalley652
Mostly yeap. Every new memory stick I buy at home gets 30 minutes or so on
memtest86. Professionally, it would be a couple of hours. Memory sticks might
go "bad" over time, but I would bet it's likely a BGA ball failure. Possibly
some directed heat or oven could fix it without having a BGA rework station.
The failures-over-time graphs I've seen for memory are similar to those for
hard drives: bathtub graphs.

My current build is a borosilicate water-cooled dual Rome EPYC with ½ a TiB of
ECC RAM. I already had it planned before the media went gaga over AMD. I just
hope prices don't spike too much before I drop what's already going to be $13k
to get this done.

PS: 72-bit (64,8) SECDED ECC RDRAM typically uses 4-bit-wide chips in
multiples of 18 chips (# = ranks, 36 for 2R), whereas consumer 64-bit DDR4
uses 8-bit chips, in multiples of 8 chips. Also, does anyone remember IBM's
Chipkill?

~~~
londons_explore
RAM does go bad over time. It all is factory tested good before you receive
it, so any failures you see are during-lifetime failures.

------
tokamak-teapot
I followed a similar route where I eventually got to blaming bad (ECC) RAM,
even sending it back to Crucial. In the end though, it wasn’t bad RAM. It was
a bad RAM socket. This is one reason I’m now allergic to hardware and prefer
to have someone else look after it.

~~~
temac
You had a bad ram socket that silently corrupted data while you used ecc?

~~~
tokamak-teapot
No, it crashed the machine when I hit certain addresses. Or rather, when
/something/ did. Hitting this part of RAM wasn’t predictable, so it was only
when I considered that it seemed my crashes happened when the OS might be
allocating more RAM than usual that I looked at RAM. I saw errors in memtest86
and blamed the RAM. But then the replacement had a similar problem. And then I
switched modules around slots. And then I realised.

