It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.
And that they were sold by HP or Dell, and manufactured by SanDisk.
Do I win a prize?
(None of us win prizes on this one).
Unbelievable. Thank you for sharing your experience!
Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.
But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.
Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.
This thread is making me feel a lot less crazy.
This one is just ... maddening.
HPE releases urgent fix to stop enterprise SSDs conking out at 40K hours - https://news.ycombinator.com/item?id=22706968 - March 2020 (0 comments)
HPE SSD flaw will brick hardware after 40k hours - https://news.ycombinator.com/item?id=22697758 - March 2020 (0 comments)
Some HP Enterprise SSD will brick after 40000 hours without update - https://news.ycombinator.com/item?id=22697001 - March 2020 (1 comment)
HPE Warns of New Firmware Flaw That Bricks SSDs After 40k Hours of Use - https://news.ycombinator.com/item?id=22692611 - March 2020 (0 comments)
HPE Warns of New Bug That Kills SSD Drives After 40k Hours - https://news.ycombinator.com/item?id=22680420 - March 2020 (0 comments)
(there's also https://news.ycombinator.com/item?id=32035934, but that was submitted today)
It makes you lose data and need to purchase new hardware, where I come from, that's usually referred to as "planned" or "convenient" obsolescence.
Both planned and convenient obsolescence are beneficial to device manufacturers. Without proper accountability for that, it only becomes a normal practice.
The manufacturer, obviously. Who else would it be?
Could be an innocent mistake or a deliberate decision. Further action should be predicated on the root cause. Which includes intent.
Of course there's no law that says SSD firmware writers can't be rookies.
Then the people under them who do give a shit, because they depend on those servers, aren’t allowed to register with HP etc for updates, or to apply firmware updates, because “separation of duties”.
Basically, IT is cancer from the head down.
The lesson I learned is that the three replacements went to different arrays and we never again let drives from the same batch be part of the same array.