Boeing didn't even test their 787 aircraft for integer overflows, and that's in a safety-critical environment, so I'm not sure I'd expect SSD vendors to be any better.
This same problem also led to the loss of the Deep Impact spacecraft on its extended mission:
"On September 20, 2013, NASA abandoned further attempts to contact the craft.[77] According to chief scientist A'Hearn,[78] the reason for the software malfunction was a Y2K-like problem. August 11, 2013, 00:38:49, was 232 tenth-seconds from January 1, 2000, leading to speculation that a system on the craft tracked time in one-tenth second increments since January 1, 2000, and stored it in an unsigned 32-bit integer, which then overflowed at this time, similar to the Year 2038 problem"
The 787 case is most fascinating in that while the bug is dead simple, a fix is not.
>Most importantly, the company's already working on an update that will patch the software vulnerability -- though there's no word on when its jets will receive it.
My search of DDG turned up nothing about a resolution. Anyone know?
I know what I would recommend, but marketing would not like it ;-)
This exact thing happened to me. I went crazy for a week straight testing every other component of my PC. I was convinced it was the graphics cards drawing too much power. Then it was clear that my OS was corrupted and needed to be reinstalled.
Finally, I found an obscure forum post telling me about a firmware bug happening at ~5K hours of disk usage. I updated the firmware and haven't had an issue since.
I thought the coding/design pattern was to set the initial value of any counter 1 minute (or whatever eon makes sense in your application) from the roll-over so you'd see it 'right away' if it was badly handled. It's like you should use specific types with default values...
Oh boy. We had somerhing like 5 our of 8 drives fail all at the same time. All of them were affected models bought at begining of summer and failed a couple months later.
It was a pretty maddening thing to debug and figure out where the issue was (servers, rack, drives, RAID do controllers). 2 different machines 2 and 3 drives. Week later we found the Intel bulletin about the issue.
They released a fix for existing devices, replaced affected devices that had been bricked, and included the fix as part of the manufacturing process for new devices being built.
It was certainly very inconvenient having to reboot systems while we waited for a fix to exist, fortunately we didn't lose too many disks before the fault was identified, which reduced the man hours involved in the DCs.
I'm not sure if any part of this is a scam. A bug, certainly.
If you look elsewhere on the Internet, you'll find people with very old and working HDDs that have rolled over, so I suspect this bug is limited to a small number of drives.
(What that page says about not being able to reset it is... not true.)
Likewise, I'm skeptical of "neither the SSD nor the data can be recovered" --- they just want you to buy a new one.
Tangentially related, I wonder how many modern cars will stop working once the odometer rolls over.
$500 ~ $1k for a JTAG flash. I'm sure they have plenty of other drives that come in and they price at $1k but that take days longer than expected, so it probably all balances itself out eventually.
Anecdotal so you don't have to look elsewhere: I can confirm that at least two of my NAS HD drives have rolled over once. Drives usually do nothing and spin up once every two weeks or so. No problems. Though SMART only says 16 bits I also have one drive which has over 16 bit hours of operation reported and is still counting so 16 bit is not universal.
Since I run a SMART test every month it is easy to track the hourly progression (and thus rollover) in the event log as the events are reported in POH timing.
> Likewise, I'm skeptical of "neither the SSD nor the data can be recovered" --- they just want you to buy a new one.
Regarding recovery: The FTL is likely toast, in which case while the data probably is unharmed and there, it's basically a giant block-sized jigsaw puzzle. With enough effort, and all the stars align - sure, you might be able to recover some/all of it.
Regarding un-bricking/reset: Potentially, no longer any access to wear-levels at the time. So the future integrity/reliability is kind-of dubious.
I used to develop SSD firmware. Remember that these things need to handle power loss at any point in time. We store lots of redundant copies of information on the NAND so its just a matter of running the code that rebuilds everything.
This is why I never use drives from the same batch, ideally never the same model, and usually not the same manufacturer. It happens way too regularly that drives start failing around the same time.
I don't see the value in that in most cases, honestly.
If you have, say, a 10-drive wide RAID6 you would need to source drives from 5 manufacturers/batches/models in order to be resilient to that kind of failure. Even if that was feasible that seems horrible to maintain long-term.
Doing a red/blue setup where your red systems use one type of drive and your blue systems use another type of drive seems like it could be reasonably accomplished.
> If you have, say, a 10-drive wide RAID6 you would need to source drives from 5 manufacturers/batches/models in order to be resilient to that kind of failure. Even if that was feasible that seems horrible to maintain long-term.
If anything, it's easier to maintain, as all you need to ensure on replacing a drive is to not unintentionally make the array have too many of one type of drive. In practice, it means you just regularly cycle what model you buy for your spares instead of the often totally counter-productive practice of making extra effort to find a supply of the exact same model.
In effect, most places I've done this, it has simply translated into refilling our spares from the currently most cost-effective model or two, and cycling manufacturers, instead of continuing to buy the same model.
The point is not to religiously prevent any kind of potentially unfortunate mixing, because these errors are fairly rare, but to reduce a very real chance using very simple means.
Over the 20+ years I've been doing this, I've seen at least 4-5 cases where homogenous raid arrays have been a major liability (the first one, that taught me to avoid this was the infamous IBM Death Star, where the film on the platters was almost totally scraped off; we had an array that we thankfully didn't lose data from, thanks to backups and careful management once the drives started failing once a week - only for it to take the array 4-5 days to rebuild... we didn't lose data, but we lost a lot of time babysitting that system and working around our dependency on it as a precaution).
I started mixing manufacturers after having had near-misses with several arrays with OCZ drives, where it appears to have been firmware problems across drive models.
> Doing a red/blue setup where your red systems use one type of drive and your blue systems use another type of drive seems like it could be reasonably accomplished.
You need multiple systems too, but the point is that every hour a system is down because of an easily avoidable problem is an hour where your system has reduced resilience and capacity. It's trivial to prevent these kinds of errors from taking down a raid array, so it's pretty pointless not to.
Good luck doing that at scale though. You can mix things up to some degree (and probably should) but if you need thousands of drives you're going to end up with lots from the same batch.
Instead of ordering 5k drives of a single model, you order from 2-3 manufacturer, and split the order between 2-3 different models from each, and build each array from one drive from each distinct type of drive.
That's true. But it's easier to sell people on avoiding mixing batches, and it catches the most common reliability issues. I'd never personally trust my own files to drives from a single manufacturer, though - I've seen too many problems with that.
The problem with SSDs is that they are too reliable and when they fail they fail reliably. The only reason why they fail is usually an intrinsic flaw in the hardware design or firmware which all SSDs of the same model share equally.
Amen. Bought a pair of brand new disks some years ago, which failed days into the deployment...apparently from a submarine batch. Luckily the array had another, older disk, which kept it up until a replacement arrived.
Brand new disks are particularly troublesome - worth doing a burn-in of hammering them for a few days (or longer if you can take the time) to weed out the worst ones.
Different SSD vendors is impossible with HP servers and controllers, they only talk to their own expensive gear. So the disk diversity option is off the table for HP customers.
We have a cluster of four nodes that were all setup and brought online within hours of each other. The entire cluster could blow up within a couple hours if not patched.
> By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.
This seems incredibly rich. If you have a bunch of this kit, and you don't immediately shut it down to apply firmware updates, then HPE wash their hands of the consequences.
One of the general best practices is to have diversity in the array of drives. It's not for bugs like this though although it helps for bugs like this. It's to ensure that not all disks fail at the same time.
If you use disks from the same batch in a RAID, they would all begin to fail around the same time, because all of them have the same lifetime more or less.
> HPE was notified by a Solid State Drive (SSD) manufacturer [...]
That's a curious bit of context. It seems to imply they're shifting some of the blame onto their manufacturer? I makes me wonder if this firmware is 100% HPE specific, or if there a 2^16 hours bug about to bite a bunch of other pipelines.
Of course HPE doesn't write their own firmware from scratch. It's likely just whitelabeled by the drive manufacturer. HPE is just a reseller of existing OEM drives, like all other enterprise server manufacturers are.
There might have been some small changes, but the bulk of the code is likely copy pasted from a common codebase that is shared across models or even families. There's no way they're rewriting the entire controller code just for one customer, it doesn't make sense technically or from a business perspective.
This is a 2^15 hours bug, not a 2^16 hours bug. Odds are, somewhere in the SSD firmware source code, there's a missing "unsigned". And in the Makefile, probably a missing "-Wextra".
Using signed integers for values that are always positive isn't necessarily a mistake.
Most notably because for signed integers (in C and C++) overflow is undefined behavior. This allows for more aggressive optimizations by the compiler.
Some advice I've read is to only use unsigned integers if you want to explicitly opt-in to having overflow be defined behavior.
> Some advice I've read is to only use unsigned integers if you want to explicitly opt-in to having overflow be defined behavior.
I read somewhere that the true reason for that advice is that it allows the compiler to silently store an "int" loop counter in a 64-bit register, without having to care about 32-bit overflow. If you use size_t for the loop counter, that's no longer an issue.
It commonly factors in loop analysis around unrolling and vectorization, e.g. the loop will run exactly 16*n times OR an integer will overflow and it'll run some other number of times.
Use of signed precludes the overflow and the exact bound enables efficient vectorization.
Undefined behavior is platform-specific behavior. In this case it means that rollover’s effect on the value depends on how the register stores the integer and on how the particular instruction is documented to behave.
No, undefined behavior is specified by the standard. There is also 'implementation defined' which tends to be both architecture and compiler specific.
The difference is that a program which invokes 'implementation defined' behavior can be well defined, whereas a program that invokes undefined behavior is literally free to do anything.
> Odds are, somewhere in the SSD firmware source code, there's a missing "unsigned". And in the Makefile, probably a missing "-Wextra".
Our industry really sucks. We need languages where this can't happen, and we need testing procedures where these things are caught. I wonder if software is the industry with the lowest quality:importance ratio.
Presumably the bug would still be considered a bug if it occured at 65536 hours? The incorrectly-signed bit only makes it appear sooner, but it's not the bug.
If you purchase a Carepaq, can actually extend the service beyond that?
3,4,and 5-Year 24x7 Carepaqs are available on purchase for the _whole_ chassis and parts... but I wonder if I can keep purchasing 1-year carepaq extensions beyond the 6th year.
I'm always baffled when companies try to pass these issues off onto one of their sub-vendors - especially for critical issues like this.
I didn't choose YOUR sub-vendor. You did. It's your responsibility to ensure that sub-vendor is operating at your standards. Passing blame to a sub-vendor indicates an unwillingness to take accountability.
I probably wouldn't even call it blame. Certainly if HPE isn't doing a full firmware audit (which I don't expect them to do), there's no way to run into this issue until it shows up in life testing. The manufacturer/supplier would be in the best position to encounter these types of issues first.
>The issue affects SSDs with an HPE firmware version prior to HPD8 that results in SSD failure at 32,768 hours of operation (i.e., 3 years, 270 days 8 hours). After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.
Looks like some sort of run time stored in a signed 2 byte integer. Oops.
But how does that brick the device? I guess the hour counter overflows, goes negative and that screws up a calculation later on, causing the firmware to crash (over and over again..)
If only SSD vendors would do the usual cost-cutting measure of loading firmware from the host computer, this could be trivially fixed.
This so much. Even if you hate uefi with the fire of a thousand suns, there are some good things there. Like GPT, and the UEFI capsule thing that fwupd uses.
HP actually has a working firmware update mechanism for all their gear. Its a bootable Linux liveDVD that starts into a browser talking to a local Tomcat instance which applies necessary patches. For many cases its also possible to invoke patching from your normal Linux installation. However, a reboot is mostly still necessary, e.g. for disk firmware which the controller applies after its own new firmware has been loaded (sometimes takes more than one reboot).
The system is quite a lot older than fwupd and less flakey usually. Google for hpsum or HP SPP
> It looks like such a bug isn't necessarily SSD specific if it completely bricks the drive.
Maybe. OTOH, plenty of people have been running spinning rust drives with way more than 4 years of power-on operation - if this bricking bug was common there, I'm pretty sure we would've noticed. SSD's are a newer tech and it's more common to replace them anyway as specs improve.
There was actually a similar drive-bricking bug in the SMART implementation of some of Seagate's hard drives about a decade ago. Not quite as deterministic as this one, but essentially uptime-dependant. After denying it for a while, they finally fessed up once a member of the public figured out that affected drives could be unbricked by connecting over the serial debug interface and wiping the SMART log. They ended up offering firmware updates and free unbricking of affected drives at their cost including shipping to their facility. (I don't think the unbricking process was terribly easy either - the publicly-known version required booting the drive with the heads disconnected from the controller to stop it from reading the data, or it'd get stuck in an infinite loop and not respond on serial.)
I would think the typical HPE customer (e.g. us) buys servers and uses them for between 3-5 years, before buying new servers. The old (out of warranty) servers might be discarded, or might be reused as test hardware.
Non tech Fortune 500 IT shops regularly see their refresh budget cut in favor of new projects. Seeing some amount of 5,7,10+ year old hardware still in service isn't unusual.
Oh I'm not implying it's a common bug. I was just saying that there's nothing in the description that makes it look SSD exclusive. To completely brick the drive sounds more like a controller failure. So such a bug could just as easily kill an HDD controller.
SSDs have insanely complex firmware when compared to HDDs so letting this kind of bug slip through is an easier mistake to make in their case.
I read OP's comment as "if this bug happened to a non-enterprise (regular) user and their drive". A regular user has a lower chance of hitting 4 years of non-continuous operation before discarding the drive due to obsolescence. On a 50% duty cycle (12h/day every day) it would take 6.5 years. That's close to how long many people would hang on to their drives. But even a regular user might have a NAS and that accelerates the process.
Unfortunately, even if you operate these drives continuously from the day you buy them, they will take 3 years, 270 days and 8 hours to fail (as someone else kindly calculated), so a 3-year warranty won't help you in this case...
Whatever the counter is, the fact that it's 32,768 instead of 65,536 suggests they used a signed int for something that presumably starts at zero and increases monotonically... Avoiding just that mistake would've given them twice as much time - nearly 7.5 years - which seems like it'd be longer than these drives would typically last anyway.
Yup. Those were power-hungry times, without deep sleep modes, or even useful hibernation. Heck, most machines then didn't even bother to reduce clock speed when idle.
Linux wasn't so great in 1995 either. We regularly rebooted for various kernel, ip stack, etc, bugs that would crop up after a fairly short amount of uptime.
Beats me but I happen to have a fleet of HP SATA (not SAS) drives and they just crossed this boundary, 32813 power on hours typical. I guess if their SATA firmware had this bug I'd be having a bad week.
>By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.
How is this work legally? For one, how would HPE prove that the customer read the bulletin? I don't imagine they're sending these out via certified mail.
This is for HPE (not HP, which is now a separate company). I haven't heard anything about HP (who makes desktops not storage arrays) experiencing this problem.
All of the impacted devices in this issue are SAS-connected enterprise-class SSDs. While it's not _impossible_ that someone installed a SAS controller and used SSDs that can cost upwards of 10x the price of an equivalent-capacity desktop model using SATA... it's probably pretty unlikely.
Do they still not test these things with artificially incremented counters?