Hacker News new | past | comments | ask | show | jobs | submit login
HPE Drive fail at 32,768 hours without firmware update (hpe.com)
229 points by abarringer on Nov 26, 2019 | hide | past | favorite | 114 comments



Those who forget history are doomed to repeat it. Just seven years ago Crucial sold tens of thousands of their "M4" SSDs with a firmware bug that made them fail after 5184 hours: https://www.anandtech.com/show/5424/crucial-provides-a-firmw...

Do they still not test these things with artificially incremented counters?


Boeing didn't even test their 787 aircraft for integer overflows, and that's in a safety-critical environment, so I'm not sure I'd expect SSD vendors to be any better.

https://www.engadget.com/2015/05/01/boeing-787-dreamliner-so...

Not that throwing an exception on integer overflow is any better, unless you catch the exception. The classic example here is the Ariane 5 failure:

http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html


This same problem also led to the loss of the Deep Impact spacecraft on its extended mission:

"On September 20, 2013, NASA abandoned further attempts to contact the craft.[77] According to chief scientist A'Hearn,[78] the reason for the software malfunction was a Y2K-like problem. August 11, 2013, 00:38:49, was 232 tenth-seconds from January 1, 2000, leading to speculation that a system on the craft tracked time in one-tenth second increments since January 1, 2000, and stored it in an unsigned 32-bit integer, which then overflowed at this time, similar to the Year 2038 problem"

https://en.wikipedia.org/wiki/Deep_Impact_(spacecraft)#Conta...


Your superscript got eaten. 2^32


The 787 case is most fascinating in that while the bug is dead simple, a fix is not.

>Most importantly, the company's already working on an update that will patch the software vulnerability -- though there's no word on when its jets will receive it.

My search of DDG turned up nothing about a resolution. Anyone know?

I know what I would recommend, but marketing would not like it ;-)


This exact thing happened to me. I went crazy for a week straight testing every other component of my PC. I was convinced it was the graphics cards drawing too much power. Then it was clear that my OS was corrupted and needed to be reinstalled.

Finally, I found an obscure forum post telling me about a firmware bug happening at ~5K hours of disk usage. I updated the firmware and haven't had an issue since.


I thought the coding/design pattern was to set the initial value of any counter 1 minute (or whatever eon makes sense in your application) from the roll-over so you'd see it 'right away' if it was badly handled. It's like you should use specific types with default values...


Some early Intel SSD's did the same thing, prior to M4's... haha


Do you have a link for the old Intel bug? Here's one for a new Intel bug after just 1700 power-on hours on some enterprise-class SSDs that are still being sold today: https://www.intel.com/content/www/us/en/support/articles/000...

That's just 71 days of uptime and they hang. There are tens of thousands of these drives deployed as well.


Oh boy. We had somerhing like 5 our of 8 drives fail all at the same time. All of them were affected models bought at begining of summer and failed a couple months later.

It was a pretty maddening thing to debug and figure out where the issue was (servers, rack, drives, RAID do controllers). 2 different machines 2 and 3 drives. Week later we found the Intel bulletin about the issue.

Thank God for pgbackrest backups.


Why are they allowed to still sell these broken products? That's a scam as far as I'm concerned.


They released a fix for existing devices, replaced affected devices that had been bricked, and included the fix as part of the manufacturing process for new devices being built.

It was certainly very inconvenient having to reboot systems while we waited for a fix to exist, fortunately we didn't lose too many disks before the fault was identified, which reduced the man hours involved in the DCs.

I'm not sure if any part of this is a scam. A bug, certainly.


New drives already had the updated firmware back in May when I bought some.


According to this page, the SMART hour counter is only 16 bits, and rollover should be harmless:

http://www.stbsuite.com/support/virtual-training-center/powe...

If you look elsewhere on the Internet, you'll find people with very old and working HDDs that have rolled over, so I suspect this bug is limited to a small number of drives.

(What that page says about not being able to reset it is... not true.)

Likewise, I'm skeptical of "neither the SSD nor the data can be recovered" --- they just want you to buy a new one.

Tangentially related, I wonder how many modern cars will stop working once the odometer rolls over.


>Likewise, I'm skeptical of "neither the SSD nor the data can be recovered"

If the firmware crashes during boot with negative hour counter, it probably could be only fixed by manually flashing new firmware over JTAG.


...and likely some of the data recovery companies already know about and are prepared for this.


$500 ~ $1k for a JTAG flash. I'm sure they have plenty of other drives that come in and they price at $1k but that take days longer than expected, so it probably all balances itself out eventually.


Or they swap boards and you eventually end up with the same problem...


In an SSD the board is the drive.


Anecdotal so you don't have to look elsewhere: I can confirm that at least two of my NAS HD drives have rolled over once. Drives usually do nothing and spin up once every two weeks or so. No problems. Though SMART only says 16 bits I also have one drive which has over 16 bit hours of operation reported and is still counting so 16 bit is not universal.

Since I run a SMART test every month it is easy to track the hourly progression (and thus rollover) in the event log as the events are reported in POH timing.


> Likewise, I'm skeptical of "neither the SSD nor the data can be recovered" --- they just want you to buy a new one.

Regarding recovery: The FTL is likely toast, in which case while the data probably is unharmed and there, it's basically a giant block-sized jigsaw puzzle. With enough effort, and all the stars align - sure, you might be able to recover some/all of it.

Regarding un-bricking/reset: Potentially, no longer any access to wear-levels at the time. So the future integrity/reliability is kind-of dubious.


I used to develop SSD firmware. Remember that these things need to handle power loss at any point in time. We store lots of redundant copies of information on the NAND so its just a matter of running the code that rebuilds everything.


Since most drives are started and used concurrently this bug would blow any RAID set up. There's a dark day coming for some sysadmins.


This is why I never use drives from the same batch, ideally never the same model, and usually not the same manufacturer. It happens way too regularly that drives start failing around the same time.


I don't see the value in that in most cases, honestly.

If you have, say, a 10-drive wide RAID6 you would need to source drives from 5 manufacturers/batches/models in order to be resilient to that kind of failure. Even if that was feasible that seems horrible to maintain long-term.

Doing a red/blue setup where your red systems use one type of drive and your blue systems use another type of drive seems like it could be reasonably accomplished.


> If you have, say, a 10-drive wide RAID6 you would need to source drives from 5 manufacturers/batches/models in order to be resilient to that kind of failure. Even if that was feasible that seems horrible to maintain long-term.

If anything, it's easier to maintain, as all you need to ensure on replacing a drive is to not unintentionally make the array have too many of one type of drive. In practice, it means you just regularly cycle what model you buy for your spares instead of the often totally counter-productive practice of making extra effort to find a supply of the exact same model.

In effect, most places I've done this, it has simply translated into refilling our spares from the currently most cost-effective model or two, and cycling manufacturers, instead of continuing to buy the same model.

The point is not to religiously prevent any kind of potentially unfortunate mixing, because these errors are fairly rare, but to reduce a very real chance using very simple means.

Over the 20+ years I've been doing this, I've seen at least 4-5 cases where homogenous raid arrays have been a major liability (the first one, that taught me to avoid this was the infamous IBM Death Star, where the film on the platters was almost totally scraped off; we had an array that we thankfully didn't lose data from, thanks to backups and careful management once the drives started failing once a week - only for it to take the array 4-5 days to rebuild... we didn't lose data, but we lost a lot of time babysitting that system and working around our dependency on it as a precaution).

I started mixing manufacturers after having had near-misses with several arrays with OCZ drives, where it appears to have been firmware problems across drive models.

> Doing a red/blue setup where your red systems use one type of drive and your blue systems use another type of drive seems like it could be reasonably accomplished.

You need multiple systems too, but the point is that every hour a system is down because of an easily avoidable problem is an hour where your system has reduced resilience and capacity. It's trivial to prevent these kinds of errors from taking down a raid array, so it's pretty pointless not to.


Good luck doing that at scale though. You can mix things up to some degree (and probably should) but if you need thousands of drives you're going to end up with lots from the same batch.


Instead of ordering 5k drives of a single model, you order from 2-3 manufacturer, and split the order between 2-3 different models from each, and build each array from one drive from each distinct type of drive.


> I never use drives from the same batch

Note that it wouldnțt help in this instance, as the bug is caused by the amount of time a drive was running. Different manufacturers would work, yes.


That's true. But it's easier to sell people on avoiding mixing batches, and it catches the most common reliability issues. I'd never personally trust my own files to drives from a single manufacturer, though - I've seen too many problems with that.


The problem with SSDs is that they are too reliable and when they fail they fail reliably. The only reason why they fail is usually an intrinsic flaw in the hardware design or firmware which all SSDs of the same model share equally.


Need to age a few of the drives by a hundred hours before putting them in the set.


Amen. Bought a pair of brand new disks some years ago, which failed days into the deployment...apparently from a submarine batch. Luckily the array had another, older disk, which kept it up until a replacement arrived.


Brand new disks are particularly troublesome - worth doing a burn-in of hammering them for a few days (or longer if you can take the time) to weed out the worst ones.


Different SSD vendors is impossible with HP servers and controllers, they only talk to their own expensive gear. So the disk diversity option is off the table for HP customers.


That would be a deal-breaker for me in choosing HP servers then, as that just seems like begging for trouble.


That's only if the sysadmin was trusting a single server with the data, instead of a pair of redundant servers.

Which were probably installed and started up at nearly the same time. Oops.

This bug has the potential of simultaneously damaging whole sets of servers, if they were bought and installed in bulk. Dark day indeed.


We have a cluster of four nodes that were all setup and brought online within hours of each other. The entire cluster could blow up within a couple hours if not patched.


I guess cluster nodes should be scheduled to be taken down for random amounts of time so that they fail in sequence more gracefully.

Have an “off on weekends” node. And 24/7 nodes.


> By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.

This seems incredibly rich. If you have a bunch of this kit, and you don't immediately shut it down to apply firmware updates, then HPE wash their hands of the consequences.


One of the general best practices is to have diversity in the array of drives. It's not for bugs like this though although it helps for bugs like this. It's to ensure that not all disks fail at the same time.

If you use disks from the same batch in a RAID, they would all begin to fail around the same time, because all of them have the same lifetime more or less.


> HPE was notified by a Solid State Drive (SSD) manufacturer [...]

That's a curious bit of context. It seems to imply they're shifting some of the blame onto their manufacturer? I makes me wonder if this firmware is 100% HPE specific, or if there a 2^16 hours bug about to bite a bunch of other pipelines.


Of course HPE doesn't write their own firmware from scratch. It's likely just whitelabeled by the drive manufacturer. HPE is just a reseller of existing OEM drives, like all other enterprise server manufacturers are.


It is possible, though, that this firmware was written specifically for HPE by the OEM.


There might have been some small changes, but the bulk of the code is likely copy pasted from a common codebase that is shared across models or even families. There's no way they're rewriting the entire controller code just for one customer, it doesn't make sense technically or from a business perspective.


The HP specific part is just to make sure that the HP disk vendor lockin works: HP controllers only talk to HP branded disks...


This is a 2^15 hours bug, not a 2^16 hours bug. Odds are, somewhere in the SSD firmware source code, there's a missing "unsigned". And in the Makefile, probably a missing "-Wextra".


Using signed integers for values that are always positive isn't necessarily a mistake. Most notably because for signed integers (in C and C++) overflow is undefined behavior. This allows for more aggressive optimizations by the compiler.

Some advice I've read is to only use unsigned integers if you want to explicitly opt-in to having overflow be defined behavior.


> Some advice I've read is to only use unsigned integers if you want to explicitly opt-in to having overflow be defined behavior.

I read somewhere that the true reason for that advice is that it allows the compiler to silently store an "int" loop counter in a 64-bit register, without having to care about 32-bit overflow. If you use size_t for the loop counter, that's no longer an issue.


I haven't seen an example where that's the case.

It commonly factors in loop analysis around unrolling and vectorization, e.g. the loop will run exactly 16*n times OR an integer will overflow and it'll run some other number of times.

Use of signed precludes the overflow and the exact bound enables efficient vectorization.


At least on X86 that truncation behavior is free. 32-bit EAX is just first 32-bits of RAX.


Undefined behavior is platform-specific behavior. In this case it means that rollover’s effect on the value depends on how the register stores the integer and on how the particular instruction is documented to behave.


No, undefined behavior is specified by the standard. There is also 'implementation defined' which tends to be both architecture and compiler specific.

The difference is that a program which invokes 'implementation defined' behavior can be well defined, whereas a program that invokes undefined behavior is literally free to do anything.


> This allows for more aggressive optimizations by the compiler.

And also more useful warnings from static analysis, since if the analysis can prove that the value will overflow this is guaranteed to be an error.


> Odds are, somewhere in the SSD firmware source code, there's a missing "unsigned". And in the Makefile, probably a missing "-Wextra".

Our industry really sucks. We need languages where this can't happen, and we need testing procedures where these things are caught. I wonder if software is the industry with the lowest quality:importance ratio.


Presumably the bug would still be considered a bug if it occured at 65536 hours? The incorrectly-signed bit only makes it appear sooner, but it's not the bug.


Someone says below that rollover may work OK but not negative times, for whatever reasons.


Oh! So in that case owners just need to wait 32,768 hours and they'll be fine. /s


That's 7.4 years, and well outside warranty.


If you purchase a Carepaq, can actually extend the service beyond that?

3,4,and 5-Year 24x7 Carepaqs are available on purchase for the _whole_ chassis and parts... but I wonder if I can keep purchasing 1-year carepaq extensions beyond the 6th year.


Yes, its possible. However, prices go up with age.


2^16 would be still rather too small, 65536 hours is not impossible duration.


I'm always baffled when companies try to pass these issues off onto one of their sub-vendors - especially for critical issues like this.

I didn't choose YOUR sub-vendor. You did. It's your responsibility to ensure that sub-vendor is operating at your standards. Passing blame to a sub-vendor indicates an unwillingness to take accountability.


> an unwillingness to take accountability.

I mean, that's no shock coming from HP.


I probably wouldn't even call it blame. Certainly if HPE isn't doing a full firmware audit (which I don't expect them to do), there's no way to run into this issue until it shows up in life testing. The manufacturer/supplier would be in the best position to encounter these types of issues first.


>The issue affects SSDs with an HPE firmware version prior to HPD8 that results in SSD failure at 32,768 hours of operation (i.e., 3 years, 270 days 8 hours). After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.

Looks like some sort of run time stored in a signed 2 byte integer. Oops.


It's probably the SMART "hours of operation" field. I see no reason for anything else to be stored in units of hours instead of seconds.

Yes, this means that a field meant for diagnosing failures was responsible for a failure. Oops.


But how does that brick the device? I guess the hour counter overflows, goes negative and that screws up a calculation later on, causing the firmware to crash (over and over again..)

If only SSD vendors would do the usual cost-cutting measure of loading firmware from the host computer, this could be trivially fixed.


Please no. I may actually want to boot from one of those devices.


Somewhere in the UEFI kitchensink there certainly is firmware loading support already.


What do you mean? The fix is to load a firmware update from the host computer.


Not once you are past 32768 hours, apparently.


Usually the SMART counter just wraps around back to 0. In this case it becomes negative because it was read as a signed short.


Would be nice if the standard firmware update mechanism on Linux (fwupd/LVFS) could be used for HPE products.

https://fwupd.org/lvfs/vendors/ https://fwupd.org/lvfs/devices/


This so much. Even if you hate uefi with the fire of a thousand suns, there are some good things there. Like GPT, and the UEFI capsule thing that fwupd uses.


HP actually has a working firmware update mechanism for all their gear. Its a bootable Linux liveDVD that starts into a browser talking to a local Tomcat instance which applies necessary patches. For many cases its also possible to invoke patching from your normal Linux installation. However, a reboot is mostly still necessary, e.g. for disk firmware which the controller applies after its own new firmware has been loaded (sometimes takes more than one reboot).

The system is quite a lot older than fwupd and less flakey usually. Google for hpsum or HP SPP


Yes, hpsum / SPP is what we use now. Not happy about it.


Ouch. I wonder how many non-enterprise SSD's come with similar bugs, and zero support by the firmware vendor.


> neither the SSD nor the data can be recovered

It looks like such a bug isn't necessarily SSD specific if it completely bricks the drive.

And while 32.768 hours may seem like a long time for a drive, it's under 4 years of continuous operation. Not unheard of if used in a NAS.


> It looks like such a bug isn't necessarily SSD specific if it completely bricks the drive.

Maybe. OTOH, plenty of people have been running spinning rust drives with way more than 4 years of power-on operation - if this bricking bug was common there, I'm pretty sure we would've noticed. SSD's are a newer tech and it's more common to replace them anyway as specs improve.


There was actually a similar drive-bricking bug in the SMART implementation of some of Seagate's hard drives about a decade ago. Not quite as deterministic as this one, but essentially uptime-dependant. After denying it for a while, they finally fessed up once a member of the public figured out that affected drives could be unbricked by connecting over the serial debug interface and wiping the SMART log. They ended up offering firmware updates and free unbricking of affected drives at their cost including shipping to their facility. (I don't think the unbricking process was terribly easy either - the publicly-known version required booting the drive with the heads disconnected from the controller to stop it from reading the data, or it'd get stuck in an infinite loop and not respond on serial.)


I have one of these drives. Apparently a model that the known procedure doesn't work on. I'll never buy a shitty Seagate product again.


I would think the typical HPE customer (e.g. us) buys servers and uses them for between 3-5 years, before buying new servers. The old (out of warranty) servers might be discarded, or might be reused as test hardware.


Non tech Fortune 500 IT shops regularly see their refresh budget cut in favor of new projects. Seeing some amount of 5,7,10+ year old hardware still in service isn't unusual.


Oh I'm not implying it's a common bug. I was just saying that there's nothing in the description that makes it look SSD exclusive. To completely brick the drive sounds more like a controller failure. So such a bug could just as easily kill an HDD controller.

SSDs have insanely complex firmware when compared to HDDs so letting this kind of bug slip through is an easier mistake to make in their case.


It's the drive firmware. Drive firmware bricks the disk because the disk is soldered into the drive.

It doesn't need to be continuous. Total operation is the metric.


I read OP's comment as "if this bug happened to a non-enterprise (regular) user and their drive". A regular user has a lower chance of hitting 4 years of non-continuous operation before discarding the drive due to obsolescence. On a 50% duty cycle (12h/day every day) it would take 6.5 years. That's close to how long many people would hang on to their drives. But even a regular user might have a NAS and that accelerates the process.


Outside of the deep-pocket money-doesn't-matter sized enterprises, 4 years can be less than half the expected lifetime for IT kit.


I'm guessing it's not a particularly productive way to store timestamps.


That's a very coincidental reason to go for a 3 year warranty.


Unfortunately, even if you operate these drives continuously from the day you buy them, they will take 3 years, 270 days and 8 hours to fail (as someone else kindly calculated), so a 3-year warranty won't help you in this case...


Whatever the counter is, the fact that it's 32,768 instead of 65,536 suggests they used a signed int for something that presumably starts at zero and increases monotonically... Avoiding just that mistake would've given them twice as much time - nearly 7.5 years - which seems like it'd be longer than these drives would typically last anyway.


It would have avoided the problem in the first place because the SMART counter is allowed to roll over back to 0.


Maybe they're running on a 15-bit architecture where a signed int would be 16384?

/s


Amazing. A repeat of the "Windows 95 crashes after 48 days uptime" and other timer rollover bugs.


I’ve always appreciated the humor of the fact that Win95 was so unstable that no one noticed this bug until years later.


It also used to be very common for people to turn off their computers when they were done using them.


Yup. Those were power-hungry times, without deep sleep modes, or even useful hibernation. Heck, most machines then didn't even bother to reduce clock speed when idle.


"It is now safe to turn off this computer"

I remember when I saw my first ATX computer, and it turned itself off. That was cool.


The CPUs used hardly any power so there wasn't much point throttling them. The rest of the computer and the CRT used loads though.


Linux wasn't so great in 1995 either. We regularly rebooted for various kernel, ip stack, etc, bugs that would crop up after a fairly short amount of uptime.

Our Sun workstations were very stable though.


I must have had great luck. My sister and I had an HP Win95 machine for gaming growing up. It never crashed. But it also never got weird software etc.


I just want to know how many of these failed at 32768 hours before they had their oh sh*t moment.


Beats me but I happen to have a fleet of HP SATA (not SAS) drives and they just crossed this boundary, 32813 power on hours typical. I guess if their SATA firmware had this bug I'd be having a bad week.


>By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.

How is this work legally? For one, how would HPE prove that the customer read the bulletin? I don't imagine they're sending these out via certified mail.


At the hospital where I work, almost all HP desktops crashed within a few months...


This is for HPE (not HP, which is now a separate company). I haven't heard anything about HP (who makes desktops not storage arrays) experiencing this problem.


Considering 900 drives out of 1800 crashed in HP computers at a Swedish hospital in the last few months I suspect there is a connection.

Maybe HP and HPE were more tightly connected 3 years, 270 days 8 hours ago.


All of the impacted devices in this issue are SAS-connected enterprise-class SSDs. While it's not _impossible_ that someone installed a SAS controller and used SSDs that can cost upwards of 10x the price of an equivalent-capacity desktop model using SATA... it's probably pretty unlikely.


They sure that's hardware related, and not some poorly written ransomware?



I did some recon on eBay looking for used units w/ the affected SKUs for sale and they appear to be Samsung units.


Whew, dodged that bullet, looks like I'm not using any of the affected drives. Lucky me, for now.


Yikes! This is why when I built my home NAS I used five different drives and manufacturers.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: