
HPE Drive fail at 32,768 hours without firmware update - abarringer
https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us
======
jzwinck
Those who forget history are doomed to repeat it. Just seven years ago Crucial
sold tens of thousands of their "M4" SSDs with a firmware bug that made them
fail after 5184 hours: [https://www.anandtech.com/show/5424/crucial-provides-
a-firmw...](https://www.anandtech.com/show/5424/crucial-provides-a-firmware-
update-for-m4-to-fix-the-bsod-issue)

Do they still not test these things with artificially incremented counters?

~~~
mhandley
Boeing didn't even test their 787 aircraft for integer overflows, and that's
in a safety-critical environment, so I'm not sure I'd expect SSD vendors to be
any better.

[https://www.engadget.com/2015/05/01/boeing-787-dreamliner-
so...](https://www.engadget.com/2015/05/01/boeing-787-dreamliner-software-
bug/)

Not that throwing an exception on integer overflow is any better, unless you
catch the exception. The classic example here is the Ariane 5 failure:

[http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html](http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html)

~~~
planteen
This same problem also led to the loss of the Deep Impact spacecraft on its
extended mission:

"On September 20, 2013, NASA abandoned further attempts to contact the
craft.[77] According to chief scientist A'Hearn,[78] the reason for the
software malfunction was a Y2K-like problem. August 11, 2013, 00:38:49, was
232 tenth-seconds from January 1, 2000, leading to speculation that a system
on the craft tracked time in one-tenth second increments since January 1,
2000, and stored it in an unsigned 32-bit integer, which then overflowed at
this time, similar to the Year 2038 problem"

[https://en.wikipedia.org/wiki/Deep_Impact_(spacecraft)#Conta...](https://en.wikipedia.org/wiki/Deep_Impact_\(spacecraft\)#Contact_lost_and_end_of_mission)

~~~
Tempest1981
Your superscript got eaten. 2^32

------
userbinator
According to this page, the SMART hour counter is only 16 bits, and rollover
should be harmless:

[http://www.stbsuite.com/support/virtual-training-
center/powe...](http://www.stbsuite.com/support/virtual-training-center/power-
on-hours-rollover)

If you look elsewhere on the Internet, you'll find people with very old and
working HDDs that have rolled over, so I suspect this bug is limited to a
small number of drives.

(What that page says about not being able to reset it is... not true.)

Likewise, I'm skeptical of "neither the SSD nor the data can be recovered"
\--- they just want you to buy a new one.

Tangentially related, I wonder how many modern cars will stop working once the
odometer rolls over.

~~~
garaetjjte
>Likewise, I'm skeptical of "neither the SSD nor the data can be recovered"

If the firmware crashes during boot with negative hour counter, it probably
could be only fixed by manually flashing new firmware over JTAG.

~~~
userbinator
...and likely some of the data recovery companies already know about and are
prepared for this.

~~~
djsumdog
$500 ~ $1k for a JTAG flash. I'm sure they have plenty of other drives that
come in and they price at $1k but that take days longer than expected, so it
probably all balances itself out eventually.

~~~
Scoundreller
Or they swap boards and you eventually end up with the same problem...

~~~
simcop2387
In an SSD the board is the drive.

------
abarringer
Since most drives are started and used concurrently this bug would blow any
RAID set up. There's a dark day coming for some sysadmins.

~~~
vidarh
This is why I never use drives from the same batch, ideally never the same
model, and usually not the same manufacturer. It happens way too regularly
that drives start failing around the same time.

~~~
GrayShade
> I never use drives from the same batch

Note that it wouldnțt help in this instance, as the bug is caused by the
amount of time a drive was running. Different manufacturers would work, yes.

~~~
vidarh
That's true. But it's easier to sell people on avoiding mixing batches, and it
catches the most common reliability issues. I'd never personally trust my own
files to drives from a single manufacturer, though - I've seen too many
problems with that.

------
verytrivial
> HPE was notified by a Solid State Drive (SSD) manufacturer [...]

That's a curious bit of context. It seems to imply they're shifting some of
the blame onto their manufacturer? I makes me wonder if this firmware is 100%
HPE specific, or if there a 2^16 hours bug about to bite a bunch of other
pipelines.

~~~
wtallis
This is a _2^15_ hours bug, not a 2^16 hours bug. Odds are, somewhere in the
SSD firmware source code, there's a missing "unsigned". And in the Makefile,
probably a missing "-Wextra".

~~~
rocqua
Using signed integers for values that are always positive isn't necessarily a
mistake. Most notably because for signed integers (in C and C++) overflow is
undefined behavior. This allows for more aggressive optimizations by the
compiler.

Some advice I've read is to only use unsigned integers if you want to
explicitly opt-in to having overflow be defined behavior.

~~~
cesarb
> Some advice I've read is to only use unsigned integers if you want to
> explicitly opt-in to having overflow be defined behavior.

I read somewhere that the true reason for that advice is that it allows the
compiler to silently store an "int" loop counter in a 64-bit register, without
having to care about 32-bit overflow. If you use size_t for the loop counter,
that's no longer an issue.

~~~
nullc
I haven't seen an example where that's the case.

It commonly factors in loop analysis around unrolling and vectorization, e.g.
the loop will run exactly 16*n times OR an integer will overflow and it'll run
some other number of times.

Use of signed precludes the overflow and the exact bound enables efficient
vectorization.

------
voiper1
>The issue affects SSDs with an HPE firmware version prior to HPD8 that
results in SSD failure at 32,768 hours of operation (i.e., 3 years, 270 days 8
hours). After the SSD failure occurs, neither the SSD nor the data can be
recovered. In addition, SSDs which were put into service at the same time will
likely fail nearly simultaneously.

Looks like some sort of run time stored in a signed 2 byte integer. Oops.

~~~
stefan_
But how does that brick the device? I guess the hour counter overflows, goes
negative and that screws up a calculation later on, causing the firmware to
crash (over and over again..)

If only SSD vendors would do the usual cost-cutting measure of loading
firmware from the host computer, this could be trivially fixed.

~~~
mnw21cam
Please no. I may actually want to boot from one of those devices.

~~~
rini17
Somewhere in the UEFI kitchensink there certainly is firmware loading support
already.

------
pabs3
Would be nice if the standard firmware update mechanism on Linux (fwupd/LVFS)
could be used for HPE products.

[https://fwupd.org/lvfs/vendors/](https://fwupd.org/lvfs/vendors/)
[https://fwupd.org/lvfs/devices/](https://fwupd.org/lvfs/devices/)

~~~
jabl
This so much. Even if you hate uefi with the fire of a thousand suns, there
are some good things there. Like GPT, and the UEFI capsule thing that fwupd
uses.

~~~
HorstG
HP actually has a working firmware update mechanism for all their gear. Its a
bootable Linux liveDVD that starts into a browser talking to a local Tomcat
instance which applies necessary patches. For many cases its also possible to
invoke patching from your normal Linux installation. However, a reboot is
mostly still necessary, e.g. for disk firmware which the controller applies
after its own new firmware has been loaded (sometimes takes more than one
reboot).

The system is quite a lot older than fwupd and less flakey usually. Google for
hpsum or HP SPP

~~~
jabl
Yes, hpsum / SPP is what we use now. Not happy about it.

------
zozbot234
Ouch. I wonder how many non-enterprise SSD's come with similar bugs, _and_
zero support by the firmware vendor.

~~~
close04
> neither the SSD nor the data can be recovered

It looks like such a bug isn't necessarily SSD specific if it completely
bricks the drive.

And while 32.768 hours may seem like a long time for a drive, it's under 4
years of continuous operation. Not unheard of if used in a NAS.

~~~
zozbot234
> It looks like such a bug isn't necessarily SSD specific if it completely
> bricks the drive.

Maybe. OTOH, plenty of people have been running spinning rust drives with way
more than 4 years of power-on operation - if this bricking bug was common
there, I'm pretty sure we would've noticed. SSD's are a newer tech and it's
more common to replace them anyway as specs improve.

~~~
makomk
There was actually a similar drive-bricking bug in the SMART implementation of
some of Seagate's hard drives about a decade ago. Not quite as deterministic
as this one, but essentially uptime-dependant. After denying it for a while,
they finally fessed up once a member of the public figured out that affected
drives could be unbricked by connecting over the serial debug interface and
wiping the SMART log. They ended up offering firmware updates and free
unbricking of affected drives at their cost including shipping to their
facility. (I don't think the unbricking process was terribly easy either - the
publicly-known version required booting the drive with the heads disconnected
from the controller to stop it from reading the data, or it'd get stuck in an
infinite loop and not respond on serial.)

~~~
kevin_thibedeau
I have one of these drives. Apparently a model that the known procedure
doesn't work on. I'll never buy a shitty Seagate product again.

------
_bxg1
Whatever the counter is, the fact that it's 32,768 instead of 65,536 suggests
they used a _signed int_ for something that presumably starts at zero and
increases monotonically... Avoiding just that mistake would've given them
twice as much time - nearly 7.5 years - which seems like it'd be longer than
these drives would typically last anyway.

~~~
imtringued
It would have avoided the problem in the first place because the SMART counter
is allowed to roll over back to 0.

------
pjc50
Amazing. A repeat of the "Windows 95 crashes after 48 days uptime" and other
timer rollover bugs.

~~~
macintux
I’ve always appreciated the humor of the fact that Win95 was so unstable that
no one noticed this bug until years later.

~~~
kube-system
It also used to be very common for people to turn off their computers when
they were done using them.

~~~
Piskvorrr
Yup. Those were power-hungry times, without deep sleep modes, or even _useful_
hibernation. Heck, most machines then didn't even bother to reduce clock speed
when idle.

~~~
olyjohn
"It is now safe to turn off this computer"

I remember when I saw my first ATX computer, and it turned itself off. That
was cool.

------
S_A_P
I just want to know how many of these failed at 32768 hours before they had
their oh sh*t moment.

~~~
retrovm
Beats me but I happen to have a fleet of HP SATA (not SAS) drives and they
just crossed this boundary, 32813 power on hours typical. I guess if their
SATA firmware had this bug I'd be having a bad week.

------
gruez
>By disregarding this notification and not performing the recommended
resolution, the customer accepts the risk of incurring future related errors.

How is this work legally? For one, how would HPE prove that the customer read
the bulletin? I don't imagine they're sending these out via certified mail.

------
bobowzki
At the hospital where I work, almost all HP desktops crashed within a few
months...

~~~
luma
This is for HPE (not HP, which is now a separate company). I haven't heard
anything about HP (who makes desktops not storage arrays) experiencing this
problem.

~~~
cotillion
Considering 900 drives out of 1800 crashed in HP computers at a Swedish
hospital in the last few months I suspect there is a connection.

Maybe HP and HPE were more tightly connected 3 years, 270 days 8 hours ago.

~~~
luma
All of the impacted devices in this issue are SAS-connected enterprise-class
SSDs. While it's not _impossible_ that someone installed a SAS controller and
used SSDs that can cost upwards of 10x the price of an equivalent-capacity
desktop model using SATA... it's probably pretty unlikely.

------
iveqy
Probably related to
[https://news.ycombinator.com/item?id=21471997](https://news.ycombinator.com/item?id=21471997)

------
EvanAnderson
I did some recon on eBay looking for used units w/ the affected SKUs for sale
and they appear to be Samsung units.

------
annoyingnoob
Whew, dodged that bullet, looks like I'm not using any of the affected drives.
Lucky me, for now.

------
paggle
Yikes! This is why when I built my home NAS I used five different drives and
manufacturers.

