This is very standard in industry (as opposed to academia) and highlights a lot ...

causalmodels · on Jan 21, 2021

This is not the correct modeling approach. All hard drives will fail given enough time, so labeling the failed hard drives as the positive class will bias your results.

Stuff like this really should be handled using survival analysis.

mlthoughts2018 · on Jan 21, 2021

In fact it screams Bayesian survival analysis, since there is so much prior knowledge both of general hard drive failure rates and SMART stats.

astrophysician · on Jan 21, 2021

I agree and have been doing some similar analysis on the backblaze dataset. I suppose you can use this for prediction but I'm personally just interested in post-hoc analysis and (1) getting better AFR estimates when failure counts are low + (2) exploring time-dependence of hazard functions with different priors (GP priors, etc.). This post and your comment have motivated me to make a post this weekend! Thanks!

causalmodels · on Jan 27, 2021

You should check out this time-to-event neural network [1].

[1] https://github.com/ragulpr/wtte-rnn

R0b0t1 · on Jan 21, 2021

SMART seems to be extremely useless in practice. Manufacturers don't seem to expose actual failure statistics through it, likely for fear of making their product look bad.

jgalentine007 · on Jan 21, 2021

SMART is a liar sometimes. I have first hand experience with faulty Seagate firmware and Equallogic SANs - where errant statistics caused disks to be ejected from the volumes before you could finish a rebuild. Nothing like watching 40TB of data disappear on multiple installations over the course of a few weeks!