This is very standard in industry (as opposed to academia) and highlights a lot of the reasons why you need ML specialists operating these models, and why you will get garbage in garbage out if you try having non-specialist engineers build, evaluate & operate these types of systems.
For example, in this post the engineer does not adequately handle class imbalance at all, and reaches for a hugely complicated model framework (XGBoost) without even trying something like bias-corrected logistic regression first.
I think this post is just an example of an infra engineer thinking “ML is easy” and when their quick attempt doesn’t work, then just bailing and saying “ML is overhyped and doesn’t work for my problem.”
This is not the correct modeling approach. All hard drives will fail given enough time, so labeling the failed hard drives as the positive class will bias your results.
Stuff like this really should be handled using survival analysis.
I agree and have been doing some similar analysis on the backblaze dataset. I suppose you can use this for prediction but I'm personally just interested in post-hoc analysis and (1) getting better AFR estimates when failure counts are low + (2) exploring time-dependence of hazard functions with different priors (GP priors, etc.). This post and your comment have motivated me to make a post this weekend! Thanks!
SMART seems to be extremely useless in practice. Manufacturers don't seem to expose actual failure statistics through it, likely for fear of making their product look bad.
SMART is a liar sometimes. I have first hand experience with faulty Seagate firmware and Equallogic SANs - where errant statistics caused disks to be ejected from the volumes before you could finish a rebuild. Nothing like watching 40TB of data disappear on multiple installations over the course of a few weeks!
For example, in this post the engineer does not adequately handle class imbalance at all, and reaches for a hugely complicated model framework (XGBoost) without even trying something like bias-corrected logistic regression first.
I think this post is just an example of an infra engineer thinking “ML is easy” and when their quick attempt doesn’t work, then just bailing and saying “ML is overhyped and doesn’t work for my problem.”