Great post, glad to see something written from beginning to end. One thing that surprised me though is that there wasn't a single word lost on feature selection (by looking at feature importance for correct predictions) or feature engineering by combining features or somehow else applying domain knowledge. So for example, trying to figure out if you can put different attributes in relation to each other to get a better feature (e.g. number-of-spinups / operating-time-in-hours). Personally that would be the first thing I'd try, and coincidentally it's also the most fun aspect of machine learning for me ;)
I think it’s not actually a very good ML case study for this reason. Feature engineering would be a huge part of a problem like this. Additionally, jumping to start with XGBoost is a pretty amateurish thing to do, and the very first problem to attack is class imbalance.
In the bio blurb they self-describe as an infra engineer who also enjoys data science. In some sense I really don’t like to see that. The quality of the data science / ML work in this is actually quite bad, but people use these blog posts as resume padders to try to jump into ML jobs without ever having any real experience or training.
I think it’s a bad thing because it devalues the importance of real statistical computing skills, which are very hard to develop through many years of education and experience - absolutely not the sort of thing you can get by dabbling in some Python packages on the weekend to do a little project like this.
The amount of waste I see from companies trying to avoid paying higher wages and avoid team structures that facilitate productivity of statistics experts is staggering - with all kinds of hacked up scripts and notebooks stitched together without proper backing statistical understanding, making ML engineers manage their own devops, and just ignoring base statistical questions.
For this drive problem for example, I expect to see a progression from simple models, each of which should address class imbalance as a first order concern. I expect to see how Bayesian modeling can help and how simple lifetime survivorship models can help. I expect to see a lot of feature engineering.
Instead I see an infra engineer playing around with data and trying one off the shelf framework, then claiming the whole premise can’t work in production.
You would probably spend a lot more time and money and arrive at the same conclusion. You can have the best background in statistics and all the experience in the world, but it doesn’t matter if you don’t have the right data. This post just kind of confirms what many people have already shown about SMART data: it’s not predictive of drive failure.
That sounds like an overly dismissive attitude. It doesn’t make much sense to say, “we tried this intern-grade strawman approach with huge flaws, and since it didn’t work, it would just be a waste of money to try a more principled solution.”
This article is the example of wasting time and money. It’s amazing to me the way anti-machine-learning sentiment causes people to do a complete 180 from common sense just to avoid actually investing minimal levels of resource or effort to study and understand problems where statistics can help.
A big observation in my career is that statistics makes non-statisticians go crazy in the head. People panic that statistics will be used to usurp their authority and then try to steamroll statistics with politics about what is or isn’t over-hyped and what would or wouldn’t be a “waste” of time or resources to try out.