It’s striking that so many stories about the triumph of metrics over intuition involve games. I’m convinced that you should put statisticians in charge if your goal is to win a game. I’m not convinced that related behaviors, like constructing a game adjacent to the problem you’re actually trying to solve and treating game performance as a proxy (like LeetCode interviews! Or KPIs!) are helpful.
It’s also interesting that he brings up the blind audition here in the appendix. Blinding can of course remove biases about the person from the process, but it is still subjective. To make the “can’t manage what you don’t measure” point we would need to show that orchestra performance got better after we developed an algorithmic way of deciding how good an audition performance is. After all millions of dollars are on the line, are you really going to listen to some credentialed old man about which one he liked better? So backwards!
>I’m not convinced that related behaviors, like constructing a game adjacent to the problem you’re actually trying to solve and treating game performance as a proxy (like LeetCode interviews! Or KPIs!) are helpful.
The devil is as always in the details. At least as far as KPIs go, they can be very useful if they are chosen appropriately and if the senior team members have all bought in to the concept. Example:
I used to work in military search and rescue, and the pilots get tasked by a complex bureaucracy indirectly descended from NORAD, and they have a requirement in most areas to be airborne within 30 minutes when called. Thus, our maintenance flight had as its first and foremost KPI chosen "percentage of the time where maintenance can't get an aircraft ready to be flying missions in less than 30 minutes".
That's a fantastic KPI because the maintainers that worked for us were all on board. Getting that KPI as close to 0 as possible (and we did hit zero every few months) was a goal that everyone had bought into, because of course if you can't put an aircraft in the sky then it potentially means people are dying unnecessarily.
Now, any AME or military aircraft tech can tell you the paperwork requirements for aircraft maintenance are very strict and at times quite onerous, but they're required by federal law. So the same maintenance organization also had, broadly, "paperwork error rates" as a tertiary KPI, because of course the maintenance paperwork gets audited by an AS9100 or similar quality assurance shop, usually in-house.
That one is a shitty KPI because the wrench turners who work on complex jobs are more likely to make errors in the paperwork, so there's a perverse incentive wherein your best maintainers make about the same number of mistakes as your journeymen right out of school, simply because it's easier to fill out the paperwork properly for topping up an oil reservoir compared to changing a propeller or a weight-and-balance.
All that to say KPIs themselves are not the problem. Shitty leaders who choose KPIs improperly are the problem.
I agree with everything you say except the bit where the failure rate is a fantastic KPI. It has room for improvement.
> they have a requirement in most areas to be airborne within 30 minutes when called. [...] "percentage of the time where maintenance can't get an aircraft ready to be flying missions in less than 30 minutes".
The problem with this KPI is that it's completely arbitrary. Why does it have to be 30 minutes, and not 25, or 35? Changing this threshold might change your failure percentage significantly. This KPI is not actually a ruler that measures your performance – it's using your performance as a ruler to measure the threshold, because the easiest way to change the KPI number is by changing the threshold, not the performance.
After all, wouldn't 25 be better than 30? The only reason it's 30 is that someone at some point said it has to be 30.
An arbitrary cutoff limit like 30 minutes (what Deming would call a numerical goal without a method) has one of three effects:
- Either it's really hard to accomplish, in which case the large failure rate is demotivating, or
- It's just about what you can do, in which case it's meaningless anyway, or
- It's easier than what you could technically do, which might inspire dilatoriness: "You can't complain; we're meeting our goals!"
Instead of comparing yourself to arbitrary thresholds, use the raw metric as a KPI instead. Use "minutes until airborne" as the KPI. That you can optimise indefinitely, and it is an honest representation of your current process, not how it compares to a number someone threw out at some time.
I know, I know, the 30 minute number is not entirely arbitrary, it's probably related to how easy it is to keep a person alive. But it's arbitrary in the sense of being a cutoff – for keeping people alive, shorter is always better. There's nothing magical happening at exactly the 30 minute mark.
> Instead of comparing yourself to arbitrary thresholds, use the raw metric as a KPI instead. Use "minutes until airborne" as the KPI. That you can optimise indefinitely, and it is an honest representation of your current process, not how it compares to a number someone threw out at some time.
This is an example of a more general phenomenon - when you have outcome buckets, it's generally better to do your modeling on raw outcomes ["it took us 32 minutes"] than to try to do it on the bucketed outcomes ["we didn't make it in 30 minutes"]. Bucketing throws most of your information away.
Andrew Gelman talks about this all the time in the context of election modeling, where it's better to predict how many votes something will get, which is a continuous variable (and then, based on the predicted vote, predict whether the thing will win or not), rather than trying to predict whether the thing will win, which is a discrete variable.
Average minutes to airborne? Median? Mode? 90th percentile? 99th percentile?
Optimizing for each of those would result in a different process, different outcomes, and quite possibly worse outcomes for the stakeholders than "proportion under 30 minutes" which is just a sixth way to slice the same data.
> Average minutes to airborne? Median? Mode? 90th percentile? 99th percentile?
Facetious answer: yes.
More useful answer: you want to record the full timeseries of individual data points. From this you can derive mean, median, mode, 90th percentile, 99th percentile and any other statistic that is shaped like hole in your cost–benefit calculations. Including trends and changing variance.
For reporting purposes, the upper process behaviour limit is probably a good one as "the" numeric value of the KPI, but the true value is in the timeseries.
30 minutes is what the 4-Star General who runs NORAD had decided s/he wants to see as a response time.
It's also just about the fastest you can get a Hercules airborne and still include time for flight planning. It takes almost exactly 22 minutes to start a Hercules by the book, and that leaves 8 minutes for taxi and takeoff.
30 minutes is not arbitrary in the context of search and rescue. When someone has fallen overboard on a Naval vessel in most parts of the world's oceans their survival time is as short as 30 minutes: https://ussartf.org/cold_water_survival.htm
In rare circumstances the water can be much colder than average and this drops below 30 minutes, but for the most part the military operates in waters with survival time no lower than 30 minutes.
On the one hand, choosing an arbitrary but good enough target avoids wasting effort on overoptimization and ensures a focus on the worst case, not on the best case; on the other hand improvements are not elastic: maybe rearranging parts and supplies to make them more accessible is enough to make 28 minutes as likely as 30 before, but getting to 26 minutes could require expensive tools and 24 minutes would require different, easier to service aircraft.
Yup. Optimisation is always a trade-off problem that needs to be approached intelligently, not mindlessly. The raw metric of minutes to airborne leads naturally to the intelligent points you raise about what each minute is worth.
The arbitrary target implies a cost function that is infinite below 30 minutes, and zero above it – this, if anything, suppresses intelligent discussion and leads to wasted effort.
> Instead of comparing yourself to arbitrary thresholds, use the raw metric as a KPI instead. Use "minutes until airborne" as the KPI. That you can optimise indefinitely, and it is an honest representation of your current process, not how it compares to a number someone threw out at some time.
You are completely right, but this is much harder to explain to a regular person who doesn't have a math/stats background.
"Get the aircraft in the air in 30 minutes" is much easier to understand, even if the underlying assumption is that the number can be anything – especially when, as you pointed out, there are some reasonable biology-based reasons for 30 minutes.
"Get the aircraft in the air as fast as possible, but then we'll try to optimize that number" is inviting confusion ("Wait, optimize? But we're already working as fast as we can!).
Of course, the tradeoff in the hard threshold is that people might stop improving after hitting the 30 minute threshold. In my opinion this is at least partially based on cultural factors – you should be working in search and rescue at least partly because you want to help people, and should be willing to optimize further to accomplish that goal.
>All that to say KPIs themselves are not the problem. Shitty leaders who choose KPIs improperly are the problem.
Deciding to govern by KPIs presupposes that we have KPIs that are chosen well enough to beat subject-matter-expert judgement. I think if we are talking about anything other than a game with very clear rules and observable outcomes, that's an extraordinary claim.
You don't need an algorithmic way to decide how good an orchestra is, because subjective reception from critics and audiences is exactly the point.
And there's a fair amount of innovation in that. Audiences want novel-but-not-too-novel interpretations.
So not only there is no way to define a definitive objective metric for musical interpretation, it's even harder to conjure up an objective metric space for "possible but unusual good" interpretations.
As go orchestras, so go individual players. You absolutely are going to listen to the opinion of an experienced conductor, because being able to tell good from bad and having some insight into compatibility with existing players in the rest of the team is exactly what the job involves.
To make it worse, it's culturally dependent, and it shifts. Good today is not the same as good fifty years ago.
This actually makes it easier than software, because it very much is just about informed opinion, with external subjective feedback from audiences and critics.
>subjective reception from critics and audiences is exactly the point
Clearly software is aimed at subjective reception by its customers. We also have the adage "programs are meant to be read by humans and only incidentally for computers to execute" -- to the extent you believe this, subjective reception of the code by maintainers also matters. For these reasons I think subjective evaluations and "taste" for software / software engineers is undersold.
> It’s striking that so many stories about the triumph of metrics over intuition involve games. I’m convinced that you should put statisticians in charge if your goal is to win a game.
I enjoy playing around with baseball data. About a year before the pandemic, I went out and presented some research at a Sabermetrics conference hosted by the Boston Red Sox. 90% of the pro teams had their data science teams present, and I spoke with many of them (FYI - working in baseball is like working in video games: long hours and low pay).
Anyway... The Red Sox (at that time, at least) divide their analytics department into 3 subject matters, and I spoke to the lead of one of those 3. I asked how different their analysis is compared to all the old baseball knowledge developed over the last 150 years.
He said that it's not different! All the intuition about the game built up by the old-timers spending decades in the industry still holds! The analytics is all about tweaking things and making slight improvement in the odds. If they can improved the odds of a beneficial outcome by 2-3%, they consider it a success, and if they can do that consistently over the course of a season they will win 5-10 extra games. That's it! (SPOILER ALERT) Remember that in the movie Moneyball, the Oakland A's didn't win the World Series. They didn't even make it to the World Series. They were beaten in the first round of the playoffs by another small market team (Minnesota) with a similarly small payroll that wasn't doing the moneyball thing ;)
It’s interesting that your anecdote lined up so well with the example from the OP about people claiming that the metrics didn’t change much. Though those people were then giving ‘supporting’ examples that support the opposite of their statement. It may be that the person you talked to worked on eg play strategy which could have been affected totally differently from scouting.
I was intentionally vague about who I spoke to, to keep some amount of anonymity for him ;) The question was about baseball in general and not a specific aspect of it.
As far as play strategy goes, it's actually really interesting that baseball does not allow for real-time analytics in the game. You can print out some cards and give them to players to keep in their pocket (baseball uniforms have pockets!) and they can refer to them during the game. You can give a binder full of information to the coach and he can keep it in the dugout. But once the game begins, you cannot communicate strategy to the players and coaches. That's quite different than most sports.
What is interesting about scouting/player development vs play strategy is that the "revolution" in scouting started 100 years ago! Branch Rickey, most famous for signing and playing Jackie Robinson with the Brooklyn Dodgers, began creating the modern farm system for St. Louis in 1919 - with a lot of derision from other teams. But funny thing - from that point on (until he left for Brooklyn) nobody but the Yankees won more games than the Cardinals. The Chicago Cubs and Philadelphia Phillies at various times hired people to collect data about young players to use for skill development and practice recommendations, as well as personnel decisions. The shape of a guys butt and the attractiveness of his girlfriend are actually data driven heuristics!!!
> I’m convinced that you should put statisticians in charge if your goal is to win a game.
Not just you - this is the concept of a casino, bookmakers, etc. You can make consistent profit by having good stat/prob models of games that people play for money but are irrational about.
More generally having accurate models gives you an advantage in almost any important situation.
Statisticians are not going to win a game. They may win more than the average but then, when everyone is a good statistician, the mean changes and money leads again.
It’s also interesting that he brings up the blind audition here in the appendix. Blinding can of course remove biases about the person from the process, but it is still subjective. To make the “can’t manage what you don’t measure” point we would need to show that orchestra performance got better after we developed an algorithmic way of deciding how good an audition performance is. After all millions of dollars are on the line, are you really going to listen to some credentialed old man about which one he liked better? So backwards!