Funny how such a basic capability doesn't appear as a batteries included thing until 40+ years after Unix appears. Lots of futzing around with uptime, ps, top, iostat, vmstat, ntop, sar, lsof, free, etc. All of that replaced with a simpler tool to get the high level answer.
That first level filter is the key to narrow in your troubleshooting. Bravo. I bet Adrian Cockroft (Sun, Netflix, now AMZN) approves.
64kB of RAM in 1980: $405. $1,335 if you adjust for inflation.
64GB of RAM in 2020: $175.
In my mind, that has the most to do with it.
Is there anyone on HN who knows why it didn’t appear earlier?
Lack of need? This doesn’t seem like a hard thing to implement - am I missing something in terms of implementation complexity?
* A flash of insight to take the specific number of threads out of the picture. (And it's become particularly irrelevant since systems now have a mix of blocking and non-blocking programs, and ever higher concurrency, so radically different thread counts for the same work. Maybe that's part of your answer; that range has expanded over time.)
* The belief I could / willingness to get something like this merged into the kernel. Besides the inherent difficulty of kernel programming (significant but surmountable), the Linux kernel community is not regarded as welcoming... Maybe this is changing. We'll see if Linus's new attitude lasts.
Vendors like Google and Facebook have for years created their own userspace OOM killers that attempt to stay one step ahead of the kernel's OOM killer in order to improve consistency and reliability. PSI is just the latest in a long series of features over the years that companies have contributed for the benefit of userspace OOM killers.
There's a similar issue with Linux' aggressive buffer cache. Even if you disable overcommit, and even if there's technically uncommitted (free) memory remaining, you can easily bring a Linux system to a halt with a combination of I/O and memory contention. See, e.g., https://lkml.org/lkml/2019/8/4/15 The hanging issue can even lead to the OOM killer kicking in when Linux' heuristics determine that memory eviction isn't progressing quickly enough. And, yes, even if you use memory cgroups, you can still hit these situations. At scale and under load you will see these issues regularly. It's a nightmare.
Outside Linux, the enterprise solution to OOM situations was to write robust applications that could handle malloc and mmap failure, naturally responding to memory pressure and allowing them to reliably and deterministically maintain state and QoS. Solaris doesn't do overcommit, for example. Neither does Windows, at least not by default--it's opt-in when you do, so robust applications don't need to fear being shot down by devil-may-care memory hogs.
To reiterate, even if you disable overcommit in Linux, the fact of the matter is that many aspects of the Linux kernel were designed and implemented with the assumption of overcommit and loose accounting.
There are various ways to deterministically handle the related buffer cache and I/O contention issue. The simplest is to keep buffer cache and anonymous memory separate so one doesn't intrude onto another. But of course that's less than ideal from an efficiency and performance perspective. Another is to have a [very] sophisticated I/O scheduler that can integrate memory and I/O resource accounting and prioritization together so you can get deterministic worst-cast behavior, at least for your most important processes. A partial solution might be to strictly provision memory for disk-mapped executables. But in any event Linux doesn't provide any of these. Any particular application could theoretically implement these itself, but unless all applications do the kernel can still get wedged.
PSI is just another band-aid. It can improve things dramatically, at least if you have the time and capability to write your own userspace OOM killer and tune it to your particular workloads. (There are no generic solutions--if there were the existing OOM killer would be good enough). But when it comes to resource accounting Linux is basically broken by design. It's the price to be paid for its performance and rapid evolution.
In this case, beyond the patches, we built oomd and ran the whole contraption in production on hundreds of thousands of servers. When it meaningfully improved metrics there, it became an easier thing to sell.
Wasn't aware of PSI - thanks to the author. The load avg alerts are mostly late to our liking but tooling around this should really help.
It's been around since kernel 4.2 according to the article. Do any of the mainstream container orchestration tools use this data, I wonder...
For historic resource utilization, sar/sadc tool is still a go-to.
I'm sure someone is already working on integrating information made available in PSI to Kubernetes so that the automation bit is taken care of.