Hacker News new | past | comments | ask | show | jobs | submit login
Linux Pressure Stall Information (PSI) by Example (unixism.net)
132 points by shuss on Aug 31, 2019 | hide | past | favorite | 18 comments

Wow. This would have totally changed my job of being a Unix sysadmin in the 90's.

Funny how such a basic capability doesn't appear as a batteries included thing until 40+ years after Unix appears. Lots of futzing around with uptime, ps, top, iostat, vmstat, ntop, sar, lsof, free, etc. All of that replaced with a simpler tool to get the high level answer.

That first level filter is the key to narrow in your troubleshooting. Bravo. I bet Adrian Cockroft (Sun, Netflix, now AMZN) approves.

> doesn't appear as a batteries included thing until 40+ years after Unix appears

64kB of RAM in 1980: $405. $1,335 if you adjust for inflation.

64GB of RAM in 2020: $175.

In my mind, that has the most to do with it.

Interesting observation, though the most expensive part (me) hasn't changed much. They'd get much better use of me with better tools.

>> Funny how such a basic capability doesn't appear as a batteries included thing until 40+ years after Unix appears.

Is there anyone on HN who knows why it didn’t appear earlier?

Lack of need? This doesn’t seem like a hard thing to implement - am I missing something in terms of implementation complexity?

I've complained about Linux's uptime calculation many times (combining cpu and io blocking into one measure makes it impossible to interpret!). I think I was missing two things to come up with this solution:

* A flash of insight to take the specific number of threads out of the picture. (And it's become particularly irrelevant since systems now have a mix of blocking and non-blocking programs, and ever higher concurrency, so radically different thread counts for the same work. Maybe that's part of your answer; that range has expanded over time.)

* The belief I could / willingness to get something like this merged into the kernel. Besides the inherent difficulty of kernel programming (significant but surmountable), the Linux kernel community is not regarded as welcoming... Maybe this is changing. We'll see if Linus's new attitude lasts.

It came about because of Linux' intrinsically unpredictable and unreliable memory management. Even when you disable overcommit, Linux can still easily get into situations where the OOM killer is triggered. Similarly, OOM scores won't reliably and consistently result in the desired process being killed.

Vendors like Google and Facebook have for years created their own userspace OOM killers that attempt to stay one step ahead of the kernel's OOM killer in order to improve consistency and reliability. PSI is just the latest in a long series of features over the years that companies have contributed for the benefit of userspace OOM killers.

There's a similar issue with Linux' aggressive buffer cache. Even if you disable overcommit, and even if there's technically uncommitted (free) memory remaining, you can easily bring a Linux system to a halt with a combination of I/O and memory contention. See, e.g., https://lkml.org/lkml/2019/8/4/15 The hanging issue can even lead to the OOM killer kicking in when Linux' heuristics determine that memory eviction isn't progressing quickly enough. And, yes, even if you use memory cgroups, you can still hit these situations. At scale and under load you will see these issues regularly. It's a nightmare.

Outside Linux, the enterprise solution to OOM situations was to write robust applications that could handle malloc and mmap failure, naturally responding to memory pressure and allowing them to reliably and deterministically maintain state and QoS. Solaris doesn't do overcommit, for example. Neither does Windows, at least not by default--it's opt-in when you do, so robust applications don't need to fear being shot down by devil-may-care memory hogs.

To reiterate, even if you disable overcommit in Linux, the fact of the matter is that many aspects of the Linux kernel were designed and implemented with the assumption of overcommit and loose accounting.

There are various ways to deterministically handle the related buffer cache and I/O contention issue. The simplest is to keep buffer cache and anonymous memory separate so one doesn't intrude onto another. But of course that's less than ideal from an efficiency and performance perspective. Another is to have a [very] sophisticated I/O scheduler that can integrate memory and I/O resource accounting and prioritization together so you can get deterministic worst-cast behavior, at least for your most important processes. A partial solution might be to strictly provision memory for disk-mapped executables. But in any event Linux doesn't provide any of these. Any particular application could theoretically implement these itself, but unless all applications do the kernel can still get wedged.

PSI is just another band-aid. It can improve things dramatically, at least if you have the time and capability to write your own userspace OOM killer and tune it to your particular workloads. (There are no generic solutions--if there were the existing OOM killer would be good enough). But when it comes to resource accounting Linux is basically broken by design. It's the price to be paid for its performance and rapid evolution.

Beyond implementation, how do you prove that a new statistic is correct and meaningful? Especially when it’s going to become part of the ABI forever?

In this case, beyond the patches, we built oomd and ran the whole contraption in production on hundreds of thousands of servers. When it meaningfully improved metrics there, it became an easier thing to sell.


Wasn't aware of PSI - thanks to the author. The load avg alerts are mostly late to our liking but tooling around this should really help.

Another advantage of PSI is that there also are per-cgroup monitors. Unlike load indicators which are system-global and thus pointless when your cgroup is limited by quotas.

This is going to cone in very useful indeed.

It's been around since kernel 4.2 according to the article. Do any of the mainstream container orchestration tools use this data, I wonder...

In-kernel resource-specific counters are great, but those are limited to 10/60/300 seconds averages.

For historic resource utilization, sar/sadc tool is still a go-to.

kees99, you should check the "total" fields. They have information in microsecond resolution.

If I have a cron job and reads the pressure file and simply stores its contents with a time stamp for later analysis, would that be enough to determine historic resource utilization?

While already available tools should have sufficed and PSI should be good for it, a very important capability that PSI provides system administrators is its high-resolution. Using PSI, with specialized tooling, system administrators can automate sensitive workload scheduling.

I'm sure someone is already working on integrating information made available in PSI to Kubernetes so that the automation bit is taken care of.

why not dump it into a time series database - even something as simple as a set of whisper files on a local disk can be a lot more useful than a flat file. go-carbon is my tool of choice for this kind of job.

Major improvement over loadavg for troubleshooting, given the difficulty in identifying when a system is io-bound.

This is an excellent write-up. The code example for polling events is going to keep me entertained long enough to get in trouble. Well done shuss!

This is pretty useful and cool.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact