
Linux Pressure Stall Information (PSI) by Example - shuss
https://unixism.net/2019/08/linux-pressure-stall-information-psi-by-example/
======
tyingq
Wow. This would have totally changed my job of being a Unix sysadmin in the
90's.

Funny how such a basic capability doesn't appear as a batteries included thing
until 40+ years after Unix appears. Lots of futzing around with uptime, ps,
top, iostat, vmstat, ntop, sar, lsof, free, etc. All of that replaced with a
simpler tool to get the high level answer.

That first level filter is the key to narrow in your troubleshooting. Bravo. I
bet Adrian Cockroft (Sun, Netflix, now AMZN) approves.

~~~
heavenlyblue
>> Funny how such a basic capability doesn't appear as a batteries included
thing until 40+ years after Unix appears.

Is there anyone on HN who knows why it didn’t appear earlier?

Lack of need? This doesn’t seem like a hard thing to implement - am I missing
something in terms of implementation complexity?

~~~
scottlamb
I've complained about Linux's uptime calculation many times (combining cpu and
io blocking into one measure makes it impossible to interpret!). I think I was
missing two things to come up with this solution:

* A flash of insight to take the specific number of threads out of the picture. (And it's become particularly irrelevant since systems now have a mix of blocking and non-blocking programs, and ever higher concurrency, so radically different thread counts for the same work. Maybe that's part of your answer; that range has expanded over time.)

* The belief I could / willingness to get something like this merged into the kernel. Besides the inherent difficulty of kernel programming (significant but surmountable), the Linux kernel community is not regarded as welcoming... Maybe this is changing. We'll see if Linus's new attitude lasts.

------
navinsylvester
Ubercool.

Wasn't aware of PSI - thanks to the author. The load avg alerts are mostly
late to our liking but tooling around this should really help.

------
the8472
Another advantage of PSI is that there also are per-cgroup monitors. Unlike
load indicators which are system-global and thus pointless when your cgroup is
limited by quotas.

~~~
danw1979
This is going to cone in very useful indeed.

It's been around since kernel 4.2 according to the article. Do any of the
mainstream container orchestration tools use this data, I wonder...

------
kees99
In-kernel resource-specific counters are great, but those are limited to
10/60/300 seconds averages.

For historic resource utilization, sar/sadc tool is still a go-to.

~~~
kccqzy
If I have a cron job and reads the pressure file and simply stores its
contents with a time stamp for later analysis, would that be enough to
determine historic resource utilization?

~~~
shuss
While already available tools should have sufficed and PSI should be good for
it, a very important capability that PSI provides system administrators is its
high-resolution. Using PSI, with specialized tooling, system administrators
can automate sensitive workload scheduling.

I'm sure someone is already working on integrating information made available
in PSI to Kubernetes so that the automation bit is taken care of.

------
magoon
Major improvement over loadavg for troubleshooting, given the difficulty in
identifying when a system is io-bound.

------
rwha
This is an excellent write-up. The code example for polling events is going to
keep me entertained long enough to get in trouble. Well done shuss!

------
markandrewj
This is pretty useful and cool.

