Hacker News new | past | comments | ask | show | jobs | submit login
The many load averages of Unix (utoronto.ca)
80 points by jsnell on Feb 17, 2016 | hide | past | favorite | 10 comments

Well that was probably as frustrating to write as it was to read... The author does a nice job of forensics but can't really pin down the meaning on any particular OS.

My preference would be to have a well-defined, even if of-limited-use, load rather than a poorly-defined, but maybe-more-useful?, load. From the article, it sounds as if kernel authors tried to implement the former, found it difficult to do as machines grew and then everyone headed off in different directions only to implement lots of the latter...

Best i can tell after some quick glancing on the history of the metric, the value of including IOWait depends more on the hardware than the software.

If your IO actions are performed by CPU rather than DMA, having them in there makes sense (it will slow down anything else). but if it is performed via DMA (or in other ways offloaded from the CPU) it makes less sense.

Generally, processes in IOWait are not running, they are sitting around in a queue waiting for their data.

Anyone waiting on the process is feeling pain, but it isn't eating any CPU so it comes down to what you want the metric to mean. Should it reflect user discomfort or consumption of cpu cycles.

Yes the process is sleeping, but the system (OS and hardware) is still active doing whatever IO was requested.

The reason load average typically included processes stuck in IOwait is because there are several systems that decide whether or not to start more processes depending on what the load average is. One of these is make, and you will also get people manually looking at load average to work out whether a machine can handle some more work.

When a machine runs out of RAM and starts thrashing, it most definitely does not want to have more stuff run on it. You want the load average to be high in this case to discourage people or programs from starting more stuff. Luckily, when a system starts thrashing, it typically has lots of processes in IOwait, so load average goes through the roof.

I would say I prefer the latter. When I look at the load average, I don't really care what exactly the number means as long as it goes up when the system is more heavily loaded in a way that reasonably correlates with the magnitude of increase in load.

Over time you know what is a "normal" value, so checking the load average is a nice quick health check. There are many other better defined metrics to look at when there's actually a problem. If there is a problem, even a well-defined load average is too coarse a metric to give particularly meaningful insights.

Brendan Gregg has a nice video on how Solaris 10's iowait works. It starts a little slow and may even sound imprecise or inaccurate, but it gets better around 6 minutes in.


As the article said, NFS has a nasty habit of rocketing your load count when the server fails, and NFS can be very unstable if everything aligns correctly (or badly.)

When the NFS server decides to take a rest I've seen 1000+ loads on production Linux (3.13+) systems

Or kernel panic


Or cats and dogs living together

I apprechiate the effort, but that article did not tell me almost anything.

This one from Linuxjournal[0] however.

[0] - http://www.linuxjournal.com/article/9001

I think a good metric is directly related to the amount of progress a single thread with default priority makes on the CPU per unit time.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact