Hacker News new | comments | show | ask | jobs | submit login
Linux Load Averages: Solving the Mystery (brendangregg.com)
623 points by dmit 130 days ago | hide | past | web | favorite | 86 comments



Awesome analysis, I have added it to my favorites list. Around 1990 or so when I was in the kernel group at Sun and a team had just embarked on the multi-processor kernel work that would later result in the 'interrupts as threads'[1] paper. During that time there was an epic thread on email which was something like "What the F*ck does load average mean on an MP system?" (no doubt I have a copy on an unreadable quarter inch tape somewhere :-(). If it helps, the exact same pivot point was identified, which is this, does 'load average' mean the load on the CPU or the load on the system. While there were supporters in the 'system' camp the traditionalists carried the day with "We can't change the definition on existing customers, all of their shell scripts would break!" or something to that effect. Basically, the response was if we were to change it, we would have to call it something different to maintain a commitment to the principle of least surprise. This has never been an issue for Linux :-).

As a "systems" guy I am always interested in how balanced the system is, which is to say that I am always trying to figure out what the slowest part of my system is and insuring that it is within some small epsilon of the other parts. If you do that, then system load is linear with workload almost regardless of task composition. So disk heavy processes load the "system" as much as "compute heavy" processes and "memory heavy" or "network heavy." In an imaginary world you could decompose a system into 'resource units' and then optimize it for a particular workload.

[1] http://dl.acm.org/citation.cfm?id=202217


Uh.. complete but relevant aside:

All you old farts (TM) need to get these freaking quarter inch tapes pushed up to some glacier S3 bucket or sum=such bucket before you kick said bucket...

I'm serious. C'mon, don't steal from the future what you actually did in the past to make the present the reality of today!!


Its all about the context though :-). One of the things the Computer History Museum has done a good job of is capturing a lot of the historical underpinnings. But while its interesting, in a "wow, isn't that interesting" sort of way that rail road track gauge is the same as cart ruts which are the same as roman chariot widths (not exactly: http://www.snopes.com/history/american/gauge.asp)

Much of this stuff was fairly constrained by the choices of the time, and as such the information generally ages poorly.


The usefulness of the info may age poorly.. but the historical context won't

Else we end up in a tech situation far in the future where the world looks in the mirror on acid and says "how the fuck did we get here?" - and the tapes provide no answers

Btw, just last weekend I was harvesting 100 year old railroad spikes from a western timber company rail in Sonora because of the context of historical significance - not because I plan to lay a new track...


... like on usenet that was archived by Deja, bought by Google, and so well organized and surfaced for eternal posterity.


What samstave said... but, I'll buy the beer.


I wonder if Illumos / SmartOS and OpenIndiana continued with the principle of least surprise via their ancestry chain, or whether they also moved to a "system" load view like Linux.

It's great that design decisions and thinking from decades past can be dug out and examined by complete strangers.


Its called POLA in the FreeBSD world. Principle of least astonishment. So this exists outside solaris too. :)


I've always seen POLA as "principle of least authority" in an OS research context, just fyi.


I think the phrase has probably been mutated many times by various groups.


Of course, at this point, most people will be astonished when their Solaris (or Solaris-derived) or FreeBSD system shows CPU load instead of system load, given that most *ix exposure now happens via Linux.


Anecdotally, people whose first ix exposure is via Linux are still astonished that the load average doesn't mean CPU load.


This is great work in general and excellent historical research.

As an additional historical note: in Unix, load averages were introduced in 3BSD, and at that time they included processes in disk IO wait and other theoretically short-term waits that weren't interruptible. This definition was carried through the BSD series and onward into Unixes derived from them, such as the initial versions of SunOS and Ultrix. At some point (perhaps SunOS 3 to SunOS 4, perhaps later), the SunOS/Solaris definition changed to be purely runable processes.

(I'm not sure what System V derived Unixes such as Irix, HP-UX, and so on did, and their kernel source is not readily available online for spelunking.)

As of early 2016 when I last looked at this, the situation on FreeBSD, OpenBSD, and NetBSD was somewhat tangled. FreeBSD load average only included runable processes, but NetBSD and OpenBSD counted some sleeping or waiting processes as well.


When details of a piece of "open" software are so easily lost I shudder to think about the vast quantity of "closed" software that have had their history lost.

I also kept thinking about how the term "software archaeology" (which I first saw in the 1999 Vernor Vinge novel "A Deepness In the Sky") becomes more and more mainstream each day.


I always thought "programmer-at-arms" was a brilliant job title and well describes what some of us do.


Thanks, I didn't know those Unix versions included disk I/O.


I think the reason for Sun changing it was due to the advent of NFS, which made measures of process based on I/O blocking inaccurate (and bound to network latency). I believe Cantrill mentioned this factoid at some point, but I don't recall the reference.


As it happens, this isn't the case. Sun introduced NFS in SunOS 2, which unsurprisingly had the BSD view of load averages and did count NFS IO as short-term waits. A NFS server with problems could send client load averages up to monstrous levels (a load average of several hundred was not unusual, all of it from processes waiting for the NFS server) and many sysadmins got to see this first hand at the time.


My point was that the reason for changing it is because of NFS and that it caused sysadmins to become confused. Not that it was changed along with the NFS release.


The numbers are definitely not comparable between Linux and some other unix variants. OpenBSD "idle" load average are about 1, for example.


Several years back the company I worked for ended up picking up some work for a client. Every quarter we'd download a huge trove of TIFFs from some source, and then do some image conversion work before shipping transferring them to the customer's infrastructure.

There was a java application that powered the logic side of things, calling out to ImageMagick to do the actual processing and conversion. For whatever reason, after careful benchmarking we settled on a java thread count that happened to get us the peak throughput, but also caused system load average to hit around 400 and keep steady at around that level.

The day that happened, and I could show that no application on the server took a performance hit, was the day that I finally persuaded my boss that load average is an interesting stat, but it's not the be-all and end-all, and that a high load average doesn't necessarily correlate to an actual problem.


I had something similar happen in the past a long time ago on a x86 Solaris 10 mail server. An employee thought it was a good idea to share best quality/full resolution JPEG pictures of his new baby with the whole company. This swamped the mail server (load average was well over 700) while it chugged through delivering a 50mb email to 200+ employees. I forgot what process was the culprit (I think GNU Mailman) but after a couple of hours it finally settled down. I was amazed that could still SSH into it and figure out what happened.


One source of high load average spikes that I've seen in my job is when a process crashes and generates a core dump. While the core dump is being written, all threads in the process are in the TASK_UNINTERRUPTIBLE state even though they are doing absolutely nothing, and as such they all count towards the load average as if they were spinning on on a CPU core. If the total virtual memory of the process is large, say in the multi-GB range, coredumping can take on the order of a minute, and Linux will report an unreasonably high load average if that process had a lot of running threads.

Things like the above scenario make me treat the load average metric with a lot of skepticism. I would much rather use other metrics to infer load.


I rarely recommend alerting monitoring or any kind of action based on load averages or more generally any metric derived from queue lengths. It's trends in high-quantile queue latencies your users (and therefore you should) care about.


Kind of ironic, given that the whole article is about the divergence of system load from being a queue length metric.


If it was bothering anyone else: yes, the parenthesis in the patch in the email are unbalanced, and the code was checked in as:

                if (*p && ((*p)->state == TASK_RUNNING ||
                           (*p)->state == TASK_UNINTERRUPTIBLE ||
                           (*p)->state == TASK_SWAPPING))


They're... not unbalanced.


Under Better Metrics the author discusses ways of drilling down to find the source of a high load average. I feel like this section should mention `atop`, which is imo a really underrated single-pane-of-glass view into everything your system is doing, now and historically.

If you haven't tried `atop`, give it a go.

This historical analysis in this article though is great, because while Load Average has been an oft discussed and we'll understood topic for a long time, the decisions that got us there are not.


I sometimes hear about “atop”, wonder Why haven’t I this installed?, install it, discover that it starts (and requires) two additional daemon processes, at which point I remember, and promptly uninstall it again.


Good article. However, it is missing the reason why load averages include tasks waiting for disc/swap.

One of the things that the load average is sometimes used for is to work out whether it is appropriate to start some more processes running on a system. For example, make has a "-l" option, which prevents more parallel jobs being run while the load is above a supplied number. When a system is thrashing due to insufficient RAM, then the load average will be high, and this option will appropriately prevent more tasks being started which would make the thrashing worse. If the load average was just based on CPU, then it would be low while thrashing, and using that make option could lead to complete system collapse.



> As a set of three, you can tell if load is increasing or decreasing

That could be accomplished with a set of two.

A set of three could in theory give you acceleration.


This comment makes perfect sense if load is a smooth function. But it is not. It tends to be a step function.

The most recent 2 data points give you is whether the problem is currently getting worse, getting better or steady. The third gives you a sense of whether it has been doing on a while.


FWIW, here's a quote from the Bobrow 1972 TENEX paper that I did not include (as the post was getting too long):

"Three figures are better than just the last one, because from these the user can predict the trend as well as note local variation."


> This comment makes perfect sense if load is a smooth function. But it is not. It tends to be a step function.

I think that depends on the sampling frequency, doesn't it? (given a modern OS with lots and lots of threads and processes)


No. Check out this video by Zach Tellman. He talks about queues and how they break down under load. One of the least intuitive things he points out is that when you have more processors, the breakdown tends to be more of a step function: everything is running smoothly till the moment that it isn't.

https://www.youtube.com/watch?v=1bNOO3xxMc0

The point he makes arises from basic queue theory and is applicable to all kinds of systems, and how those systems react to load. It's got little to due with particular hardware and everything to do with basic math.


That was a great talk and really sound recommendations he makes.


No. It depends on the fact that when something decides to go wrong, it frequently goes south fast. So, for example, a busy lock in a database goes from 99% of capacity (using very little resources) to 101%, processes start backing up, and the system goes haywire.

Think of it as being like traffic. Analytically it is easy to think of smoothly varying speeds. Reality is that there is a car accident, then a sudden traffic jam. We are poking around to figure out where and when that traffic jam happened. And sometimes the cars get cleared off the road and by the time we begin looking the jam is already evaporating.

So comparing the 1 min and 5 min load averages tell us whether the jam is getting worse, holding steady, or improving on its own. Looking at the 15 minute one tells us whether this happened recently.


See aksiL swap

Performance tends to degrade rather...rapidly when you start to actually meaningfully swap actual working memory. With modern quantitys of RAM I'd almost prefer to just run swapless and let the system OOM so it can just be rebooted and get on with it...


Linux doesn't handle this case well. You'll eventually get the OOM, but the thrashing is actually worse than you get via swap (it arrives more suddenly and causes more severe slowdown than swap thrashing, making it difficult to manually fix the problem). I think this is because the disk cache gets severely squeezed before the OOM killer actually gets invoked, but I'm not sure.


Never go to sea with two chronometers; take one or three.


That's to measure error, not acceleration ;)


This analysis cleared up a mystery for me. I've noticed that when a server app is under heavy load in Linux, the load average goes high if the bottleneck is the CPU or the disk, but the load average goes low if the bottleneck is network resources (like databases or microservice calls). I know why that happens, but it's very unintuitive and it confused me when I was new to Linux. I thought load average would measure the CPU load only. Now I know the historical reasons for measuring system load instead of CPU load.

I kind of like it the way it is since it's handy to be able to distinguish network load from CPU+disk load just by looking at the load average. However, since the load average includes other stuff as well, sometimes I still don't know what the load average really means.


Holy crap, Brendan Gregg's site went down. Proof he is human I guess?


Yes, sorry. I guess proof this is a hobby on some personal hosting that can get overloaded. Try refreshing. Although it's load averages (couldn't resist) aren't that high:

    10:36:09 up 34 days, 20:05,  1 user,  load average: 2.39, 2.34, 2.08


Load averages should probably get even higher and include network load - right now, a saturated ethernet card doesn't show up in the load

(network card load is one of the next metrics I check next if load average and wait%/user% etc aren't telling what's wrong)


Very good.

And yes I'd noticed on many *nx systems that it didn't seem to be pure CPU (I think I once had a single-CPU SunOS 4.1.1 NFS server report into the tens or hundreds because disc was slow) I've mainly been treating it as if CPU for ~30Y. Goodness knows what I might have tuned better!

Thank you!


Very much appreciate what you give away for free!


Thank you for the superlative spelunking, that was a great trip down memory lane! :)


The cobbler's children have no shoes :)

Just because we can deploy services that can take a million RPS doesn't mean we have our side projects / hobby sites in order, hah. I worked in hosting for a long time and I had a personal WordPress site which would get hacked every other month. I literally fixed that problem daily at $JOB, but couldn't be arsed to do something better for myself. It worked, and it was quick and easy. The point was the content.

These days, I'd just use something like Medium or Tumblr. Let someone else worry about hosting it :)


I still managed to read the whole thing. Quite fascinating, really, considering the lengths he went into tracking the ancient (1993) patch that turned CPU load averages into whole system load averages.


I am beginning to suspect that Brendan Gregg is not a real person, but a collection of extremely talented individuals publishing under a pseudonym. ( https://en.wikipedia.org/wiki/Nicolas_Bourbaki )

PS: yes this is meant to be a compliment on the prolific output on performance related topics that Greg puts up on his page.


Thanks. It's clones.

http://www.brendangregg.com/Images/brendan_clones2006.jpg

(I made that in response to a similar comment back then...)


This made me laugh out loud. Thanks.

Also wow, what a time capsule that picture is in terms of office style and computer hardware.


Thats amazing!


I wonder what a what a good solution for aggregating all of this data and making it more easily searchable next time would look like, and if there's anyone working on it.

It seems like such a waste to have it scattered all over the place, and for all the author's hard work in tracking it down to go to waste.


Why isn't there one for ram in i3? I read something about how it's hard to gauge ram usage despite htop displaying it as well as inxi in general on Windows you look at task manager there is memory usage.


Do you want RAM use, or virtual memory use?

Here's an article on gathering this data on Windows with Powershell:

https://www.petri.com/display-memory-usage-powershell


Thanks I primarily use Linux with i3-wm

Not sure which is which ram or vram though have heard, probably not vram my computers are generally garbage.

Thanks for the link.


Virtual RAM is what your OS gives out to your programs. Hardware devices will also have addresses mapped in this space.

Physical RAM is just that - the physical RAM in your PC. Virtual RAM uses this entirely, and then some. There is a file that maps virtual address locations to physical address locations. The addresses which are in use by programs, but not frequently used, are "paged" (written) to memory addresses in the swap file on the storage device. In this way, programs get the safety net of having every version of every possible library ever written in any permutation of the universe loaded into memory, while the OS can conserve fast storage for other active programs.

This is why 32bit OSes present a variable amount of RAM (less than 4GB) on systems with 4GB of physical RAM. They can only address 4GB of virtual memory, and each device has to use a few of those addresses for their Hardware mapping. So 32bit OSes with more devices actually had slightly less RAM available to programs.

This only scratches the surface. If you want real fun, delve into the Windows 32bit 3GB user mode checkbox.


It incredibly detailed, including references and historical investigation. Mind blowing. Kudos, Brendan Gregg.


Worth remembering that essentially the same issue exists at a lower level: the “%Cpu” number as shown by top includes not just the share of time spent actually executing your instructions, but also the share of time waiting on memory access.

As explained by the same author: http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-...


When I'd asked Brendan via Twitter for an article on Load Averages in Linux, I hadn't expected such a detailed response. I've worked on a few projects where I've had to show that even though the "load" on the Linux system was low, it was really the steal% and the iowait that were killing performance. I'm sure that from now on, so many system and support engineers will have a good article to reference. Thanks, Brendan!


Yes, thanks for the question, it reminded me that I'd never got to the bottom of uninterruptible before, so I was really determined to do so this time.


My company took over production support of an ESB from another company for a client a couple of years ago. The worker nodes had about 100 JVMs running on it and its resting Load Avg was around 30. This on a 2 CPU RHEL vm.

Out of morbid curiosity, I restarted one of the test servers and ran top. Load Avg was in the order of 2200 for about 3 hours.

The worst part was that the guys we took it over from didn't even think it was a problem.


What was the business impact of the high load? When reducing the load improves the company's bottom line, it should absolutely be pursued.


Page swapping seems like it makes a lot of sense to include in the load average. Disk I/O seems like something more towards the opposite end of the spectrum, though TASK_KILLABLE (https://lwn.net/Articles/288056/) presumably mitigates this where used.


What we need is a systems model that allows us to assess the overall health of a server in a single metric. Indicators of something under strain will reflect itself in the metric and draw our attention for further drilldown and analysis. "Load Average" is the metric we (the systems community) have generally been using for this. Unfortunately it appears that the model it is based on may be rather dated and may have flaws which mean we will miss, or misinterpret system health status by relying on that number. So the million dollar question is - starting from scratch, how can we design a model of our system that yields an reliable system health indicator metric?


OT: what could cause a system to have a load of 1 when idle?

I have one (unimportant) Linux system that idles with a load of exactly 1. The issue persists through reboots. It is a KVM virtual machine and qemu confirms nothing is going on in the background.

Any ideas how to find out what's causing it?


I have the same problem on my laptop after loading the nvidia kernel module -- so something in the kernel is always uninterruptible, but it's not a process that I can see in userspace.

$ uptime; ps -L aux | awk '{ print $10 }' | sort -u 15:57:37 up 21 days, 22:49, 16 users, load average: 2.23, 1.61, 1.61 R+ Rl S S+ S< SLs SN STAT Sl Ss Ss+ Ssl T $ ps -L aux|awk '($10 ~ /^R/){ print }' rkeene 1025 1025 0.0 1 0.0 17980 2344 pts/10 R+ 15:57 0:00 ps -L aux $


Any processes stuck in D state?


There is one, a [hwrng] process, but I don't think that's it. It's also in D state on other virtual machines on the same host without this symptom.

(The process is probably from virtio-rng.)


I am assuming the guest (and host) kernels are sufficiently recent? I remember older kernels having a bug with load calculation. Does disabling Virtio-rng help? D state processes will cause rise in load average depending on NRCPUs.


I thought that including disk wait in the load average was a common Unix feature. Sadly I can't go spelunking through the archives right now, but it would be interesting to see what Solaris and BSD do, for comparison with systems a little bit closer to Linux than TENEX :-)


Solaris and BSD load averages are based on CPU only. As for avoiding TENEX, here's the comment from the freebsd src:

    /*
     * Compute a tenex style load average of a quantity on
     * 1, 5 and 15 minute intervals.
     */
    static void
    loadav(void *arg)
    {
    [...]
:)


It’s indeed different, Solaris doesn’t count time waiting for the disk in the load average.


there's a very good (and old) article about linux load averages here: http://www.linuxjournal.com/article/9001?page=0,0


It's been years but I really remember that Solaris load avg used to similarly be affected by I/O, particularly NFS.


Great article. Interesting, insightful, and actionable.

Cheers


brendan, you could consider adding an option to offcputime that merges all kworker stacks together, since they're really just separate workers in the same thread pool.


good idea, thanks.


Brilliant. Time to patch it back to CPU loads.


No, long time past to discard it. Measure what you actually care about, not some synthetic, poorly understood "load"


This is an incredible analysis! Well done!


netdata is a good tool if you are looking for precise data on where the bottlenecks are on your server.


"Do we want to measure demand on the system in terms of threads, or just demand for physical resources?"

The intent behind load averages is to measure how (over)loaded the hardware is; if you now try to re-define that intent, it will just be yet another Linux atrocity where Linux will be "special" and behave differently from how every other UNIX-like OS behaves (exempli gratia: ss versus netstat). I argue that this will help the momentum against Linux already fueled by, and well underway with the systemd fiasco. You would break the rule of least surprise. It's bad enough that Linux measures load differently from all other UNIX-like operating systems and this would make the situation even worse.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: