And that is how I made payroll for BitKeeper in the early days.
I think it's cool that this stuff is far more common knowledge these days. All I was doing was figuring out if it was CPU, network, disks, file system, vm system, memory. 99% of the time I knew what the problem was in a few minutes, all the real effort was tracking down what program (or programs) were causing it. Funny enough, most people didn't care what cause was, they just wanted to know what to buy to make it not an issue. Crazy amounts of money being thrown at Sun equipment.
I stopped because I had enough money for payroll and wanted to work on my startup.
I had written some follow on commands at the end of the article, but trimmed them to make it shorter. They were:
perf record -F 99 -a -g -- sleep 10; perf report -n --stdio # and flamegraphs
execsnoop # from perf-tools
iosnoop # from perf-tools
tcpretrans # from perf-tools
* `iftop` that allows me to quickly check which network streams are hogging the machine (this is a little like sar -n DEV 1 but much more detailed!)
* `tcpdump -c 100 -vv`- poor man's alternative to iftop or systat if they're not available locally
* tail those logfiles, with systemd it's even easier with journald as all logfiles can be checked at once: `journalctl -xf -p notice | ccze -m ansi` (ccze colors the output, purely optional)
* probably more i am forgetting now
the great takeaways in the article for me are:
* pidstat: great to have an average, beats top for sure
* back to basics: hit dmesg, vmstat and iostat first, as you learned in the beginning! :)
dstat i already knew about, somehow forgot it in the listing, definitely a must as well.
edit: old version is (PDF) http://www.brendangregg.com/Articles/Netflix_Linux_Perf_Anal...
I've had a couple of programs over the years that I wondered if we were just hitting memory bandwidth limitations but I couldn't find any way to prove that, or even particularly gather evidence.
perf can record hardware counter events - use "perf list" to see the list - including those that count frontend and backend stalls (stalls in the instruction decoding and execution stages respectively). A high stall ratio / low instructions per cycle throughput is an indicator that you're running into memory bandwidth limitations.
They are also a PITA to work with.
As another person said, it shows up as CPU utilization. So I check that along with IPC (instructions per cycle), and if IPC is low (what "low" is depends, but say, < 1.0), then that's a good clue you're blocked on memory.
... but of course, I want actual throughput (usage), bandwidth (maximum), and utilization (ratio), which is more digging with the PMCs.
After that, I compiled and ran a tiny executable called Stream and got the numbers I needed in order to explain why one machine was twice as slow as another.
top has a batch mode: "top -b" will print all processes periodically to the terminal, without starting the curses-style UI, providing the same "rolling summary".
Best advice given in the whole article. Many times I go to check something that "can't figure out what is wrong." dmseg | tail and then the swearing begins on their end.
It's amazing how much time one useful error message can save.
Also, I'm a little disappointed that the author didn't drill down into what the actual problem was in his example. That java process is of course suspicious, but that might just be the video server and expected behavior.
And a likely source of misgivings regarding recent developments in the ecosystem...
One caveat: the times printed by dmesg -T will be incorrect if the system has suspended to RAM or disk.
Check dmesg(1) for more info! :D
For a massive server the CPU reading might not be unusual. Maybe it has 32+ CPU cores and a multi-threaded java app is spinning most of them.
Also remember that on a heavily loaded system the task that is reading CPU use is itself competeing for time. Timing issues and other monitoring vagueries can make such readings noticably imprecise (though for CPU time usually in the downwards direction by missing tasks that started+worked+ended between readings).
The 233GB of memory seems high but there are possible explanations for this. A server with 32+ cores is not unlikely to have a lot of RAM too, the motherboard in my home server supports up to 128GB so perhaps that large java process genuinely does use more then 200. Also all that memory might not really be in use: it could have been allocated but never accessed so it isn't yet holding pages in RAM or swap for all of it.
The RES column shows how much memory is resident (eg, currently backed by physical memory), and that is a much more reasonable 12GB.
For a massive server the CPU reading might not be unusual. Maybe it has
32+ CPU cores and a multi-threaded java app is spinning most of them.
The %CPU column is the total across all CPUs;
1591% shows that that java processes is consuming almost 16 CPUs.
pip install glances.
I like how there are configurable alerts for each datapoint, that's a start for analyzing what's wrong.
I'd love to be able to identify the root cause of the instance locking up. It seems like this article is more about the commands you'd run to assess the health of an active/working EC2 instance and not one that you're unable to SSH into. Any idea on how to identify the problem with my EC2 instance?
For things like databases, you can gather more specific information as well, for example number of queries and connections, failed transactions and the likes.
If you have a hunch what it might be, you could try to counter-act, for example limiting memory, number of processes, or use cgroups to limit IO rate (see for examplehttp://unix.stackexchange.com/questions/48138/how-to-throttl...).
Also leave a ssh connection with something like htop or atop open, and look at it once it freezes up.
>and stolen time (by other guests, or with Xen, the guest's own isolated driver domain).
This isn't necessarily true. If there is any sort of credit scheduler interaction ( http://wiki.xen.org/wiki/Credit_Scheduler ) resulting in the CPU being throttled, it will show as steal. Steal actually just means the CPU was not in a runnable state, which can be caused by multiple things, but predominately throttling by the CPU scheduler.
- system and stolen cpu to show virtualization overhead and VMs starvation for CPU (in vmstat)
- interrupts and context switches which might indicate that VMs might be running non-optimized OSs or non-paravirtualized drivers (vmstat)
- abusing VMs/Containers (platform specific), ie for KVM: virsh vcpuinf, virsh dominfo
- socket summary and dig from there: ss -s
Just curious though, anyone know of a similar type of writeup or post regarding windows?
I have an MSSQL box I want to kill with fire - but I would like to be able to measure its perf with as much insight as I might on a comparable linux box.
What is the best way to measure disk IO perf on a windows box, specifically?
Let it run for a minute, click Stop, then open up the generated ETL in WPA.
From there it's hard to tell you what to do in WPA. Documentation isn't the greatest, but this link is close to what you're trying to do: http://blogs.msdn.com/b/sql_pfe_blog/archive/2013/03/19/trou...
Drag in the relevant graphs into your current view, and understand how the yellow bar aggregates data. MSSQL probably has event providers - you might want to look those up and add them to wprui when you generate your ETL.
Here is another quick performance view.
Install atop a better top replacement and watch for anything showing up in red colors.