Hacker News new | past | comments | ask | show | jobs | submit login
Linux Performance Analysis (netflix.com)
451 points by anand-s on Dec 1, 2015 | hide | past | favorite | 82 comments

This is pretty cool because back in the dot com days I got paid huge amounts of money to do what they did in the first 60 seconds. I started at $1000/day and the phone rang off the hook. Talked to a friend and he said "double your rates" and I said "WTF? I'm not worth $2000/day, that's crazy". He said "double your rates". So I did, phone still rang like crazy. I got up to $4000/day and had 5 days/week work. Went to $8000/day and got about 3-4 more days and then it dried up.

And that is how I made payroll for BitKeeper in the early days.

I think it's cool that this stuff is far more common knowledge these days. All I was doing was figuring out if it was CPU, network, disks, file system, vm system, memory. 99% of the time I knew what the problem was in a few minutes, all the real effort was tracking down what program (or programs) were causing it. Funny enough, most people didn't care what cause was, they just wanted to know what to buy to make it not an issue. Crazy amounts of money being thrown at Sun equipment.

Did the money dry up because you got too expensive or because the dot com bubble burst?

$8K / day in 1998 or 1999 was too much. I suspect I could have gone on until the bubble burst at $4-$6K.

I stopped because I had enough money for payroll and wanted to work on my startup.

Good to see people sharing extra commands!

I had written some follow on commands at the end of the article, but trimmed them to make it shorter. They were:

  # CPU
  perf record -F 99 -a -g -- sleep 10; perf report -n --stdio   # and flamegraphs
  execsnoop       # from perf-tools

  # Memory
  cat /proc/meminfo

  # Disk
  df -h
  iosnoop         # from perf-tools
  pidstat -d

  # Networking
  netstat -s
  tcpretrans      # from perf-tools
Where perf-tools is https://github.com/brendangregg/perf-tools.

that's a pretty awesome summary. to that i would add:

* `iftop` that allows me to quickly check which network streams are hogging the machine (this is a little like sar -n DEV 1 but much more detailed!)

* `tcpdump -c 100 -vv`- poor man's alternative to iftop or systat if they're not available locally

* tail those logfiles, with systemd it's even easier with journald as all logfiles can be checked at once: `journalctl -xf -p notice | ccze -m ansi` (ccze colors the output, purely optional)

* probably more i am forgetting now

the great takeaways in the article for me are:

* pidstat: great to have an average, beats top for sure

* back to basics: hit dmesg, vmstat and iostat first, as you learned in the beginning! :)

I'll add to your list with htop. Its a little nicer than plain ole top.

And `glances` is pretty great for big-picture stuff, all-in-one. It's not as precise on CPU/memory as htop, but you get network bandwidth, disk I/O, disk usage and more.

No love for dstat? Glances looks cool though, thanks for that one.

glances (which i just discovered recently) is pretty awesome, definitely something i'll try out more next time i need something similar.

dstat i already knew about, somehow forgot it in the listing, definitely a must as well.

Also like to add atop. I love the interface and usually is my first command I launch.

problem with (h)top is that it doesn't catch short-lived processes, and it displays only the most basics stats. For more detailed perf analysis you will likely be better with something like atop

htop is useful, but i find it doesn't add enough on top of top to justify changing my hand-wired reflexes to add that "h". it's often not installed everywhere either, so the top habit remains.

If you like iftop, iptraf is similar and goes deeper in the ability to dig into TCP details or confuse yourself with scads of BPF rules.

iptraf doesn't work on FreeBSD, unfortunately, but it is a great tool.

The link redirects to the blog frontpage for me, but https://media.netflix.com/en/tech-blog/linux-performance-ana... works.

Odd, that URL messes up the pre tags when I load it in Chrome or Firefox (desktop). http://techblog.netflix.com/2015/11/linux-performance-analys... looks right.

A blog redesign was launched today, which has messed up the pre tags. Hopefully fixed shortly...

edit: old version is (PDF) http://www.brendangregg.com/Articles/Netflix_Linux_Perf_Anal...

I think Netflix may have just rolled out a blog redesign. (The original link worked for me at first, but now redirect me to media.netflix.com.) Too bad they broke old links though. Hopefully that's temporary.

It seems that they also removed the RSS feed, which is a shame.

Can confirm that your link works, but the current link in the HN submission does not. MODS, CAN YOU FIX THIS? :)

The real gem for me in that article (particularly since I already knew about those linux commands) was the USE method [1] and then consequently finding the TSA method [2] on the same linked site.

[1]: http://www.brendangregg.com/usemethod.html

[2]: http://www.brendangregg.com/tsamethod.html

If you want more from Brendan Gregg on analyzing performance, be sure to check out his book, Systems Performance: http://smile.amazon.com/dp/0133390098

There was an interesting interview with him earlier in the year on Software Engineering Radio where he talked about his book: http://www.se-radio.net/2015/04/se-radio-episode-225-brendan...

Direct link to the blog post (the OP is redirecting to the list of blog posts for me):


Is there any way to analyze memory bandwidth usage?

I've had a couple of programs over the years that I wondered if we were just hitting memory bandwidth limitations but I couldn't find any way to prove that, or even particularly gather evidence.

It will show up like CPU usage - the hyperthread that is waiting for memory will appear busy executing.

perf can record hardware counter events - use "perf list" to see the list - including those that count frontend and backend stalls (stalls in the instruction decoding and execution stages respectively). A high stall ratio / low instructions per cycle throughput is an indicator that you're running into memory bandwidth limitations.

You've also reminded me that while they do show up as CPU time, they wouldn't be 'user' time (for example in the output of the command 'time')

If it's user-mode loads and stores, which for applications is pretty common, it is user time. Easy to test.

Thanks for that, I'll take another look at it and see what I've been missing then.

Yes, PMCs (CPU performance monitoring counters; also known by many other terms, such as PMU counters, PICs, CPCs, etc). In the past I've written tools that print usage of memory busses (really, CPU interconnect ports) via the PMCs.

They are also a PITA to work with.

As another person said, it shows up as CPU utilization. So I check that along with IPC (instructions per cycle), and if IPC is low (what "low" is depends, but say, < 1.0), then that's a good clue you're blocked on memory.

... but of course, I want actual throughput (usage), bandwidth (maximum), and utilization (ratio), which is more digging with the PMCs.

I've identified a memory bandwidth issue in the past by keeping an eye on truss/strace output and "counting" mem operations.

After that, I compiled and ran a tiny executable called Stream[1] and got the numbers I needed in order to explain why one machine was twice as slow as another.

[1] http://www.cs.virginia.edu/stream/

Using truss/strace to figure out memory bandwidth issues sounds pretty unreliable. I would not have guessed there was much correlation between memory syscalls that truss/strace can observe (mmap/munmap, brk), and the CPU load/stores that consume memory bandwidth.

> Pidstat is a little like top’s per-process summary, but prints a rolling summary instead of clearing the screen. This can be useful for watching patterns over time, and also recording what you saw (copy-n-paste) into a record of your investigation.

top has a batch mode: "top -b" will print all processes periodically to the terminal, without starting the curses-style UI, providing the same "rolling summary".

"pidstat 1" prints active processes, making it much less noisy that top -b, and therefore highlights malevolent processes. I think the top equivalant would be "top -b -i", though that has more output and is a bit noisier.

> Don’t miss this step! dmesg is always worth checking.

Best advice given in the whole article. Many times I go to check something that "can't figure out what is wrong." dmseg | tail and then the swearing begins on their end.

Can you imagine a world where the Windows System Event Viewer was this useful?

It's amazing how much time one useful error message can save.

Also, I'm a little disappointed that the author didn't drill down into what the actual problem was in his example. That java process is of course suspicious, but that might just be the video server and expected behavior.

dmesg and being able to run most things in a terminal to check what its doing (grumble, skype, grumble) is likely what has drawn many to Linux in the first place.

And a likely source of misgivings regarding recent developments in the ecosystem...

journalctl is a good tool. I really like SystemD but wish we still had plain text logs.

One improvement (which may or may not be related to systemd, but is in newer kernels) is that dmesg has timestamps enabled by default which makes it much easier and `dmesg -H` is really nice as well.

  dmesg -T
Has been available for years to print out human-readable timestamps. Check to see if dmesg is an alias to that on your system. Also, -IIRC- dmesg has printed system-uptime timestamps with every line since at least the early 2000's.

One caveat: the times printed by dmesg -T will be incorrect if the system has suspended to RAM or disk.

Check dmesg(1) for more info! :D

Hmm, that java process has 0.227 TB of mapped virtual memory and 3090% CPU..

You sound like you doubt the readings. This is netflix, I assume they have some pretty hefty kit backing up their service.

For a massive server the CPU reading might not be unusual. Maybe it has 32+ CPU cores and a multi-threaded java app is spinning most of them.

Also remember that on a heavily loaded system the task that is reading CPU use is itself competeing for time. Timing issues and other monitoring vagueries can make such readings noticably imprecise (though for CPU time usually in the downwards direction by missing tasks that started+worked+ended between readings).

The 233GB of memory seems high but there are possible explanations for this. A server with 32+ cores is not unlikely to have a lot of RAM too, the motherboard in my home server supports up to 128GB so perhaps that large java process genuinely does use more then 200. Also all that memory might not really be in use: it could have been allocated but never accessed so it isn't yet holding pages in RAM or swap for all of it.

The 233GB is VIRT, meaning virtual memory. It need not all be backed by physical memory. For example, if you mmap a file, and then access only a small portion.

The RES column shows how much memory is resident (eg, currently backed by physical memory), and that is a much more reasonable 12GB.

The r3.8xlarge instance type has 32 vCPUs and 244GiB RAM

  For a massive server the CPU reading might not be unusual. Maybe it has 
  32+ CPU cores and a multi-threaded java app is spinning most of them.
Yes, the article states that

  The %CPU column is the total across all CPUs;
  1591% shows that that java processes is consuming almost 16 CPUs.

Probably r3.8xlarge.


I'm guessing that the java process is their video server. It's probably having to DRM and bandwidth regulate hundreds of HD streams. Unfortunately, this might mean that all of the stuff he talked about in here didn't actually help him solve the problem because none of it had any visibility inside of their Java VM.

Some of their recommendation algorithms run on large boxes in java.

Note that's virtual memory. For determining real RAM usage for programs, I find https://github.com/pixelb/ps_mem extremely useful

and they run it as root

It could be running in a container?

I'm wondering if there's a way to automate this kind of analysis to give you condensed, interesting points of a system performance status. À la powertop for power management.

Collectd + Graphite / InfluxDB + Grafana. You shoudn't need to ssh on a host to see those metrics.

Stackdriver, boundary, signalFX

glances? https://pypi.python.org/pypi/Glances

pip install glances.

Agreed, considering they've built Vector (mentioned by bgregg in the second sentence - https://github.com/Netflix/vector) on top of Performance Co-Pilot. While PCP doesn't yet have all the wrappers to mimic each sysstat output in a fully compat manner. The underlying mechanisms to remotely fetch that data (using the tools Vector is already built on), is already there.

glances. Mentioned elsewhere in the thread.

Very nice. https://github.com/nicolargo/glances/

I like how there are configurable alerts for each datapoint, that's a start for analyzing what's wrong.

There is - software like Splunk, SysTrack, etc. get towards that in various ways.

There is, look up Conky

Conky doesn't tell you what's wrong with your system (À la powertop 2.0). It just present the information differently.

Conky is for your desktop. We are talking about a server, which you connect to via ssh.


maybe monit or something similar (inspeqtor etc)

I had one of my EC2 instances (Ubuntu 14.04) lock up on Monday morning. I use it to run a Ruby on Rails app and to do a handful of ETL jobs. The website wasn't able to load and I wasn't able to SSH into the box. So, I went into the AWS console and rebooted the instance. This has happened 3 times in the last 6 months. It looks like the CPU spikes up to almost 100% around the same time when the instance locks up.

I'd love to be able to identify the root cause of the instance locking up. It seems like this article is more about the commands you'd run to assess the health of an active/working EC2 instance and not one that you're unable to SSH into. Any idea on how to identify the problem with my EC2 instance?

I can recommend atop[1] for that. It runs every ten minutes by default and writes lots of information to /var/log/atop/atop_YYYYMMDD. With that you can examine what happend before the crash, just open a file with atop -r /var/log/atop/atop_YYYYMMDD.

[1] http://linux.die.net/man/1/atop

A nice atop tutorial: https://lwn.net/Articles/387202/

One thing we do at $work is collecting each server's static with https://collectd.org/ and often the graphs are pretty revealing. Sometimes it's the disk activity that kills the server, sometimes it's the memory usage, or number of processes, or a disk running full.

For things like databases, you can gather more specific information as well, for example number of queries and connections, failed transactions and the likes.

If you have a hunch what it might be, you could try to counter-act, for example limiting memory, number of processes, or use cgroups to limit IO rate (see for examplehttp://unix.stackexchange.com/questions/48138/how-to-throttl...).

Also leave a ssh connection with something like htop or atop open, and look at it once it freezes up.

Check the log files that got written before you killed the instance?

Awesome article! One point of contention, though:

>and stolen time (by other guests, or with Xen, the guest's own isolated driver domain).

This isn't necessarily true. If there is any sort of credit scheduler interaction ( http://wiki.xen.org/wiki/Credit_Scheduler ) resulting in the CPU being throttled, it will show as steal. Steal actually just means the CPU was not in a runnable state, which can be caused by multiple things, but predominately throttling by the CPU scheduler.

Great list with "standard" tools that you'd find on most systems by default. From my experience for virtualized servers you might also look at:

- system and stolen cpu to show virtualization overhead and VMs starvation for CPU (in vmstat)

- interrupts and context switches which might indicate that VMs might be running non-optimized OSs or non-paravirtualized drivers (vmstat)

- abusing VMs/Containers (platform specific), ie for KVM: virsh vcpuinf, virsh dominfo

- socket summary and dig from there: ss -s

This is yet another awesome contribution from Netflix.

Just curious though, anyone know of a similar type of writeup or post regarding windows?

I have an MSSQL box I want to kill with fire - but I would like to be able to measure its perf with as much insight as I might on a comparable linux box.

What is the best way to measure disk IO perf on a windows box, specifically?

xperf/wpr. Launch wprui, select First Level Triage and Disk I/O and File I/O, make sure logging mode is set to Memory, then click Start.

Let it run for a minute, click Stop, then open up the generated ETL in WPA.

From there it's hard to tell you what to do in WPA. Documentation isn't the greatest, but this link is close to what you're trying to do: http://blogs.msdn.com/b/sql_pfe_blog/archive/2013/03/19/trou...

Drag in the relevant graphs into your current view, and understand how the yellow bar aggregates data. MSSQL probably has event providers - you might want to look those up and add them to wprui when you generate your ETL.

Thank you!!

If you have any questions, send me an email and I can help. See profile for email.

Seems like the link opens a different page from the original ? Original Link : https://media.netflix.com/en/tech-blog/linux-performance-ana...

Do you use a application monitoring solution in production? An APM records all such data and you can drill down (starting from pretty graphs down to call stacks and long lasting SQL snips).

Good page,

Here is another quick performance view. Install atop a better top replacement and watch for anything showing up in red colors.

Anybody got a different source? Apparently I'm not allowed to watch netflix at work (or view their blog).

My goto are htop, and for network: nload.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact