
Saving 13M Computational Minutes per Day with Flame Graphs - mspier
http://techblog.netflix.com/2016/04/saving-13-million-computational-minutes.html
======
azinman2
What I think is interesting about this is that they weren't able to easily
measure or find using existing tools these hotspots -- they needed a
combination of visualization and data munging to do so.

Visualization is an often overlooked tool in CS -- for example IDEs do little
to zero visualization... only LightTable is starting to break out of the
traditional text document. It also shows that depending on the problem
visualization & data can be morphed and stretched to provide new insights when
others might have walked away.

So why isn't this something that's a part of job interviewing or a bigger part
of our normal toolbox as engineers?

~~~
sdesol
> Visualization is an often overlooked tool in CS

It's often overlooked, because generating meaningful data, that can provide
visual insight, is usually very difficult. Right now I'm working on a blog
post that goes over how you can use motion bubble charts to track code changes
and I use GitLab as an example. You can find a draft of the blog at:

[http://gitsense.github.io/blog/motion-bubble-
charts.html](http://gitsense.github.io/blog/motion-bubble-charts.html)

Note the blog post is still in DRAFT state, so there are broken links and
grammatical errors and what not.

Capturing meaningful data at the Enterprise scale, requires a lot of effort.
There is a reason why I ended up creating my own real-time process monitoring
system:

[http://gitsense.github.io/blog/realtime-process-
monitoring.h...](http://gitsense.github.io/blog/realtime-process-
monitoring.html)

What I'm ultimately hoping to do with the metrics, is create a new way to
visual Git logs and improve how we approach complex code reviews and diffs.

------
alexc05
I saw a youtube talk on this one ... I think it was this one:
[https://www.youtube.com/watch?v=O1YP8QP9gLA](https://www.youtube.com/watch?v=O1YP8QP9gLA)

Really great stuff. The spot where he gets to a pretty good description of how
he uses his flame graph is roughly here:
[https://youtu.be/O1YP8QP9gLA?t=611](https://youtu.be/O1YP8QP9gLA?t=611)

With respect to that blog-post the bit about the truncated towers is a bit of
a red herring if you're 100% new to flame graphs.

The real meaty bits are the _wide_ sections.

~~~
MikeTheJoker
Generally you're right that the wide sections are where you want to focus your
attention when looking for optimizations. The point I was trying to make in
the blog post is that we had to take the flame graph visualization a step
further to eliminate the noise obscuring a major hot spot. The large number of
broken stacks was one of the first hurdles we had to cross to improve the
clarity of the visualization.

BTW, this is a different flame graph and optimization than the one discussed
in the YouTube video. We use flame graphs extensively throughout Netflix.

~~~
m4dc4pXXX
Can you write up how you fixed the broken call stacks? I've used Brendan's
tools (with java-perf-map, also an awesome tool) to generate flame graphs for
Scala code and had no idea I could only see 127 frames.

~~~
brendangregg
We ultimately should be fixing this with BPF, which we'll certainly post
instructions for.

------
surrealvortex
I'm currently using flame graphs at work. If your application hasn't been
profiled recently, you'll usually get lots of improvement for very little
effort.

Some 15 minutes of work improved CPU usage of my team's biggest fleet by ~40%.
Considering we scaled up to 1500 c3.4xlarge hosts at peak in NA alone on that
fleet, those 15 minutes kinda made my month :)

One thing to note once you eliminate the easy pickings is that as you go
higher up the call graph, the profiler visualization is often misleading.
There may be sections of code without safe-points, and stuff that appears wide
on the flame graph may just be getting blamed for adjacent code that doesn't
have safe points.

~~~
tracker1
Profiling in general is a really good thing when you're seeing odd
load/timing/performance issues... I once found a project was storing its'
configuration settings (loaded/cached from DB) in a really badly performing
way, an in-memory datatable, with text queries instead of a hashtable (not my
design).

A single call wasn't so bad, but the lookup was happening many hundreds of
times per request adding seconds to some requests. Wild how much difference a
relatively small thing can make.

~~~
surrealvortex
That brings up another distinction - profilers don't distinguish between a
method that takes very little time to run but is called very often and another
method that is pretty expensive, but is not called very often.

Ultimately, we do care about the total time taken, but the approaches
necessary for the two cases above are very different. In many cases, the
method that is simply called very often will call for some type of caching
solution in the caller, while the more expensive method will require retooling
within the method itself.

------
asragab
"Middle-Out" approach...wonder where they got that from?

~~~
mrgriscom
That's when I checked to make sure it wasn't an April Fools joke.

------
f_
Very interesting indeed; but somehow I was even more baffled at the package
names they seem to be using:

    
    
      com.netflix.vulturemonkey.cow.iguana.MacawSquirrel
    
      com.netflix.ape.serpent.vulture.ApeVultureMantis
    
      com.netflix.iguanas.monkey.insect.IguanaRabbit
    

Any idea what's up with that?

~~~
mspier
We really love animals! :-) JK. Just obfuscating class names with animal names
before publishing the blog post.

~~~
f_
Hehe! Thanks for the clarification -- I wondered if everyone was going bananas
over at your company! This explains it (:

~~~
mspier
Still better than some unpronounceable old-norse names we had on a few
projects. :-)

------
Illniyar
Wouldn't such an insanely big call stack be a performance issue in itself?

~~~
geodel
These call stacks are normal for typical Java enterprise application.

~~~
topspin
Indeed this is normal. Apache Camel produced such huge stack traces they
refactored the routing system specifically to reduce AsyncCallback usage and
shorten stack traces; at one time Camel would dump traces thousands of lines
long. However, pointing this out doesn't actually address the question; is
there a performance issue indicated by these huge call stacks?

I've wondered about the question myself when encountering incredibly long
stack traces while troubleshooting Java systems. I've also wondered if there
is some more general dysfunction indicated. I've see impressive stack traces
in C and C++, but nothing quite like what I've found in Java. What is the
experience of C# programmers?

------
chadlavi
just think of the millions of dollars they could save if they stopped doing
double spaces after sentences

