
A Microscope on Microservices - trickz
http://techblog.netflix.com/2015/02/a-microscope-on-microservices.html
======
twic
I'm interested by the application of Little's law as a tool for distilling a
particular slice through performance down to a single number:

[http://en.wikipedia.org/wiki/Little%27s_law](http://en.wikipedia.org/wiki/Little%27s_law)

You have a request rate in requests per chronometer-second, and a response
time in stopwatch-seconds per request, and you multiply them to get a "demand"
figure in stopwatch-seconds per chronometer-second. It's sort of
dimensionless, and sort of not, because the seconds on either side of the
division operator are sort of orthogonal (very vaguely like how joules per
newton-metre is not dimensionless).

How do i use a number like this? Does it make sense to compare the numbers
from two different instances of the same system? From instances of two
different systems? Should i worry if it goes up? If it goes down? What can i
do about it, either way? Is it meaningful to calculate it for component parts
of my system, and is there a way to critically relate the values in parts to
the whole? Is there a way to relate it to other quantities in my system?

~~~
cpwatson
From my experience Little's law is typically used to quantify the number of
users in the system, a measure of concurrency. For the purpose of our tools we
leverage the calculation to provide insight on "offered load" or the time
spent w/in the service for a given interval. We do have a challenge in that
many of our downstream dependencies are called concurrently. At the current
time this prevents us from easily decomposing the demand in a service cleanly
among it's dependencies. Some of this has to do with our transaction tracing
framework and the granularity at which we require call behavior to be easily
time-ordered. We believe we can solve this overtime with an improved
framework. In the case of Mogul we leverage the demand calculation to
understand who is the largest contributor, pointing us in the direction of
possible optimization. If we are using the utility to triage an issue we
typically find that an increase in the demand or offered load within the
problematic dependency tends to easily correlate with the demand of the
calling service. I think we are just at the beginning of leveraging this data
in a more effective manner, and getting away from having eyeballs look at a
dashboard is definitely a goal.

------
adeptus
That CPU flame graph is pretty cool. Haven't seen it displayed like that
before. Shows process path/name contribution to volume of CPU spike, all in 1
graph. Neato.

~~~
brendangregg
Thanks, I summarized them on:
[http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html](http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html)
. They are really helpful for seeing the big picture of CPU usage, and
quantifying the contribution by different code paths.

The flamegraph code is on github
([https://github.com/brendangregg/FlameGraph](https://github.com/brendangregg/FlameGraph)).
There's other implementations too (see
[http://www.brendangregg.com/flamegraphs.html#Updates](http://www.brendangregg.com/flamegraphs.html#Updates)).

We're using them primarily to analyze CPU usage of the Linux and FreeBSD
kernels, Java, and Node.js. We had an earlier post about the Node.js ones:
[http://techblog.netflix.com/2014/11/nodejs-in-
flames.html](http://techblog.netflix.com/2014/11/nodejs-in-flames.html)

------
twic
> At Netflix we pioneer new cloud architectures and technologies to operate at
> massive scale - a scale which breaks most monitoring and analysis tools.

Do we have any idea how massive Netflix's scale is, in terms of end-user
requests per second, or some other metric?

And, probably more relevantly for me, how big a scale can one get to while
using most monitoring and analysis tools?

------
cpwatson
we don't share our actual requests per-second numbers on the front door. We
have mentioned that we run tens of thousands of instances across three AWS
regions. Per the Atlas techblog, these instances can generate in aggregate
upwards of 1.2B time series which are exposed at the minute level.

~~~
twic
Part of why i ask is to get an idea of what those tens of thousands of
instances are doing. How much of your leviathan scale is about the sheer mass
of requests, how much of it is about the depth and sophistication of what you
do to serve every request, and how much is about providing an environment
which supports deployment and operation of the code which serve those
requests?

I used carbon-relay in one job, and if you're using that, i'd guess you have
1000 machines serving users, and 30000 collecting metrics!

~~~
cpwatson
Our architecture and the sheer number of microservices contributes to much of
the scale. In order to achieve the engineering velocity and reliability goals
we felt the explosion of instances with this architecture was worth it. If you
consider the number of microservice instances serving user traffic (and
include persistency tiers such as memcache and cassandra) you would still be
in the tens of thousands.

