
Introducing Atlas: Netflix's Primary Telemetry Platform - trickz
http://techblog.netflix.com/2014/12/introducing-atlas-netflixs-primary.html
======
brendangregg
I work at Netflix and use Atlas every day. It's our go-to performance
monitoring tool, and has solved countless performance and reliability issues.
It's exciting to have it open source!

I summarized it in a talk recently at Surge 2014, where I showed its role for
a performance investigation, and how it is central to everything:

[http://youtu.be/H-E0MQTID0g?t=22m](http://youtu.be/H-E0MQTID0g?t=22m)
[http://www.slideshare.net/brendangregg/netflix-from-
clouds-t...](http://www.slideshare.net/brendangregg/netflix-from-clouds-to-
roots/33)

There's more features to Atlas I didn't mention; check out the Overview on
github linked in the blog post.

~~~
crescentfresh
What are the sources for data for Atlas vs for Suro at Netflix?

(Suro: [http://techblog.netflix.com/2013/12/announcing-suro-
backbone...](http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-
netflixs.html)). Suro was/is used to collect "more than 1.5 million events per
second during peak hours, or around 80 billion events per day" from ec2
instances.

~~~
copperlight
There are several different data sources for Atlas:

* There is a poller cluster that gathers SNMP and HTTP healthcheck metrics and forwards them to the Atlas backend.

* There are on-instance log parsers written in Perl and Python that count events in Apache HTTPd and Tomcat logs and send data to the Atlas backend.

* The Servo library [0] is used to instrument Java code with counters, timers and gauges. There is a separate client implementation that handles forwarding metrics to the Atlas backend. The client also polls and reports JMX metrics from the JVM that it runs inside. Spectator [1] is a new library that provides cleaner abstractions of Servo concepts.

* The Prana sidecar [2] was extended to provide REST endpoints for Servo and the client, so that metrics can be delivered from non-Java code.

[0] [https://github.com/Netflix/servo](https://github.com/Netflix/servo)

[1]
[https://github.com/Netflix/spectator](https://github.com/Netflix/spectator)

[2] [https://github.com/Netflix/prana](https://github.com/Netflix/prana)

~~~
lifeisstillgood
What kind of ratio of metadata traffic (telemetry) to total traffic did you
see? How does this divide between "system level" and "application level"?

My client is lookin at these telemetry problems now, is there possibility of
commercial high-level consultancy coming out of Netflix / colleagues ? Ping me
on details in my profile if you can help?

~~~
copperlight
Telemetry traffic is a small fraction of the the total traffic running through
a region, partially due to the use of the Smile data format (binary JSON) for
delivering metrics from the client to the Atlas backend.

When you give developers tools for creating and aggregating highly dimensional
metrics, they tend to create lots of metrics so that they can answer
interesting business questions about the use of their applications. We have
some developers who have written code that produces up to 150,000 metrics per
instance and the vast majority of these metrics are application-level. We
typically see 3-5% of the metrics delivered from an instance are system-level
performance metrics.

~~~
CrankyFool
And this, of course, doesn't account for cases where a minor developer error
results in code that, say, creates a new metric for every source IP address
from which we see a request. Dynamic metric names FTW.

~~~
lifeisstillgood
Thank you both for your insights

------
tzakrajs
One of the most useful applications of Atlas (while working on Netflix
Reliability) was creating alert conditions that would scale dynamically with
changes in volume via the use of double-exponential smoothing (DES). It was
very easy to create alerts that compared and combined multiple signals using
the features in Atlas Stack Language. I am so excited to see it finally open-
sourced!

------
dkhenry
Sounds like a great platform. I wonder how many instances of everything they
need for peak time. I have a system that is rated for 5M Time series data
points per second and it takes about 150 physical servers so I am curious
about what a Netflix sized ( 20M time series data points per second ) would
look like.

~~~
Thaxll
In the video they say that they're close to 1B/min.
[http://youtu.be/tHrT6kQR7vw?t=36m](http://youtu.be/tHrT6kQR7vw?t=36m)

~~~
CrankyFool
Pshaw. That was last year :)

(The presenter in the video)

------
malingo
I really enjoy learning about real-world actor-model systems like this. Having
access to the source code is even better.

Are there other similar examples?

~~~
malingo
Can't edit my earlier post so I'll reply.

I found this page on other companies & projects using Akka:
[http://doc.akka.io/docs/akka/2.0.1/additional/companies-
usin...](http://doc.akka.io/docs/akka/2.0.1/additional/companies-using-
akka.html)

I just learned about Akka recently (by way of Akka.net), and I'm encouraged by
these frameworks that look like legitimate alternatives to Erlang. Erlang has
always seemed like an extreme measure for implementing a distributed
concurrent system, so it's good to know there are Java/Scala/C#/F# options.

------
skinofstars
Wait, didn't Hashicorp release Atlas a couple of days ago?

~~~
ShonM
That's why I've taken to calling out full names now, like "Hashicorp's Atlas",
because yes, it's a mess.

~~~
CrankyFool
Really, we should have opensourced it as com.netflix.atlas :)

-roy rapoport

------
Rapzid
Very interesting.. I'm guessing you can do less than a minute on granularity?
That's a lifetime to me.

I'm working on my own little system based on Riak for storage and their new
search functionality to index the streams(what I call a unique metric). I have
a goal of graphite API compatibility though so it can be a drop-in replacement
for me. Worked out a schema for mapping graphite metric names to the stream
dimensions, wrote a graphite function parser/lexar, etc. Atlas should
definitely be worth a look for some fresh ideas :)

~~~
CrankyFool
We CAN do <1m, but very rarely do (at least in terms of persisting and showing
it). We have a feature called 'Critical Metrics' that is a separate publishing
pipeline into Atlas that is shorter, simpler, hardier, and supports 10s
granularity, though for a trivial minority of metrics -- our current limit is
on the order of around 400K metrics per cluster, IIRC, which means that if
you've got, say, 1000 nodes, we're going to limit you to 400 metrics per node
that would be flowing through the Critical Metrics pipeline.

(We haven't opensourced much of the pipeline ... yet)

------
paulasmuth
Sounds similar to FnordMetric
([http://fnordmetric.io/chartsql](http://fnordmetric.io/chartsql)) which also
supports dimensional timeseries data. Major differences between Atlas and
FnordMetric on first sight:

\- SQL based query and charting frontend (ChartSQL), so you don't have to
learn yet another DSL

\- ships with a a wire compatible StatsD API

\- supports pluggable backends

\- renders charts to SVG

\- will probably not scale to petabytes of data out-of-the box

\- single c++ binary, deploy it in 5 minutes using docker

\- includes an interactive query editor

~~~
copperlight
Interesting - I had not heard of Fnordmetrics before. There are not many
monitoring systems that implement dimensionality for metrics tagging. It looks
like Fnordmetrics goes about it slightly differently, but it seems to achieve
the same end goal of arbitrary grouping by like characteristics.

In the Atlas eco-system, Servo and Atlas client eliminate the need for StatsD.
The combination of these products allow for code-level instrumentation and
delivery of metrics. The Prana sidecar is available for non-Java applications
to deliver metrics to the Atlas backend as formatted JSON objects.

Atlas supports multiple backend storage systems, although this is not easily
pluggable just yet. Earlier iterations of Atlas had support for MongoDB and
Cassandra as storage backends, but there were issues obtaining enough IOPS to
satisfy the read and write performance requirements at scale, so storage was
switched to in-memory.

Atlas can return data in JSON format (?format=json) which is suitable for JS
or SVG based rendering systems such as Highcharts. There is also a streaming
API that trades response time for increased data payloads.

It takes some time to learn the Atlas Stack Language, but the fact that it is
URL-based means that the browser is the interactive query editor. Using tools
like Chrome's Edit URL to help handle long strings, you can make small changes
to queries iteratively and see the results in less than 2 seconds in many
cases. Average PNG render time is typically less than 10 seconds; slow
rendering can take around 30 seconds.

------
lifeisstillgood
20 million different time series. I mean that is a lot.

If you have say, 20,000 servers running that is still 1,000 different time
series per server. Memory, CPU, logins, logouts, customer selections, I mean I
struggle to get to those numbers.

~~~
misframer
1,000 metrics per server is quite reasonable. I work for a performance
management company and we handle thousands of time series metrics per
monitored server at one-second resolution.

~~~
lifeisstillgood
What's the rough ratio for system level, process level and app level (ie total
MB, MB / process and "a customer just signed up")?

How much traffic does that add up to? It seems a lot.

~~~
CrankyFool
above, copperlight quotes about 3%-5% of metrics being system-level. A pretty
small number would be process-level, I'm guessing, with the vast majority
being app-level.

At Netflix-sized, the answer to pretty much any question is "a lot." :)

------
misframer
I see "anomaly detection" is listed under real-time analytics. Do they use
Holt-Winters for that or something else?

~~~
CrankyFool
"Anomaly detection" is one of those vague terms that can mean anything from
"it's gone above the pre-set limit, and that's anomalous" to "the system has
studied the signal to learn what the accepted limits should be, and it's
exceeded these limits." We mostly mean the latter for anomaly detection.

The Insight Engineering team at Netflix is largely composed of four kinds of
engineers: Platform/back-end engineers, UI engineers, Site Reliability
Engineers, and Real-Time Analytics (RTA) engineers; it's the latter group of
engineers who are looking into ways to quickly (and efficienlty) detect
anomalies in a truly-absurd amount of data.

The RTA group is now about 6 months old or thereabouts; I have high hopes that
we'll see some public presentations from them soon that will be helpful to
other people outside Netflix.

