
Loupe: Etsy's New Monitoring Stack - theyCallMeSwift
http://codeascraft.com/2013/06/11/introducing-loupe/
======
noelwelsh
Interesting stuff! I've actually been working on the same idea recently,
starting with reading about anomaly detection. In particular, this survey:
[http://www-users.cs.umn.edu/~kumar/papers/anomaly-survey.php](http://www-
users.cs.umn.edu/~kumar/papers/anomaly-survey.php)

I would like to know more about the performance of Skyline in practice:

\- what are the accuracy and recall like?

\- what is CPU consumption like?

Regarding the latter, I had a quick look at the implemented algorithms and
them seemed very inefficient. Basically recomputing over the entire series at
every change. I think with a bit of work most of the algorithms could be
reimplemented in an incremental way. I also wouldn't use Python for something
that is going to be CPU bound. (I await the "We rewrote in Go and it's 10x
faster!" blog post ;-)

~~~
aba_sababa
Author here. Accuracy is okay - we err on the side of noise, but it does
routinely pick up anomalies. It doesn't currently account for seasonal trends,
though.

We aim for 100% CPU consumption. Analyzing is very CPU intensive process, and
there are two parts in particular that are expensive: decoding the Redis
string from MessagePack to Python, and running the algorithms.

As for the algorithm inefficiencies, pull requests encouraged :)

Rewriting it in Go is a plan for a rainy weekend :) The problem with Go is
that it doesn't have as great statistics support as Python does.

~~~
noelwelsh
Thanks for replying! A little while ago I watched part of your Bacon Conf talk
([http://devslovebacon.com/conferences/bacon-2013/talks/bring-...](http://devslovebacon.com/conferences/bacon-2013/talks/bring-
the-noise-continuously-deploying-under-a-hailstorm-of-metrics)) and read the
slides.

There were a few things I thought were a bit odd about the architecture, as I
recall it.

IIRC you poll Graphite for metrics. Why not push them from StatsD directly
into Skyline? This would probably be more efficient. If you used incremental /
online / streaming algorithms you'll have a compact summary at each time step,
so you can throw away the raw data. 250K metrics would fit in memory quite
easily (we're just talking approximately a number and a string each, right?)
and you have 4000+ cycles per second to process them, which should be
sufficient.

Python lack of good threading would possibly be a problem. I would use the JVM
(Scala in my case). Apache Commons Math is pretty good
([http://commons.apache.org/proper/commons-
math/](http://commons.apache.org/proper/commons-math/)). Java's verbose
interfaces are a bit annoying, but the JVM is damn efficient, and you can wrap
the crap is something more aesthetic. It's a solid choice no matter what the
hipsters say. ;-)

~~~
aba_sababa
Ah! We could, but StatsD provides support for more complex metrics like
aggregated sums over time. Not something that lends itself easily to a
discrete datapoint.

It is. But we use multiprocessing, which is basically the same API. Still, you
can't beat the awesome Python stats libraries: Numpy, SciPy, Statsmodels,
Pandas..

------
tel
DTW is quadratic. While in grad school I worked for a bit with a team
interested in doing massive speech recognition using DTW so they did some work
speeding up the algorithm using a technical called Locality Sensitive Hashing
[1], [2], [3]. It might be worth a look in order to speed your algorithms.

[1]
[http://www.academia.edu/2600658/Indexing_Raw_Acoustic_Featur...](http://www.academia.edu/2600658/Indexing_Raw_Acoustic_Features_for_Scalable_Zero_Resource_Search)
[2] [http://old-site.clsp.jhu.edu/~ajansen/papers/IS2012a.pdf](http://old-
site.clsp.jhu.edu/~ajansen/papers/IS2012a.pdf) [3]
[http://www.cs.jhu.edu/~vandurme/papers/JansenVanDurmeASRU11....](http://www.cs.jhu.edu/~vandurme/papers/JansenVanDurmeASRU11.pdf)

------
jqueryin
I was very intrigued until I poked around the github repos and noticed the
server specs Etsy was using to analyze the 250k metrics.

Oculus recommended setup (found at
[https://github.com/etsy/oculus](https://github.com/etsy/oculus)):

    
    
      * ElasticSearch
        * At least 8GB RAM
        * Quad Core  Xeon 5620 CPU or comparable
        * 1GB disk space
        * two ElasticSearch servers in separate clusters
      * a cluster of Worker boxes running Resque 
        * worker master runs redis
        * additional resque worker boxes (and potentially slaves)
        * At least 12GB RAM
        * Quad Core Xeon 5620 CPU or comparable
        * 1GB disk space
    

It'd be nice if there was a more established baseline set of server specs to
get up and running. While many of us aspire to be at Etsy level monitoring,
we're just not there.

~~~
jonlives
I'll definitely have a look at doing that - the initial specs were designed
around the metric volumes we use the tools for, but I realise that might not
be practical for smaller workloads :)

------
laichzeit0
I'm not really convinced that this is very useful. I've been in the
application monitoring space for a few years now and I'm not sure that
watching graphs is something Ops people should be doing.

There should be rules which notify them if something is anomalous, by email,
SMS, or logging a problem on an incident management tool. e.g. "Java request
foo.bar() on Managed Server 1 is throwing exceptions for 50% of invocations
(20 requests, 10 exceptions) in the last 10 minutes. This affects the
following services: Customer Login page on foo.bar." possibly even attaching
some of the exception messages to the email, if sampled through
instrumentation or correlating it back to the log files, automatically.

This type of monitoring is actually useful because Ops understand what is
broken, what it effects and gives them enough detail to either fix it or pass
the problem to someone else; and they're not wasting their time looking at
graphs waiting for a problem to appear.

~~~
thibaut_barrere
The thing is that at a given scale (and it comes early, actually), pushes do
not scale.

I still use pushes for clear-cut things that require paging, but having
graphes of a lot of things and just noticing changes or anomaly on the overall
patterns will help spot a lot of issues, including things you haven't yet
planned paging for :-)

~~~
sokoloff
Totally agree. We have a mix of algorithmic/automated monitor and visual
monitors. I can tell you a lot more about how healthy our site is from looking
at two screens in our NOC than I can from all the pages sent over the last
<pick your time range>.

Computers are great at executing repetitive, specified tasks. Use them for
that.

Humans are great at pattern recognition and flexibly adapting. Use them for
that, IMO.

------
jlgreco
Very nifty. The automatic selection is a great innovation.

I built a somewhat similar system a while ago on-top of statsd/graphite. Mine
was not designed for production deployment though, just as a test platform (I
was basically using graphite to store and query metric data. Not optimal, but
that problem was out of scope and it was easy to abuse like that.) This tool
allowed a user to manually select a set of metrics and create a fault
classifiers with those metrics.

These classifiers were able to detect not only the presence of faults but also
classify what type of faults they were (provided sufficient training data. Of
course you could train new classifiers with data you collected in production
so training new classifiers becomes an ongoing activity.). We were only
testing geometric classification, but using any sort of classifier to identify
complex fault types seems to be an idea with promise.

------
cilo
Always fun to read these Etsy ops posts. I'm very curious to know what their
practical architecture looks like that allows them to capture 250k unique
metrics and also run skyline against them all. It seems like each new
algorithm would add a ton of processing requirements when you're at that
scale.

Also, it seems like this would be really useful with the addition of metrics
grouping and group specific algorithms as right now it looks like their 250k
metrics all pop up in the same anomalous bucket with all metrics getting the
same algorithms applied to them.

------
oscilloscope
Abe gave a talk at OpenVisConf about Skyline and how to make use of 250k
metrics
[http://www.youtube.com/watch?v=Rij604NBXqk](http://www.youtube.com/watch?v=Rij604NBXqk)

------
plasma
Anyone recommend an easy way to get started with StatsD?

I've tried to configure/install/setup StatsD etc in the past but hit so many
problems with dependencies, undocumented software needing to be installed,
etc.

Any tutorial or something to get stats being tracked and graphed beautifully
would be awesome.

~~~
russgray
This blog post helped me when I took a look last year:
[http://geek.michaelgrace.org/2011/09/installing-statsd-on-
ub...](http://geek.michaelgrace.org/2011/09/installing-statsd-on-ubuntu-
server-10-04/)

~~~
kawsper
It just too bad there is a lot of moving parts in this setup.

Graphite consists of these three parts:

carbon - a daemon that listens for time-series data. whisper - a simple
database library for storing time-series data. webapp - a (Django) webapp that
renders graphs on demand.

And statsd is its own daemon.

That means 3 daemons needs to run to make stat aggregating.

~~~
ghotli
Just use the existing chef recipies out there for setting it up. Why would you
try to get all the moving parts integrated properly when the work has already
been done? Stand on the shoulders of giants.

~~~
kawsper
> when the work has already been done?

I am very wary about introducing new software into our stack, if I doesn't
understand it. Bad configured software could cause problems down the road.

Last time I tried out a recipe that installed Redis, it didn't version lock
the Redis-server which meant that the daemon couldn't start because they had
deprecated some configs.

The recent DDOS DNS attacks was possible because people have setup wrongly
configured DNS servers.

------
jmcqk6
They might want to reconsider the name:

[http://www.gibraltarsoftware.com/](http://www.gibraltarsoftware.com/)

Their monitoring solution is also called Loupe.

------
contingencies
Skyline and oculus both look interesting and this is definitely a solid
direction to be heading in.

However, I wonder if some form of topology knowledge, operations dependency
tree or similar could further inform this type of root cause analytics.

Without a declarative style "here is how thing should be" model of adequate
accuracy, it seems like the analytics will be stuck at the "these things are
strange and happened at once, what does human think?" level of sophistication.

------
philsnow
I'm confused on one point: does the anomaly correlation find other metrics
that look "similar" or other metrics that also have anomalies in the same time
span ? The latter seems like it would be very useful.

You mention elsewhere that statsd lets you do complicated aggregations over
time. If you have a moving average of errors over 10 minutes or something,
that's potentially not going to show up when you do anomaly correlation, since
a spike is smeared across 20 minutes. Do you account for that? It would
require knowing which metrics are aggregated across time and by how much, etc,
I guess.

~~~
jonlives
Oculus author here - Oculus detects other metrics that look similar, ie that
have a simlar anomaly or shape in the same time span. It doesn't pick up
metrics that have other, "dissimilar" anomalies - that part is left to Skyline

Oculus treats all metrics that it gets from Skyline equally at the moment, ie
it doesn't know if what it's looking at is an aggregation, or a single set of
data points. It just takes the data as it's presented. It would be totally
possible, however, to add 10 and 20 minute averages (for example) for the same
metric into Skyline so that Oculus would treat them separately.

------
josh2600
This is really interesting. I can't even imagine measuring 250k different
metrics without a tool like this. It's just so much data to assess.

Granted it would be extremely useful for post-mortems, but looking at it real
time is a bit like the library of Babel [1].

[1][http://en.wikipedia.org/wiki/The_Library_of_Babel](http://en.wikipedia.org/wiki/The_Library_of_Babel)

------
methehack
> That’s far too many graphs for a team of 150 engineers to watch all day
> long!

Does etsy have 150 engineers? Is that even possible?

~~~
methehack
I'll take the downvote as a "yes" :)

It's true the "Is that even possible?" was out of line and I should have
tempered it -- but, I am truly surprised.

------
misiti3780
I love etsy's open source contributions but am i the only one here that
think's they are due for a blog redesign?

~~~
aba_sababa
It's in the works :)

