

A Practical guide to StatsD/Graphite monitoring - mattetti
http://matt.aimonetti.net/posts/2013/06/26/practical-guide-to-graphite-monitoring/

======
another
In addition to this helpful guide, note that statsd / graphite both spring
some unfortunate surprises on new users, e.g., graphite changing your data
across retention rates and time scales [0], graphite changing your data at
different plot widths (?!) [1], statsd believing that only count and time data
deserve to be aggregated [2], etc.

I have no alternative to suggest, however. Perhaps Cube [3], but unclear if it
has any user community.

[0] [http://stackoverflow.com/questions/10820119/graphite-is-
not-...](http://stackoverflow.com/questions/10820119/graphite-is-not-graphing-
anything-for-ranges-bigger-than-7-hours) [1]
[http://graphite.readthedocs.org/en/1.0/functions.html#graphi...](http://graphite.readthedocs.org/en/1.0/functions.html#graphite.render.functions.cumulative)
[2]
[https://github.com/etsy/statsd/issues/98](https://github.com/etsy/statsd/issues/98)
[3] [https://github.com/square/cube](https://github.com/square/cube)

~~~
fourk
Re [0]: If you never want your data downsampled, keep data at a single
resolution which is equal to the flush interval used to push data to Graphite.
Carbon will never "change your data" under such a configuration.

Re [1]: How would you expect the presentation layer to present >n data points
using n pixels?

Graphite doesn't "change your data". Presentation of data != the data itself,
just as a map of a city != the city itself.

~~~
another
> If you never want your data downsampled, keep data at a single resolution...

Sure, and many people do exactly that. The point is that a new user to
graphite is likely to be surprised by this behavior. (I would further bet that
a reasonable fraction of statsd+graphite users end up viewing incorrect data
without realizing it, especially given the statsd focus on count data, for
which the default aggregationMethod setting is exactly the wrong choice.)

(And even awareness of this behavior isn't quite enough, since every user
needs to also remember their server's exact storage configuration, lest they
inadvertently expand their plot across a retention boundary.)

> How would you expect the presentation layer to present n data points using n
> pixels?

The same way that most plotting tools do so: by overdrawing. Yes, one ends up
with a solid block of pixels if the data are noisy and the plot is small, but
that outcome is easily understood and has the easily understood solution of
explicitly aggregating appropriately. Graphite instead takes the approach of
implicitly aggregating based on how wide the plot is rendered in a given
interface. That behavior is, at the very least, surprising.

------
sciurus
This is a very nice article.

One nitpick- You don't need to use statsd as an intermediary in order for your
application to send metrics via UDP; just set ENABLE_UDP_LISTENER to True in
carbon.conf and graphite will accept metrics on UDP itself. Other options are
TCP(obviously) and AMQP.

I love how simple Graphite's plaintext protocol is; it's nothing more than a
line of text with <metric path> <metric value> <metric timestamp>. This has
lead lots of software to integrate graphite support and makes it easy to do
yourself. In a pinch I've even set up a cronjob reading a value from /proc and
sending it to graphite via netcat.

Graphite shines at generating graphs, but it's ability to return JSON is also
very useful. For example, I've written a script
([https://github.com/sciurus/grallect](https://github.com/sciurus/grallect))
that plugs into Nagios and generates alerts based on system metrics sent by
Collectd to Graphite.

My two frustrations with graphite-

You have to choose a single aggregation method. I'd like to be able to store
the average, minimum, and maximum values.

Sometimes I find it hard to query for the data I want. E.G. To check the
percentage of space used on each filesystem I have to fetch example.com.df-
_.df_complex-used and example.com.df-_.df_complex-free separately and
calculate the percentages myself because
asPercent(example.df-*.df_complex-{used,free}) would combine all the
filesystems.

------
pearkes
I use a statsd compatible alternative called statsite.[1]

It's written in pure c and behaves like you would expect statsd to, with some
additional improvements. I'm definitely more comfortable deploying it as
opposed to installing and managing a node.js application.

[1] [https://github.com/armon/statsite](https://github.com/armon/statsite)

~~~
mattetti
Interesting, 37signals released their own Golang based version of statsd:
[https://github.com/noahhl/go-batsd](https://github.com/noahhl/go-batsd)
Probably for the same reasons you rewrote it in pure C.

~~~
geetarista
The main reason being that StatsD will max out at about 10K OPS (unless
they've improved it recently) whereas Statsite will reach 10 MM. Also, look at
the difference between the implementation of sets. StatsD uses a JS object[1]
versus statsite using a C implementation of HyperLogLog[2][3]. If you're doing
anything significant, you should not be using the node.js version of StatsD.

[1]
[https://github.com/etsy/statsd/blob/master/lib/set.js](https://github.com/etsy/statsd/blob/master/lib/set.js)
[2]
[https://github.com/armon/statsite/blob/master/src/set.c](https://github.com/armon/statsite/blob/master/src/set.c)
[3]
[http://research.google.com/pubs/pub40671.html](http://research.google.com/pubs/pub40671.html)

------
SEJeff
Graphite committer / co-maintainer here...

Feel free to point out useful things graphite could do better (constructively
only) and/or some of your favorite posts or tools used with graphite. We
aren't too far off from 2 quite massive releases (0.9.11 / 0.10) and are
thinking about departing from some of the legacy bits moving forward. I'm
looking at you python 2.4

~~~
m0th87
The process of getting graphite web up and running on Mac seemed pretty
involved since it's broken up into 3 packages and depends on Cairo which can
be finicky.

When I last checked there didn't seem to be solid documentation on how to get
it all setup. Searching now, this looks promising:
[http://amin.bitbucket.org/posts/graphite-mac-
homebrew.html](http://amin.bitbucket.org/posts/graphite-mac-homebrew.html)

~~~
SEJeff
Would it be helpful if that was rolled up in the official docs?

~~~
m0th87
Yeah I think so. Or maybe a homebrew formula. I don't know how straight-
forward that is with Cairo though.

Thank you for graphite and thanks for being so receptive!

------
threeseed
Everyone considering using Graphite needs to take a look at Librato Metrics
([https://librato.com](https://librato.com)).

It's very affordable and dramatically simplifies management of your stats. I
am so glad to put Graphite behind me.

------
tbh
Excellent guide. We (at
[http://www.hostedgraphite.com](http://www.hostedgraphite.com)) will be
pointing new users towards it and I'm sure many will benefit. Thanks!

------
mattetti
37Signals wrote a piece on how to instrument your Rails app:
[http://37signals.com/svn/posts/3091-pssst-your-rails-
applica...](http://37signals.com/svn/posts/3091-pssst-your-rails-application-
has-a-secret-to-tell-you)

------
jpadilla_
I created a Vagrant VM for StatsD and Graphite. Maybe it'll be of help to
someone [https://github.com/jpadilla/statsd-graphite-
vm](https://github.com/jpadilla/statsd-graphite-vm)

------
boundlessdreamz
Thank you. This is awesome.

If you have the same metrics coming in from multiple app servers, how can they
be viewed in aggregate and separately?

~~~
mattetti
To do that, you can change your stats name to include each machine's name, for
instance: <host name>.accounts. _.http.post.authenticate.response_time

When you display your metrics you can query for:
_.accounts.*.http.post.authenticate.response_time

To get a breakdown per machine (and per client). or you can sum the metrics
still using the wildcard.

------
thesis
Graphite & Carbon was an absolute nightmare to get installed on CentOS -- we
had to settle installing it on Ubuntu instead.

~~~
mhurron
CentOS isn't right for everyone, oddly because of the Enterprise focus for
stability. The system libraries on CentOS end up being quite old from the
viewpoint of a lot of developers.

Sometimes the best solution is to ignore the system provided libraries and
build your own environment.

