Hacker News new | comments | show | ask | jobs | submit login
A Practical guide to StatsD/Graphite monitoring (aimonetti.net)
136 points by mattetti 1394 days ago | hide | past | web | 29 comments | favorite

In addition to this helpful guide, note that statsd / graphite both spring some unfortunate surprises on new users, e.g., graphite changing your data across retention rates and time scales [0], graphite changing your data at different plot widths (?!) [1], statsd believing that only count and time data deserve to be aggregated [2], etc.

I have no alternative to suggest, however. Perhaps Cube [3], but unclear if it has any user community.

[0] http://stackoverflow.com/questions/10820119/graphite-is-not-... [1] http://graphite.readthedocs.org/en/1.0/functions.html#graphi... [2] https://github.com/etsy/statsd/issues/98 [3] https://github.com/square/cube

Re [0]: If you never want your data downsampled, keep data at a single resolution which is equal to the flush interval used to push data to Graphite. Carbon will never "change your data" under such a configuration.

Re [1]: How would you expect the presentation layer to present >n data points using n pixels?

Graphite doesn't "change your data". Presentation of data != the data itself, just as a map of a city != the city itself.

> If you never want your data downsampled, keep data at a single resolution...

Sure, and many people do exactly that. The point is that a new user to graphite is likely to be surprised by this behavior. (I would further bet that a reasonable fraction of statsd+graphite users end up viewing incorrect data without realizing it, especially given the statsd focus on count data, for which the default aggregationMethod setting is exactly the wrong choice.)

(And even awareness of this behavior isn't quite enough, since every user needs to also remember their server's exact storage configuration, lest they inadvertently expand their plot across a retention boundary.)

> How would you expect the presentation layer to present n data points using n pixels?

The same way that most plotting tools do so: by overdrawing. Yes, one ends up with a solid block of pixels if the data are noisy and the plot is small, but that outcome is easily understood and has the easily understood solution of explicitly aggregating appropriately. Graphite instead takes the approach of implicitly aggregating based on how wide the plot is rendered in a given interface. That behavior is, at the very least, surprising.

Changing data in undesired ways is something I'd call a bug, not a surprise.

It's not a bug; carbon was behaving exactly the way it was configured to behave. This wouldn't be surprising to anyone who is familiar with RRDTool. However, since one of the reasons graphite uses its own file format (whisper) instead of RRD is to better handle intermittent values, I could see the argument that the default xFilesFactor should be higher.

"xFilesFactor should be a floating point number between 0 and 1, and specifies what fraction of the previous retention level’s slots must have non-null values in order to aggregate to a non-null value. The default is 0.5." - http://graphite.readthedocs.org/en/1.0/config-carbon.html

Fixing statsd is easy at least, that program is trivial.

Oh, yeah, it's weird but not even a serious limitation: it's being fixed, other statsd clones have more features, and you can always pretend your data are time intervals. statsd is limited but nicely simple.

This is a very nice article.

One nitpick- You don't need to use statsd as an intermediary in order for your application to send metrics via UDP; just set ENABLE_UDP_LISTENER to True in carbon.conf and graphite will accept metrics on UDP itself. Other options are TCP(obviously) and AMQP.

I love how simple Graphite's plaintext protocol is; it's nothing more than a line of text with <metric path> <metric value> <metric timestamp>. This has lead lots of software to integrate graphite support and makes it easy to do yourself. In a pinch I've even set up a cronjob reading a value from /proc and sending it to graphite via netcat.

Graphite shines at generating graphs, but it's ability to return JSON is also very useful. For example, I've written a script (https://github.com/sciurus/grallect) that plugs into Nagios and generates alerts based on system metrics sent by Collectd to Graphite.

My two frustrations with graphite-

You have to choose a single aggregation method. I'd like to be able to store the average, minimum, and maximum values.

Sometimes I find it hard to query for the data I want. E.G. To check the percentage of space used on each filesystem I have to fetch example.com.df-.df_complex-used and example.com.df-.df_complex-free separately and calculate the percentages myself because asPercent(example.df-*.df_complex-{used,free}) would combine all the filesystems.

Graphite committer / co-maintainer here...

Feel free to point out useful things graphite could do better (constructively only) and/or some of your favorite posts or tools used with graphite. We aren't too far off from 2 quite massive releases (0.9.11 / 0.10) and are thinking about departing from some of the legacy bits moving forward. I'm looking at you python 2.4

The process of getting graphite web up and running on Mac seemed pretty involved since it's broken up into 3 packages and depends on Cairo which can be finicky.

When I last checked there didn't seem to be solid documentation on how to get it all setup. Searching now, this looks promising: http://amin.bitbucket.org/posts/graphite-mac-homebrew.html

Here is another tutorial on how to get setup on OS X: http://steveakers.com/2013/03/12/installing-graphite-statsd-...

Would it be helpful if that was rolled up in the official docs?

Yeah I think so. Or maybe a homebrew formula. I don't know how straight-forward that is with Cairo though.

Thank you for graphite and thanks for being so receptive!

Yes please, it looks like amazing software, I just can't figure out how to actually use it.

First of all, thanks a lot for such an awesome tool!

I think graphite would greatly improve by having the "web" part split into different apps/packages (API, graphing and frontend/dashboard).

That way, people could install whatever they wanted. Imagine "only" having to improve the backend while other people create amazing dashboards (which, right now, is already happening anyway... there's such a fragmentation in the available frontends...)

I'd love to help with the split if you deem it worthy & if you need a hand :) I will create a GH issue anyway :D

Again, THANKS!

I use a statsd compatible alternative called statsite.[1]

It's written in pure c and behaves like you would expect statsd to, with some additional improvements. I'm definitely more comfortable deploying it as opposed to installing and managing a node.js application.

[1] https://github.com/armon/statsite

Interesting, 37signals released their own Golang based version of statsd: https://github.com/noahhl/go-batsd Probably for the same reasons you rewrote it in pure C.

The main reason being that StatsD will max out at about 10K OPS (unless they've improved it recently) whereas Statsite will reach 10 MM. Also, look at the difference between the implementation of sets. StatsD uses a JS object[1] versus statsite using a C implementation of HyperLogLog[2][3]. If you're doing anything significant, you should not be using the node.js version of StatsD.

[1] https://github.com/etsy/statsd/blob/master/lib/set.js [2] https://github.com/armon/statsite/blob/master/src/set.c [3] http://research.google.com/pubs/pub40671.html

Everyone considering using Graphite needs to take a look at Librato Metrics (https://librato.com).

It's very affordable and dramatically simplifies management of your stats. I am so glad to put Graphite behind me.

Excellent guide. We (at http://www.hostedgraphite.com) will be pointing new users towards it and I'm sure many will benefit. Thanks!

37Signals wrote a piece on how to instrument your Rails app: http://37signals.com/svn/posts/3091-pssst-your-rails-applica...

I created a Vagrant VM for StatsD and Graphite. Maybe it'll be of help to someone https://github.com/jpadilla/statsd-graphite-vm

Thank you. This is awesome.

If you have the same metrics coming in from multiple app servers, how can they be viewed in aggregate and separately?

To do that, you can change your stats name to include each machine's name, for instance: <host name>.accounts..http.post.authenticate.response_time

When you display your metrics you can query for: .accounts.*.http.post.authenticate.response_time

To get a breakdown per machine (and per client). or you can sum the metrics still using the wildcard.

You can also have carbon aggregate them for more efficient queries https://graphite.readthedocs.org/en/0.9.10/config-carbon.htm...

Graphite & Carbon was an absolute nightmare to get installed on CentOS -- we had to settle installing it on Ubuntu instead.

CentOS isn't right for everyone, oddly because of the Enterprise focus for stability. The system libraries on CentOS end up being quite old from the viewpoint of a lot of developers.

Sometimes the best solution is to ignore the system provided libraries and build your own environment.

I only ran into one problem (and they quickly accepted my pull request to fix it) building and using RPMs from the spec files at https://github.com/dcarley/graphite-rpms

So for 0.10, one of my personal goals is to get it all working with EPEL5 so you can just you install all of the components

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact