Hacker News new | comments | show | ask | jobs | submit login

Does anyone here use this?

Yes, at IFTTT we use this. I've been pleasantly surprised at the amount of traffic statsd can handle, even without sampling.

We also believe in measuring everything you can. We're interacting with many APIs across many boxes. Statsd + graphite are the tools we use to understand what's happing at runtime.

Graphite has a lot of warts, but it's really powerful once you get used to it. There are plenty of pretty interfaces you can put over graphite, but nothing really matches it for ease of ad-hoc queries.

Typically I'll use graphite to view ad-hoc metrics and build reports. When I find I'm repeatedly viewing a particular graphite report then I'll "hard-code" it in gdash [1] for the rest of the team.

We use this combo to track thousands of separate metrics and we've been pretty happy with it so far.

[1] https://github.com/ripienaar/gdash

We use StatsD at SeatGeek (any at my previous job as well) to track as much as possible. In general we try to time each call to an external service and use counters for any exceptions with those services.

When I was independently contracting as a systems engineer I convinced clients to invest some time into this whenever I could. For example, I had one client with a set of content aggregators (read: web crawlers) that grew and shrunk dynamically with no centralized logging. When I eventually convinced them it was worth the time, I fixed a variety of their logging issues and also introduced statsd and graphite.

Implementation was easy. statsd is pretty simple to deploy and graphite wasn't too difficult either. To add statsd reporting to your code, it's essentially one line to create the statsd socket, another line of code to declare each timer or counter, and another one to increment. I think more time was spent determining what name to give each metric than it was implementing it in this project.

Now that I'm at dotCloud, I'm working with a much larger distributed system and we use it here also. We liked it enough to build some statsd hooks onto our RPC layer we use for just about everything. Now every time a component makes a remote procedure call, a counter for that call is incremented and the response time is sent to statsd. It's been very useful for troubleshooting odd behaviors and correlating events across the platform.

As people who work with complex distributed systems, we can't know exactly what they're doing. We'll think we know, and sometimes we'll be close. Other times we'll think we know, and then we'll wake up at 2AM because something failed horribly. By being able to monitor the system's behavior (sometimes in gross detail), we can get a little closer to knowing what's really going on.

We use StatsD and Graphite a ton at lonelyplanet.com: http://bit.ly/lp-perf-and-metrics

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact