
Brubeck – A statsd-compatible metrics aggregator - samlambert
http://githubengineering.com/brubeck/
======
mcfunley
I was at Etsy when we made statsd, and I'm currently working on fixing the
same problems at Stripe. This is super interesting and relevant to me
personally, thanks for writing it all down!

There are some details of how Etsy uses statsd that are not well-communicated.
Etsy samples metrics aggressively to limit the amount of total traffic. And
they monitor the packet error rates on the statsd boxes like hawks to keep the
loss rate in check. Back when I was working there, if you added a high-volume
counter without sampling it, the alarms would sound and you'd have an ops
person tapping your shoulder pretty quickly. If you use statsd and skip either
of these steps, the 40% loss that github experienced is what you get.

AIUI Etsy's moved to a consistent hashing scheme that's at least vaguely
similar to this.

Node was not then, nor is now Etsy's area of expertise. We were going through
an adolescent "let's just use every language" phase when we built statsd. I
think the problems outlined here are solid supporting evidence that you should
use a smallish set of tools and master them (a point of view which is very on-
brand for Etsy engineering as it exists today).

~~~
bbrazil
Have you considered pushing the aggregation into the applications, rather than
doing it across the network?

Having whatever library is sending data to statsd, instead keep
counters/gauges in memory and then expose that on a regular basis would
greatly reduce the data volumes involved as it's O(timeseries*frequency)
rather than O(events).

This is the approach we take with Prometheus, and based on the statsd setups
of some people who've come talking to us there's scope for a reduction in
network load of at least an order of magnitude without having to do any
downsampling.

~~~
mcfunley
Yeah this has come up and it's reasonable, although it's a tricky/laborious
migration in practice given a wide variety of things emitting stats.

The statsd design choices here are mostly explained by the fact that Etsy uses
it to collect from PHP. PHP doesn't afford a great way to aggregate in the
client. (These are design choices that serve PHP well systemically, although
it's limiting here.)

~~~
bbrazil
I hadn't realised the PHP link, things make more sense now.

You're getting into IPC then, which is a fun topic alright e.g.
[https://github.com/prometheus/client_ruby/issues/9](https://github.com/prometheus/client_ruby/issues/9)
and
[https://github.com/prometheus/client_python/issues/30](https://github.com/prometheus/client_python/issues/30)

------
jameskilton
This is awesome, and more choice is definitely good, but I'm curious if Github
tried out statsite[1] at all in this process? My guess is that these two tools
were built at the exact same time solving the same problems.

[1] [https://github.com/armon/statsite](https://github.com/armon/statsite)

~~~
lpgauth
statsite is also based on a event-loop and is limited to a single core (no
smp).

~~~
armon
To clarify, statsite does spawn threads to do some of the aggregation and
flushing at the end of the collection interval, but yes it is possible to
saturate the main loop at a very high ingest rate.

------
jmsdnns
I once built a web framework named Brubeck
([https://github.com/j2labs/brubeck](https://github.com/j2labs/brubeck)). I
stopped developing it a while ago, though.

I am really happy to see another project, one that will probably reach more
people than my web framework, honor the same wonderful musician I was trying
to honor.

~~~
dberg
Funny i think i remember going to a brubeck meetup here in NY once. My old co-
woker Patrick mentioned it and we ended up seeing a talk on it.

~~~
jmsdnns
That was me. BrubeckNYC. :)

------
anton_gogolev
Can't help but plug our MIT-licensed port of Graphite/StatsD to .NET: [1].

[1]: [https://bitbucket.org/aeroclub-
it/statsify](https://bitbucket.org/aeroclub-it/statsify)

------
piotrp
There already exist many statsd implementations from which two in C and four
in golang. Why the need for another one?

[https://github.com/etsy/statsd/wiki#server-
implementations](https://github.com/etsy/statsd/wiki#server-implementations)

~~~
simonw
Presumably because they didn't exist (or weren't production-ready) three years
ago:

"After three years running in our production servers (with very good results
both in performance and reliability), today we're finally releasing Brubeck as
open-source."

------
moogly
Perchance named after Dave Brubeck?

~~~
jssjr
Indeed!

~~~
robhanlon
Complete with a subtle joke about odd meter!

------
rodionos
>A Turing-complete computing device running a modern version of the Linux
kernel... I liked that!

Any plans for pluggable backends in addition to BRUBECK_BACKEND_CARBON? Pls
consider making it agnostic, i.e. using a wire protocol.

------
leetrout
I don't really "know" golang but this seems like an area where it would have
been useful.

Am I think about this the right way? Would channels have helped handle some of
the load effectively?

Any gophers care to comment? :D

~~~
sagichmal
Choosing C for this project is definitely... strange. Keith Rarick calls it
"downright irresponsible" [1] — I'm not sure I disagree.

[1]
[https://twitter.com/krarick/status/610502413007503360](https://twitter.com/krarick/status/610502413007503360)

~~~
DasIch
What is the alternative? Rust isn't there yet when it comes to IO and there
are certainly no other languages that qualify and allow a similar level of
control.

~~~
rakoo
What about nim ? There are facilities to run an evented loop, to listen on an
udp socket, can be tuned for more realtime capabilities, is compiled to C...

~~~
DasIch
Nim doesn't have the same popularity. There aren't a lot of people that know
the language, that makes hiring people that do difficult, it means there may
be nobody at Github that does either and due to it's lack of popularity people
will probably be less willing to learn it than Rust or Go.

------
marceldegraaf
Looks great! Question for the author(s) of Brubeck: are you planning to add an
InfluxDB backend, or is that something you'd want the community to pick up?

------
whost49
Is there a reason why you didn't just use collectd, which is written in C and
has lots of plugins, including statsd, InfluxDB, Graphite, etc.?

~~~
jssjr
We use collectd extensively and it is wonderful software. Brubeck and collectd
do very different jobs.

~~~
whost49
What does Brubeck do that the collectd statsd plugin can't?

[https://collectd.org/wiki/index.php/Plugin:StatsD](https://collectd.org/wiki/index.php/Plugin:StatsD)

