Hacker News new | past | comments | ask | show | jobs | submit login

I ended up rolling my own replacement. My biggest problem with Graphite was that it managed to grind an expensive large RAID array into the ground with a relatively small number (in my eyes) of metrics. We had the realisation that we'd waste a tremendous amount of hardware or have to cut down drastically on our data collection if we were to roll out Graphite across the board.

(And yes, we had crashes too)

The reason for the disk grinding was simple: The whisper storage system is ridiculously inefficient as it does tiny writes all over the places, and an excessive number of system calls to boot.

In our case, I decided we don't care if we lose some data if a metric server crashes - if it becomes an issue we'll run two or more vms on separate hardware and feed half our samples into each -, so the first step was to write a simple statsd replacement that shovels 10 second intervals of data into Redis with disk snapshots turned off, coupled with a small daemon that rolls up (I've hardcoded roll-up intervals as it made it easy to use naming of the keys to use "keys <timestamp for start of each interval to roll up>-<postfix for type of period e.g. we use 10 second then 5 minutes, then hourly>-*" to retrieve the keys of the objects to process each step).

We could've easily beat Carbon/Graphite on the same system just by doing more efficient disk writes, but since we were first going to replace it I figured I might as well keep things in memory.

Then a tiny replacement for the subset of the Graphite HTTP API we used for our graphing (if we'd relied on Graphite itself for our dashboards I'd have thought twice about this...).

Lastly a tiny process that archives a final roll-up of data past 48 hours (currently) to CouchDB for if/when we need to do longer term historical trending.

I keep wanting to talk to our commercial director about letting me release some of this code, though a lot of it is probably too specific to our needs to be all that useful to others (e.g. as mentioned, we only support a tiny subset of the functionality of the Graphite HTTP API, as I've only cared about being able to do the averaging and filtering etc. that we actually use). In general, though, if you don't use Graphite for the actual dashboard, replacing it is surprisingly little work.




I ended up setting up one machine with a 80GB tmpfs mount for the graphite data, and then rsync it to disk every hour. That allows carbon-cache to keep up, but I'm not happy with the setup.

Whisper is terrible for spinning disks.


if you use collectd to feed values into graphite, you've got the advantage of its bulk writes. This article describes how its solved for rrd only collectd installations: https://collectd.org/wiki/index.php/Inside_the_RRDtool_plugi... but the effect also becomes visible if you use it to send values to graphite.

Its also good in reducing the amount of data you need to send to graphite.


How do you figure? Unless recent versions of Whisper have been totally rewritten, whisper writes each metric to a separate file. Submit hundreds of metric per vm/server every 10 seconds, and you get ridiculous amounts of tiny writes (e.g. 4 byte writes) fenced by redundant seek()'s and a number of other syscalls, no matter how much you batch up stuff before sending it to statsd.


I came up with the same workaround myself; then switched to influxdb, and my monitoring server's io-wait is still at 0% even with twice as many metrics coming in :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: