

Graph everything with Graphite - pmoriarty
http://www.xkyle.com/graph-everything-with-graphite/

======
eloycoto
This post it's quite old and graphite ecosystem improve, some examples:

Grafana from my POV is the best dashboard at the moment:
[http://grafana.org/](http://grafana.org/)
[http://grafana.org/blog/2014/05/25/monitorama-video-and-
upda...](http://grafana.org/blog/2014/05/25/monitorama-video-and-update.html)
[http://play.grafana.org/](http://play.grafana.org/)

About alert system I'm using cabot:
[http://cabotapp.com/](http://cabotapp.com/)

About system metrics, at the moment I'm using Diamond.
[https://github.com/BrightcoveOS/Diamond](https://github.com/BrightcoveOS/Diamond)

In the other hand, Influxdb is growing, maybe I'll switch to InfluxDB next
year, some features are better than graphite, and it's statsd compatible ;-)
[http://influxdb.com/](http://influxdb.com/)

Regards ;-)

~~~
Shish2k
FWIW I've found influxdb considerably easier to install and manage than
graphite (graphite doesn't play well with virtualenv, which makes dependency
management horrible, compared to influxdb's single static binary)

Also, I can see logging dictionaries being much more efficient and useful than
logging single values -- with graphite if you want to track page hits per
section of your site (of which you have 10) per user (100) per browser (5),
you end up with 5000 individual metrics, and you need to have thought of them
in advance. With influxdb you can log {"section": "front page", "user": "bob",
"browser": "firefox", "hits": 1} as a single metric and then use an SQL-like
query to filter by section / user / browser (or any combination of those) as
and when you want to.

TBH the only thing I miss going from graphite to influxdb+grafana is the tree
of metrics (grafana has autocomplete once you start typing, but you can't just
browse) and a few of the rendering functions (moving average).

~~~
pauldix
I think the tagging features in the upcoming 0.9.0 release [1] will help with
the navigation of metrics. With that we're adding new types of queries to help
in discovery. [2]

[1] -
[http://influxdb.com/blog/2014/12/08/clustering_tags_and_enha...](http://influxdb.com/blog/2014/12/08/clustering_tags_and_enhancements_in_0_9_0.html)
[2] -
[https://github.com/influxdb/influxdb/blob/master/QUERIES.md#...](https://github.com/influxdb/influxdb/blob/master/QUERIES.md#list)

------
sztanko
"If you can't measure it, you can't prove you made it better" is a core value
in our organisations' tech culture. We have written an interface in our
framework for automatically creating new metrics and it is very easy for a
developer to set up a new graph that monitors theirs code.

Now, we have another problem. There are over 2 million metrics in our
monitoring system and no one knows what most of them mean. Some graphs have
been set up for features that don't exist anymore, other graphs were set up by
developers who have already quit, there are lot's of duplicated metrics and in
general it is a mess. So we are currently working on this problem. I still
would like to mention this is a better problem then not having metrics at all,
but still a problem.

If you ever had a similar situation, I will be thankful if you could share
your experience on how you solved it.

~~~
yummyfajitas
[https://codeascraft.com/2013/06/11/introducing-
kale/](https://codeascraft.com/2013/06/11/introducing-kale/)

~~~
sztanko
Looks interesting, thanks!

------
noelwelsh
I haven't had the best experience with Graphite. Namely, our main systems
practically never crash but Graphite does fall over every few months.
Seriously, Graphite is less reliable than the systems we use it to monitor.
Furthermore, there hasn't been a release in about 2 years which makes me think
the project is dead.

~~~
vidarh
I ended up rolling my own replacement. My biggest problem with Graphite was
that it managed to grind an expensive large RAID array into the ground with a
relatively small number (in my eyes) of metrics. We had the realisation that
we'd waste a tremendous amount of hardware or have to cut down drastically on
our data collection if we were to roll out Graphite across the board.

(And yes, we had crashes too)

The reason for the disk grinding was simple: The whisper storage system is
_ridiculously_ inefficient as it does tiny writes all over the places, and an
excessive number of system calls to boot.

In our case, I decided we don't care if we lose some data if a metric server
crashes - if it becomes an issue we'll run two or more vms on separate
hardware and feed half our samples into each -, so the first step was to write
a simple statsd replacement that shovels 10 second intervals of data into
Redis with disk snapshots turned off, coupled with a small daemon that rolls
up (I've hardcoded roll-up intervals as it made it easy to use naming of the
keys to use "keys <timestamp for start of each interval to roll up>-<postfix
for type of period e.g. we use 10 second then 5 minutes, then hourly>-*" to
retrieve the keys of the objects to process each step).

We could've easily beat Carbon/Graphite on the same system just by doing more
efficient disk writes, but since we were first going to replace it I figured I
might as well keep things in memory.

Then a tiny replacement for the subset of the Graphite HTTP API we used for
our graphing (if we'd relied on Graphite itself for our dashboards I'd have
thought twice about this...).

Lastly a tiny process that archives a final roll-up of data past 48 hours
(currently) to CouchDB for if/when we need to do longer term historical
trending.

I keep wanting to talk to our commercial director about letting me release
some of this code, though a lot of it is probably too specific to our needs to
be all that useful to others (e.g. as mentioned, we only support a tiny subset
of the functionality of the Graphite HTTP API, as I've only cared about being
able to do the averaging and filtering etc. that we actually use). In general,
though, if you don't use Graphite for the actual dashboard, replacing it is
surprisingly little work.

~~~
alxnlssn
I ended up setting up one machine with a 80GB tmpfs mount for the graphite
data, and then rsync it to disk every hour. That allows carbon-cache to keep
up, but I'm not happy with the setup.

Whisper is terrible for spinning disks.

~~~
dothebart
if you use collectd to feed values into graphite, you've got the advantage of
its bulk writes. This article describes how its solved for rrd only collectd
installations:
[https://collectd.org/wiki/index.php/Inside_the_RRDtool_plugi...](https://collectd.org/wiki/index.php/Inside_the_RRDtool_plugin)
but the effect also becomes visible if you use it to send values to graphite.

Its also good in reducing the amount of data you need to send to graphite.

~~~
vidarh
How do you figure? Unless recent versions of Whisper have been totally
rewritten, whisper writes each metric to a separate file. Submit hundreds of
metric per vm/server every 10 seconds, and you get ridiculous amounts of tiny
writes (e.g. 4 byte writes) fenced by redundant seek()'s and a number of other
syscalls, no matter how much you batch up stuff before sending it to statsd.

------
maguireb
Graphite is a great tool, but the graphs can be a bit ugly and changing time
can be a bit annoying. We use Grafana
([http://grafana.org/](http://grafana.org/)) for a nicer frontend to Graphite.

~~~
aequitas
Grafana is surely the way to go when composing dashboards for your Graphite
data.

Be sure to check out the screencasts to get a nice overview and quickstart of
the features:
[http://grafana.org/docs/screencasts/](http://grafana.org/docs/screencasts/)

Also if you are no longer using the Graphite web frontend consider switching
to graphite-api: [http://graphite-
api.readthedocs.org/en/latest/](http://graphite-
api.readthedocs.org/en/latest/)

------
anton_gogolev
Being a Windows shop, we had no interest in having a Linux box running with
Graphite/StatsD so we went ahead and essentially ported
Graphite/StatsD/CollectD to .NET/C#. We'll be be open-sourcing this toolset
soonish.

For those interested (not much there yet): [https://bitbucket.org/aeroclub-
it/statsify](https://bitbucket.org/aeroclub-it/statsify)

------
Htsthbjig
I would love software developers stopping their use of names of real things
for their software apps. It is so confusing.

It is like they have material world envy or something.

Graphite is already something very common. It makes it hard for people to
search in a search engine, and in headlines like this it confuses the hell out
of normal people.

I graph lots of things using graphite powder.

~~~
nodata
Could you give us a few examples of a better name?

~~~
monista
Graphit, Graf(f)it, Grapheme, Graphol (graph-all), Graspit(e), Graphograph (as
in tool to graph a graph :)

------
KaiserPro
Graphite is _awesome_ : We graph lots and lots of things. we have two
datastore servers, and each of them has a static write load of about 60 megs a
second (bear in mind that each update is less than 100bytes) we have many
thousands of updates a second.

But why is it awesome? because it almost eliminates the need for log shipping.
90% of the time we can diagnose most problems with just graphs. Something is
running slow? well we can see which server it is by looking at load, response
time and queue size.

Because we are not doing silly things like parsing logs to gain metrics, we
don't need a hadoop system. We plumb metrics collection directly into Cgroups
(the primitive that docker uses) so we can get per process metrics (disk,
memory, cpu etc)

The only time we need logs are when we are really stuck, or diagnosing a
specific issue like "why" it went wrong, not what has gone wrong.

------
jcr
The graphite project docs and github repo are also helpful:

[http://graphite.readthedocs.org/en/latest/](http://graphite.readthedocs.org/en/latest/)

[https://github.com/graphite-project](https://github.com/graphite-project)

------
pmoriarty
Does anyone know of any good solutions for feature extraction from logs, for
the purpose of making graphs out of these features?

Has anyone integrated a graphing system like this in to Graylog/Logstash and
care to share some lessons or advice on how you did it?

Finally, monitoring suites like Nagios an Zabbix: they have their own data-
collection and graphing features. When tools like Graphite are used in
conjunction with these, would you bypass them entirely? Or would you leverage
their data-collection features to first grabb the data and then somehow funnel
that data in to Graphite?

If it's the former, what do you use instaed of all thosebuilt-in data
collection tools? If it's the latter, how do you do it?

~~~
grosskur
Ryan Smith's "Building Metrics From Log Data" is an interesting talk about
this:

[http://vimeo.com/68183624](http://vimeo.com/68183624)

Also check out Heroku's Lumbermill project, which handles extracting router
metrics on their platform:

[https://github.com/heroku/lumbermill](https://github.com/heroku/lumbermill)

If you're on AWS, CloudWatch is capable of ingesting log data and extracting
metrics via pattern-matching:

[http://docs.aws.amazon.com/AmazonCloudWatch/latest/Developer...](http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatchLogs.html)

When I used Graphite with Nagios, I bypassed all the Nagios data-collection
and graphing features. Instead, I funneled all the data into Graphite and used
check-graphite to alert on it:

[https://github.com/pyr/check-graphite](https://github.com/pyr/check-graphite)

------
paulasmuth
You might also be interested FnordMetric ChartSQL
([http://fnordmetric.io](http://fnordmetric.io)) which is a SQL-based graphite
competitor.

~~~
mmsimanga
Thanks for this link. Most of the data I deal with is in relational SQL
databases so this looks very promising for my use cases. Maybe its just me but
most of the graphing libraries and APIs are geared towards JSON data.

------
dothebart
I very much like the way in which metrics 2.0 enhances the duett of collectd
and graphite ( [http://metrics20.org/](http://metrics20.org/) ) See the
amazing video how you can select across the data in the cluster of dieterbe's
employer.

------
MattHodge
If you are on Windows and want to send metrics to graphite, I created a set of
PowerShell functions to do it: [https://github.com/MattHodge/Graphite-
PowerShell-Functions](https://github.com/MattHodge/Graphite-PowerShell-
Functions)

------
NoCowLevel
Reminds me of an older post about how Graphite + StatsD can be a powerful tool
to measure everything. [https://codeascraft.com/2011/02/15/measure-anything-
measure-...](https://codeascraft.com/2011/02/15/measure-anything-measure-
everything/)

~~~
dothebart
It should be mentioned that there is a huge bunch of various implementations
of the statsd pattern; depending on ones existing infrastructure one may
prefer one or the other. Heres a comprehensive list of them:
[http://www.joemiller.me/2011/09/21/list-of-statsd-server-
imp...](http://www.joemiller.me/2011/09/21/list-of-statsd-server-
implementations/)

------
pseudometa
How awesome would it be if they hired a great visual designer to add some
polish to the overall look of the graphs and software.

------
tschellenbach
:) I gave a presentation about this as well. brilliant approach. we do the
same for getstream.io

