
Riemann – A network monitoring system - jonbaer
http://riemann.io/
======
falcolas
Evaluated it, and ultimately rejected it for a couple of reasons:

\- You must pick up Clojure to understand and configure Riemann (we're not a
Clojure shop, so this is a non-trivial requirement)

\- Config file isn't a config file, it's an executed bit of Clojure code

\- Riemann is not a replacement for an alerting mechanism, it's another signal
for alerting mechanisms (though since it's Clojure and the configuration file
is a Clojure script, you can absolutely hack it into becoming an alerting
system)

\- Riemann is not a replacement for a trend graphing mechanism.

\- There are other solutions which can be piped together to get the 80% of the
functionality we wanted from Riemann (Graphite + Skyline) in much less
invested time

Skyline link:
[https://github.com/etsy/skyline](https://github.com/etsy/skyline)

~~~
dozzie
> Config file isn't a config file, it's an executed bit of Clojure code

For stream processing engines, configuration _will_ be code. Unfortunate, but
unavoidable.

> Riemann is not a replacement for an alerting mechanism

> Riemann is not a replacement for a trend graphing mechanism.

Indeed it is not. It's misadvertised as a monitoring solution, while it's a
stream processing engine.

What I think of it is that you're supposed build a monitoring system on top of
stream processing engine. It's a pity Riemann doesn't allow to subscribe to
its streams from the outside, so to add any message destination you need to
update its config.

~~~
AdamN
It seems like having a small tool to turn yaml files into basic clojure code
for easy rulesets would be an easy extension. It might encourage Bad behavior
and of course it couldn't do everything ... just an idea.

~~~
retrogradeorbit
That would actually be quite easy to do in clojure without the need for an
external tool by writing a clojure macro that reads the other format and emits
the s-expressions that represent it.

------
foolano
Riemann is great. We use it at work.

I like it so much that I did an experiment to implement it in C++

[https://github.com/juruen/cavalieri](https://github.com/juruen/cavalieri)

My implementation sucks, but I had a lot of fun working on it and I got to
learn how Riemann works better.

~~~
Cyph0n
Good job on the README!

~~~
junto
Went to check it out, expecting something funny.

Was surprised to see the most comprehensive and will written documentation
I've ever seen on Github!

------
jamtur01
Love Riemann. So much that I'm writing a book about monitoring with Riemann as
the core routing engine:

[http://artofmonitoring.com/](http://artofmonitoring.com/)

There's a sample chapter available, which covers the initial Riemann
implementation and a Clojure "getting started" guide which should help anyone
- even if you're not interested in the rest of the book! :)

[http://artofmonitoring.com/TheArtOfMonitoring_sample.pdf](http://artofmonitoring.com/TheArtOfMonitoring_sample.pdf)

------
shizcakes
We've been using Riemann since 2013 and love it!

If you're coming from Nagios (or not), and you'd like something that will
schedule Nagios event scripts (and others) and send them to Riemann, I have
been using this in production since mid-2013:
[https://github.com/bmhatfield/riemann-
sumd](https://github.com/bmhatfield/riemann-sumd)

It allows you to tap into the huge ecosystem that is Nagios monitors, without
requiring any other Nagios component at all. It just translates the output
into a Riemann event.

~~~
charlieflowers
We also use Riemann in prod and love it. It is pretty much the perfect
switchboard/aggregator for stat streams.

It's only part of the stack, but it's great for routing some stats to this
TSDB and other stats to that TSDB. It's also great for detecting anomalies and
sending updates to wherever you want them to go.

For us, this happens in 150 lines of clojure plus 150 lines of unit tests. I
know that's fairly meaningless without knowing more about our system -- but
the point is, it's very expressive so you get a lot done in a few lines of
code. And therefore, don't worry too much about it being clojure.

------
donaldguy
Since the theme of this thread seems to be non-clojure alternatives, I'll
point in the direction of InfluxData (formerly InfluxDB)'s new-ish Kapacitor
project.

While I'm unconvinced their custom JavaScripty DSL (TICKscript) is actually
preferable to Clojure or even can be read without careful, quite LISP-like
indentation, it is pretty similar in basic functionality to Riemann and is
definitely not-Clojure*

see: [https://influxdata.com/blog/announcing-kapacitor-an-open-
sou...](https://influxdata.com/blog/announcing-kapacitor-an-open-source-
streaming-and-batch-time-series-processor/)

[https://docs.influxdata.com/kapacitor/v0.2/tick/](https://docs.influxdata.com/kapacitor/v0.2/tick/)

[https://github.com/influxdata/kapacitor](https://github.com/influxdata/kapacitor)

*at worst, it's a mangled subset of clojure with extraneous dots and the parentheses in the wrong places :-)

------
moomin
Reminder that the author of Reimann is available for consulting and is a
scary-smart human being based around SF.

~~~
romanhn
Specifically, it was written by Kyle Kingsbury aka aphyr of the Jepsen fame.

~~~
VonGuard
He consults on Riemann? I was under the impression he wanted nothing to do
with helping people use it, and is just a general consultant. Like, he's not
trying to support Riemann, he built it because he needed it. Maybe he's
changed his mind?

------
dorfsmay
Note that this isn't a better nagios... It's more like Apache Storm. Think of
it as grep on steroid. I've heard of several thousand events per seconds on a
single server. The other advantage is that it uses a well known programing
language rather than a DSL.

~~~
dozzie
It's actually a disadvantage, since it's Clojure, and Riemann requires that
its operator _actually knows_ Clojure (knowledge of dozen languages, even when
some of them are functional, is not enough).

~~~
Anderkent
Is requiring someone to learn a programming language worse than requiring
someone to learn a custom DSL? Seems a strange assertion.

~~~
dozzie
Clojure is a bigger language than small custom DSL.

~~~
mtrimpe
The parts of Clojure required to write a config file are probably roughly the
same size as a custom DSL though.

Maps and lists are gonna be maps and lists...

~~~
dozzie
Except for Clojure's data model. You'll still need to understand what
apostrophe ("'") does and what keyword (":foo") is.

~~~
outworlder
You can do that in five minutes. You can easily spend days trying to
understand some arcane, adhoc configuration syntax that someone came up with.
And still get the non-trivial cases wrong.

It's a good thing that it's Clojure and not some freakish turing-complete XML-
based configuration file.

------
pyritschard
I'm very happy to see riemann featured here. I've been using it in production
since early 2012 and contributed to it (intensively, at one point) as well.

It's been a breeze, rather worry free and its very good collectd support has
enabled us to cover very interesting use cases at Exoscale.

~~~
anth1y
I'm just getting started with riemann and I'm also learning clojure at the
same time. How did you get buy in from your company to use riemann

~~~
pyritschard
I co-founded it :-)

------
FlorianOver
Used in production without a hassle for 2 years. For our setup and scenario it
was a very fitting and good solution. Easy Setup and easy usage. Recommended.

~~~
panchicore2
can you give the top 3 use cases and how it solves the problem?

------
pkd
Clojure. I think I should stop thinking and really learn that language now. It
seems very tempting as a first functional language.

~~~
yogthos
It's an excellent beginner FP language. Clojure is fairly minimal and requires
learning only a few concepts to become productive. At the same time it
continues to see more use in the industry, so it's one of the easier FP
languages to actually get a job with as well.

I wrote a starter guide that might be helpful
[https://yogthos.github.io/ClojureDistilled.html](https://yogthos.github.io/ClojureDistilled.html)

There's also a free book that's very good
[http://www.braveclojure.com/](http://www.braveclojure.com/)

~~~
pkd
Thanks for compiling such a great resource. I have had a look at it, and will
go through it this weekend. I have also been doing some 4clojure in my spare
time.

Although Brave and True looks like a great book, its not for me. I would love
a book which blazes me through things rather than hand holding me through
every concept. I have been looking at Living Clojure and will probably get
that.

------
lafay
"Network monitoring system" makes me think of routers, switches, Mbps / Gbps,
etc. This seems more like a "server monitoring system."

~~~
dozzie
Nope as well. It's more a stream processing engine. Icinga/Nagios or Zabbix
are (inflexible) specializations of such engine.

------
adam-_-
The presentation video on the homepage is amazingly engaging. It's well worth
a worth watch.

------
elementai
Evaluated it and a couple of others. Chose Bosun, never looked back, it's
probably the only system with flexible and concise DSL for evaluating alerts
(somewhat similar to R in spirit).

Riemann rocks, just not as monitoring system.

Bosun link: [http://bosun.org/](http://bosun.org/)

~~~
mVChr
We evaluated both Bosun and Riemann for our use case and chose Riemann. Use
case is over 10,000 servers with over 2 million metrics incoming every XX
seconds. Riemann simply performs better, probably because it uses streams
instead of a poller. The other part of our use case is to use it primarily for
alerting, and it's fantastically responsive and robust for that, including
simple integration with other systems (email, Slack, nagios, PagerDuty, etc).
Bosun seems like a good tool for other uses however.

------
hendry
Anyone use [http://prometheus.io/](http://prometheus.io/) ? wdyt

~~~
MayMuncher
I don't like it since the master basically does a curl on each server to get
the info, so it doesn't work behind a firewall without tons of issues

~~~
zimpenfish
That's the reason I haven't bothered with Prometheus - pull systems make no
sense to me since you have to configure a single place with perfect* knowledge
of your system rather than just pushing local knowledge to a collector.

(Ok, you do need some knowledge at the parent if you want to raise alerts but
you'd need that anyway.)

~~~
bbrazil
You need a place with all that information anyway, otherwise how do you alert
on something being missing?

If you can't easily do that with your existing infrastructure, you should fix
that first. I've written about this at [http://www.robustperception.io/you-
look-good-have-you-lost-m...](http://www.robustperception.io/you-look-good-
have-you-lost-machines/)

~~~
zimpenfish
If a stream of data stops appearing, then you can alert. You don't need to
pre-configure the existence of that stream (although if you might never get
data from an object, this obviously is a failure you won't catch.)

Your article is a good one but in my experience, many companies are still many
years from being able to implement that kind of database:machine knowledge
consistency.

~~~
bbrazil
You can only alert in that case if the data starts, and there's no transient
issues preventing your monitoring working around the time it stops.

Alerting based on state changes is fragile, it's better to compare against
what you expect to see.

------
zgohr
Monitoring request - dynamic configuration. Using Nagios requires
configuration file changes and service reload Would prefer dynamic, on the fly
configuration. Even better, configuration that can be adjusted externally,
possibly from a web API endpoint. Better still, configuration that IS
external, where the monitoring service queries an external service every X
minutes to determine what to monitor.

Performant - In the realm of 6 oid monitoring of 50,000+ devices in 5 minute
intervals

~~~
cliffmoon
I'm working on a startup that does exactly this. It's AWS only and in private
beta right now: [https://opsee.com/](https://opsee.com/)

------
MayMuncher
Ive been using Riemann at work for the past 6 months or so. There is a
learning curve if you don't know clojure, that has been the biggest hurdle for
me. But I love what it does and its fantastic along with the dashboard it
comes with

------
ultramancool
How does a system like this differ from say, the ElasticSearch+Logstash
system?

~~~
bbrazil
Logs and metrics service different needs, I wrote recently on this:
[https://blog.raintank.io/logs-and-metrics-and-graphs-oh-
my/](https://blog.raintank.io/logs-and-metrics-and-graphs-oh-my/)

------
porker
Anyone able to give a quick overview of why they use Riemann vs SaaS
offerings, and the distinguishing features (other than cost/control)?

I like the look of the config structure, being Clojure.

------
bluedonuts
We use Riemann to track all requests on Apis amongst other things. Have become
rather fond of it. It happily handles 40k events p/s on rather modest
hardware. The dashboard is not the prettiest but it is functional. Overall it
has proved invaluable for spotting anomalies in our metrics.

------
jordigh
Hm, why the name? I don't get it. Riemann wasn't knowing for being a good
monitor or sentry.

~~~
devonkim
Perhaps the idea is related to Riemann sums being an approximation of areas
under curves where the curve is supposed to represent reality? Either that or
that the Riemann hypothesis is an important problem that lots of people have
tried to solve but have really failed.

------
avodonosov
What is the purpose of such a system? Can anyone describe at least one solid
use case?

------
acd
Very interesting project in monitoring! Thanks for creating it! Seems like new
fresh ideas :).

------
herbst
Why Java and Ruby?

------
wuliwong
What is the connection between the name Riemann and the product?

------
la6470
All new monitoring systems should be named YAMF-X (Yet Another Monitoring
Framework)

Jokes aside.. while the stream processing on events seem powerful there was
something similar in graphite but probably not as advanced or easy to use.
However the push approach brings its own limitations specially on existing
setups.

------
peterwwillis
A Java version of Ganglia?

------
uberneo
[https://nodequery.com/](https://nodequery.com/) \-- this is the lightweight
version , not as extensive as Riemann but good for normal monitoring and
notifications.

~~~
bru
No, that's really different.

\- nodequery source (and ecosystem) is closed

\- "periodically store various system data" "minute" VS "low-latency"

\- email notifications only

\- no query language, but a JSON API → overhead

Riemann looks more free (libre as well as liberty of usage, you can just
monitor anything), more in-depth (does more with lower latency objectives) and
better designed (DSL for rules, protobuff over tcp and udp instead of json
over http over tcp).

