Hacker News new | past | comments | ask | show | jobs | submit login
Riemann – A network monitoring system (riemann.io)
340 points by jonbaer on Jan 19, 2016 | hide | past | favorite | 104 comments



Evaluated it, and ultimately rejected it for a couple of reasons:

- You must pick up Clojure to understand and configure Riemann (we're not a Clojure shop, so this is a non-trivial requirement)

- Config file isn't a config file, it's an executed bit of Clojure code

- Riemann is not a replacement for an alerting mechanism, it's another signal for alerting mechanisms (though since it's Clojure and the configuration file is a Clojure script, you can absolutely hack it into becoming an alerting system)

- Riemann is not a replacement for a trend graphing mechanism.

- There are other solutions which can be piped together to get the 80% of the functionality we wanted from Riemann (Graphite + Skyline) in much less invested time

Skyline link: https://github.com/etsy/skyline


We use Riemann at Two Sigma to monitor/alert/heal our Mesos cluster [1], precisely because of above reasons to reject.

>- You must pick up Clojure to understand and configure Riemann (we're not a Clojure shop, so this is a non-trivial requirement) >- Config file isn't a config file, it's an executed bit of Clojure code

This is actually great -- static files quickly become their own franken-languages, with code generating config files.

>- Riemann is not a replacement for an alerting mechanism, it's another signal for alerting mechanisms (though since it's Clojure and the configuration file is a Clojure script, you can absolutely hack it into becoming an alerting system) >- Riemann is not a replacement for a trend graphing mechanism.

You probably don't want another alerting mechanism; you probably already have pagerduty or something else -- what you want is a rich way to create the alert.

[1] https://github.com/twosigma/satellite


> You probably don't want another alerting mechanism; you probably already have pagerduty or something else -- what you want is a rich way to create the alert

This is the heart of why we use Riemann. When we first started using it 2 years ago, we had thousands of different types of error emails per day (due to monitoring thousands of retail stores, all with their quirks). Because Riemann config is just code, we were able to build systems and abstractions on top of it for describing the various error types and their semantics. E.g If 500s are being returned from service A, only alert us if > 1% of those requests failed in the last 2 minutes. You can get these kinds of rules in something like Nagios, but if you want customization, you have to deal with plugins. Here, it's just code. If we don't like it, we change it. The result is that there's no excuse to setup gmail filters. You can ensure that all errors are actionable.


> Config file isn't a config file, it's an executed bit of Clojure code

For stream processing engines, configuration will be code. Unfortunate, but unavoidable.

> Riemann is not a replacement for an alerting mechanism

> Riemann is not a replacement for a trend graphing mechanism.

Indeed it is not. It's misadvertised as a monitoring solution, while it's a stream processing engine.

What I think of it is that you're supposed build a monitoring system on top of stream processing engine. It's a pity Riemann doesn't allow to subscribe to its streams from the outside, so to add any message destination you need to update its config.


It seems like having a small tool to turn yaml files into basic clojure code for easy rulesets would be an easy extension. It might encourage Bad behavior and of course it couldn't do everything ... just an idea.


That would actually be quite easy to do in clojure without the need for an external tool by writing a clojure macro that reads the other format and emits the s-expressions that represent it.


> For stream processing engines, configuration will be code. Unfortunate, but unavoidable.

I honestly don't think it's unavoidable, so long as you separate the configuration (i.e. hosts, thresholds, outputs, etc) from your processing logic. Of course, this requires additional development work from within the "configuration" file.


But for a stream engine the processing logic is configuration.

There aren't many examples of when code is a configuration parameter for service (generic RPC server for sysadmins being another example I've encountered), but there are some.


> For stream processing engines, configuration will be code. Unfortunate, but unavoidable.

How so?

Kafka is a stream processing engine that uses plain old Zookeeper data structures for config.

Edit: Kafka also seems to have the missing features you mentioned if Riemann should be taken seriously as a general-purpose stream processing engine.


I would argue Kafka isn't a stream processing engine so much as a stream shipping engine. Kafka barely looks at the content of your messages.

I'd also argue that Zookeeper nodes are anything but "plain" :)


That was my problem with Riemann. I love its core but I really want something built on top of it. Basically a Jenkins of monitoring (since clojure is JVM). I contemplated building it (ie taking Jenkins plugin system as inspiration) but it was just way to much work.


Go any further than that and you have Yahoo Pipes or IFTTT


IME configs often end up being turing-complete; if so, better to have them in a real programming language where you at least have tools available to manage the complexity.


The problem with configs (and this is with my operations hat on), is that they are rarely as well secured (or reviewed) as regular code, so code based configuration files pose a significant privilege escalation threat on production servers.

With my programmer's hat on, they're also harder to populate programmatically, so I have a hard time justifying their use.


> The problem with configs (and this is with my operations hat on), is that they are rarely as well secured (or reviewed) as regular code, so code based configuration files pose a significant privilege escalation threat on production servers.

Any complex config file runs that kind of risk though, whether it's in a well-known programming language or an ad-hoc DSL. My preferred approach is to include most of the config in the regular code (subject to the normal review/release process), with the only thing on the server being a one-line "which config to use" setting (e.g. dev/stag/prod). Of course that has its own problems.

> With my programmer's hat on, they're also harder to populate programmatically, so I have a hard time justifying their use.

Not at all true in the case of Clojure - it's just S-expressions, very easy to write, parse or modify programatically. I agree that a config structure should have good programmatic access, but to my mind that's an argument for using a language with a good metamodel rather than anything else.


> Any complex config file runs that kind of risk though, whether it's in a well-known programming language or an ad-hoc DSL.

The major difference is that Clojure (Python, Lua, Perl, et al) gives you all the tools right out of the box, whereas with a DSL you should be severely restricted from doing things like reading/writing to disk, making network calls, or executing other binaries.

Granted, there are possibly ways to break out of the sandbox, but it's the difference between giving the thief a set of master keys and $50 for a U-Haul and making them work to enter every safe you have on the premises.

/me takes off the tinfoil hat


That sounds like a security-through-obscurity approach to me. (No doubt others would call it defense in depth).


How is "don't give a config file an arbitrary writable open() call" security by obscurity? What is being hidden? That's not really how that term works. I also don't understand your invocation of defense in depth or the (wrong) comparison you are trying to make. Can you reframe your rebuttal without loaded security terms that don't fit what you're saying?

The point GP is making, and with which I agree, is that executable configurations can be dangerous if not sandboxed and even then still carry an elevated risk versus a parser. We are speaking relatively; it is absolutely still a risk to parse user input as a config, but less so than a full programming environment being immediately available to a malicious config writer.

Stepping back and identifying the malicious vector is worth it here, though, as there's a case to be made that configurations are the domain of administrators and should be secured accordingly via external means. Then the problem is recentered.


If it's code, the only way to evaluate it is to run it. That makes it very hard to reason about at scale. ("Which URLs go to load balancer x with SSL".) Not a great idea, in my opinion.


I'd be really hesitant recommending Skyline, since Etsy has declared it and the rest of Kale a failure.

https://vimeo.com/131581331


We use it (well, a derivative of it) to great success. The trend monitoring has proven invaluable at early detection of problems, at a level where pure thresholds would produce much more noise than signal.


What derivative are you using?


The falcolas derivative. Working on getting it open sourced and released... gotta love bureaucracy.


We rejected Riemann as well because there was just too much overlap with other tools and the user interface was not very good.

I actually think Clojure is a huge selling point.. seriously you should see the crap that Rackspace has https://www.rackspace.com/knowledge_center/article/alarm-lan... which I'm ashamed to say we use it (the lua monitors though are cool and its free monitoring infrastructure).. and yes its not the same as Riemann as Riemann is not exactly just an alerting tool.

And that is sort of the problem.. Riemann is a tool that does one thing really well but has not that good of a UI.. sadly we want prettier graphs and less granular tool.. a better nagios.


I spent over a day wrestling to get Skyline to work, but it's so outdated, and all of its dependencies have moved on since it was launched, that it's an absolute nightmare to even get running.

to be fair though it does say it's no longer maintained.


> Config file isn't a config file, it's an executed bit of Clojure code

What exactly here is the problem for you? That config is not a config, or that it's specifically Clojure?


skyline is abandoned though lol


Riemann is great. We use it at work.

I like it so much that I did an experiment to implement it in C++

https://github.com/juruen/cavalieri

My implementation sucks, but I had a lot of fun working on it and I got to learn how Riemann works better.


Good job on the README!


Went to check it out, expecting something funny.

Was surprised to see the most comprehensive and will written documentation I've ever seen on Github!


No joke.


Love Riemann. So much that I'm writing a book about monitoring with Riemann as the core routing engine:

http://artofmonitoring.com/

There's a sample chapter available, which covers the initial Riemann implementation and a Clojure "getting started" guide which should help anyone - even if you're not interested in the rest of the book! :)

http://artofmonitoring.com/TheArtOfMonitoring_sample.pdf


We've been using Riemann since 2013 and love it!

If you're coming from Nagios (or not), and you'd like something that will schedule Nagios event scripts (and others) and send them to Riemann, I have been using this in production since mid-2013: https://github.com/bmhatfield/riemann-sumd

It allows you to tap into the huge ecosystem that is Nagios monitors, without requiring any other Nagios component at all. It just translates the output into a Riemann event.


We also use Riemann in prod and love it. It is pretty much the perfect switchboard/aggregator for stat streams.

It's only part of the stack, but it's great for routing some stats to this TSDB and other stats to that TSDB. It's also great for detecting anomalies and sending updates to wherever you want them to go.

For us, this happens in 150 lines of clojure plus 150 lines of unit tests. I know that's fairly meaningless without knowing more about our system -- but the point is, it's very expressive so you get a lot done in a few lines of code. And therefore, don't worry too much about it being clojure.


Since the theme of this thread seems to be non-clojure alternatives, I'll point in the direction of InfluxData (formerly InfluxDB)'s new-ish Kapacitor project.

While I'm unconvinced their custom JavaScripty DSL (TICKscript) is actually preferable to Clojure or even can be read without careful, quite LISP-like indentation, it is pretty similar in basic functionality to Riemann and is definitely not-Clojure*

see: https://influxdata.com/blog/announcing-kapacitor-an-open-sou...

https://docs.influxdata.com/kapacitor/v0.2/tick/

https://github.com/influxdata/kapacitor

*at worst, it's a mangled subset of clojure with extraneous dots and the parentheses in the wrong places :-)


Reminder that the author of Reimann is available for consulting and is a scary-smart human being based around SF.


Specifically, it was written by Kyle Kingsbury aka aphyr of the Jepsen fame.


He consults on Riemann? I was under the impression he wanted nothing to do with helping people use it, and is just a general consultant. Like, he's not trying to support Riemann, he built it because he needed it. Maybe he's changed his mind?


Note that this isn't a better nagios... It's more like Apache Storm. Think of it as grep on steroid. I've heard of several thousand events per seconds on a single server. The other advantage is that it uses a well known programing language rather than a DSL.


It's actually a disadvantage, since it's Clojure, and Riemann requires that its operator actually knows Clojure (knowledge of dozen languages, even when some of them are functional, is not enough).


Is requiring someone to learn a programming language worse than requiring someone to learn a custom DSL? Seems a strange assertion.


A full programming language is likely more effort to learn than a custom DSL. If you're in a position where you can reuse the full programming language then it's probably better, but if you're never going to use that programming language for anything other than this one product then the DSL is probably better.


The problem with immature DSLs are that you spend a lot if time figuring out if the bug is in the DSL or in your use of it, and lesser network effect: because they are less used, they are more buggy, there is less information and help about them out there, there are less tools you can use.


Clojure is a bigger language than small custom DSL.


The parts of Clojure required to write a config file are probably roughly the same size as a custom DSL though.

Maps and lists are gonna be maps and lists...


Except for Clojure's data model. You'll still need to understand what apostrophe ("'") does and what keyword (":foo") is.


You can do that in five minutes. You can easily spend days trying to understand some arcane, adhoc configuration syntax that someone came up with. And still get the non-trivial cases wrong.

It's a good thing that it's Clojure and not some freakish turing-complete XML-based configuration file.


Custom DSLs eventually pick up those things, anyway.

Greenspun's 10th rule? http://c2.com/cgi/wiki?GreenspunsTenthRuleOfProgramming


I had forgotten about this... Thanks!


I'm very happy to see riemann featured here. I've been using it in production since early 2012 and contributed to it (intensively, at one point) as well.

It's been a breeze, rather worry free and its very good collectd support has enabled us to cover very interesting use cases at Exoscale.


I'm just getting started with riemann and I'm also learning clojure at the same time. How did you get buy in from your company to use riemann


I co-founded it :-)


Used in production without a hassle for 2 years. For our setup and scenario it was a very fitting and good solution. Easy Setup and easy usage. Recommended.


can you give the top 3 use cases and how it solves the problem?


Clojure. I think I should stop thinking and really learn that language now. It seems very tempting as a first functional language.


It's an excellent beginner FP language. Clojure is fairly minimal and requires learning only a few concepts to become productive. At the same time it continues to see more use in the industry, so it's one of the easier FP languages to actually get a job with as well.

I wrote a starter guide that might be helpful https://yogthos.github.io/ClojureDistilled.html

There's also a free book that's very good http://www.braveclojure.com/


Thanks for compiling such a great resource. I have had a look at it, and will go through it this weekend. I have also been doing some 4clojure in my spare time.

Although Brave and True looks like a great book, its not for me. I would love a book which blazes me through things rather than hand holding me through every concept. I have been looking at Living Clojure and will probably get that.


If you want to learn foundational FP topics but don't have the time to commit to learning a whole new ecosystem, I really recommend Fogus's Functional Javascript: http://www.amazon.com/Functional-JavaScript-Introducing-Prog...

I know that JS isn't the same thing as Clojure but the ideas in this book work with really any language. After reading this book I'm a better Python programmer.


Functional means different things to different people. By all means learn Clojure, but also learn a functional language with a first-class type system (e.g. Haskell).


Ah, Haskell. I find it equal parts fascinating and scary. I definitely won't start at Haskell, although I would surely want to come to it at some point of time. Static typing looks nice.


OCaml might be a friendlier option (though you'll want HKTs eventually, and the syntax is a bit ugly), or maybe Ceylon if you're willing to go a bit less mainstream. (I'm a Scala programmer myself, but I can't really recommend it unless you're already familiar with the JVM and its oddities, there are a lot of warts that have to be there for Java compatibility).

But yeah, there's the Lisp tradition and the typed tradition and they're almost entirely separate, but through accidents of history we call them both "functional". So in the same way that you'd learn an OO language and a functional language, I'd say it's worth learning one of each.


I would look at is as ok language for lightweight scripting of Java, but some areas like how bad the tracebacks are or how well many of the clojure-specific libraries are fleshed out (or quality of docs), leaves me with major concerns about the language.


Do it. At minimum, you'll have some very useful new perspectives.


"Network monitoring system" makes me think of routers, switches, Mbps / Gbps, etc. This seems more like a "server monitoring system."


Nope as well. It's more a stream processing engine. Icinga/Nagios or Zabbix are (inflexible) specializations of such engine.


Servers are on the network as well... and the phrasing is more towards the "network" as an environment.


The presentation video on the homepage is amazingly engaging. It's well worth a worth watch.


Evaluated it and a couple of others. Chose Bosun, never looked back, it's probably the only system with flexible and concise DSL for evaluating alerts (somewhat similar to R in spirit).

Riemann rocks, just not as monitoring system.

Bosun link: http://bosun.org/


We evaluated both Bosun and Riemann for our use case and chose Riemann. Use case is over 10,000 servers with over 2 million metrics incoming every XX seconds. Riemann simply performs better, probably because it uses streams instead of a poller. The other part of our use case is to use it primarily for alerting, and it's fantastically responsive and robust for that, including simple integration with other systems (email, Slack, nagios, PagerDuty, etc). Bosun seems like a good tool for other uses however.


Ive been using Riemann at work for the past 6 months or so. There is a learning curve if you don't know clojure, that has been the biggest hurdle for me. But I love what it does and its fantastic along with the dashboard it comes with


Anyone use http://prometheus.io/ ? wdyt


We switched from using Riemann to Prometheus, and we're very happy with the decision. Prometheus has been much easier to work with.


I don't like it since the master basically does a curl on each server to get the info, so it doesn't work behind a firewall without tons of issues


That's the reason I haven't bothered with Prometheus - pull systems make no sense to me since you have to configure a single place with perfect* knowledge of your system rather than just pushing local knowledge to a collector.

(Ok, you do need some knowledge at the parent if you want to raise alerts but you'd need that anyway.)


You need a place with all that information anyway, otherwise how do you alert on something being missing?

If you can't easily do that with your existing infrastructure, you should fix that first. I've written about this at http://www.robustperception.io/you-look-good-have-you-lost-m...


If a stream of data stops appearing, then you can alert. You don't need to pre-configure the existence of that stream (although if you might never get data from an object, this obviously is a failure you won't catch.)

Your article is a good one but in my experience, many companies are still many years from being able to implement that kind of database:machine knowledge consistency.


You can only alert in that case if the data starts, and there's no transient issues preventing your monitoring working around the time it stops.

Alerting based on state changes is fragile, it's better to compare against what you expect to see.


You can leverage service discovery to make your configuration dynamic: http://prometheus.io/blog/2015/06/01/advanced-service-discov...


That's what the Prometheus push gateway is for. While prom itself is poll-based, you can extract data from servers behind a firewall by pushing metrics from an agent into the gateway. No crazy issues.


The general advice is to run Prometheus behind the firewall, you want as few things between your monitoring and what you're monitoring as possible. This makes it more resilient.


It supports push, though.


We're using Prometheus. Very happy, though the collection side of things aren't as smooth as we'd like it.

The lack of a plugin system, and the reliance on HTTP to collect, means writing small collectors is a pain. You can't run 23 different daemons, each on their own port, to collect stats from things like PostgreSQL stats or RabbitMQ.

We opted for the "text directory" way, where we populate a directory with .prom files that node_exporter automatically picks up. It's not ideal: It means a whole bunch of collectors run via Cron jobs, which themselves need to be monitored; it means if we remove a collector, we also have to clean up its .prom files; and in the end it meant we had to invent our own little plugin system in order to avoid writing a lot of boilerplate code needed to manage the different collectors we use. We'd love to share our collectors, but due to that last point, our collectors are less reusable than we'd like.

Prometheus itself has been quite stable, but it still has some rough edges:

* If anything goes wrong with its database files, it tends to just crash, and the only way out is to wipe the entire database (e.g., see [1]).

* There's no way to do snapshots of the database.

* The team is rather cavalier about backwards compatibility. We've experienced at least one version upgrade where they changed the database format and didn't provide any upgrade tools, so people were forced to start their metrics history from scratch. I know that it's pre-1.0, but still, they knew perfectly well that people were running it in production. The alert manager was also written from scratch recently, with a whole new config format. With several releases, every tool has had its command line flags changed ever so subtly, too.

* The lack of packages (Debian/Ubuntu in our case) is also problematic. Fortunately I've got a script now that grabs a release and bundles a .deb from it, but I'd vastly preferred real releases.

* No syslog support is not acceptable in this day and age. Our Upstart scripts spawns a "logger" subprocess to catch stderr. Not everyone is running under Docker.

[1] https://github.com/prometheus/prometheus/issues/877


> The lack of a plugin system

We have many ways to plugin to Prometheus across the ecosystem, the textfile collector you're using is one of them.

> You can't run 23 different daemons, each on their own port, to collect stats from things like PostgreSQL stats or RabbitMQ.

There's no fundamental challenge with this approach. If you've got good basic infrastructure, particularly configuration management, the rollout of each should be a small operational task. If it's a major challenge, then your problem probably isn't with the Prometheus architecture.

> it means if we remove a collector, we also have to clean up its .prom files

There's several problems arising from this approach, this is one of them. You can also expect odd artifacts in graphs.

The textfile collector is only intended for machine-level metrics, by putting service level metrics in there you're missing out on a big win of Prometheus by thinking in terms of machines rather than services.

Fighting against the architecture means you're not getting the maximum benefits from Prometheus, this would be easier with exporters and service discovery.

> which themselves need to be monitored

Are you aware that the node exporter exports the mtime of all the textfile collector files? That's there to make monitoring of them easier.

> If anything goes wrong with its database files, it tends to just crash, and the only way out is to wipe the entire database

As far as we're aware, the only way that happens is if you run out of disk space. If you've evidence otherwise please let us know, so we can prioritize accordingly.

> We've experienced at least one version upgrade where they changed the database format and didn't provide any upgrade tools, so people were forced to start their metrics history from scratch. I know that it's pre-1.0, but still, they knew perfectly well that people were running it in production.

We broke backwards compatibility in the storage format once, and there's no plans to do so again. The core developers who were all running it in production didn't see it as worthwhile to write a converter, and noone else stepped up.

> The alert manager was also written from scratch recently, with a whole new config format.

The old alertmanager has always been flagged as very experimental, as it was a functioning PoC. The rewrite was always been on the cards, and this came up regularly.

This is all part of evolving the system to be better for everyone. If we tried to keep perfect backwards compatibility then we couldn't remove warts, bugs and misfeatures. We aren't afraid to deprecate where it makes sense to do so, and have transition plans where practical.

> The lack of packages (Debian/Ubuntu in our case) is also problematic.

There are packages in Debian proper, and nightlies at http://deb.robustperception.io/

> No syslog support is not acceptable in this day and age.

That's in the latest versions.

The high level problem is that there's so many different ways to do logging that we can't sanely support them all. For every X there is someone who thinks it's essential.


> If you've got good basic infrastructure [...]

We do have good basic infrastructure, thanks. We use Puppet and have a decent deploy system that performs atomic deploys from Git.

We also do think in terms of services. But the exporter has to run somewhere. About half of our exporters are machine-specific (reads local stats from files or proc or whatever), about half run on the Prometheus node itself and talk to services like ElasticSearch or Postgres.

The problem is operational overhead of maintaining a dozen daemons per box, each with its own allocated port. It's not rocket science, just annoying. What could be a small script becomes something unnecessarily big. That's why we're sticking to "textfile" for now. When I have time, my plan is to write a small HTTP server that spawns plugin as subprocesses that emit their metrics via stdout, which seems like a much more reasonable, low-maintenance solution, and something node_exporter ought to support in the first place.


> (reads local stats from files or proc or whatever)

If there's useful stats from /proc we're missing, we accept PRs.

> about half run on the Prometheus node itself and talk to services like ElasticSearch or Postgres.

That doesn't sound right, those two should run on the ES/Postgres nodes.

Only things like the blackbox exporter and snmp exporter should be considered for running beside Prometheus.

> When I have time, my plan is to write a small HTTP server that spawns plugin as subprocesses that emit their metrics via stdout, which seems like a much more reasonable, low-maintenance solution, and something node_exporter ought to support in the first place.

We went with the textfile collector approach, rather than reinventing cron.


Sorry, I misspoke. Only non-machine-specific collectors run on the Prometheus machine.

> rather than reinventing cron.

But Prometheus already has a job interval setting. With text files emitted with cron jobs, you get the silly effect where metrics are produced at intervals which don't correspond with the job interval.



Monitoring request - dynamic configuration. Using Nagios requires configuration file changes and service reload Would prefer dynamic, on the fly configuration. Even better, configuration that can be adjusted externally, possibly from a web API endpoint. Better still, configuration that IS external, where the monitoring service queries an external service every X minutes to determine what to monitor.

Performant - In the realm of 6 oid monitoring of 50,000+ devices in 5 minute intervals


I'm working on a startup that does exactly this. It's AWS only and in private beta right now: https://opsee.com/


How does a system like this differ from say, the ElasticSearch+Logstash system?


Logs and metrics service different needs, I wrote recently on this: https://blog.raintank.io/logs-and-metrics-and-graphs-oh-my/


Anyone able to give a quick overview of why they use Riemann vs SaaS offerings, and the distinguishing features (other than cost/control)?

I like the look of the config structure, being Clojure.


We use Riemann to track all requests on Apis amongst other things. Have become rather fond of it. It happily handles 40k events p/s on rather modest hardware. The dashboard is not the prettiest but it is functional. Overall it has proved invaluable for spotting anomalies in our metrics.


Hm, why the name? I don't get it. Riemann wasn't knowing for being a good monitor or sentry.


Perhaps the idea is related to Riemann sums being an approximation of areas under curves where the curve is supposed to represent reality? Either that or that the Riemann hypothesis is an important problem that lots of people have tried to solve but have really failed.


What is the purpose of such a system? Can anyone describe at least one solid use case?


Very interesting project in monitoring! Thanks for creating it! Seems like new fresh ideas :).


Why Java and Ruby?


What is the connection between the name Riemann and the product?


All new monitoring systems should be named YAMF-X (Yet Another Monitoring Framework)

Jokes aside.. while the stream processing on events seem powerful there was something similar in graphite but probably not as advanced or easy to use. However the push approach brings its own limitations specially on existing setups.


A Java version of Ganglia?


https://nodequery.com/ -- this is the lightweight version , not as extensive as Riemann but good for normal monitoring and notifications.


No, that's really different.

- nodequery source (and ecosystem) is closed

- "periodically store various system data" "minute" VS "low-latency"

- email notifications only

- no query language, but a JSON API → overhead

Riemann looks more free (libre as well as liberty of usage, you can just monitor anything), more in-depth (does more with lower latency objectives) and better designed (DSL for rules, protobuff over tcp and udp instead of json over http over tcp).


Ah, unfortunately, it looks like you need to sign up to even see what features it has. Do you know of an alternative feature list or have some things off the top of your head you could say about it?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: