Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Best monitoring system?
135 points by mspaulding06 on July 22, 2016 | hide | past | favorite | 121 comments
At my company I'm considering switching us from Nagios to another monitoring system and starting to do some research. What's the best monitoring solution out there today? I'm pretty impressed by Prometheus, but just like to get some more opinions.

Prometheus.io which is a modern fresh monitoring system that I would checkout if replacing a legacy system.

Also take a look at Riemann which is system monitoring written in Clojure. Riemann should be good for monitoring latency of the system.

If it helps here is Slidedeck from Spotify how they do their monitoring https://www.netways.de/fileadmin/images/Events_Trainings/Eve...

I like Riemann and actually read The Art of Monitoring (https://artofmonitoring.com/) which was a great book. There are two big downsides for me on this one. First, you MUST build your monitoring solution from scratch and you MUST learn Clojure (which can be hard to get a whole team to agree to). Second, there's no alerting dashboard, which makes it difficult to see the overall state of the clusters you're dealing with. The only way you know there's a problem if you get an email. Maybe there's a better way to handle that but I wasn't able to find one.

You can alert via a number of mechanisms: email, PagerDuty, Slack et al, etc (I talk about most of those in Chapter 9).

I have never been very keen on alerting dashboards, I find they are rarely actually reviewed and flash red for days or weeks. :) So I only covered metrics/graphing as a console rather than a status console. If you want to add such a console it'd be easy to output Riemann events via an API to such a console.

Glad you enjoyed the book!

> you MUST learn Clojure (which can be hard to get a whole team to agree to)

For any sufficiently advanced monitoring system, you're going to have to learn some form of DSL/language to take advantage of it.

I wouldn't consider this to be a major issue, more part of the irreducible complexity of the problem. The Riemann examples I've seen all seemed pretty readable, and routing alerts is not world away from what you'd be doing in say a Prometheus Alertmanager config; just with S-Expressions against YAML.

It's a major issue for me because of the time investment required to learn the language (functional programming isn't exactly second nature to most programmers) and the need to design the monitoring system from scratch. Instead of an out-of-the-box system you're getting more of a framework which allows you to build one. I actually like Clojure and agree that it is readable as well as fun to learn. But I think with something like Prometheus you do more simple configuration than monitoring system design.

I can only second Prometheus. It's modeled after Google Borgmon (and possibly Monarch) and excels in any flexible workload. Really great piece of engineering.

Riemann is a generic event processor. You can use it to generate alerts or aggregate metrics, but you still need something like collectd or telegraf to collect system/app stats from each machine and send it to Riemann.

Riemann comes with collection clients as well

It should be noted that Riemann is different from Nagios and Prometheus in that it's push-based. The other two are primarily pull-based.

Yep, I think that is Riemann's greatest advantage. I don't like the fact that with Prometheus I have to set up some sort of service discovery mechanism. Right now we're using Foreman to manage our clusters, so most likely I would have to use the file service discovery type and query the Foreman API to produce a YAML or JSON file to provide a list of static hosts.

If your monitoring system doesn't have knowledge of what's supposed to be out there (like via service discovery), how do you know when something is broken/down/missing?

Riemann injects expire events into the event stream when it doesn't hear from a service for a while. Each event has a TTL that it uses to figure this out. Then of course you can alert on the expired events or take other actions.

How does this tell you that something is missing if it never reported, or that something is no longer meant to exist?

It's true, you can't tell if something is missing if it never reported. I guess it depends on your requirements. In a situation where your workloads are ephemeral you may not care as much if all of your services have reported as long as most have. In the case that something should no longer exist you could write that functionality into your Riemann configuration. As an example, I could write a "dead service" stream processor that is used to inform Riemann that a service should no longer exist. When a "dead service" event is injected into the event stream for a particular service, Riemann will note the dead service and ignore future events from it.

If you want to hand off all the hard stuff about monitoring and get some easy to use, core functionality (graphing, alerts, dashboards) then my company https://www.serverdensity.com has been going 7+ years now.

For highly sophisticated environments then https://www.datadog.com is a very advanced product.

Both are based off my original agent: https://github.com/serverdensity/sd-agent (DataDog forked it in 2010 and we forked them back last year).

We're also behind http://www.humanops.com trying to build monitoring that also helps you run on-call and ops teams generally in a way that considers fatigue, stress and the realities of the humans running IT systems! E.g. https://blog.serverdensity.com/introducing-alert-costs/ and https://opzzz.sh

dmytton you have done a FANTASTIC job on your pricing page (https://www.serverdensity.com/pricing/)! I just got screwed by DataDog's confusing pricing page (http://sagemath.blogspot.com/2016/07/datadogs-pricing-dont-m...). Your's is the ultimate example of what to do right. WOW.

Thanks! We just recently released this new pricing page as a test. We've done a few iterations over the years and patio11 did a great writeup at http://www.kalzumeus.com/2012/08/13/doubling-saas-revenue/

"generally in a way that considers fatigue, stress and the realities of the humans running IT systems!"

I've been pushing for that to be included in more scheduling systems for production workers. Very awesome to see it considered in your work! :)

I've used Nagios and Icinga2, and I've become a huge proponent of check_mk. The documentation isn't great, but the product rocks: with very little time, you can start monitoring a slew of services (disk, hardware, logs, ntp) with almost no tweaking required. You can easily create custom checks, but you also have all the Nagios plug-ins it's compatible with. No daemon listens on the hosts being monitored. You get graphing for free (no setup time) for almost all your checks.

It uses Nagios under the hood, it's basically an automation system that generates those Nagios systems. The GUI is amazing, because it uses a plug-in so you don't have to edit files on disk to group your hosts or tweak the alerts. Those configs are snapshotted automatically at every change, and you can replicate that configuration automatically to remote servers. Download it from the upstream site instead of relying on distro package repositories.

Caveat, the documentation sucks, the GUI can be nonintuitive and it's hard to Google problems. It takes time to fully tune. Out of the box you'll probably still be impressed though.

Yes, check_mk is great. I rolled out a 30000+ checks installation on a single server today. I've looked at most monitoring products on the market and check_mk easily wins. It's really popular in Germany, many huge companies are using it.

The English documentation isn't that great, the German one is better. That being said, it's mostly self-explanatory and all checks are very well documented in man pages.

My favorite features:

- Auto discovery for literally everything, including SNMP interfaces.

- Fine grained rule system for customizing check threshold and parameters

- All configuration is automatically versioned and you can integrate it with Git - this includes the changes you make in the web interface.

- It's very easy to set up a distributed monitoring system (multisite) with a central node which aggregates all states and replicated configuration changes, yet each site is fully autonomous.

- The agent takes zero network input, so no attack surface.

- Even though it's extremely featureful, it's architecture is very simple and it's easy to contribute code and write custom checks.

Their Git is public: http://git.mathias-kettner.de/git/?p=check_mk.git;a=shortlog...

It works well with Naemon and Nagios 4. Been using it for a number of projects, ask me anything!

Monitoring systems not to use:

  - Shinken (zero security awareness and dishonest PR, when I tried it, it has so many bugs that I wouldn't ever trust it to keep my data safe)

  - Zabbix (it has some brilliant features, but the architecture is a mess, configuration and time series stored in a MySQL database which is hard to manage and automate, I found it cumbersome to debug, written in PHP)

Best at what? What is driving you to want to switch?

I like Icinga a lot. I won't bother reviewing it; is is very well known. Professionally, my last two gigs have used Zabbix.

Zabbix, architecturally, is a nightmare. Uses an RDBMS for storing time-series data, so it wastes a ton of space on historic data while managing to be far slower than it needs to be when querying larger ranges. Uses an agent. Has a proxy-agent that, while handy, encourages all sorts of sketchy, error-prone monitoring topologies. With 3.0, the UI has crawled out of the awful range, and is now merely annoying. Takes the all-singing, all-dancing monolithic approach for the main app, including features for drawing maps on big-screens.

For all that, it works well. Give it the hardware it wants, be sane in setting it up[1], ignore the goofy features (maps, inventory, screens - I guess someone must of requested those), and it is very solid and very powerful.

[1] The template system, pseudo-language for triggers, naming convention for variables and method of creating custom monitors take some getting used to. Expect to take the time to actually read the docs, and most likely to throw out your templates the first time you model your systems.

It depends on your needs and budget.

Can you afford time but not money? Try Sensu or Nagios.

Do you have money and not time? Try datadog.

Like someone else mentioned here, if you're looking to alert off of logs from ELK, try Elastalert.

Agree with this.

If you have money and no time: DataDog

If you don't mind putting in a little time: Sensu

Sensu is straightforward to deploy if you use Chef/Ansible/Puppet. It also supports running Nagios plugins which is pretty useful.

Datadog does WAY more than Sensu does. Sensu doesn't handle metrics with more than 1 dimension, which should be a standard feature of any modern metrics platform.

I also disagree that setting up Sensu takes a "little" time. What is a "little" to an inexperienced Sensu administrator? A day? A week? Several weeks? Quantifying it would be valuable to the reader.

Prometheus is absolutely the way you should be going. All of the other systems I'm seeing mentioned here — Nagios, Icinga, check_mk, Zabbix, Sensu — are host-centric and are very awkward when you try to bend them to fit modern (containerized, etc.) workloads.

There's always a server. Regardless of how far away you've abstracted it away, there's always a server which should probably be monitored (even if to know when it's about to fail and should have work shunted off it prior to its failure). Icinga and others make it easy to programmatically add and remove servers as they enter and leave your environment.

Even if you don't have access to the server so you can monitor it, you can use the "host" concept as containers for your services: "api.mycorp.com", "tasks.mycorp.com", "backups.mycorp.com" are great starting points.

> There's always a server.

Actually, no.

When you monitor state of a cluster (e.g. node count), you don't have a server, you have plenty of servers and a cluster (completely different thing).

When you monitor temperature in your server room, you don't have a server, you have a server room.

When you monitor exchange rate, you don't have a server.

When you monitor a website, you still don't have a server.

And now add all the AWS Lambdas and other serverless rage.

Notion that everything works on a (single!) server was never valid, and today it's even more visible than it was twenty years ago, when Nagios was state of the art.

True but in practice it doesn't really matter. With sensu, you have offbox checks, and you just pick some internal server (there's always some "misc" server hanging around).

What matters is that the alert about the issue is raised and relayed to the proper notification channels. Since sensu doesn't concern itself with a fancy dashboard, it doesn't really matter if the alert pertains to the host or not.

Any decent monitoring will have customized the alert handling based on what's alerting, so there's some amount of post-processing possible.

> True but in practice it doesn't really matter.

In practice it doesn't matter if you name a file handle "juju" and a database query "peach" in your code.

It's a matter of calling things what they are instead of forcing them into a mismatched data scheme by creating artificial hosts.

Yeah, containerization is a good point. We've been starting to deploy apps on Docker within the last 6 months. What does Prometheus offer that makes it better for containerization?

Mostly, it has great support for service discovery for many major cloud and container platforms (Kubernetes, Marathon, EC2, Azure, Zookeeper, Consul, ...). So it can not only go out and pull metrics data directly from instances as they float around your dynamic cluster scheduler, but also attach identity metadata (as provided by the service discovery) to the time series collected from each instance. For example, you may map EC2 tags or Kubernetes labels into your Prometheus time series labels to give you more useful metrics. There's also a way to plug in your own custom service discovery. Also, Kubernetes exports native Prometheus metrics since quite a while already.

Borg inspired Kubernetes. Borgmon inspired Prometheus. So naturally it works well together with a dynamically scheduled world.

If you can have a monitoring system in the cloud Datadog is a great choice.

Good documentation, UI, many, many plugins and fair pricing (IMO).


(Im not affiliated with in any way other than using their product on a pet project with many moving parts).

Datadog was down so often when I had to use it. It felt so unreliable. We used to monitor hosts and it got to the point where checking if datadog was down was part of troubleshooting..

How long ago was that? We just switched in January and it's been pretty reliable.

As far as Datadog goes, it's the most team friendly dashboard system we've used. We had a specialty monitoring system for one application stack previously, and no one made custom dashboards there or even just looked at the data. Now we've got custom dashboards out the nose and we're gradually consolidating to a "best of" dashboard for each service.

Datadog may be okay if you're doing really simple stuff and not sending much data. Once you get to scale, you will need a system like Wavefront. Wavefront can take millions of data points per second, query on them super fast, and they don't go down. Every other monitoring system downsamples, or throws away your data after a certain amount of time.

We throw hundreds of thousands of metrics at Datadog per minute from thousands of hosts; it hasn't broken a sweat yet.

Are you affiliated with Wavefront?

I'm a happy customer of Wavefront. I completely believe that Datadog can handle hundreds of thousands per minute -- especially if most of them are pre-canned, non-custom metrics grabbed by their agent. Hundreds of thousands of metrics/minute is a few thousand a second only. Wavefront does millions of custom metrics per second, which can be sent with different dimensions and tags. That's much harder.

Yeah I second this. Hundreds of thousands per minute is a very small number. Wavefront can go over a million per second.

Per tenant, or globally? And does it have an SLA around this?

(I've learned not to trust numbers that seem too good to be true unless they're contractually obligated.)

Prometheus can do 800k/s on a single machine. Handling a million per second sounds perfectly plausible to me if you design it properly. The question is more how much it's going to cost you.

Wavefront doesn't publish pricing, but if we take Librato's pricing as a general indication you're talking several million dollars a month.

The ui is pretty nice for sure, especially compared to some other things im using.

I used it up until two months ago when I left that job. There was 2000~ servers monitored I think.

Just make sure you understand their pricing (I didn't): http://sagemath.blogspot.com/2016/07/datadogs-pricing-dont-m...

How could their pricing page be clearer? It says per host in fairly large letters underneath it.

I'm asking because I will be designing a similar page soon (that's also billed per host) and I'd like to avoid the same mistakes.

[EDIT: This pricing page by the top poster in this thread is way better than I suggest below -- https://www.serverdensity.com/pricing/]

1. VERY clearly state that when you sign up for the service, then you are on the hook for up to $18*500 = $9000 + tax in charges for any month. Even Google compute engine (and Amazon) don't create such a trap, and have a clear explicit quota increase process.

2. Instead of "HUGE $15" newline "(small light) per host", put "HUGE $18 per host" all on the same line. It would easily fit. I don't even know how the $15/host datadog discount could ever really work, given that the number of hosts might constantly change and there is no prepayment.

3. Inform users clearly in the UI at any time how much they are going to owe for that month (so far), rather than surprising them at the end. Again, Google Cloud Platform has a very clear running total in their billing section, and any time you create a new VM it gives the exact amount that VM will cost per month.

4. If one works with a team, 3 is especially important. The reason that I had monitors on 50+ machines is that another person working on the project, who never looked at pricing or anything, just thought -- he I'll just set this up everywhere. He had no idea there was a per-machine fee.

It depends on your architecture and scale. There is no "best", just "best we've found for this" and "best given other constraints".

This is yet another point where DevOps is not "devs doing ops" but "operations building and deploying with all the tools of modern software development". You need a subject matter expert.

What are you monitoring? Do you care about availability or performance or both? Scale? Do you have services or servers? Do you manage the underlying hardware? Do you need to track which hardware boxes have which VMs or containers?

There are a million questions to answer. One big set of them: what do you dislike about Nagios? Make sure that you don't get those problems with the next one, but also make sure you get something that does what you need as well as what you want.

Thanks for the comprehensive reply. Briefly I would say that we care about availability more than performance, though both are important. We're running somewhere on the order of 2000 VMs w/ ESX with some bare metal systems running database clusters. We have a separate team that manages the hardware infrastructure and they have their own monitoring and alerting system. I'm mostly concerned about preventing downtime for the application cluster, alerting the right people via the right means (chat, email, pagerduty) when something does go down, and getting some high resolution graphs for analysis.

I still don't know anything much about your systems, but I do know this: find out what your hardware team uses and see if it is right for you, too. (Or find out that they are unhappy with it, and perhaps go in together on a new system.)

Benefits: shared expertise. Common language. Propagation of alerts up from hardware and down from services. Better root cause analysis. If you have a good culture, faster resolution time and better understanding.

> You need a subject matter expert

Spot on. Too many people think that monitoring is about slapping a piece of code on some hosts. Monitoring is data science.

Just my opinion, but I won't use Prometheus, because of the active polling model. It won't scale without a number of workarounds.

My preferred method is Icinga2 (a Nagios clone with better configuration and clustering built-in) with reports coming in via passive NSCA. Toss in Graphite (or I'm warming up to Grafana on Influx) with some ability to write alerts against those reported metrics, and you're as close to ideal as I can come up with.

Of course, that requires a fair bit of up-front knowledge to stand up and operate, but they're so rock solid (and scale like mad) I have a hard time not recommending them.

Active polling scales _better_ than the alternative. You directly manage the ingest rate at the server side, rather than DDOSing your monitoring infrastructure when you scale out your services to deal with load.

> You directly manage the ingest rate at the server side

Why in the world would you want to manage (which I'm reading as throttle) the ingest rate of a monitoring service? That strikes me as a recipe for missing important events.

Monitoring should be a fairly stable load rate. If you're setting up monitoring, and your machine can't handle the load of all the data points coming in, you need to shard out, not drop data points.

> Active polling scales _better_ than the alternative

Exceptional claims require exceptional evidence. Personally, I have never heard of anything that is actively pinging outside services performing better than receiving and processing data passively.

Prometheus will scale better than Nagios running active checks since it won't be using subprocesses, but it is still going to require more overhead than a service receiving passive reports.

The pulling itself has never been a scaling/bottleneck concern in Prometheus, especially since no individual events are transferred, but just the current state of each metric. The bottleneck is usually the storage's sustained ingestion speed (disk IO or similar).

A single big Prometheus server can easily do millions of time series, and in an older record, 800,000 samples stored per second. You could monitor e.g. 10,000 hosts with that with quite some detail, and the bottleneck is still not the pull aspect.

I want to understand more about why you think that pull doesn't scale. Borgmon at Google (the biggest scale ever) worked fine with a pull model, and as far as I know, the successor still pulls, too. A push model also has the problem that now the identity of each instance needs to be known by each instance itself, instead of only by the monitoring system via service discovery (which needs to know about what instances should exist and their identity anyways, otherwise how is it going to let you know that something is wrong?).

We also have an FAQ in Prometheus about why we prefer pull: https://prometheus.io/docs/introduction/faq/#why-do-you-pull...?

In my experience, pulling is operationally much nicer than pushing, and I've worked with both. It also gives you somewhat less of an accidental DDoS exposure.

> but I won't use Prometheus, because of the active polling model.

> [Grafana+Ichinga] they're so rock solid (and scale like mad) I have a hard time not recommending them.

As a Prometheus developer I have seen a significant number of users who moved from Graphite because they found it doesn't scale and was far from rock solid for them, requiring regular manual care and feeding. By contrast Prometheus seems to be working pretty well for them at what we would consider to be a moderate load. I've heard similar about Nagios/Ichinga.

Push vs. pull is largely not relevant for scaling (pull is slightly better in this regard, but only slightly). I've been involved with some extremely large scale monitoring systems, and the fact they were pull was never relevant to scaling them.

May I ask what you consider to be a high level of scale?

> It won't scale without a number of workarounds.

There are very very few systems who won't scale without workarounds. That's the nature of scaling a non-trivial system.

You are wrong about that. There is nothing inherent in a polling model that limits its ability to scale. You simply shard based on service or pod, or row(whatever your unit of scale is) and then you can always use consistent hashing/federation. Prometheus is a single binary. How many moving parts are involved in a Icinga 2 cluster?

Two. Icinga2 and nsca-ng. You can bring it down to one if you use Icinga2 as the test runner on the individual machines.

If you need to shard, no other binaries are involved, just a couple of changed configuration files and some SSL certs. Passive reports can go to any node; a simple load balancer will work fine without any form of hashing or federation required.

> There is nothing inherent in a polling model that limits its ability to scale

Aside from the additional processing requirements, such as SSH, NRPE, subprocesses, etc, you are also limited in that a polling process must be told in advance of new systems or services, whereas it's fairly easy to just have a new service or system start reporting and be immediately monitored.

> Passive reports can go to any node; a simple load balancer will work fine without any form of hashing or federation required.

How do you re-combine and alert on data for one service that got reported to multiple nodes? This would imply to me a database of some form, which is going to need hashing or federation or other distributed systems approaches to be scaled out.

> Aside from the additional processing requirements

Push or pull, there's additional processing requirements when you scale. It's the same bytes on the wire and broadly the same amount of processing power required.

> you are also limited in that a polling process must be told in advance of new systems or services

That's not a scaling limit, that's a more fundamental issue that isn't different between push and pull.

For a push system you need to have a list of all systems and services in order to be able to alert on systems that never reported, or are no longer reporting.

A more thorough investigation of pull vs. push.


Not a terribly thorough investigation; simply a series of unsubstantiated assertions.

Pull works fine at 2 million machines" is a great statement to make, but it would be much stronger with more details. How many machines are doing the pulling? How often? Are they using subprocesses or threads or greenthreads? How do they handle timeouts? How many metrics per machine? How many pulls per metric grab?

At Stack Overflow we use a homebuilt Go solution called bosun: http://bosun.org/ -- it runs on pretty much anything and lets us incorporate data from windows machines / linux machines in one place.

I recently tried out Bosun and liked it a lot. The documentation is a little light, and the dependency on hbase and hadoop (since opentsdb uses that) is a bit of a pain. Maintaining those isn't particularly straight forward or fun.

I'm also interested in prometheus but haven't gotten to try it out yet. Anyone reading this have experience with both? How do they compare?

We've been dogfooding the new Stack Overflow Documentation system for Bosun, so you may find some better examples at http://stackoverflow.com/documentation/bosun/topics which just opened yesterday. If you see anything missing you can request a topic or flag an example as needs improvement.

The one you use. I have sold and implemented these types of tools for the past ten years. Biggest problem is companies not actually fully implementing and using the tools they already own, and letting teams splinter off into their own tool sets.

I think it depends on your needs and software, how much time you want to invest, what you want to monitor, do you want to maintain it or you want saas?

You want metrics from counters you build in your app? (see statsd?)

You want to aggregate and do analysis on logs? (see ELK stack?)

You want to monitor cloud infrastructure (see stackdriver?)

You want to run end to end tests on your application to ensure it's behaving? (see runscope?)

As your application grows, you probably want a blend of tools to see inside your app.

Just use Prometheus, nothing else comes close to it. It also just hit 1.0


Should have mentioned how well it pairs with Grafana ;)

Why not to start with AWS Cloud Watch: https://aws.amazon.com/cloudwatch/details/ - simple, scalable, but of the box solution. It's much simpler than build similar functionality yourself.

Hi all, I'm surely biased as I work at Instana (https://www.instana.com), but here's my opinion about monitoring.

Applications are dramatically and rapidly changing, with continuous delivery, microservice approach, containers and orchestration tools, things are all over and you might have a component spun up and down within few minutes. Humans cannot keep up with data and it doesn't make any sense to stare at a big screen full of data, just looking the all day at charts trying to visually correlate data. The correlation of data is becoming harder and harder as systems are more and more resilient. There's, therefore, no unique root cause anymore (https://www.instana.com/blog/no-root-cause-microservice-appl...).

At Instana we're re-defining what monitoring means. We're moving the bar from visualizing data to providing plain English explanation of what's going together with suggestion for remediation. Instana 3 main values are: - Automatic Discovery: dynamically models the architecture of infrastructure, middleware and services - Automatic QoS Analysis: continuously derives KPIs of all components and services and alerts on incidents - Integrated Investigation: visualizes in real-time physical and logical architecture, compares over time, suggests fixes and optimizations.

Happy to get feedback and provide more info. Enrico

Can you compare Instana to Datadog, SignalFX and Wavefront?

Let's first say that I am the co-founder an CEO of Instana, but I am trying to give a generic answer so that I don't "attack" competitors.

Most of the mentioned tools in this thread, including Datadog, SignalFX etc are using a simple agent to collect data - see Datadog agent on GitHub: https://github.com/DataDog/dd-agent or statsD (https://github.com/etsy/statsd) that is mostly recommended by SignalFX who have no own agent. Tools like Prometheus work similar.

On the backend side you can see two approaches for data store technology: A time series based approach like DataDog or Prometheus and a Streaming based approach like SignalFX - stream are the superior approach in my point of view as they allow for realtime approaches and stream (window based) analytics. There is a third category which is similar to time series but more "log" centric like the ELK stack or tool like Splunk.

On top of the data store these tools give you the ability to build your own dashboards (and provide standard dashboards for standard technology) and a alerting based on thresholds. They also allow to add you own metrics via API which can be used to add application specific data. They also give you a query API to query and combine the data in the store. So overall this is a Lambda architecture for monitoring data.

I would say that SignalFX is the most sophisticated one but the framework to work on stream is much more complicated then DataDogs time series approach so people go the easier way.

The problem with all of these tools is that they rely on the user to build dashboards, thresholds and in case of a problem do the correlation to find the root cause of the problem.

To correlate you need to understand the dependencies of the system components. As an easy example if service A has a performance issues because it calls service B that has a CPU problem, you need to know that A calls B and correlate the latency of A with the latency and CPU of B to find the root cause. You can discover/model dependencies with tools like Zipkin (https://github.com/openzipkin) or Spring Cloud Sleuth (https://cloud.spring.io/spring-cloud-sleuth/) which are based on the Google Dapper paper. You could even add or log the Span ID to the metrics/logs so that you can correlate them automatically.

Typically if you do so manually it is a disaster for change. All your correlations (and even dashboards) will not work if the topology of your services changes. Which is quite normal in the microservice world.

Instana uses a stream based approach similar to SignalFX BUT we combine this with a graph database that holds the dependencies of all physical and application dependencies. Our agent automatically discovers all the components and dependencies and adds them to the graph in realtime - including containers etc.

We then use the Google Four golden signals + Capacity (that was added by Netflix as the fifth one) to analyze the KPIs of the services and apply machine learning on it. That way we don't need manual thresholds which are also hard to maintain when things change a lot. If we see e.g. slow response times or sudden drops in requests or high error rates, then we analyze the dependency tree of that service to find the issues that are related to the problem and generate an incident for that - as we also discover changes, we add them to the incident as most often a change is the reason for a problem. I've written a blog entry on the Dynamic Graph: https://www.instana.com/blog/monitoring-microservice-applica...

Hope this answers you question.


> stream are the superior approach in my point of view as they allow for realtime approaches and stream (window based) analytics.

I'd see them as slightly different approaches to providing fundamentally the same solution. One builds up time series and then operates on them, the other operates on the time series as they come in.

Taking Prometheus as an example we're a time series database, and you can do both realtime and window-based analysis. In fact that's how it is usually used.

> I would say that SignalFX is the most sophisticated

Do you have an example of something that you can do with your streaming approach that's not possible with other tools?

It's hard to get a proper understanding of the myriad of monitoring systems out there, so I'm always looking for insights.

> Our agent automatically discovers all the components and dependencies and adds them to the graph in realtime.

That sounds interesting, how do you do that for network dependencies? Do you have something like Zipkin?

I agree that streaming and timeseries queries/scans are two different approaches which can solve the problem in the same way. With instant vectors of Prometheus queries you can operate very similar to windows and if you do the right queries and take care that it works in-memory you also should get similar performance and throughput.

My point was more about the framework you get and how easy it is to apply analytics to streams/queries. SignalFx seems to have a nice workbench for this with direct visual feedback in the UI, so that you can work on existing data to get the right result.

As said we at Instana think that most people will not be able to build a sophisticated monitoring solution with these types of frameworks as they don't have the time to do it and maybe even not the analytical domain knowledge. You can see that SignalFx is adding specific knowledge for some technologies. I give you two simple examples to show that it is not easy:

- How would you predict if a file system is running out of disk space?

- How would you predict if you should add a node to a Cassandra cluster because it is running out of capacity (and it can take some serious time to add a node, so you should know in advance)?

Already the disk space problem is hard to solve - linear regression and basic algorithms will not work.

Now think of hundreds (or thousands) of services running on a dynamic container platform and new services released on a daily or even minute basis - with lots of different technologies involved...

No question that you can build a good monitoring solution with Prometheus, SignalFX, DataDog etc - but it will take a serious amount of time, consulting and dev teams involved adding the right instrumentation, metrics etc. And you need a lot of analytical knowledge. I can even imagine that there are situation were tools like Prometheus are a better choice - especially if you have a very strict set of technologies and communication framework and really good people to do a very specific set of "rules" for this environment.

We've added a domain model to our product (all the mentioned product have a generic metric model, but no semantics that describe servers, containers, processes, services and their communication which is the domain of system and application monitoring): Our Dynamic Graph.

And yes, we are using something very similar to Zipkin to get the dependencies between services. Here a are two blog entries describing the approach:

- About distributed tracing: https://www.instana.com/blog/evolution-tracing-application-p...

- How we safely instrument code: https://www.instana.com/blog/how-instana-safely-instruments-...


> SignalFx seems to have a nice workbench for this with direct visual feedback in the UI, so that you can work on existing data to get the right result.

Wavefront does as well; I'd recommend you compare it for competitive analysis.

So would you say your product is in direct competition with these offerings, or do you see it more as a complement to them?

Yes, I didn't compare to Wavefront as I have only basic insights and therefore cannot make a valid statement.

Competition depends on the uses case - if you are using a tool like SignalXF for custom metric analytics, then we are no competition as our focus is monitoring of applications and its underlying infrastructure.

We are an Application Performance Management (APM) solution and therefore compete more with tools like New Relic oder AppDynamics. Theses tools are sadly only used for troubleshooting in 90% of the cases and not for management or monitoring. They also do not work in highly dynamic and scaled environments as there "model" is too static. (which they try to fix with their analytics offerings)

This is what we want to change and were we add the whole stack to the game to analyze all the dependencies and help finding root causes quickly and monitor and predict the KPIs of your applications, services, clusters and components.

We integrate with solutions like SignalFX if needed but I have really good experience to do "dashboarding" with more business related tools like Tableau or QlikView - this also offers application owners an easier way to aggregate the monitoring data and metrics on a higher (business) level, where tools like Instana offer the instrumentation data as an input.

Hynek Schlawack gave a talk at PyCon this year about using Prometheus and Grafana to unify monitoring metrics. Honestly the talk goes beyond my own understanding, but you may find it helpful. He's quite knowledgeable.

> To get real time insight into your running applications you need to instrument them and collect metrics: count events, measure times, expose numbers. Sadly this important aspect of development was a patchwork of half-integrated solutions for years. Prometheus changed that and this talk will walk you through instrumenting your apps and servers, building dashboards, and monitoring using metrics.

Abstract - https://us.pycon.org/2016/schedule/presentation/1601/

Slides - https://speakerdeck.com/hynek/get-instrumented-how-prometheu...

Video - https://www.youtube.com/watch?v=b-qLOY5ChnQ

In the past we used icinga at Zalando and it scaled for us to 40k checks, after that we got huge latency problems. We use now zmon https://github.com/zalando/zmon/ which is really great, because it scales the checks, the graph database is kairosdb on top of Cassandra, which also scales and even creating alerts can be automated and also added by development teams themselves and you can easily build team dashboards and reuse checks/alerts and filter to your entities. Influxdb was a nice try, but clustering was very unstable in the beginning (tried with 0.7 and 0.8). If you don't want to be the monitoring configurator for your organization (application monitoring should also be created and maintained), I highly recommend to use zmon ( maybe Prometheus can also help). There is also a check to query Prometheus in zmon.

I'm a big fan of Zabbix for general server monitoring and alerting. Sensible defaults, built in graphing, multi-step web scenarios. Good times.

What do you have to monitor? For network hardware you can use Observium (or the fork LibreNMS). It's simple and work fine.

Yeah, should have made it more clear. Monitoring a linux-based application cluster of a couple thousand machines.

Kentik (www.kentik.com) is a great choice for monitoring network traffic details. Terabit-scale aggregates down to individual conversations. (Disclaimer: I work at Kentik)

Most people here are recommending Prometheus. What is the best monitoring system to monitor good old infrastructure software like DNS servers, IMAP/SMTP server etc? Is Prometheus a reasonable choice for those as well?

Yes, Prometheus is a great choice for that as well. It's pretty easy to write integrations (we call them exporters) to get metrics out of existing third-party systems that you cannot easily instrument directly. Here's a list of exporters we already know about, but it's usually easy to write one of your own if it doesn't exist yet: https://prometheus.io/docs/instrumenting/exporters/

Prometheus is the clear winner

We have had very good success with sensu. We like it better than nagios, but I haven't used many others so can't really say that sensu is better than everything.

So we just recently switched over to Wavefront from an aging Zabbix monitoring. We had tested and reviewed a few time series based monitoring systems and felt Wavefront was what we needed for Enterprise level monitoring.

Some of the key items we liked were:

* Able to consume millions of metrics per second. This is pretty huge. While we're not even close to that much (11k/s at the moment), we expect that number to triple or quadruple in the next year.

* Fast. Wavefront renders graphs quickly. The ability to manipulate the data in real time has been impressive.

* Feature requests. Wavefront has been receptive to ideas from their customer base. They even have a voting system in their community page if other customers like a certain request.

* Support has been great. Questions on issues or general technical guidance has been handled quickly, within the hour.

* Docker ready. Already using Wavefront with our emerging docker infrastructure.

* Engineers are self sufficient. Before, Tech Ops had to do all the monitoring for new services. With technologies such as docker, our engineers are capable of setting up monitoring within the application to directly send to Wavefront. This offloads quite a bit of work from Tech Ops.

No, I'm not affiliated with Wavefront. We just use their monitoring service.

We use Icinga 2 at work which serves our needs well enough.

The configuration was a bit of an initial hurdle when coming from icinga 1 / nagios - the config syntax is essentially an EDSL for programming your monitoring requirements - but the flexibility is worth it. Adding new hosts and services is pretty cheap (programmer-time-wise), and I can use whatever programming constructs and conditions I want to decide what services to apply to which hosts in which measure.

That said, it's still in a bit of a young state and some parts are very rough around the edges - for example, icinga 2's dependency model is a bit naive. You can configure email notifications to ignore notifications for services that depend on a different failed host/service, but this only applies if icinga already knows about the dependency having failed. So when a parent service dies, an extra e-mail notification could be generated for each of its children before icinga realizes the parent has also died and stops sending notifications for them.

tl;dr I had fun setting it up and it works well for us, but expect some quirks

We've been using riemann and it's wonderful. There's a little bit of learning curve as the configurations are just clojure code, but since it's all code you can build whatever you want on top of it if you know some Clojure. The DSL is well thought of and we ended up writing a REST API on top of riemann to make our monitoring stack self-serviced for all the internal users.

Hey new metric system called monsoon.. Its a framework and its pretty impressive. It can do collectd, any json and is both s push /pull based system... Check it out https://github.com/groupon/monsoon soon to support wavefront and pagerduty..

We use a combination of metric monitors, with Wavefront being the leading monitoring solution - integration is smooth, the querying language is simple and powerful, the graphs render fast and their support is very helpful - even after the contract is signed :)

Cray Advance Cluster Engine EMS ( http://www.cray.com/products/computing/cs-series?tab=cs_seri... ). Formerly Appro Cluster Engine.

Complete control and monitoring of cluster with either a CLI or GUI. Scalable monitoring with negligible impact on running workloads, including global synchronization of metric collection times, to minimize jitter. Ganglia front-end, but without the overhead of gmond/gmetric running on nodes. Validated as scaling well on a 8,000 node cluster.

Full disclosure: I designed and implemented the monitoring system.


Sorry - accidental down vote while trying to upvote on mobile.

There's now an 'undown' link you can click to undo accidental downs.

I don't see an 'undown' link anywhere, even when clicking on the timestamp of the comment I replied to. shrugs #PoorUI

click the timestamp, and then click [undown].


Hey there, Librato here (https://www.librato.com/)

Welp, nobody can blame you for wanting to get away from Nagios. It’s certainly a tool from a different, simpler era and hasn’t aged well in our opinion.

As a push-based metrics solution, Librato is probably a lot different than what you're used to. But don’t worry: we're super easy to get up and running with, and obviously you no longer need to worry about maintaining or scaling infrastructure. Also, unlike with some other solutions, you can use us with your existing toolchain (it’s easy to plug us into your existing Nagios infrastructure to try us - the trial is free & full-featured).

We’re a hosted metrics platform, meaning you can send metrics of any type and amount you want. We’re functionally similar to Graphite+Grafana, except we do all the work of scaling and management for you so you can focus on the metrics themselves. We provide alerting and other useful bits out of the box (things that are not trivial to setup yourself, e.g., bolting together collectd+Graphite+Grafana+statsd+flapjack+kitchen sink and hoping it scales and doesn’t fall over). We’ve got an agent that comes with a bunch of turn-key integrations too, to make it super easy for you to monitor what you care about.

As to pricing, we're the only hosted monitoring system that will just charge you for what you actually USE. You pay pennies per metric metered by the hour, instead of a per-node model, which gets crazy expensive and inefficient for modern ephemeral infrastructure. For example, if all you're doing is integrating us with AWS CloudWatch to monitor some EC2 instances and an RDS instance, we can do that for effectively a $1-$2 an instance. We also have an agent you can install on your servers if you want more detailed metrics, which adds $5-10 per instance depending on how many metrics you enable. Our customer success team (email support@librato.com, or the Help chat window if you already have a Librato account) will be more than happy to walk you through any permutation of our pricing and the details of the model to help you better understand it.

As mentioned, you can try us out for free--no credit card required: https://www.librato.com/

We are using Wavefront at Doordash and have been very happy with it. Setting up is super easy, UI is easy to use, they never have major outage. Definitely something you can try out.

I used to use nagios and migrated to sensu for system checks. I was using graphite/seyren for time series and alerting, but doing a YoY or week over week was very slow especially if it's a lot of metrics. You should look at http://wavefront.com

You can do some nice math functions for your alerts.

+1 to this as long as you are ok with an external vendor.

A couple of caveats. If you are coming from Nagios, this is a different worldview on monitoring. Like many other solutions commented here this is all based around metrics and their associated time series, and then you need to alert on those metrics. You ask the system questions with a time series query language.

Wavefront doesn't yet have a great solution for poll-based monitoring (i.e. hitting host Xs /healthcheck endpoint) so I still use terrible 'ol Nagios for that in my environment. However the rest of my work is all done in Wavefront - I'd say easily the high 90% of all my material alerts are done in wavefront with a small subset of work done in Nagios.

The killer feature here is the query language. I don't think there is anything else on the market that has its level of sophistication. I've had ex-Googlers on my team who "grew up" with Borgmon, which is in some sense the Ur-time series monitoring system and they loved it.

All this said, there are a lot of options about there. I have a strong bias against supporting my own complicated monitoring infrastructure. I want to focus on my own product. If you don't share that opinion or are on a super duper tight cash budget (but you do have time) than disregard the above ;)

> The killer feature here is the query language. I don't think there is anything else on the market that has its level of sophistication. I've had ex-Googlers on my team who "grew up" with Borgmon, which is in some sense the Ur-time series monitoring system and they loved it.

Prometheus is inspired by Borgmon, and has a query language that is unmatched by almost everything else I'm aware of.

Are there public docs on the semantics and features of the WaveFront language so I can compare?

Has anyone tried Vector [1] by Netflix?

[1] http://vectoross.io/

Give opsee a try https://opsee.com/

AppSignal is also a cool product, although mostly focused on Rails applications. And, a big plus, they are working on an Elixir integration :-) https://appsignal.com/elixir

OpenNMS. It is a java memory pig but is used by some of the twenty largest ASNs (by CAIDA ASRANK) in north america. Truly open source and free. Very extensible. Large development community behind it and many constant updates.

If you want monitoring plus automation and remote management check out http://www.kaseya.com/

It depends on what you want to achieve, but Nagios works well and Graphite for trending is also quite useful.

If you are okay with something which you don't have to run yourself-

The winner IMO is dataloop.io [0].

Dataloop is a SaaS monitoring solution that is super easy to get up and running and has tons of fantastic features and capabilities. The team behind it is stellar and their pricing is reasonable.

10/10, will continue to use again and again :)

[0] https://dataloop.io/

If you care about correctness of data, solid data retention and good analytics (prediction, forecasting, etc.) then you should take a look at Circonus.


500 metrics accounts are free for life.

Built by SREs for SREs.

postwait is the founder of Circonus.

Does anyone have thoughts on ITRS Geneos?

for small setups munin and (m)monit are still my goto place.

dead simple, easy to configure and very reliable

disclaimer: I evaluated most of these tools and wrote a blog post here. https://thehftguy.wordpress.com/2016/04/18/monitoring-in-the...

It's a bit old and i'll update it later, but here is the short resume with all the latest tool:

### Free (as in open-source) shitty options: icinga, nagios, riemann

They suck so much they're not even worthy of having their names written.


The other open-source option is prometheus.

I didn't try it personally but I've have candidates interviewing at my company who talked at length about their experience on it and they were satisfied.

I red the whole documentation and it's better than the old shitty tools but it's still not great. Be aware that it has many limitations by design, they skipped all the hard stuff (single node only, no HA, pull-mode only for metrics).


The new SaaS tools (ordered by maturity), all 10-20$ per host, they're mostly copy-cat:

Datadog, BMC truesigh pulse (Boundary), signalfx, wavefront, server density.

Datadog is the best option. It's older (about 5 years) and more mature. It has the most features and integrations. It's really the next generation of monitoring.

BMC truesight pulse is the historic competitor. It was a startup called "Boundary" that was bought by BMC, and BMC rebranded the product. That's about the same thing. Not sure what the acquisition may or may not have changed.

SignalFX is a direct copy-cat of datadog (and BMC). But it came later so it's lacking in features and integrations.

Wavefront is an even later copy-cat of datadog and signalfx. Except it has no public price nor public trial. You have to contact them and go through sales for anything. (Honestly: just ignore wavefront. There are 3 directs competitors who are better and more accessible).

ServerDensity: Don't bother trying. The website is buggy, it fails to load pages very often. The product is not even finished and lack 80% of the competitor features. The company will probably die soon. (sorry for their employees who are commenting here and reading that :\ )

[Google] StackDriver: It was another company that was acquired by Google 2 years ago. Currently, it's dead and it's being integrated to Google offerings. That might be great when it comes back (probably this year, there seem to be some closed beta given by Google at the moment).

### Current status-quo:

Datadog beats everything by a long margin. More mature, more features, more integrations. It's has the advantage and it's evolving faster. That's the horse you have to put your money on (I did).

You can try the competitors (either BMC or signalfx) if you wanna play around or just tickle datadog sales team to get a better price (I did) :D

### Far future:

There might be a market rupture within 1-2 years when google finally release StackDriver. It had some quite advanced stuff and great review when it was acquired. It's the only one that might be able to catch up with datadog and provide the very advanced stuff that doesn't currently exist (e.g. outlier detection done right).

If and When Google finally offers GCE (cheaper & faster than AWS) + kubernetes (docker and infrastructure on steroid) + StackDriver (complete monitoring AND logging solutions), they will be the best IaaS provider on the planet by a wide margin. The evolutions brought by these tools will allow me to do the work of 3 infra/sre guy all by myself.

> they skipped all the hard stuff (single node only, no HA

We skipped the hard stuff on purpose, as hard stuff is extremely tricky to get right and liable to fail right when it's most needed. See http://www.robustperception.io/monitoring-without-consensus/

Per the above, there is HA.


Not icinga and not boost

this isn't the proper way to learn that.

grafana ??


monitoring system too.

Plenty of folks building large scale custom monitoring solutions with InfluxDB (plus Grafana, Collectd, Telegraf etc) - https://influxdata.com/testimonials/

I know I might come off as trolling, but try to get out of the business of managing servers. I've done it, it sucks, and I don't ever want to do it again.

Get everything containerized and use a container runtime like ECS, unless you're operating in analytics, adtech, or something else with extreme storage/compute/network requirements.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact