
Ask HN: Best monitoring system? - mspaulding06
At my company I&#x27;m considering switching us from Nagios to another monitoring system and starting to do some research.  What&#x27;s the best monitoring solution out there today?  I&#x27;m pretty impressed by Prometheus, but just like to get some more opinions.
======
acd
Prometheus.io which is a modern fresh monitoring system that I would checkout
if replacing a legacy system.

Also take a look at Riemann which is system monitoring written in Clojure.
Riemann should be good for monitoring latency of the system.

If it helps here is Slidedeck from Spotify how they do their monitoring
[https://www.netways.de/fileadmin/images/Events_Trainings/Eve...](https://www.netways.de/fileadmin/images/Events_Trainings/Events/OSMC/2015/Slides_2015/Monitoring_at_Spotify_When_things_go_ping_in_the_night-
Martin_Parm.pdf)

~~~
mspaulding06
I like Riemann and actually read The Art of Monitoring
([https://artofmonitoring.com/](https://artofmonitoring.com/)) which was a
great book. There are two big downsides for me on this one. First, you MUST
build your monitoring solution from scratch and you MUST learn Clojure (which
can be hard to get a whole team to agree to). Second, there's no alerting
dashboard, which makes it difficult to see the overall state of the clusters
you're dealing with. The only way you know there's a problem if you get an
email. Maybe there's a better way to handle that but I wasn't able to find
one.

~~~
bbrazil
> you MUST learn Clojure (which can be hard to get a whole team to agree to)

For any sufficiently advanced monitoring system, you're going to have to learn
some form of DSL/language to take advantage of it.

I wouldn't consider this to be a major issue, more part of the irreducible
complexity of the problem. The Riemann examples I've seen all seemed pretty
readable, and routing alerts is not world away from what you'd be doing in say
a Prometheus Alertmanager config; just with S-Expressions against YAML.

~~~
mspaulding06
It's a major issue for me because of the time investment required to learn the
language (functional programming isn't exactly second nature to most
programmers) and the need to design the monitoring system from scratch.
Instead of an out-of-the-box system you're getting more of a framework which
allows you to build one. I actually like Clojure and agree that it is readable
as well as fun to learn. But I think with something like Prometheus you do
more simple configuration than monitoring system design.

------
dmytton
If you want to hand off all the hard stuff about monitoring and get some easy
to use, core functionality (graphing, alerts, dashboards) then my company
[https://www.serverdensity.com](https://www.serverdensity.com) has been going
7+ years now.

For highly sophisticated environments then
[https://www.datadog.com](https://www.datadog.com) is a very advanced product.

Both are based off my original agent: [https://github.com/serverdensity/sd-
agent](https://github.com/serverdensity/sd-agent) (DataDog forked it in 2010
and we forked them back last year).

We're also behind [http://www.humanops.com](http://www.humanops.com) trying to
build monitoring that also helps you run on-call and ops teams generally in a
way that considers fatigue, stress and the realities of the humans running IT
systems! E.g. [https://blog.serverdensity.com/introducing-alert-
costs/](https://blog.serverdensity.com/introducing-alert-costs/) and
[https://opzzz.sh](https://opzzz.sh)

~~~
williamstein
dmytton you have done a FANTASTIC job on your pricing page
([https://www.serverdensity.com/pricing/](https://www.serverdensity.com/pricing/))!
I just got screwed by DataDog's confusing pricing page
([http://sagemath.blogspot.com/2016/07/datadogs-pricing-
dont-m...](http://sagemath.blogspot.com/2016/07/datadogs-pricing-dont-make-
same-mistake.html)). Your's is the ultimate example of what to do right. WOW.

~~~
dmytton
Thanks! We just recently released this new pricing page as a test. We've done
a few iterations over the years and patio11 did a great writeup at
[http://www.kalzumeus.com/2012/08/13/doubling-saas-
revenue/](http://www.kalzumeus.com/2012/08/13/doubling-saas-revenue/)

------
mancerayder
I've used Nagios and Icinga2, and I've become a huge proponent of check_mk.
The documentation isn't great, but the product rocks: with very little time,
you can start monitoring a slew of services (disk, hardware, logs, ntp) with
almost no tweaking required. You can easily create custom checks, but you also
have all the Nagios plug-ins it's compatible with. No daemon listens on the
hosts being monitored. You get graphing for free (no setup time) for almost
all your checks.

It uses Nagios under the hood, it's basically an automation system that
generates those Nagios systems. The GUI is amazing, because it uses a plug-in
so you don't have to edit files on disk to group your hosts or tweak the
alerts. Those configs are snapshotted automatically at every change, and you
can replicate that configuration automatically to remote servers. Download it
from the upstream site instead of relying on distro package repositories.

Caveat, the documentation sucks, the GUI can be nonintuitive and it's hard to
Google problems. It takes time to fully tune. Out of the box you'll probably
still be impressed though.

~~~
eeZi
Yes, check_mk is great. I rolled out a 30000+ checks installation on a single
server today. I've looked at most monitoring products on the market and
check_mk easily wins. It's really popular in Germany, many huge companies are
using it.

The English documentation isn't that great, the German one is better. That
being said, it's mostly self-explanatory and all checks are very well
documented in man pages.

My favorite features:

\- Auto discovery for literally everything, including SNMP interfaces.

\- Fine grained rule system for customizing check threshold and parameters

\- All configuration is automatically versioned and you can integrate it with
Git - this includes the changes you make in the web interface.

\- It's very easy to set up a distributed monitoring system (multisite) with a
central node which aggregates all states and replicated configuration changes,
yet each site is fully autonomous.

\- The agent takes zero network input, so no attack surface.

\- Even though it's extremely featureful, it's architecture is very simple and
it's easy to contribute code and write custom checks.

Their Git is public: [http://git.mathias-
kettner.de/git/?p=check_mk.git;a=shortlog...](http://git.mathias-
kettner.de/git/?p=check_mk.git;a=shortlog;h=refs/heads/master)

It works well with Naemon and Nagios 4. Been using it for a number of
projects, ask me anything!

Monitoring systems not to use:

    
    
      - Shinken (zero security awareness and dishonest PR, when I tried it, it has so many bugs that I wouldn't ever trust it to keep my data safe)
    
      - Zabbix (it has some brilliant features, but the architecture is a mess, configuration and time series stored in a MySQL database which is hard to manage and automate, I found it cumbersome to debug, written in PHP)

------
__jal
Best at what? What is driving you to want to switch?

I like Icinga a lot. I won't bother reviewing it; is is very well known.
Professionally, my last two gigs have used Zabbix.

Zabbix, architecturally, is a nightmare. Uses an RDBMS for storing time-series
data, so it wastes a ton of space on historic data while managing to be far
slower than it needs to be when querying larger ranges. Uses an agent. Has a
proxy-agent that, while handy, encourages all sorts of sketchy, error-prone
monitoring topologies. With 3.0, the UI has crawled out of the awful range,
and is now merely annoying. Takes the all-singing, all-dancing monolithic
approach for the main app, including features for drawing maps on big-screens.

For all that, it works well. Give it the hardware it wants, be sane in setting
it up[1], ignore the goofy features (maps, inventory, screens - I guess
someone must of requested those), and it is very solid and very powerful.

[1] The template system, pseudo-language for triggers, naming convention for
variables and method of creating custom monitors take some getting used to.
Expect to take the time to actually read the docs, and most likely to throw
out your templates the first time you model your systems.

------
atom_enger
It depends on your needs and budget.

Can you afford time but not money? Try Sensu or Nagios.

Do you have money and not time? Try datadog.

Like someone else mentioned here, if you're looking to alert off of logs from
ELK, try Elastalert.

~~~
pwelch
Agree with this.

If you have money and no time: DataDog

If you don't mind putting in a little time: Sensu

Sensu is straightforward to deploy if you use Chef/Ansible/Puppet. It also
supports running Nagios plugins which is pretty useful.

~~~
otterley
Datadog does WAY more than Sensu does. Sensu doesn't handle metrics with more
than 1 dimension, which should be a standard feature of any modern metrics
platform.

I also disagree that setting up Sensu takes a "little" time. What is a
"little" to an inexperienced Sensu administrator? A day? A week? Several
weeks? Quantifying it would be valuable to the reader.

------
sagichmal
Prometheus is absolutely the way you should be going. All of the other systems
I'm seeing mentioned here — Nagios, Icinga, check_mk, Zabbix, Sensu — are
host-centric and are very awkward when you try to bend them to fit modern
(containerized, etc.) workloads.

~~~
falcolas
There's always a server. Regardless of how far away you've abstracted it away,
there's always a server which should probably be monitored (even if to know
when it's about to fail and should have work shunted off it prior to its
failure). Icinga and others make it easy to programmatically add and remove
servers as they enter and leave your environment.

Even if you don't have access to the server so you can monitor it, you can use
the "host" concept as containers for your services: "api.mycorp.com",
"tasks.mycorp.com", "backups.mycorp.com" are great starting points.

~~~
dozzie
> There's always a server.

Actually, no.

When you monitor state of a cluster (e.g. node count), you don't have _a_
server, you have _plenty of servers_ and _a cluster_ (completely different
thing).

When you monitor temperature in your server room, you don't have a server, you
have a server room.

When you monitor exchange rate, you don't have a server.

When you monitor a website, you still don't have a server.

And now add all the AWS Lambdas and other serverless rage.

Notion that everything works on a (single!) server was never valid, and today
it's even more visible than it was twenty years ago, when Nagios was state of
the art.

~~~
athenot
True but in practice it doesn't really matter. With sensu, you have offbox
checks, and you just pick some internal server (there's always some "misc"
server hanging around).

What matters is that the alert about the issue is raised and relayed to the
proper notification channels. Since sensu doesn't concern itself with a fancy
dashboard, it doesn't really matter if the alert pertains to the host or not.

Any decent monitoring will have customized the alert handling based on what's
alerting, so there's some amount of post-processing possible.

~~~
dozzie
> True but in practice it doesn't really matter.

In practice it doesn't matter if you name a file handle "juju" and a database
query "peach" in your code.

It's a matter of calling things what they are instead of forcing them into a
mismatched data scheme by creating artificial hosts.

------
Zenfinch
If you can have a monitoring system in the cloud Datadog is a great choice.

Good documentation, UI, many, many plugins and fair pricing (IMO).

[https://www.datadoghq.com/](https://www.datadoghq.com/)

(Im not affiliated with in any way other than using their product on a pet
project with many moving parts).

~~~
williamstein
Just make sure you understand their pricing (I didn't):
[http://sagemath.blogspot.com/2016/07/datadogs-pricing-
dont-m...](http://sagemath.blogspot.com/2016/07/datadogs-pricing-dont-make-
same-mistake.html)

~~~
bqe
How could their pricing page be clearer? It says per host in fairly large
letters underneath it.

I'm asking because I will be designing a similar page soon (that's also billed
per host) and I'd like to avoid the same mistakes.

~~~
williamstein
[EDIT: This pricing page by the top poster in this thread is way better than I
suggest below --
[https://www.serverdensity.com/pricing/](https://www.serverdensity.com/pricing/)]

1\. __VERY clearly __state that when you sign up for the service, then you are
on the hook for up to $18*500 = $9000 + tax in charges for any month. Even
Google compute engine (and Amazon) don 't create such a trap, and have a clear
explicit quota increase process.

2\. Instead of "HUGE $15" newline "(small light) per host", put "HUGE $18 per
host" all on the same line. It would easily fit. I don't even know how the
$15/host datadog discount could ever really work, given that the number of
hosts might constantly change and there is no prepayment.

3\. Inform users clearly in the UI at any time how much they are going to owe
for that month (so far), rather than surprising them at the end. Again, Google
Cloud Platform has a very clear running total in their billing section, and
any time you create a new VM it gives the exact amount that VM will cost per
month.

4\. If one works with a team, 3 is especially important. The reason that I had
monitors on 50+ machines is that another person working on the project, who
never looked at pricing or anything, just thought -- he I'll just set this up
everywhere. He had no idea there was a per-machine fee.

------
dsr_
It depends on your architecture and scale. There is no "best", just "best
we've found for this" and "best given other constraints".

This is yet another point where DevOps is not "devs doing ops" but "operations
building and deploying with all the tools of modern software development". You
need a subject matter expert.

What are you monitoring? Do you care about availability or performance or
both? Scale? Do you have services or servers? Do you manage the underlying
hardware? Do you need to track which hardware boxes have which VMs or
containers?

There are a million questions to answer. One big set of them: what do you
dislike about Nagios? Make sure that you don't get those problems with the
next one, but also make sure you get something that does what you need as well
as what you want.

~~~
mspaulding06
Thanks for the comprehensive reply. Briefly I would say that we care about
availability more than performance, though both are important. We're running
somewhere on the order of 2000 VMs w/ ESX with some bare metal systems running
database clusters. We have a separate team that manages the hardware
infrastructure and they have their own monitoring and alerting system. I'm
mostly concerned about preventing downtime for the application cluster,
alerting the right people via the right means (chat, email, pagerduty) when
something does go down, and getting some high resolution graphs for analysis.

~~~
dsr_
I still don't know anything much about your systems, but I do know this: find
out what your hardware team uses and see if it is right for you, too. (Or find
out that they are unhappy with it, and perhaps go in together on a new
system.)

Benefits: shared expertise. Common language. Propagation of alerts up from
hardware and down from services. Better root cause analysis. If you have a
good culture, faster resolution time and better understanding.

------
falcolas
Just my opinion, but I won't use Prometheus, because of the active polling
model. It won't scale without a number of workarounds.

My preferred method is Icinga2 (a Nagios clone with better configuration and
clustering built-in) with reports coming in via passive NSCA. Toss in Graphite
(or I'm warming up to Grafana on Influx) with some ability to write alerts
against those reported metrics, and you're as close to ideal as I can come up
with.

Of course, that requires a fair bit of up-front knowledge to stand up and
operate, but they're so rock solid (and scale like mad) I have a hard time not
recommending them.

~~~
sagichmal
Active polling scales _better_ than the alternative. You directly manage the
ingest rate at the server side, rather than DDOSing your monitoring
infrastructure when you scale out your services to deal with load.

~~~
falcolas
> You directly manage the ingest rate at the server side

Why in the world would you want to manage (which I'm reading as throttle) the
ingest rate of a monitoring service? That strikes me as a recipe for missing
important events.

Monitoring should be a fairly stable load rate. If you're setting up
monitoring, and your machine can't handle the load of all the data points
coming in, you need to shard out, not drop data points.

> Active polling scales _better_ than the alternative

Exceptional claims require exceptional evidence. Personally, I have never
heard of anything that is actively pinging outside services performing better
than receiving and processing data passively.

Prometheus _will_ scale better than Nagios running active checks since it
won't be using subprocesses, but it is still going to require more overhead
than a service receiving passive reports.

~~~
jrv
The pulling itself has never been a scaling/bottleneck concern in Prometheus,
especially since no individual events are transferred, but just the current
state of each metric. The bottleneck is usually the storage's sustained
ingestion speed (disk IO or similar).

A single big Prometheus server can easily do millions of time series, and in
an older record, 800,000 samples stored per second. You could monitor e.g.
10,000 hosts with that with quite some detail, and the bottleneck is still not
the pull aspect.

------
jc4p
At Stack Overflow we use a homebuilt Go solution called bosun:
[http://bosun.org/](http://bosun.org/) \-- it runs on pretty much anything and
lets us incorporate data from windows machines / linux machines in one place.

~~~
tokenizerrr
I recently tried out Bosun and liked it a lot. The documentation is a little
light, and the dependency on hbase and hadoop (since opentsdb uses that) is a
bit of a pain. Maintaining those isn't particularly straight forward or fun.

I'm also interested in prometheus but haven't gotten to try it out yet. Anyone
reading this have experience with both? How do they compare?

~~~
gbrayut
We've been dogfooding the new Stack Overflow Documentation system for Bosun,
so you may find some better examples at
[http://stackoverflow.com/documentation/bosun/topics](http://stackoverflow.com/documentation/bosun/topics)
which just opened yesterday. If you see anything missing you can request a
topic or flag an example as needs improvement.

------
poulsbohemian
The one you use. I have sold and implemented these types of tools for the past
ten years. Biggest problem is companies not actually fully implementing and
using the tools they already own, and letting teams splinter off into their
own tool sets.

------
ninjakeyboard
I think it depends on your needs and software, how much time you want to
invest, what you want to monitor, do you want to maintain it or you want saas?

You want metrics from counters you build in your app? (see statsd?)

You want to aggregate and do analysis on logs? (see ELK stack?)

You want to monitor cloud infrastructure (see stackdriver?)

You want to run end to end tests on your application to ensure it's behaving?
(see runscope?)

As your application grows, you probably want a blend of tools to see inside
your app.

------
rvanniekerk
Just use Prometheus, nothing else comes close to it. It also just hit 1.0

 _EDIT_

Should have mentioned how well it pairs with Grafana ;)

------
rmykhajliw
Why not to start with AWS Cloud Watch:
[https://aws.amazon.com/cloudwatch/details/](https://aws.amazon.com/cloudwatch/details/)
\- simple, scalable, but of the box solution. It's much simpler than build
similar functionality yourself.

------
enricobruschini
Hi all, I'm surely biased as I work at Instana
([https://www.instana.com](https://www.instana.com)), but here's my opinion
about monitoring.

Applications are dramatically and rapidly changing, with continuous delivery,
microservice approach, containers and orchestration tools, things are all over
and you might have a component spun up and down within few minutes. Humans
cannot keep up with data and it doesn't make any sense to stare at a big
screen full of data, just looking the all day at charts trying to visually
correlate data. The correlation of data is becoming harder and harder as
systems are more and more resilient. There's, therefore, no unique root cause
anymore ([https://www.instana.com/blog/no-root-cause-microservice-
appl...](https://www.instana.com/blog/no-root-cause-microservice-
applications/)).

At Instana we're re-defining what monitoring means. We're moving the bar from
visualizing data to providing plain English explanation of what's going
together with suggestion for remediation. Instana 3 main values are: \-
Automatic Discovery: dynamically models the architecture of infrastructure,
middleware and services \- Automatic QoS Analysis: continuously derives KPIs
of all components and services and alerts on incidents \- Integrated
Investigation: visualizes in real-time physical and logical architecture,
compares over time, suggests fixes and optimizations.

Happy to get feedback and provide more info. Enrico

~~~
otterley
Can you compare Instana to Datadog, SignalFX and Wavefront?

~~~
de107549
Let's first say that I am the co-founder an CEO of Instana, but I am trying to
give a generic answer so that I don't "attack" competitors.

Most of the mentioned tools in this thread, including Datadog, SignalFX etc
are using a simple agent to collect data - see Datadog agent on GitHub:
[https://github.com/DataDog/dd-agent](https://github.com/DataDog/dd-agent) or
statsD ([https://github.com/etsy/statsd](https://github.com/etsy/statsd)) that
is mostly recommended by SignalFX who have no own agent. Tools like Prometheus
work similar.

On the backend side you can see two approaches for data store technology: A
time series based approach like DataDog or Prometheus and a Streaming based
approach like SignalFX - stream are the superior approach in my point of view
as they allow for realtime approaches and stream (window based) analytics.
There is a third category which is similar to time series but more "log"
centric like the ELK stack or tool like Splunk.

On top of the data store these tools give you the ability to build your own
dashboards (and provide standard dashboards for standard technology) and a
alerting based on thresholds. They also allow to add you own metrics via API
which can be used to add application specific data. They also give you a query
API to query and combine the data in the store. So overall this is a Lambda
architecture for monitoring data.

I would say that SignalFX is the most sophisticated one but the framework to
work on stream is much more complicated then DataDogs time series approach so
people go the easier way.

The problem with all of these tools is that they rely on the user to build
dashboards, thresholds and in case of a problem do the correlation to find the
root cause of the problem.

To correlate you need to understand the dependencies of the system components.
As an easy example if service A has a performance issues because it calls
service B that has a CPU problem, you need to know that A calls B and
correlate the latency of A with the latency and CPU of B to find the root
cause. You can discover/model dependencies with tools like Zipkin
([https://github.com/openzipkin](https://github.com/openzipkin)) or Spring
Cloud Sleuth ([https://cloud.spring.io/spring-cloud-
sleuth/](https://cloud.spring.io/spring-cloud-sleuth/)) which are based on the
Google Dapper paper. You could even add or log the Span ID to the metrics/logs
so that you can correlate them automatically.

Typically if you do so manually it is a disaster for change. All your
correlations (and even dashboards) will not work if the topology of your
services changes. Which is quite normal in the microservice world.

Instana uses a stream based approach similar to SignalFX BUT we combine this
with a graph database that holds the dependencies of all physical and
application dependencies. Our agent automatically discovers all the components
and dependencies and adds them to the graph in realtime - including containers
etc.

We then use the Google Four golden signals + Capacity (that was added by
Netflix as the fifth one) to analyze the KPIs of the services and apply
machine learning on it. That way we don't need manual thresholds which are
also hard to maintain when things change a lot. If we see e.g. slow response
times or sudden drops in requests or high error rates, then we analyze the
dependency tree of that service to find the issues that are related to the
problem and generate an incident for that - as we also discover changes, we
add them to the incident as most often a change is the reason for a problem.
I've written a blog entry on the Dynamic Graph:
[https://www.instana.com/blog/monitoring-microservice-
applica...](https://www.instana.com/blog/monitoring-microservice-applications-
introducing-dynamic-graph/)

Hope this answers you question.

Mirko

~~~
bbrazil
> stream are the superior approach in my point of view as they allow for
> realtime approaches and stream (window based) analytics.

I'd see them as slightly different approaches to providing fundamentally the
same solution. One builds up time series and then operates on them, the other
operates on the time series as they come in.

Taking Prometheus as an example we're a time series database, and you can do
both realtime and window-based analysis. In fact that's how it is usually
used.

> I would say that SignalFX is the most sophisticated

Do you have an example of something that you can do with your streaming
approach that's not possible with other tools?

It's hard to get a proper understanding of the myriad of monitoring systems
out there, so I'm always looking for insights.

> Our agent automatically discovers all the components and dependencies and
> adds them to the graph in realtime.

That sounds interesting, how do you do that for network dependencies? Do you
have something like Zipkin?

~~~
de107549
I agree that streaming and timeseries queries/scans are two different
approaches which can solve the problem in the same way. With instant vectors
of Prometheus queries you can operate very similar to windows and if you do
the right queries and take care that it works in-memory you also should get
similar performance and throughput.

My point was more about the framework you get and how easy it is to apply
analytics to streams/queries. SignalFx seems to have a nice workbench for this
with direct visual feedback in the UI, so that you can work on existing data
to get the right result.

As said we at Instana think that most people will not be able to build a
sophisticated monitoring solution with these types of frameworks as they don't
have the time to do it and maybe even not the analytical domain knowledge. You
can see that SignalFx is adding specific knowledge for some technologies. I
give you two simple examples to show that it is not easy:

\- How would you predict if a file system is running out of disk space?

\- How would you predict if you should add a node to a Cassandra cluster
because it is running out of capacity (and it can take some serious time to
add a node, so you should know in advance)?

Already the disk space problem is hard to solve - linear regression and basic
algorithms will not work.

Now think of hundreds (or thousands) of services running on a dynamic
container platform and new services released on a daily or even minute basis -
with lots of different technologies involved...

No question that you can build a good monitoring solution with Prometheus,
SignalFX, DataDog etc - but it will take a serious amount of time, consulting
and dev teams involved adding the right instrumentation, metrics etc. And you
need a lot of analytical knowledge. I can even imagine that there are
situation were tools like Prometheus are a better choice - especially if you
have a very strict set of technologies and communication framework and really
good people to do a very specific set of "rules" for this environment.

We've added a domain model to our product (all the mentioned product have a
generic metric model, but no semantics that describe servers, containers,
processes, services and their communication which is the domain of system and
application monitoring): Our Dynamic Graph.

And yes, we are using something very similar to Zipkin to get the dependencies
between services. Here a are two blog entries describing the approach:

\- About distributed tracing: [https://www.instana.com/blog/evolution-tracing-
application-p...](https://www.instana.com/blog/evolution-tracing-application-
performance-management/)

\- How we safely instrument code: [https://www.instana.com/blog/how-instana-
safely-instruments-...](https://www.instana.com/blog/how-instana-safely-
instruments-applications-for-monitoring/)

Mirko

~~~
otterley
> SignalFx seems to have a nice workbench for this with direct visual feedback
> in the UI, so that you can work on existing data to get the right result.

Wavefront does as well; I'd recommend you compare it for competitive analysis.

So would you say your product is in direct competition with these offerings,
or do you see it more as a complement to them?

~~~
de107549
Yes, I didn't compare to Wavefront as I have only basic insights and therefore
cannot make a valid statement.

Competition depends on the uses case - if you are using a tool like SignalXF
for custom metric analytics, then we are no competition as our focus is
monitoring of applications and its underlying infrastructure.

We are an Application Performance Management (APM) solution and therefore
compete more with tools like New Relic oder AppDynamics. Theses tools are
sadly only used for troubleshooting in 90% of the cases and not for management
or monitoring. They also do not work in highly dynamic and scaled environments
as there "model" is too static. (which they try to fix with their analytics
offerings)

This is what we want to change and were we add the whole stack to the game to
analyze all the dependencies and help finding root causes quickly and monitor
and predict the KPIs of your applications, services, clusters and components.

We integrate with solutions like SignalFX if needed but I have really good
experience to do "dashboarding" with more business related tools like Tableau
or QlikView - this also offers application owners an easier way to aggregate
the monitoring data and metrics on a higher (business) level, where tools like
Instana offer the instrumentation data as an input.

------
tedmiston
Hynek Schlawack gave a talk at PyCon this year about using Prometheus and
Grafana to unify monitoring metrics. Honestly the talk goes beyond my own
understanding, but you may find it helpful. He's quite knowledgeable.

> To get real time insight into your running applications you need to
> instrument them and collect metrics: count events, measure times, expose
> numbers. Sadly this important aspect of development was a patchwork of half-
> integrated solutions for years. Prometheus changed that and this talk will
> walk you through instrumenting your apps and servers, building dashboards,
> and monitoring using metrics.

Abstract -
[https://us.pycon.org/2016/schedule/presentation/1601/](https://us.pycon.org/2016/schedule/presentation/1601/)

Slides - [https://speakerdeck.com/hynek/get-instrumented-how-
prometheu...](https://speakerdeck.com/hynek/get-instrumented-how-prometheus-
can-unify-your-metrics)

Video -
[https://www.youtube.com/watch?v=b-qLOY5ChnQ](https://www.youtube.com/watch?v=b-qLOY5ChnQ)

------
sszuecs
In the past we used icinga at Zalando and it scaled for us to 40k checks,
after that we got huge latency problems. We use now zmon
[https://github.com/zalando/zmon/](https://github.com/zalando/zmon/) which is
really great, because it scales the checks, the graph database is kairosdb on
top of Cassandra, which also scales and even creating alerts can be automated
and also added by development teams themselves and you can easily build team
dashboards and reuse checks/alerts and filter to your entities. Influxdb was a
nice try, but clustering was very unstable in the beginning (tried with 0.7
and 0.8). If you don't want to be the monitoring configurator for your
organization (application monitoring should also be created and maintained), I
highly recommend to use zmon ( maybe Prometheus can also help). There is also
a check to query Prometheus in zmon.

------
ilikejam
I'm a big fan of Zabbix for general server monitoring and alerting. Sensible
defaults, built in graphing, multi-step web scenarios. Good times.

------
lormayna
What do you have to monitor? For network hardware you can use Observium (or
the fork LibreNMS). It's simple and work fine.

~~~
mspaulding06
Yeah, should have made it more clear. Monitoring a linux-based application
cluster of a couple thousand machines.

------
moehm
Most people here are recommending Prometheus. What is the best monitoring
system to monitor good old infrastructure software like DNS servers, IMAP/SMTP
server etc? Is Prometheus a reasonable choice for those as well?

~~~
jrv
Yes, Prometheus is a great choice for that as well. It's pretty easy to write
integrations (we call them exporters) to get metrics out of existing third-
party systems that you cannot easily instrument directly. Here's a list of
exporters we already know about, but it's usually easy to write one of your
own if it doesn't exist yet:
[https://prometheus.io/docs/instrumenting/exporters/](https://prometheus.io/docs/instrumenting/exporters/)

------
shofetim
Prometheus is the clear winner

------
mordocai
We have had very good success with sensu. We like it better than nagios, but I
haven't used many others so can't really say that sensu is better than
everything.

------
msims
So we just recently switched over to Wavefront from an aging Zabbix
monitoring. We had tested and reviewed a few time series based monitoring
systems and felt Wavefront was what we needed for Enterprise level monitoring.

Some of the key items we liked were:

* Able to consume millions of metrics per second. This is pretty huge. While we're not even close to that much (11k/s at the moment), we expect that number to triple or quadruple in the next year.

* Fast. Wavefront renders graphs quickly. The ability to manipulate the data in real time has been impressive.

* Feature requests. Wavefront has been receptive to ideas from their customer base. They even have a voting system in their community page if other customers like a certain request.

* Support has been great. Questions on issues or general technical guidance has been handled quickly, within the hour.

* Docker ready. Already using Wavefront with our emerging docker infrastructure.

* Engineers are self sufficient. Before, Tech Ops had to do all the monitoring for new services. With technologies such as docker, our engineers are capable of setting up monitoring within the application to directly send to Wavefront. This offloads quite a bit of work from Tech Ops.

No, I'm not affiliated with Wavefront. We just use their monitoring service.

------
haasn
We use Icinga 2 at work which serves our needs well enough.

The configuration was a bit of an initial hurdle when coming from icinga 1 /
nagios - the config syntax is essentially an EDSL for programming your
monitoring requirements - but the flexibility is worth it. Adding new hosts
and services is pretty cheap (programmer-time-wise), and I can use whatever
programming constructs and conditions I want to decide what services to apply
to which hosts in which measure.

That said, it's still in a bit of a young state and some parts are very rough
around the edges - for example, icinga 2's dependency model is a bit naive.
You can configure email notifications to ignore notifications for services
that depend on a different failed host/service, but this only applies if
icinga already knows about the dependency having failed. So when a parent
service dies, an extra e-mail notification could be generated for each of its
children before icinga realizes the parent has also died and stops sending
notifications for them.

tl;dr I had fun setting it up and it works well for us, but expect some quirks

------
zphds
We've been using riemann and it's wonderful. There's a little bit of learning
curve as the configurations are just clojure code, but since it's all code you
can build whatever you want on top of it if you know some Clojure. The DSL is
well thought of and we ended up writing a REST API on top of riemann to make
our monitoring stack self-serviced for all the internal users.

------
nolofsson1
Hey new metric system called monsoon.. Its a framework and its pretty
impressive. It can do collectd, any json and is both s push /pull based
system... Check it out
[https://github.com/groupon/monsoon](https://github.com/groupon/monsoon) soon
to support wavefront and pagerduty..

------
ycdk
We use a combination of metric monitors, with Wavefront being the leading
monitoring solution - integration is smooth, the querying language is simple
and powerful, the graphs render fast and their support is very helpful - even
after the contract is signed :)

------
pcvarmint
Cray Advance Cluster Engine EMS ( [http://www.cray.com/products/computing/cs-
series?tab=cs_seri...](http://www.cray.com/products/computing/cs-
series?tab=cs_series) ). Formerly Appro Cluster Engine.

Complete control and monitoring of cluster with either a CLI or GUI. Scalable
monitoring with negligible impact on running workloads, including global
synchronization of metric collection times, to minimize jitter. Ganglia front-
end, but without the overhead of gmond/gmetric running on nodes. Validated as
scaling well on a 8,000 node cluster.

Full disclosure: I designed and implemented the monitoring system.

------
random55643
Sensu!

~~~
click170
Sorry - accidental down vote while trying to upvote on mobile.

~~~
bryanlarsen
There's now an 'undown' link you can click to undo accidental downs.

~~~
click170
I don't see an 'undown' link anywhere, even when clicking on the timestamp of
the comment I replied to. _shrugs_ #PoorUI

------
librato
Hey there, Librato here ([https://www.librato.com/](https://www.librato.com/))

Welp, nobody can blame you for wanting to get away from Nagios. It’s certainly
a tool from a different, simpler era and hasn’t aged well in our opinion.

As a push-based metrics solution, Librato is probably a lot different than
what you're used to. But don’t worry: we're super easy to get up and running
with, and obviously you no longer need to worry about maintaining or scaling
infrastructure. Also, unlike with some other solutions, you can use us with
your existing toolchain (it’s easy to plug us into your existing Nagios
infrastructure to try us - the trial is free & full-featured).

We’re a hosted metrics platform, meaning you can send metrics of any type and
amount you want. We’re functionally similar to Graphite+Grafana, except we do
all the work of scaling and management for you so you can focus on the metrics
themselves. We provide alerting and other useful bits out of the box (things
that are not trivial to setup yourself, e.g., bolting together
collectd+Graphite+Grafana+statsd+flapjack+kitchen sink and hoping it scales
and doesn’t fall over). We’ve got an agent that comes with a bunch of turn-key
integrations too, to make it super easy for you to monitor what you care
about.

As to pricing, we're the only hosted monitoring system that will just charge
you for what you actually USE. You pay pennies per metric metered by the hour,
instead of a per-node model, which gets crazy expensive and inefficient for
modern ephemeral infrastructure. For example, if all you're doing is
integrating us with AWS CloudWatch to monitor some EC2 instances and an RDS
instance, we can do that for effectively a $1-$2 an instance. We also have an
agent you can install on your servers if you want more detailed metrics, which
adds $5-10 per instance depending on how many metrics you enable. Our customer
success team (email support@librato.com, or the Help chat window if you
already have a Librato account) will be more than happy to walk you through
any permutation of our pricing and the details of the model to help you better
understand it.

As mentioned, you can try us out for free--no credit card required:
[https://www.librato.com/](https://www.librato.com/)

------
hendratj
We are using Wavefront at Doordash and have been very happy with it. Setting
up is super easy, UI is easy to use, they never have major outage. Definitely
something you can try out.

------
calvinx408
I used to use nagios and migrated to sensu for system checks. I was using
graphite/seyren for time series and alerting, but doing a YoY or week over
week was very slow especially if it's a lot of metrics. You should look at
[http://wavefront.com](http://wavefront.com)

You can do some nice math functions for your alerts.

~~~
smoodles
+1 to this as long as you are ok with an external vendor.

A couple of caveats. If you are coming from Nagios, this is a different
worldview on monitoring. Like many other solutions commented here this is all
based around metrics and their associated time series, and then you need to
alert on those metrics. You ask the system questions with a time series query
language.

Wavefront doesn't yet have a great solution for poll-based monitoring (i.e.
hitting host Xs /healthcheck endpoint) so I still use terrible 'ol Nagios for
that in my environment. However the rest of my work is all done in Wavefront -
I'd say easily the high 90% of all my material alerts are done in wavefront
with a small subset of work done in Nagios.

The killer feature here is the query language. I don't think there is anything
else on the market that has its level of sophistication. I've had ex-Googlers
on my team who "grew up" with Borgmon, which is in some sense the Ur-time
series monitoring system and they loved it.

All this said, there are a lot of options about there. I have a strong bias
against supporting my own complicated monitoring infrastructure. I want to
focus on my own product. If you don't share that opinion or are on a super
duper tight cash budget (but you do have time) than disregard the above ;)

~~~
bbrazil
> The killer feature here is the query language. I don't think there is
> anything else on the market that has its level of sophistication. I've had
> ex-Googlers on my team who "grew up" with Borgmon, which is in some sense
> the Ur-time series monitoring system and they loved it.

Prometheus is inspired by Borgmon, and has a query language that is unmatched
by almost everything else I'm aware of.

Are there public docs on the semantics and features of the WaveFront language
so I can compare?

------
andrenth
Has anyone tried Vector [1] by Netflix?

[1] [http://vectoross.io/](http://vectoross.io/)

------
cliffmoon
Give opsee a try [https://opsee.com/](https://opsee.com/)

------
arjan_sch
AppSignal is also a cool product, although mostly focused on Rails
applications. And, a big plus, they are working on an Elixir integration :-)
[https://appsignal.com/elixir](https://appsignal.com/elixir)

------
walrus01
OpenNMS. It is a java memory pig but is used by some of the twenty largest
ASNs (by CAIDA ASRANK) in north america. Truly open source and free. Very
extensible. Large development community behind it and many constant updates.

------
walkingolof
If you want monitoring plus automation and remote management check out
[http://www.kaseya.com/](http://www.kaseya.com/)

------
louwrentius
It depends on what you want to achieve, but Nagios works well and Graphite for
trending is also quite useful.

------
jaytaylor
If you are okay with something which you don't have to run yourself-

The winner IMO is dataloop.io [0].

Dataloop is a SaaS monitoring solution that is super easy to get up and
running and has tons of fantastic features and capabilities. The team behind
it is stellar and their pricing is reasonable.

10/10, will continue to use again and again :)

[0] [https://dataloop.io/](https://dataloop.io/)

------
imperialdrive
PRTG

[https://www.paessler.com/prtg](https://www.paessler.com/prtg)

------
postwait
If you care about correctness of data, solid data retention and good analytics
(prediction, forecasting, etc.) then you should take a look at Circonus.

[http://www.circonus.com/](http://www.circonus.com/)

500 metrics accounts are free for life.

Built by SREs for SREs.

~~~
misframer
postwait is the founder of Circonus.

------
Daviey
Does anyone have thoughts on ITRS Geneos?

------
bmaeser
for small setups munin and (m)monit are still my goto place.

dead simple, easy to configure and very reliable

------
user5994461
disclaimer: I evaluated most of these tools and wrote a blog post here.
[https://thehftguy.wordpress.com/2016/04/18/monitoring-in-
the...](https://thehftguy.wordpress.com/2016/04/18/monitoring-in-the-cloud-
datadog-vs-server-density-vs-stackdriver-vs-bmc-boundary-vs-newrelic/)

It's a bit old and i'll update it later, but here is the short resume with all
the latest tool:

### Free (as in open-source) shitty options: icinga, nagios, riemann

They suck so much they're not even worthy of having their names written.

###

The other open-source option is prometheus.

I didn't try it personally but I've have candidates interviewing at my company
who talked at length about their experience on it and they were satisfied.

I red the whole documentation and it's better than the old shitty tools but
it's still not great. Be aware that it has many limitations by design, they
skipped all the hard stuff (single node only, no HA, pull-mode only for
metrics).

\----

The new SaaS tools (ordered by maturity), all 10-20$ per host, they're mostly
copy-cat:

Datadog, BMC truesigh pulse (Boundary), signalfx, wavefront, server density.

Datadog is the best option. It's older (about 5 years) and more mature. It has
the most features and integrations. It's really the next generation of
monitoring.

BMC truesight pulse is the historic competitor. It was a startup called
"Boundary" that was bought by BMC, and BMC rebranded the product. That's about
the same thing. Not sure what the acquisition may or may not have changed.

SignalFX is a direct copy-cat of datadog (and BMC). But it came later so it's
lacking in features and integrations.

Wavefront is an even later copy-cat of datadog and signalfx. Except it has no
public price nor public trial. You have to contact them and go through sales
for anything. (Honestly: just ignore wavefront. There are 3 directs
competitors who are better and more accessible).

ServerDensity: Don't bother trying. The website is buggy, it fails to load
pages very often. The product is not even finished and lack 80% of the
competitor features. The company will probably die soon. (sorry for their
employees who are commenting here and reading that :\ )

[Google] StackDriver: It was another company that was acquired by Google 2
years ago. Currently, it's dead and it's being integrated to Google offerings.
That might be great when it comes back (probably this year, there seem to be
some closed beta given by Google at the moment).

### Current status-quo:

Datadog beats everything by a long margin. More mature, more features, more
integrations. It's has the advantage and it's evolving faster. That's the
horse you have to put your money on (I did).

You can try the competitors (either BMC or signalfx) if you wanna play around
or just tickle datadog sales team to get a better price (I did) :D

### Far future:

There might be a market rupture within 1-2 years when google finally release
StackDriver. It had some quite advanced stuff and great review when it was
acquired. It's the only one that might be able to catch up with datadog and
provide the very advanced stuff that doesn't currently exist (e.g. outlier
detection done right).

If and When Google finally offers GCE (cheaper & faster than AWS) + kubernetes
(docker and infrastructure on steroid) + StackDriver (complete monitoring AND
logging solutions), they will be the best IaaS provider on the planet by a
wide margin. The evolutions brought by these tools will allow me to do the
work of 3 infra/sre guy all by myself.

~~~
bbrazil
> they skipped all the hard stuff (single node only, no HA

We skipped the hard stuff on purpose, as hard stuff is extremely tricky to get
right and liable to fail right when it's most needed. See
[http://www.robustperception.io/monitoring-without-
consensus/](http://www.robustperception.io/monitoring-without-consensus/)

Per the above, there is HA.

------
homero
Newrelic

------
0xdeadbeefbabe
Not icinga and not boost

------
syngrog66
this isn't the proper way to learn that.

------
thdn
grafana ??

------
tankfeeder
netxms

~~~
tankfeeder
monitoring system too.

------
sickeythecat
Plenty of folks building large scale custom monitoring solutions with InfluxDB
(plus Grafana, Collectd, Telegraf etc) -
[https://influxdata.com/testimonials/](https://influxdata.com/testimonials/)

------
eldavido
I know I might come off as trolling, but try to get out of the business of
managing servers. I've done it, it sucks, and I don't ever want to do it
again.

Get everything containerized and use a container runtime like ECS, unless
you're operating in analytics, adtech, or something else with extreme
storage/compute/network requirements.

