Also take a look at Riemann which is system monitoring written in Clojure. Riemann should be good for monitoring latency of the system.
If it helps here is Slidedeck from Spotify how they do their monitoring
I have never been very keen on alerting dashboards, I find they are rarely actually reviewed and flash red for days or weeks. :) So I only covered metrics/graphing as a console rather than a status console. If you want to add such a console it'd be easy to output Riemann events via an API to such a console.
Glad you enjoyed the book!
For any sufficiently advanced monitoring system, you're going to have to learn some form of DSL/language to take advantage of it.
I wouldn't consider this to be a major issue, more part of the irreducible complexity of the problem. The Riemann examples I've seen all seemed pretty readable, and routing alerts is not world away from what you'd be doing in say a Prometheus Alertmanager config; just with S-Expressions against YAML.
For highly sophisticated environments then https://www.datadog.com is a very advanced product.
Both are based off my original agent: https://github.com/serverdensity/sd-agent (DataDog forked it in 2010 and we forked them back last year).
We're also behind http://www.humanops.com trying to build monitoring that also helps you run on-call and ops teams generally in a way that considers fatigue, stress and the realities of the humans running IT systems! E.g. https://blog.serverdensity.com/introducing-alert-costs/ and https://opzzz.sh
I've been pushing for that to be included in more scheduling systems for production workers. Very awesome to see it considered in your work! :)
It uses Nagios under the hood, it's basically an automation system that generates those Nagios systems. The GUI is amazing, because it uses a plug-in so you don't have to edit files on disk to group your hosts or tweak the alerts. Those configs are snapshotted automatically at every change, and you can replicate that configuration automatically to remote servers. Download it from the upstream site instead of relying on distro package repositories.
Caveat, the documentation sucks, the GUI can be nonintuitive and it's hard to Google problems. It takes time to fully tune. Out of the box you'll probably still be impressed though.
The English documentation isn't that great, the German one is better. That being said, it's mostly self-explanatory and all checks are very well documented in man pages.
My favorite features:
- Auto discovery for literally everything, including SNMP interfaces.
- Fine grained rule system for customizing check threshold and parameters
- All configuration is automatically versioned and you can integrate it with Git - this includes the changes you make in the web interface.
- It's very easy to set up a distributed monitoring system (multisite) with a central node which aggregates all states and replicated configuration changes, yet each site is fully autonomous.
- The agent takes zero network input, so no attack surface.
- Even though it's extremely featureful, it's architecture is very simple and it's easy to contribute code and write custom checks.
Their Git is public: http://git.mathias-kettner.de/git/?p=check_mk.git;a=shortlog...
It works well with Naemon and Nagios 4. Been using it for a number of projects, ask me anything!
Monitoring systems not to use:
- Shinken (zero security awareness and dishonest PR, when I tried it, it has so many bugs that I wouldn't ever trust it to keep my data safe)
- Zabbix (it has some brilliant features, but the architecture is a mess, configuration and time series stored in a MySQL database which is hard to manage and automate, I found it cumbersome to debug, written in PHP)
I like Icinga a lot. I won't bother reviewing it; is is very well known. Professionally, my last two gigs have used Zabbix.
Zabbix, architecturally, is a nightmare. Uses an RDBMS for storing time-series data, so it wastes a ton of space on historic data while managing to be far slower than it needs to be when querying larger ranges. Uses an agent. Has a proxy-agent that, while handy, encourages all sorts of sketchy, error-prone monitoring topologies. With 3.0, the UI has crawled out of the awful range, and is now merely annoying. Takes the all-singing, all-dancing monolithic approach for the main app, including features for drawing maps on big-screens.
For all that, it works well. Give it the hardware it wants, be sane in setting it up, ignore the goofy features (maps, inventory, screens - I guess someone must of requested those), and it is very solid and very powerful.
 The template system, pseudo-language for triggers, naming convention for variables and method of creating custom monitors take some getting used to. Expect to take the time to actually read the docs, and most likely to throw out your templates the first time you model your systems.
Can you afford time but not money? Try Sensu or Nagios.
Do you have money and not time? Try datadog.
Like someone else mentioned here, if you're looking to alert off of logs from ELK, try Elastalert.
If you have money and no time: DataDog
If you don't mind putting in a little time: Sensu
Sensu is straightforward to deploy if you use Chef/Ansible/Puppet. It also supports running Nagios plugins which is pretty useful.
I also disagree that setting up Sensu takes a "little" time. What is a "little" to an inexperienced Sensu administrator? A day? A week? Several weeks? Quantifying it would be valuable to the reader.
Even if you don't have access to the server so you can monitor it, you can use the "host" concept as containers for your services: "api.mycorp.com", "tasks.mycorp.com", "backups.mycorp.com" are great starting points.
When you monitor state of a cluster (e.g. node count), you don't have a
server, you have plenty of servers and a cluster (completely different
When you monitor temperature in your server room, you don't have a server,
you have a server room.
When you monitor exchange rate, you don't have a server.
When you monitor a website, you still don't have a server.
And now add all the AWS Lambdas and other serverless rage.
Notion that everything works on a (single!) server was never valid, and today
it's even more visible than it was twenty years ago, when Nagios was state of
What matters is that the alert about the issue is raised and relayed to the proper notification channels. Since sensu doesn't concern itself with a fancy dashboard, it doesn't really matter if the alert pertains to the host or not.
Any decent monitoring will have customized the alert handling based on what's alerting, so there's some amount of post-processing possible.
In practice it doesn't matter if you name a file handle "juju" and a database
query "peach" in your code.
It's a matter of calling things what they are instead of forcing them into
a mismatched data scheme by creating artificial hosts.
Borg inspired Kubernetes. Borgmon inspired Prometheus. So naturally it works well together with a dynamically scheduled world.
Good documentation, UI, many, many plugins and fair pricing (IMO).
(Im not affiliated with in any way other than using their product on a pet project with many moving parts).
As far as Datadog goes, it's the most team friendly dashboard system we've used. We had a specialty monitoring system for one application stack previously, and no one made custom dashboards there or even just looked at the data. Now we've got custom dashboards out the nose and we're gradually consolidating to a "best of" dashboard for each service.
Are you affiliated with Wavefront?
(I've learned not to trust numbers that seem too good to be true unless they're contractually obligated.)
Wavefront doesn't publish pricing, but if we take Librato's pricing as a general indication you're talking several million dollars a month.
I used it up until two months ago when I left that job. There was 2000~ servers monitored I think.
I'm asking because I will be designing a similar page soon (that's also billed per host) and I'd like to avoid the same mistakes.
1. VERY clearly state that when you sign up for the service, then you are on the hook for up to $18*500 = $9000 + tax in charges for any month. Even Google compute engine (and Amazon) don't create such a trap, and have a clear explicit quota increase process.
2. Instead of "HUGE $15" newline "(small light) per host", put "HUGE $18 per host" all on the same line. It would easily fit. I don't even know how the $15/host datadog discount could ever really work, given that the number of hosts might constantly change and there is no prepayment.
3. Inform users clearly in the UI at any time how much they are going to owe for that month (so far), rather than surprising them at the end. Again, Google Cloud Platform has a very clear running total in their billing section, and any time you create a new VM it gives the exact amount that VM will cost per month.
4. If one works with a team, 3 is especially important. The reason that I had monitors on 50+ machines is that another person working on the project, who never looked at pricing or anything, just thought -- he I'll just set this up everywhere. He had no idea there was a per-machine fee.
This is yet another point where DevOps is not "devs doing ops" but "operations building and deploying with all the tools of modern software development". You need a subject matter expert.
What are you monitoring? Do you care about availability or performance or both? Scale? Do you have services or servers? Do you manage the underlying hardware? Do you need to track which hardware boxes have which VMs or containers?
There are a million questions to answer. One big set of them: what do you dislike about Nagios? Make sure that you don't get those problems with the next one, but also make sure you get something that does what you need as well as what you want.
Benefits: shared expertise. Common language. Propagation of alerts up from hardware and down from services. Better root cause analysis. If you have a good culture, faster resolution time and better understanding.
Spot on. Too many people think that monitoring is about slapping a piece of code on some hosts.
Monitoring is data science.
My preferred method is Icinga2 (a Nagios clone with better configuration and clustering built-in) with reports coming in via passive NSCA. Toss in Graphite (or I'm warming up to Grafana on Influx) with some ability to write alerts against those reported metrics, and you're as close to ideal as I can come up with.
Of course, that requires a fair bit of up-front knowledge to stand up and operate, but they're so rock solid (and scale like mad) I have a hard time not recommending them.
Why in the world would you want to manage (which I'm reading as throttle) the ingest rate of a monitoring service? That strikes me as a recipe for missing important events.
Monitoring should be a fairly stable load rate. If you're setting up monitoring, and your machine can't handle the load of all the data points coming in, you need to shard out, not drop data points.
> Active polling scales _better_ than the alternative
Exceptional claims require exceptional evidence. Personally, I have never heard of anything that is actively pinging outside services performing better than receiving and processing data passively.
Prometheus will scale better than Nagios running active checks since it won't be using subprocesses, but it is still going to require more overhead than a service receiving passive reports.
A single big Prometheus server can easily do millions of time series, and in an older record, 800,000 samples stored per second. You could monitor e.g. 10,000 hosts with that with quite some detail, and the bottleneck is still not the pull aspect.
We also have an FAQ in Prometheus about why we prefer pull: https://prometheus.io/docs/introduction/faq/#why-do-you-pull...?
In my experience, pulling is operationally much nicer than pushing, and I've worked with both. It also gives you somewhat less of an accidental DDoS exposure.
> [Grafana+Ichinga] they're so rock solid (and scale like mad) I have a hard time not recommending them.
As a Prometheus developer I have seen a significant number of users who moved from Graphite because they found it doesn't scale and was far from rock solid for them, requiring regular manual care and feeding. By contrast Prometheus seems to be working pretty well for them at what we would consider to be a moderate load. I've heard similar about Nagios/Ichinga.
Push vs. pull is largely not relevant for scaling (pull is slightly better in this regard, but only slightly). I've been involved with some extremely large scale monitoring systems, and the fact they were pull was never relevant to scaling them.
May I ask what you consider to be a high level of scale?
> It won't scale without a number of workarounds.
There are very very few systems who won't scale without workarounds. That's the nature of scaling a non-trivial system.
If you need to shard, no other binaries are involved, just a couple of changed configuration files and some SSL certs. Passive reports can go to any node; a simple load balancer will work fine without any form of hashing or federation required.
> There is nothing inherent in a polling model that limits its ability to scale
Aside from the additional processing requirements, such as SSH, NRPE, subprocesses, etc, you are also limited in that a polling process must be told in advance of new systems or services, whereas it's fairly easy to just have a new service or system start reporting and be immediately monitored.
How do you re-combine and alert on data for one service that got reported to multiple nodes? This would imply to me a database of some form, which is going to need hashing or federation or other distributed systems approaches to be scaled out.
> Aside from the additional processing requirements
Push or pull, there's additional processing requirements when you scale. It's the same bytes on the wire and broadly the same amount of processing power required.
> you are also limited in that a polling process must be told in advance of new systems or services
That's not a scaling limit, that's a more fundamental issue that isn't different between push and pull.
For a push system you need to have a list of all systems and services in order to be able to alert on systems that never reported, or are no longer reporting.
Pull works fine at 2 million machines" is a great statement to make, but it would be much stronger with more details. How many machines are doing the pulling? How often? Are they using subprocesses or threads or greenthreads? How do they handle timeouts? How many metrics per machine? How many pulls per metric grab?
I'm also interested in prometheus but haven't gotten to try it out yet. Anyone reading this have experience with both? How do they compare?
You want metrics from counters you build in your app? (see statsd?)
You want to aggregate and do analysis on logs? (see ELK stack?)
You want to monitor cloud infrastructure (see stackdriver?)
You want to run end to end tests on your application to ensure it's behaving? (see runscope?)
As your application grows, you probably want a blend of tools to see inside your app.
Should have mentioned how well it pairs with Grafana ;)
Applications are dramatically and rapidly changing, with continuous delivery, microservice approach, containers and orchestration tools, things are all over and you might have a component spun up and down within few minutes.
Humans cannot keep up with data and it doesn't make any sense to stare at a big screen full of data, just looking the all day at charts trying to visually correlate data.
The correlation of data is becoming harder and harder as systems are more and more resilient. There's, therefore, no unique root cause anymore (https://www.instana.com/blog/no-root-cause-microservice-appl...).
At Instana we're re-defining what monitoring means. We're moving the bar from visualizing data to providing plain English explanation of what's going together with suggestion for remediation.
Instana 3 main values are:
- Automatic Discovery: dynamically models the architecture of infrastructure, middleware and services
- Automatic QoS Analysis: continuously derives KPIs of all components and services and alerts on incidents
- Integrated Investigation: visualizes in real-time physical and logical architecture, compares over time, suggests fixes and optimizations.
Happy to get feedback and provide more info.
Most of the mentioned tools in this thread, including Datadog, SignalFX etc are using a simple agent to collect data - see Datadog agent on GitHub: https://github.com/DataDog/dd-agent or statsD (https://github.com/etsy/statsd) that is mostly recommended by SignalFX who have no own agent. Tools like Prometheus work similar.
On the backend side you can see two approaches for data store technology: A time series based approach like DataDog or Prometheus and a Streaming based approach like SignalFX - stream are the superior approach in my point of view as they allow for realtime approaches and stream (window based) analytics. There is a third category which is similar to time series but more "log" centric like the ELK stack or tool like Splunk.
On top of the data store these tools give you the ability to build your own dashboards (and provide standard dashboards for standard technology) and a alerting based on thresholds. They also allow to add you own metrics via API which can be used to add application specific data. They also give you a query API to query and combine the data in the store. So overall this is a Lambda architecture for monitoring data.
I would say that SignalFX is the most sophisticated one but the framework to work on stream is much more complicated then DataDogs time series approach so people go the easier way.
The problem with all of these tools is that they rely on the user to build dashboards, thresholds and in case of a problem do the correlation to find the root cause of the problem.
To correlate you need to understand the dependencies of the system components. As an easy example if service A has a performance issues because it calls service B that has a CPU problem, you need to know that A calls B and correlate the latency of A with the latency and CPU of B to find the root cause. You can discover/model dependencies with tools like Zipkin (https://github.com/openzipkin) or Spring Cloud Sleuth (https://cloud.spring.io/spring-cloud-sleuth/) which are based on the Google Dapper paper. You could even add or log the Span ID to the metrics/logs so that you can correlate them automatically.
Typically if you do so manually it is a disaster for change. All your correlations (and even dashboards) will not work if the topology of your services changes. Which is quite normal in the microservice world.
Instana uses a stream based approach similar to SignalFX BUT we combine this with a graph database that holds the dependencies of all physical and application dependencies. Our agent automatically discovers all the components and dependencies and adds them to the graph in realtime - including containers etc.
We then use the Google Four golden signals + Capacity (that was added by Netflix as the fifth one) to analyze the KPIs of the services and apply machine learning on it. That way we don't need manual thresholds which are also hard to maintain when things change a lot. If we see e.g. slow response times or sudden drops in requests or high error rates, then we analyze the dependency tree of that service to find the issues that are related to the problem and generate an incident for that - as we also discover changes, we add them to the incident as most often a change is the reason for a problem. I've written a blog entry on the Dynamic Graph: https://www.instana.com/blog/monitoring-microservice-applica...
Hope this answers you question.
I'd see them as slightly different approaches to providing fundamentally the same solution. One builds up time series and then operates on them, the other operates on the time series as they come in.
Taking Prometheus as an example we're a time series database, and you can do both realtime and window-based analysis. In fact that's how it is usually used.
> I would say that SignalFX is the most sophisticated
Do you have an example of something that you can do with your streaming approach that's not possible with other tools?
It's hard to get a proper understanding of the myriad of monitoring systems out there, so I'm always looking for insights.
> Our agent automatically discovers all the components and dependencies and adds them to the graph in realtime.
That sounds interesting, how do you do that for network dependencies? Do you have something like Zipkin?
My point was more about the framework you get and how easy it is to apply analytics to streams/queries. SignalFx seems to have a nice workbench for this with direct visual feedback in the UI, so that you can work on existing data to get the right result.
As said we at Instana think that most people will not be able to build a sophisticated monitoring solution with these types of frameworks as they don't have the time to do it and maybe even not the analytical domain knowledge. You can see that SignalFx is adding specific knowledge for some technologies. I give you two simple examples to show that it is not easy:
- How would you predict if a file system is running out of disk space?
- How would you predict if you should add a node to a Cassandra cluster because it is running out of capacity (and it can take some serious time to add a node, so you should know in advance)?
Already the disk space problem is hard to solve - linear regression and basic algorithms will not work.
Now think of hundreds (or thousands) of services running on a dynamic container platform and new services released on a daily or even minute basis - with lots of different technologies involved...
No question that you can build a good monitoring solution with Prometheus, SignalFX, DataDog etc - but it will take a serious amount of time, consulting and dev teams involved adding the right instrumentation, metrics etc. And you need a lot of analytical knowledge. I can even imagine that there are situation were tools like Prometheus are a better choice - especially if you have a very strict set of technologies and communication framework and really good people to do a very specific set of "rules" for this environment.
We've added a domain model to our product (all the mentioned product have a generic metric model, but no semantics that describe servers, containers, processes, services and their communication which is the domain of system and application monitoring): Our Dynamic Graph.
And yes, we are using something very similar to Zipkin to get the dependencies between services. Here a are two blog entries describing the approach:
- About distributed tracing: https://www.instana.com/blog/evolution-tracing-application-p...
- How we safely instrument code: https://www.instana.com/blog/how-instana-safely-instruments-...
Wavefront does as well; I'd recommend you compare it for competitive analysis.
So would you say your product is in direct competition with these offerings, or do you see it more as a complement to them?
Competition depends on the uses case - if you are using a tool like SignalXF for custom metric analytics, then we are no competition as our focus is monitoring of applications and its underlying infrastructure.
We are an Application Performance Management (APM) solution and therefore compete more with tools like New Relic oder AppDynamics. Theses tools are sadly only used for troubleshooting in 90% of the cases and not for management or monitoring. They also do not work in highly dynamic and scaled environments as there "model" is too static. (which they try to fix with their analytics offerings)
This is what we want to change and were we add the whole stack to the game to analyze all the dependencies and help finding root causes quickly and monitor and predict the KPIs of your applications, services, clusters and components.
We integrate with solutions like SignalFX if needed but I have really good experience to do "dashboarding" with more business related tools like Tableau or QlikView - this also offers application owners an easier way to aggregate the monitoring data and metrics on a higher (business) level, where tools like Instana offer the instrumentation data as an input.
> To get real time insight into your running applications you need to instrument them and collect metrics: count events, measure times, expose numbers. Sadly this important aspect of development was a patchwork of half-integrated solutions for years. Prometheus changed that and this talk will walk you through instrumenting your apps and servers, building dashboards, and monitoring using metrics.
Abstract - https://us.pycon.org/2016/schedule/presentation/1601/
Slides - https://speakerdeck.com/hynek/get-instrumented-how-prometheu...
Video - https://www.youtube.com/watch?v=b-qLOY5ChnQ
Some of the key items we liked were:
* Able to consume millions of metrics per second. This is pretty huge. While we're not even close to that much (11k/s at the moment), we expect that number to triple or quadruple in the next year.
* Fast. Wavefront renders graphs quickly. The ability to manipulate the data in real time has been impressive.
* Feature requests. Wavefront has been receptive to ideas from their customer base. They even have a voting system in their community page if other customers like a certain request.
* Support has been great. Questions on issues or general technical guidance has been handled quickly, within the hour.
* Docker ready. Already using Wavefront with our emerging docker infrastructure.
* Engineers are self sufficient. Before, Tech Ops had to do all the monitoring for new services. With technologies such as docker, our engineers are capable of setting up monitoring within the application to directly send to Wavefront. This offloads quite a bit of work from Tech Ops.
No, I'm not affiliated with Wavefront. We just use their monitoring service.
The configuration was a bit of an initial hurdle when coming from icinga 1 / nagios - the config syntax is essentially an EDSL for programming your monitoring requirements - but the flexibility is worth it. Adding new hosts and services is pretty cheap (programmer-time-wise), and I can use whatever programming constructs and conditions I want to decide what services to apply to which hosts in which measure.
That said, it's still in a bit of a young state and some parts are very rough around the edges - for example, icinga 2's dependency model is a bit naive. You can configure email notifications to ignore notifications for services that depend on a different failed host/service, but this only applies if icinga already knows about the dependency having failed. So when a parent service dies, an extra e-mail notification could be generated for each of its children before icinga realizes the parent has also died and stops sending notifications for them.
tl;dr I had fun setting it up and it works well for us, but expect some quirks
Complete control and monitoring of cluster with either a CLI or GUI. Scalable monitoring with negligible impact on running workloads, including global synchronization of metric collection times, to minimize jitter. Ganglia front-end, but without the overhead of gmond/gmetric running on nodes. Validated as scaling well on a 8,000 node cluster.
Full disclosure: I designed and implemented the monitoring system.
Welp, nobody can blame you for wanting to get away from Nagios. It’s certainly a tool from a different, simpler era and hasn’t aged well in our opinion.
As a push-based metrics solution, Librato is probably a lot different than what you're used to. But don’t worry: we're super easy to get up and running with, and obviously you no longer need to worry about maintaining or scaling infrastructure. Also, unlike with some other solutions, you can use us with your existing toolchain (it’s easy to plug us into your existing Nagios infrastructure to try us - the trial is free & full-featured).
We’re a hosted metrics platform, meaning you can send metrics of any type and amount you want. We’re functionally similar to Graphite+Grafana, except we do all the work of scaling and management for you so you can focus on the metrics themselves. We provide alerting and other useful bits out of the box (things that are not trivial to setup yourself, e.g., bolting together collectd+Graphite+Grafana+statsd+flapjack+kitchen sink and hoping it scales and doesn’t fall over). We’ve got an agent that comes with a bunch of turn-key integrations too, to make it super easy for you to monitor what you care about.
As to pricing, we're the only hosted monitoring system that will just charge you for what you actually USE. You pay pennies per metric metered by the hour, instead of a per-node model, which gets crazy expensive and inefficient for modern ephemeral infrastructure. For example, if all you're doing is integrating us with AWS CloudWatch to monitor some EC2 instances and an RDS instance, we can do that for effectively a $1-$2 an instance. We also have an agent you can install on your servers if you want more detailed metrics, which adds $5-10 per instance depending on how many metrics you enable. Our customer success team (email firstname.lastname@example.org, or the Help chat window if you already have a Librato account) will be more than happy to walk you through any permutation of our pricing and the details of the model to help you better understand it.
As mentioned, you can try us out for free--no credit card required: https://www.librato.com/
You can do some nice math functions for your alerts.
A couple of caveats. If you are coming from Nagios, this is a different worldview on monitoring. Like many other solutions commented here this is all based around metrics and their associated time series, and then you need to alert on those metrics. You ask the system questions with a time series query language.
Wavefront doesn't yet have a great solution for poll-based monitoring (i.e. hitting host Xs /healthcheck endpoint) so I still use terrible 'ol Nagios for that in my environment. However the rest of my work is all done in Wavefront - I'd say easily the high 90% of all my material alerts are done in wavefront with a small subset of work done in Nagios.
The killer feature here is the query language. I don't think there is anything else on the market that has its level of sophistication. I've had ex-Googlers on my team who "grew up" with Borgmon, which is in some sense the Ur-time series monitoring system and they loved it.
All this said, there are a lot of options about there. I have a strong bias against supporting my own complicated monitoring infrastructure. I want to focus on my own product. If you don't share that opinion or are on a super duper tight cash budget (but you do have time) than disregard the above ;)
Prometheus is inspired by Borgmon, and has a query language that is unmatched by almost everything else I'm aware of.
Are there public docs on the semantics and features of the WaveFront language so I can compare?
The winner IMO is dataloop.io .
Dataloop is a SaaS monitoring solution that is super easy to get up and running and has tons of fantastic features and capabilities. The team behind it is stellar and their pricing is reasonable.
10/10, will continue to use again and again :)
500 metrics accounts are free for life.
Built by SREs for SREs.
dead simple, easy to configure and very reliable
It's a bit old and i'll update it later, but here is the short resume with all the latest tool:
Free (as in open-source) shitty options:
icinga, nagios, riemann
They suck so much they're not even worthy of having their names written.
The other open-source option is prometheus.
I didn't try it personally but I've have candidates interviewing at my company who talked at length about their experience on it and they were satisfied.
I red the whole documentation and it's better than the old shitty tools but it's still not great. Be aware that it has many limitations by design, they skipped all the hard stuff (single node only, no HA, pull-mode only for metrics).
The new SaaS tools (ordered by maturity), all 10-20$ per host, they're mostly copy-cat:
Datadog, BMC truesigh pulse (Boundary), signalfx, wavefront, server density.
Datadog is the best option. It's older (about 5 years) and more mature. It has the most features and integrations. It's really the next generation of monitoring.
BMC truesight pulse is the historic competitor. It was a startup called "Boundary" that was bought by BMC, and BMC rebranded the product. That's about the same thing. Not sure what the acquisition may or may not have changed.
SignalFX is a direct copy-cat of datadog (and BMC). But it came later so it's lacking in features and integrations.
Wavefront is an even later copy-cat of datadog and signalfx. Except it has no public price nor public trial. You have to contact them and go through sales for anything. (Honestly: just ignore wavefront. There are 3 directs competitors who are better and more accessible).
ServerDensity: Don't bother trying. The website is buggy, it fails to load pages very often. The product is not even finished and lack 80% of the competitor features. The company will probably die soon. (sorry for their employees who are commenting here and reading that :\ )
[Google] StackDriver: It was another company that was acquired by Google 2 years ago. Currently, it's dead and it's being integrated to Google offerings. That might be great when it comes back (probably this year, there seem to be some closed beta given by Google at the moment).
Datadog beats everything by a long margin. More mature, more features, more integrations. It's has the advantage and it's evolving faster. That's the horse you have to put your money on (I did).
You can try the competitors (either BMC or signalfx) if you wanna play around or just tickle datadog sales team to get a better price (I did) :D
There might be a market rupture within 1-2 years when google finally release StackDriver. It had some quite advanced stuff and great review when it was acquired. It's the only one that might be able to catch up with datadog and provide the very advanced stuff that doesn't currently exist (e.g. outlier detection done right).
If and When Google finally offers GCE (cheaper & faster than AWS) + kubernetes (docker and infrastructure on steroid) + StackDriver (complete monitoring AND logging solutions), they will be the best IaaS provider on the planet by a wide margin. The evolutions brought by these tools will allow me to do the work of 3 infra/sre guy all by myself.
We skipped the hard stuff on purpose, as hard stuff is extremely tricky to get right and liable to fail right when it's most needed. See http://www.robustperception.io/monitoring-without-consensus/
Per the above, there is HA.
Get everything containerized and use a container runtime like ECS, unless you're operating in analytics, adtech, or something else with extreme storage/compute/network requirements.