
Bosun – open-source monitoring and alerting system by Stack Exchange - aps-sids
http://bosun.org/
======
KyleBrandt
I did a presentation on Bosun at the most recent Monitorama conference:
[https://vimeo.com/131581326](https://vimeo.com/131581326)

The first ~13 minutes is some of the design thoughts, the why etc. Then I
start a demo with some screencasts.

~~~
gbrayut
And if you are a developer that likes working on these sort of things you
should come work with us!

Details at [https://careers.stackoverflow.com/jobs/92395/site-
reliabilit...](https://careers.stackoverflow.com/jobs/92395/site-reliability-
team-developer-stack-exchange)

------
spo81rty
3 years ago I started a company (Stackify) with the hopes of building a better
nagios. But it doesn't seem like developers really wanted it. They wanted and
needed a lot more as basic server metrics didn't really tell much of a story
about application health. The shift to cloud based services also makes a lot
of basic monitoring tools unnecessary. Cloud based apps don't really need to
monitor servers or infrastructure beyond simple CPU and memory measurement.
Developers need to monitor the app itself. Which can really only be done by
code profiling, custom metrics, and analyzing errors and log statements.

So we since pivoted a little bit and have focused heavily on true application
monitoring via basic server metrics, custom app metrics, error tracking, log
management, and true APM code profiling. All of this together provides a lot
of power when it comes to monitoring and finding application problems.

A lot of companies we talk to barely monitor anything about their apps. So
many IT teams work in such a reactive mode they aren't very proactive when it
comes to monitoring application health and behavior.

Would love to get anyone's feedback about this topic. Do you just use basic
server monitoring? How detailed do you monitor the actual behavior and health
of your application? How do you do it?

If you're curious you can check out our product.
[http://stackify.com](http://stackify.com)

~~~
cbaleanu
Just curious, how is this related to the article on which you're commenting?

~~~
spo81rty
Most developers in most IT departments don't even have access to monitoring
tools and very little of developer/application importance is monitored beyond
server up/down, cpu, and memory usage.

A lot of the most important monitoring I do for my own application is looking
at page load times, slow DB queries, custom app metrics, error rates, and
looking for specific log statements. Many of these things can't even be done
in basic monitoring systems.

Bosun's alerting rules look awesome. But most people don't even know what to
monitor, let alone figure out how to write javascript expressions to do so.

~~~
KyleBrandt
Bosun uses a custom DSL, not javascript. What you are saying is fair - bosun
is currently targeted at an advanced audience currently.

There is active discussion about making alerts GUI creatable to make it more
accessible. Can hopefully be less vague about that in a few weeks.

------
INTPenis
I tried it for a while and it sure has potential has one of those modern
monitoring systems that are replacing nagios right now.

However, in production I would not want to run it in a docker. I would want to
setup my own server with option to scale it to remote pollers.

In my org we ended up choosing another nagios replacement, but not because of
any flaw in bosun.

I love iterating over the main points that we look for in a monitoring
solution.

Self-hosted. Scalable, remote pollers that can plugin to the central servers.
Locations, remote pollers can add locations to monitor from. Collector agent
that runs periodically from monitored servers instead of the nrpe model that
listens to connections. The collector OS agent is windows compatible and
backwards compatible with nagios scripts. Monitoring focuses on sending
metrics first and foremost, so you can set thresholds for metrics, just like
bosun does. And of course, with those metrics the web gui draws fancy graphs
for everything.

And last but not least, all of this, monitoring agent, pollers, they all use a
standard API like REST or xmlrpc.

~~~
newman314
Which nagios replacement did you chose?

~~~
falcolas
Not the OP, but we went with Icinga2. Aside from some crappy pre- and post-
upgrade scripts, it works remarkably well, and is compatible with the Nagios
monitoring plugin ecosystem.

Add in some Skyline (based off the Etsy project), graphite, and collectd, and
it makes for a flexible and extensible monitoring solution.

------
smegel
God it's time someone came up with a good, modern monitoring system. I used
Nagios for years but it never evolved past a bunch of CGI scripts written in
C(!). I tried Sensu, and was moderately impressed until a major update broke
everything and it never worked again.

~~~
xorcist
I've used Nagios in quite big installations. There are plenty of things that
can be improved, and has been in the surrounding ecosystem, but that's just
paiting a false picture. Nobody runs CGI anymore, and even if they did,
they're not in C and never has been. Some of the checks are, of course.

~~~
smegel
Hmmm
[https://github.com/NagiosEnterprises/nagioscore/tree/master/...](https://github.com/NagiosEnterprises/nagioscore/tree/master/cgi)

------
euroclydon
I've had an intern working to set up Bosun and OpenTSDB on an Ubuntu server,
from source, for a few weeks now. He's close, but today is his last day.

I'd need to pay someone to professionally set this up for us (so we can easily
distribute it with our enterprise software), preferably with just bash
scripts. I also need consulting. Like, is it realistic to use a single server
for our logging load?

I work for a large multi-national. If you're qualified, and interested, we can
engage you to help us out. Contact is in my profile.

~~~
bbrazil
How many servers and metrics are you expecting?

I know for prometheus.io we can handle at least 1M metrics per server.

~~~
euroclydon
I had not heard of Prometheus. Thanks. Any reason it would not run on Windows?

~~~
bbrazil
We'd like it to run on Windows, however none of the core developers use
Windows.

[https://github.com/prometheus/prometheus/issues/505](https://github.com/prometheus/prometheus/issues/505)

We haven't head back recently, so it's possible it's all working now.

~~~
euroclydon
Cool. If/when we get to it, I'll let you know.

------
xorcist
It seems to be a DSL to describe alerts over whole clusters. That's probably
what a monitoring system for the cloud age should. It can monitor Logstash and
Graphite, which are proven ways to collect data in a disparate environment.

But many in the comments compare it with Nagios which I think isn't really
fair. You could probably easily plug this into Nagios and it's dependency
rulework can figure out who to page when. Because that's what Nagios is, not
the default checks it ships with.

~~~
KyleBrandt
Bosun really is different in many ways than Nagios. There are 4 major
components that are a the root of Bosun:

* Time Series Database: Lets you do forecasting and anomalous alerts. Basically lets your alerts have context

* Expression Language so you can manipulate that data: Makes the data you collect and how you alert more orthogonal

* Templating for alerts (Built on Go templates, can include graphs, tables, links, etc)

* IDE interface for developing and alert handling workflow. It also lets you test alerts against history which allows for rapid alert development.

So other than "it does alerting" it is a _very_ different beast from Nagios.

------
Gideonnn
I browsed through the website, but couldn't find if it had SNMP support. If it
has, it can maybe replace Zabbix for our company.

~~~
jrgnsd
It looks like since you define both the queries and the inputs, you can set it
up to support SNMP. You can use Logstash to listen for SNMP traps, and push it
to your OpenTSDB or Elasticsearch instance, from where Bosun will pull it.

~~~
Gideonnn
That sounds pretty good. I'll have a more in depth look later. Thanks!

