
The Art of Monitoring - manojlds
https://www.artofmonitoring.com/
======
thinkersilver
First off, I'd like to say that I think this is great and we need more books
in this space.

I hope this can be seen as constructive criticism but I do have a few comments
on the layout and content.

1\. Free chapter - perhaps chapter one or two may have been a better choice.
I'd like to know the philosophy behind monitoring expressed in the book before
diving into the details.

2\. capacity planning - A chapter on this would have been great. Most teams
I've worked with have struggled with sizing, planning resources required and
archiving strategies with their monitoring solutions.

3\. Monitoring strategies for different levels in the stack - where do I
start, what should my short term goals be and so on.

4\. naming of some of the chapters are too focused on the technology- for
example the chapter on logstash could have been renamed to something to do
with application logging or log scrapping.

5\. visualisation and communication of results - there could have been a
chapter on dashboard and reporting. This is a common issue with teams trying
to understand how to do this.

This was written in a bit of a hurry but I hope my points came through.

~~~
jamtur01
Thanks for your feedback.

1\. I decided not to do this because my experience is that people like to do
something practical first. I've had a huge response to that chapter - lots of
folks have gotten into Riemann that had previously been stuck. That alone is a
solid +++ for me.

2\. Each chapter contains some discussion of capacity planning for specific
tools, where relevant.

3\. The capstone chapters (11-13) discusses this, as do the chapters covering
logging and application instrumentation.

4\. Thanks - I'll consider that.

5\. I discuss in various chapters visualization but I've found that most folks
have very different needs and desires. So I focussed on discussing what to
show in small segments as well as some visual design discussion rather than a
specific chapter on dashboarding/reporting. Hard choice but a 750 page book
needs to stop somewhere. :)

Thanks for taking the time to comment - it's awesome when folks share their
thoughts!

------
graycat
The _monitoring_ considered in the OP is for server farms and networks where
the main challenges are rates of false alarms and rates of missed detections.

The challenge is to find means of monitoring that will permit selecting the
rate of false alarms are willing to tolerate and, then, for that rate, get the
lowest rate possible for missed detections.

Thus, would like to use the Neyman-Pearson result. Usually, however, for this
context, do not have enough data for that. E.g., typically are quite short on
data on the _anomalies_ are trying to detect, and more short as the systems
become more reliable.

From the above, we see that necessarily and inescapably such monitoring is
some continually applied statistical hypothesis tests.

Apparently in practice, false alarm rate is not known and not reasonable to
select or even to adjust.

Then we see that we need tests are both multi-dimensional and distribution-
free.

A special case of high interest is _zero-day_ problems, that is, detecting
problems never seen before. So, this is _behavioral_ monitoring -- any
_behavior_ sufficiently unusual is regarded as an anomaly, that is, evidence
of something wrong.

From all I can see, so far the monitoring community has yet to take these
points to heart.

The OP's remarks on thresholds are on target: Thresholds have been the old,
lame, weak workhorse of monitoring far too long.

If anyone is actually seriously interested in this subject, let me know. Some
years ago I concluded that no one was interested!

~~~
jamtur01
It's a spectrum to me. We're way behind the curve on monitoring and the "state
of the art" in, anywhere but cutting edge shops, is woeful. I'd love folks to
be able to anomaly detection easily and simply but the technology and tools
aren't quite there yet. I am just hoping to get folks to advance their
environments a little way forward.

~~~
graycat
Yes, the situation is horrible. I'm reluctant to believe that the "cutting
edge shops" are doing very well.

For good "tools", I have a good paper on the subject, but from all I can see
there is essentially no interest. People would prefer not to be bothered. The
attitude seems to be, if there is a problem, then we will detect it,
eventually if not soon, and then we will fix it.

~~~
ch
I could be bothered. When you say you have a paper, is that something yet to
be published? Or is it just sitting in some dusty corner of the Internet?

~~~
graycat
It was published, in the Elsevier journal 'Information Sciences' in 1999.

It appears to be the first, and the first large, collection of statistical
hypothesis tests that are both multi-dimensional and distribution-free.

I try to be anonymous here at HN, but I'm willing enough to send a PDF of the
paper to anyone who wants a copy. E.g., ask for a copy and leave your e-mail
address on your HN profile, at least temporarily.

The main point of the paper is that we do get an _hypothesis test_. In
particular, we get to select false alarm rate and then get that rate
essentially exactly in practice.

It's _behavioral_ monitoring -- it assumes that the past and future of
_healthy_ performance are, to be simple, _statistically_ the same. So, right,
its for a server farm or network that is _statistically_ relatively _stable_ ,
that is statistically unchanging, in what it is doing. The site can be wild
and crazy, but it has to continue to be wild and crazy in statistically the
same way.

In particular, the work is for detecting _zero day_ problems, that is,
problems never seen before. Maybe the _philosophy_ here is that when we get a
new problem and detect it, then we fix the cause of the problem and never see
it again and, then, again are left looking for _zero day_ problems.

Then the work uses past data -- hypothesis tests have done that since Karl
Pearson 100+ years ago, and now parts of computer science do something similar
and call it _training_ data in _unsupervised learning_ or some such. The
approaches of just statistical hypothesis testing make more sense to me.

The key, core mathematical argument is a finite algebraic group of measure
preserving transformations on the data. I believe that there are connections
with U-statistics, e.g., as in an advanced statistics book by Serflng.

This stuff with groups and measure preserving is a little like some classic
arguments in ergodic theory. On the page, the math looks awful, but actually
it is conceptually quite simple.

But, you don't need to dig into the math too much.

For the actual calculations, those are based on nearest neighbors (although
other options also work with the basic math). At least since the paper, others
have thought of using nearest neighbors, but they didn't have an hypothesis
test because they didn't know how to calculate and adjust false alarm rate.
So, they have an heuristic instead of an hypothesis test. So, again, the main
contribution of the paper is that it really is an hypothesis test, that is,
know and get to adjust false alarm rate (conditioned on the old data and also
true in long run expectation over the conditioned work -- standard result in
conditional expectation from the Radon-Nikodym theorem in measure theory).

For detection rate, there is some good news, not as good as from the classic
Neyman-Pearson result (in practice in the context we don't have enough data to
do much with Neyman-Person), but nice: In a useful sense, for the selected
false alarm rate, the work gives the highest possible detection rate. Really
the mathematical key here is just Fubini's theorem (the measure theory version
of interchange of order of integration). Intuitively, the technique has the
largest area where alarms are raised consistent with the selected false alarm
rate.

For the practical application, do need some help with some _computational
geometry_. For that, I dreamed up some work. Soon I found that part of what I
dreamed up was k-D trees, e.g., as in Sedgewick's book on algorithms. But
there is more -- need some _cutting planes_. I programmed most of it 20+ years
ago in PL/I but finally dropped it due to lack of interest.

I have some ideas for more results of interest and more papers, but after 20+
years of no interest I just gave up.

More can be said, but I stopped the research when I discovered, about 20 years
ago, that no one was interested. The paper was published in 1999, and since
then interest has been quieter than the tombs of ancient Egypt. So, I'm doing
a startup that is quite different.

I dreamed up the work when I saw the need, or at least as I regarded the need,
way back in about 1990 when I was in an AI group at the IBM Watson lab doing
work on monitoring and management of large server farms and networks. The AI
work was trying to build on essentially just threshold detectors. There was no
attention to false alarm rate or a _best_ detector -- highest detection rate
for given false alarm rate. The classic Neyman-Pearson result was ignored. I
was our guy with GM Research, and we gave a paper at the Stanford AAAI IAAI
conference. But I was outraged by the lack of concern for false alarm rate,
ignoring hypothesis tests and distribution free hypothesis tests (long common
in the social sciences), and with no attention at all to multi-dimensional
data.

The real world context is just awash in multi-dimensional data. Treating the
data components separately in effect says that the geometrical region of
_healthy_ behavior is just a box. Bummer. Box too small -- get false alarm
rate too large. Box too big, get too many missed detections. Problem: A box is
a poor fit to reality. Simple stuff.

How to see this? Monitor CPU busy and page faults per second and look for
anomalies, e.g., thrashing, a program allocating infinite memory, etc. Then
the normal behavior is just a 2-D box? I don't think so! But, sure, need to
automate picking the shape of the region of _healthy_ behavior.

For the distribution-free stuff, that is where we make no assumptions about
probability distributions. I got a kick in the back side on that sitting one
day in the office of Ulf Grenander, one of the world's best ever
statisticians, at Brown (I got accepted to grad school there; was considering
going; went elsewhere instead). Grenander had been looking at computer
performance data and was shocked at how different it was from the data, e.g.,
biomedical, he had been used to. So, right, Gaussian assumptions and more go
out the window!

So, really, just want to make no assumptions about distributions, want to be
_distribution-free_ (a.k.a., _non-parametric_ although I believe distribution-
free is more appropriate terminology).

For multi-dimensional, at IBM I got a slap in the face: There was a _cluster_
of computers doing transaction processing. There was some front end load
leveling that sent the next transaction to the least busy computer in the
cluster. Okay. But one day one of the computers got sick, just a little sick
in the head, and was doing a very silly thing -- it was throwing all its
incoming transaction work into the bit bucket! Thus, this computer looked to
the load leveling as not very busy and, thus, was getting nearly all the
transactions. Thus, nearly all the transactions for the whole cluster were
going into the bit bucket. Bummer.

So, I thought, to detect this _anomaly_ , want somehow to look at all of the
computers in the cluster at the same time and compare them with each other,
that is, have all the data in some appropriate region in some space of several
dimensions, a region that works whether the cluster is busy or not. So, want
to be multi-dimensional, that is, don't want just threshold detectors on
variables one at a time.

There are more war stories where the importance of being multi-dimensional is
crucial. Really, commonly separating multi-dimensional data into its
components and treating the components separately can be throwing away a lot
of crucial _information_ which stands to give a poor combination of false
alarm rate and detection rate.

Heck, in principle the region of healthy performance can be a fractal, say,
like the Mandelbrot set, and, so, somehow we need to approximate that. Can we
do that? Basically with nearest neighbor, or k-nearest neighbors (which also
works), yes.

There is now a good opportunity for my work: My work can use a LOT of
_training_ data, and in the near real-time detection work can want to do a lot
with that data. So, could use fast access to a lot of data which doesn't
change very fast. So, sure, use some big solid state disks (SSDs)! A few of
the Samsung 14 TB drives should do wonders for my paper!

My view is, anyone doing monitoring of a large server farm or network and not
using what is in my paper is not being fully serious. And, since get to adjust
false alarm rate, say, to one a month, can't say that can't afford the extra
false alarms.

Uh, I left out: For each alarm, get told the lowest false alarm rate at which
the real-time input data is still an alarm -- so get an indication of alarm
_seriousness_.

More is possible, but at least have to be using what I cooked up in 1990,
wrote prototype software for in the early 1990s, and published in 1999.

I did the work a long time ago, guys! And since then, there have been various
serious consequences from anomalies, intrusions, etc. Maybe in some of those
cases, my work would have done good, early detection. My work looks a heck of
a lot better than anything else!

~~~
laichzeit0
So I think the mathematics to make this work is not the problem. How do you
engineer this though?

If I were to try and build a platform that could do this in real-time for,
lets say, a million metrics per minute, can you engineer something that would
scale horizontally to do this? Can it be done by cobbling together various
open-source tools/libraries currently out there? Then how would you present
the results in a way that someone that's not necessarily "mathematically
inclined", say for example, your typical operational support person, that they
could meaningfully interpret whatever your system is spitting out?

That's for me the hard part, is to get those two components working well. Make
it scale, make it idiot friendly. If you can't get those parts right, it
doesn't matter what you're trying to do.

I say this because I've spent the last 6 years in the application performance
management space and "the best" way to handle alarms at the moment is to put
down a team, literally a team, of people and have them hand-tune thresholds by
looking at a combination of history, incidents/outages and root cause
outcomes, domain specialist inputs (like DBAs or application server
specialists). You send out a false or noisy alarm to an ops guy too many times
and they become desensitized. You don't put enough context in your alarm
messages, they won't use it (logging into a tool is asking too much, the email
must contain everything they need or they complain).

Any form of dynamic baselining is just too noisy. The simplest example is
trying to "baseline" CPU usage. CPU usage without something trivial like
comparing to run-queue is stupid. It's actually even more stupid because you
should be looking at things top-down, i.e. so what if the CPU is 100% and the
run queue is 100, are any user facing transactions slowing down? i.e. is there
customer impact. It could be some batch job kicking off. So in short, anything
that looks at a metric in isolation is stupid, dynamic baselines with time of
day, day of month, etc. it's all rubbish shit, you're wasting your time with
this approach. This is the sad state that current "cutting edge" third
generation APM tools offer though.

~~~
twoodall
Hello graycat - I would be interested in a copy of your paper and/or the
article name/publication date/etc. Regards

~~~
graycat
The same as for several others here at HN, leave your e-mail address in your
HN profile, and I will try to remember to return, use your e-mail, and send
you a PDF of the paper, 1999 in _Information Sciences_.

------
ginkgotree
Another thought. Two really: \- Lead us into more about the _why_ before
showing the price. Really tell the story of the pain that so many of us can
identify with. \- Use a standard three teir pricing box style, with the values
of each above the price. Something like:
[https://planscope.io/pricing/](https://planscope.io/pricing/)

~~~
jamtur01
Thanks - I added a pricing panel.

~~~
vinchuco
IMHO switching Buy The Book and Table of Contents sections is a more
reasonable strategy.

~~~
jamtur01
Thanks - good idea - will do that.

------
bogomipz
If you haven't read one of the authors book's before (he's released titles on
both Logstash and Docker.), he puts out really quality material and he seems
to update them when new releases of the subject come out. This looks like
another great release. Kudos James.

~~~
coredog64
Seconded. My satisfaction with the Docker book has had me on the edge of my
seat for this book since I first saw the "coming soon" message.

------
AndyNemmity
Just bought the book, and I didn't even realize it was James Turnbull, that
guy is extremely advanced in monitoring. I've seen many of his talks online,
and no one speaks about monitoring in a more rational way that I've seen.

This is Hacker News at it's best in my view.

------
guidedlight
The looks to be more about infrastructure monitoring... less about application
monitoring (i.e. New Relic / AppDynamics, and synthetics)

~~~
jamtur01
That's mostly correct. I have a chapter on adding instrumentation to
applications with examples in Ruby/Rails and Clojure that can easily be
adapted to other languages and frameworks. I also cover adding structured
logging to your applications.

------
SonicSoul
Love your site design. it's clean and easy to read, great presentation. Do you
just whip up a new design from scratch each time you create a book or did you
hire someone or purchase off the shelf? Curious as these things take me
forever..

~~~
jamtur01
I usually find a template I like and modify it. I really should start paying
someone. It's not my core skillset. :)

------
reptation
check_mk is a very useful monitoring system which doesn't seem to be included:
[http://mathias-kettner.com/check_mk.html](http://mathias-
kettner.com/check_mk.html)

~~~
bogomipz
I would argue that its not and that it's a mess, starting with that URL. This
is yet another fork of an outdated monitoring system(how many Nagios forks are
there?) The architecture diagram made my head hurt, how many moving parts is
that? I counted 8 separate components. There are much more modern monitoring
systems these days such as Prometheus, Bosun and Riemann.

~~~
reader_1000
There is also icinga2. So how does one decide which one to use? For a mixed
environment (windows, linux, etc) which monitoring tool is better when both
server and client is considered?

~~~
bogomipz
IMHO Icinga 2 is also way too many moving parts. For mixed environments, you
should be able to run both Prometheus and Bosun:

[https://prometheus.io/download/](https://prometheus.io/download/)
[https://bosun.org/downloads](https://bosun.org/downloads)

~~~
reader_1000
Thanks for the information. Currently, we are using icinga + check_mk
combination and mostly happy with that. However since both icinga and check_mk
are going their ways, it is no longer possible to use them together. Also I
want to check out some modern alternatives like the ones you suggested.

------
alamaison
Is this going to be published as a physical book too?

~~~
jamtur01
Not as this stage - my experience with how fast physical books date hasn't
been good.

------
Omnipresent
does the book provide example applications that are monitored using the
information provided or does it just go through the tools that can be used.

~~~
jamtur01
It provides several example applications and goes through the tools.

------
ginkgotree
Wouldn't be a bad idea to put the free sample chapter behind an email list
signup - worked well for my book!

~~~
jamtur01
Thanks - I considered that but I don't like to put barriers in people's way -
especially for free content. I don't like to be too marketing-esque. Just not
in my nature. :) A couple of thousand folks signed up to the mailing list
using the current approach, which I'm pretty happy with.

