Monitoring Sucks. But Monitoring as Testing Sucks A Lot More

btilly · on July 19, 2012

At Devopsdays I listened to a lot of smart people saying smart things. And to some people saying things that sounded smart, but really weren’t. It was especially confusing when you heard both of these kinds of things from the same person.

Experience has taught me that if someone who is clearly smart, who clearly says smart things, says something that sounds dumb to me, it is worthwhile for me to not just dismiss that. Instead I need to examine my preconceptions for why I disagree, and why they came to the conclusion that they did. Until I have satisfied myself that I understand both why they thought as they did, and why I disagree, there is a good chance that I'm missing a valuable lesson.

This example is a case in point. Clearly it is from the point of view of a web company. The advice offered is not for all environments - there is a world of difference between a case where downtime means someone doesn't see a picture for 15 minutes and one where someone dies.

Now about unit testing and monitoring, let me give an example. I know a company (which I can't name) that releases multiple times per day, and releases every change as an A/B test. This is important. If they release a change, that works exactly as designed, that hurts conversion by 10%, THEY WILL KNOW. (You need significant traffic to follow this strategy, they have that.) This is important. There are a lot of trivial changes that could move the needle 10% without your realizing, and you don't want to move it in the wrong way.

In fact, if you look at the dollar values, a bug that causes 1% of pages to crash which unit testing could catch is simply not as important as a bug that hurts revenue that the A/B test could catch.

But it gets better. If you have 0 tolerance for web pages crashing and have monitoring in place to catch it (I know people who have all crashes email key developers), then you'll catch a lot of bugs that you would catch with unit tests, AND you'll catch bugs that you SHOULD have caught with unit tests but messed up on the test. Which, then, provided more value, the unit test or monitoring?

You want the unit test. You don't want to be catching stuff after you roll out. And one of the automatic questions when you do catch it should be, "How could we have caught that with an automatic test?" But having smart monitoring is more valuable to you than the unit test.

Terretta · on July 19, 2012

The A/B test or phased rollout test works if you have great telemetry (NewRelic, etc) and sufficient traffic that results are statistically sound.

A low traffic site working on MVP releases can synthesize both the OP article and the comment above by "unit testing" the essential functions of the web site from users' browsers point of view. Think Selenium, Watir, PhantomJS, and the like.

http://jquery.bassistance.de/webtesting/presentation.html

benjaminwootton · on July 19, 2012

This rant completely ignores context.

In the air traffic control system, you can never have an error and need to take every possible precaution up front to avoid bugs.

In a lot of situations, you might be prepared to slightly increase the risk of introducing bugs in order to move towards continuous or much more frequently delivery.

I'd argue that in most applications and businesses the later scenario is true - it's just that where you draw the line just varies from project to project.

Once you've accepted that, proactive monitoring is your second line of defence.

Spooky23 · on July 19, 2012

I think the root of what this guy is ranting about is that many web people seem to confuse rapid release iteration with rapid engineering iteration. The folks who do this don't understand that the frameworks that they are using are taking care of their lack of forethought -- but will only do so for a limited time.

Design, implementation and testing are different disciplines that are essential to reliable systems. You may not need to formally embrace all of them at a given point in time (or product cycle), but claiming that "testing" is a fraud or that "roll back" is mythical is just a demonstration that the speaker doesn't have a mature engineering background.

NoahSussman · on July 23, 2012

I'm pretty sure that I didn't say that testing is a fraud. Although I did say that I think production monitoring should be set up on day one of a new project, while unit testing can wait until the project has matured a bit. I also said that unit testing isn't going to be nearly as helpful without having production monitoring in place. And I said that with production monitoring in place it's possible to get by with a lot less unit testing than has been historically prescribed -- which is a good thing since production monitoring systems tend to be less expensive to maintain than unit tests.

I did at one point say "there's no such thing as rolling back." This is an idea to which I'm pretty dedicated. I've seen first hand many times that using an SCM revert command to attempt to restore "last good state" is a risky endeavor, especially when a large changeset is involved.

aidenn0 · on July 19, 2012

Talking about not testing at all is both stupid and non-sensical, as there will always be some informal integration testing

Not doing any integration testing would translate to the developer making a change in the code and checking it into the VCS without running it. That does happen from time to time, but it's obviously a bad thing, right?

If you build and run a smoke-test before checking in, that's integration testing! So clearly everyone does at least a tiny amount of integration testing, and the question is how far do you take it, and how much effort do you put into automating it is a tradeoff. From experience, the more effort that is put into testing, the fewer bugs will be seen by customers, though you do run into diminishing returns.

Monitoring has the advantage of reporting just bugs that are seen by customers.

QA I think of more as having professionals that pretend to be customers, which, done right, can let you catch the bugs that are egregious enough to make you look had actual customers seen it.