
Ask HN: Write tests for your infrastructure automation - yeukhon
I use Ansible to automate infrastructure configuration and deployment. Take graphite as an example. Graphite is made up of UI (in Django), Carbon (metrics processing daemon) and whisper (database).<p>Ansible created the instance, installed the necessary packages, templated the configuration files, created appropriate directories, added Nginx (access to the UI), configured supervisord to manage the django web app and carboon daemon, and a bunch of other things.<p>Each task&#x2F;step in Ansible should tell me the state of the deployment (has config file A been changed? Have we reloaded carbon? Is carbon running?). I can also use Ansible to tell me whether or not http:&#x2F;&#x2F;graphite.internal.company is accessible and whether or not I can see the data coming through. I can also write a test in Ansible by sending dummy metric to Graphite over socket (to verify that I can actually send things to carbon and the data is then saved into whisper).<p>I also want to know if we rotoate our logs or not (yes sure we deployed logrotate config file yesterday and my policy is rotate daily, so the next day I want to check if rotation happened).<p>See, Ansible can do that. I can write plugins&#x2F;modules to handle all the above.<p>But the question is, to what extend do people write tests for their infrastructure? There are probably a lot of things I question myself after deployment. Another thing is, sure I can write everything in Ansible, but would my tests be better off writing in my favorite language, and yes perhaps run using Ansible (especially if I need SSH access).<p>Because my end goal is to reassure myself, and eventually produce my own Chaos Monkey. How can I run Chaos Monkey if I only have the tests for applications, but no tests for infrastructure?<p>Thoughts?
======
aelsabbahy
I like to think to myself "If I had to verify this server manually what would
I check?" and write automated tests for those. Usually, it's a few high
level/important checks for critical services, ports, packages, and users. For
configuration files I check for one setting and if it's there assume the rest
are correct. Ultimately, it's mostly going by gut and balancing level of
effort in maintaining a test vs increased level of confidence in the
deployment from having said test.

Some tools out there:

* [https://github.com/aelsabbahy/goss](https://github.com/aelsabbahy/goss) \- YAML, simple, self-contained binary, extremely fast.

* [https://github.com/indusbox/goss-ansible](https://github.com/indusbox/goss-ansible) \- Ansible module for goss, never used this, but you might find it useful

* [http://serverspec.org/](http://serverspec.org/) \- Ruby, most popular infra testing tool

* [https://github.com/chef/inspec](https://github.com/chef/inspec) \- Ruby, looks like an improved serverspec, almost same syntax, made by the chef guys

* [https://github.com/philpep/testinfra](https://github.com/philpep/testinfra) \- Python, don't know much about it, but mentioning it since Ansible is Python.

Spend a little bit of time experimenting with all of them, see which one you
like.

Full disclosure: I'm the author of goss.

------
pjungwir
I have similar thoughts. In the new SRE book[0], there is a history of
Google's infrastructure automation, and in a way it started with tests:

First they had Python scripts to do things on various machines. (It actually
sounds a lot like Ansible: lots of little scripts that all ran over ssh.) But
because of high configurability, these didn't always work right.

So they wrote a bunch of tests, e.g. ClusterExistsInMachineDatabase,
DNSTestHasBeenAssignedMachines, so they could find out what wasn't right when
a new machine had been provisioned.

Then they realized that fixing the tests could usually be automated, so they
wrote code for each test, to correct the issue if it was failing.

It seems like they sort of backed into a declarative idempotent configuration
management solution like Chef or Puppet, where you say what you want the
machine to look like, and the config management is responsible for getting you
there.

As I think you are feeling, in config management, the redundancy of tests and
automation code is a bit more ... redundant ... than with automated tests for
development.

I think monitoring/alerting is another kind of test: Is the database up? Is
the web site responding?

Another good story from that book is how one internal database never went
down, so teams became lax about designing systems that would still work
without that component. So Google decided they'd just take the database down
for a bit. :-) It sounds a bit like their version of Chaos Monkey.

[0] [http://www.amazon.com/Site-Reliability-Engineering-
Productio...](http://www.amazon.com/Site-Reliability-Engineering-Production-
Systems/dp/149192912X/)

------
FlopV
Depends on the criticalality of getting this right without errors is, how
often this deploy is going to be run, and how much time is needed to deliver
this.

I'm not sure what Chaos Monkey is, but I come from an operations background,
where I moved to a devops role, and automating the infrastructure builds.
Testing is essential. There is no need to differentiate best practices if
you're coding for the application or the infrastructure.

As you said, checking each state of the deployment should be done, I'll do
pre-checks to make sure the filesystems have the needed space for the MW
component, etc. I'll check each phase, and break down the phases via functions
much like a developer would for application development. This helps keep my
code reusable, and easy to read for other admins. It's much easier to
rerun/correct one function that fails on a deployment than have to go through
the entire setup again.

I'd write the test in whatever language is doing the deploy, as I test while
it's deploying.

From there, it's good to have some type of audit you can run against your
infrastructure, to check versions, mount points, and changes. It can be a mess
when someone updates one server, but the rest are slightly different. You'd be
surprised how often this happens and eventually development environment looks
different than prod, and people are wondering why the applications behaving
differently.

I can go into more detail later on, but hope that gives you a feel for it!

BTW, I'm at a large enterprise so these deployments and installs are on an
enterprise scale, which makes it worth getting it right in the automation
piece, as the person running it isn't always an expert on the automation, or
the infrastructure itself. Get it right the first time or log what broke will
save a lot of head aches later on.

------
stevekemp
I write tests for my own servers, via the ruby gem ServerSpec. It's generally
something I let lapse over time, but I always ensure that I update the tests
before I'm applying a distribution update.

Allowing myself to validate that a Wheezy installation of Debian has all the
services running that I expect after I upgrade it to Jessie is very useful and
reassuring.

I wrote a brief introduction to the process here:

[https://debian-
administration.org/article/703/A_brief_introd...](https://debian-
administration.org/article/703/A_brief_introduction_to_server-
testing_with_serverspec)

------
afarrell
I wrote this tutorial on infrastructure automation and it uses py.test for
things, but honestly I don't think the tests are very good...

[https://amfarrell.com/saltstack-from-
scratch/](https://amfarrell.com/saltstack-from-scratch/)

