

The Limoncelli Test (Joel test for Sysadmin Teams) - progga
http://everythingsysadmin.com/the-test.html

======
tptacek
It's not bad, but it needs editing. The Joel Test is a hugely important
document (for a blog post) because it's incisive. Spolsky could have added,
"does your team ban sprintf, strcpy, and strcat" and then written a graf on
buffer overflows. He didn't, because that's not one of the very few questions
in the Joel Test.

The Joel Test isn't "if Joel Spolsky was designing a new dev team from
scratch, here's his whole checklist". But this sysadmin test seems that way.

So in that spirit, here's my first wave of things I think you should cut:

(3) Plenty of very excellent ops team don't keep internal team metrics (other
than availability stats). It's also too fuzzy.

(8) Prioritizing features over stability actually contradicts Joel on Software
("some bugs aren't worth fixing", to paraphrase). Saying that you prioritize
one over the other is also a platitude. Axe this.

(9) Virtually every great dev team uses source control, has bug tracking, &c.
Not every great ops team has "design docs" for every (or even any) project
they undertake

(11) Similarly, an "opsdoc" for every service (the mini website with "how to
rebuild this") is a nice-to-have, not a must have. How I know that is, I've
never once met an ops team that actually has this.

(14) Dev/QA/Prod environments: I don't think you can axe this, but I think you
got too aspirational, on two axes: first, most teams don't have dev AND QA AND
prod (though every good ops team has at least a prod and a "something else"
environment) and secondly, there are services that don't need this much rigor.

(22) Refresh policy for hardware? If it ain't broke, &c. Why does a good
sysadmin team refresh hardware just for the hell of it? I remember when
network admins used to be proud of keeping highly utilized networks running on
the old ugly Cisco AGS+ boxes and made fun of the kids who bragged about their
7500s. This is too fuzzy to be part of a "test".

(23) Why do I care if servers stay up when one hard drive dies, as long as my
_service_ stays up even when a whole rack catches fire?

(28) Anti-malware? Really? In 2011? I'm sure you have a whole blog post to
write about this, but if you have to justify it, maybe leave it out of your
"test".

A suggestion for your document that also adds another acid test to things you
should keep on the list: what's an ops team that everyone knows that kicks ass
as a result of doing _all these things_? When Joel Spolsky wrote The Joel
Test, he got to use Microsoft as a "12 out of 12" case. Who does all these
things? Amazon? (Did I miss that in your document? I'm tipsy, sorry).

Hope that's constructive.

~~~
mechanical_fish
I kind of liked the hardware refresh policy. It's not necessarily because the
hardware goes stale. It's to keep the process and the personnel from going too
stale.

By constantly mixing in new hardware one piece at a time, you compel code to
be runnable on multiple generations of hardware at one time, avoiding flag
days; you continuously shake the bugs out of new code, preventing it from
growing a hardware dependency in year N that only gets discovered in year N+2;
you periodically drill the team (especially the newer folks) in the procedure
for bringing up new boxes but also accomplish real work (gradual upgrade of
the server farm) in the process; you'll end up running hardware with a
continuous range of model numbers and batches, perhaps mitigating against
flaws that strike entire batches at once; when disaster strikes and you have
to replace a box ASAP, odds are better that you've set up a similar box in
recent memory and know exactly what to get, how to set it up, and what any
pitfalls might be - and if there _are_ pitfalls, you discovered them during
working hours on spare hardware, rolled back, and spent a few weeks fixing
them instead of discovering them at 5 AM on a Saturday morning and having to
fix them on the fly.

(Of course, my devops team does everything in AWS, so what do I know about
managing hardware?)

I agree that this checklist is way too long to vie with the Joel Test, though.

~~~
rhizome
What you describe belongs to what is called "platform spread" and is generally
something to be avoided.

~~~
mechanical_fish
That's interesting. Perhaps the fact that I've learned about ops entirely in
the era of virtual machines running atop disposable, generic, and (in the case
of AWS) entirely invisible hardware has distorted my thinking on this
matter...

~~~
TomLimoncelli
"Sysadmins and devs shouldn't really care if they have five generations of
hardware."

I wish it was true. sadly there are some services where scale and latency is
so carefully measured that individual software releases are rejected if
performance gets worse (or unacceptably worse, etc). In these situations you
need to test on all hardware platforms. It is much better to have fewer
platforms: Optimally: the one you are migrating off of, the one you are moving
to.

For desktops... have you ever tried to maintain an Windows or Linux desktop
environment with more than 4 "standard desktop configurations"? It becomes a
nightmare. If you have a single "gold image" you blast to all machines it
makes the task harder; if you stay with the vendor's OS and try to maintain it
"forever" it is even worse.

One thing that makes virtualization a "win" is that the virtual box looks like
a single hardware platform. It reduces testing, etc. However, then you still
need to test the virtualization software on all hardware platforms... so
you've made things easier for everyone but that team.

------
antoncohen
This is good list. I think a key part is "The score doesn't matter as much as
attitude." Not every company will have everything on the list, you can't
expect small startups to have all this. But if they balk at the ideas it shows
that there is a problem.

It turns out that some companies have a management team that is against
automation, written policies, and fixing security and stability issues. Here
are questions I wish I asked in job interviews:

4\. Do you have a "policy and procedure" wiki?

8\. In your bugs/tickets, does stability have a higher priority than new
features?

16\. Do you use configuration management tools like cfengine/puppet/chef?

20\. Is OS installation automated?

28\. Do desktops/laptops/servers run self-updating, silent, anti-malware
software?

29\. Do you have a written security policy?

------
rhizome
I give it a C, maybe a B if you're entry level.

~~~
rhizome
Fine, how's this for question 18: how do you know when your notification
system goes down?

~~~
antoncohen
> 18\. Do automated processes that generate email only do so when they have
> something to say?

> for question 18: how do you know when your notification system goes down?

You monitor your monitoring system. I think 18 is important, noise from
automated processes will hide real problems. I worked with a manager who would
consistently write cron jobs that run as root (17), that would send out
useless emails every day (18). One of the cron jobs sent 500KB - 10MB of text
every day, no one will read 10MB of text, so if there is a an error no one
will see it. Write your scripts correctly, use --quiet flags and redirect
stdout to /dev/null.

~~~
rhizome
What monitors the monitor-monitor?

~~~
antoncohen
The monitoring servers monitor each other (and themselves), they should be in
different data centers. You can also use a third-party service to monitor
parts of your infrastructure, including the monitoring server. Depending on
your needs a simple service like Pingdom could be used.

If you are wondering how to monitor if both data centers go don't at the same
time, I'd say for most companies you don't worry about it. 1) The odds are
extremely low. 2) You will notice if two DCs go down. 3) That nightly email
that says "I'm up" isn't going to help here. 4) Even the free version of
Pingdom will alert you when your whole datacenter is down.

There are all sorts of other things to consider with redundant monitoring, but
that's the job of a sysadmin -- identifying failure points, assessing risk,
etc.

~~~
TomLimoncelli
I'm not sure how email would help. I guess you mean the system would email you
once a day saying that the monitoring system is working and if you don't see
the email you know to check into it. The monitoring system I use has a higher
SLA than 24 hours.

Usually folks divide the monitoring work among two servers and each server
monitors the other. Or, you "meta monitor"... a monitoring system that just
monitors the monitoring system. Then you get a third-party to monitor that.
Then it is turtles all the way down.

~~~
rhizome
It sounds like both of you are talking about cloud stuff when thinking about
datacenters. I can see how that many layers and cross-checks would be
necessary when all you really control is running memory and some pieces of
storage, but a lot of that is due to the platform. When you control the actual
metal, third-party monitoring services are much less necessary.

For a real DC, when it goes down I get a phone call from a human. I don't have
to reinvent that process. If it's my own server room, I use a landline and a
modem for OOB "dude you gotta come down here" notifications, a WAV of Woody
Woodpecker or something. If the phone lines are down, I look at the newspaper
headlines to see what happened.

There's no reason not to set up a standalone monitoring regime. Whether or not
you use heartbeat notifications to tell you all is well is a matter of taste,
but there is definitely more to maintaining your nines on a daily basis than
simply adding more layers of monitoring.

