

Programmers: how to make the systems guy love you - mocko
http://mocko.org.uk/b/2010/10/17/programmers-how-to-make-the-systems-guy-love-you/

======
vilya
One thing to add to the points about documentation: if you as a developer
either don't write it or do a half-assed job of it, then you won't be able to
just hand it off to the systems guy - you're going to end up doing
2nd/3rd/whatever line support for it because the documentation is YOU. And
that, of course, will mean less time for coding.

------
megablast
In my limited experience, you can never make the system guys love you, nobody
can.

Simply put, system guys have a lot of downtime, where they do stuff that they
enjoy, rather than work. So anytime anyone interrupts them from this, they are
going to be annoyed, it is only natural. You are taking them away from fun, to
something boring, and annoying to fix.

Regarding the article, I am not sure why hosting considerations aren't talked
about at the beginning of the project. Even working with non-technical people,
this has always been one of the first issues. Are we going to host it here,
are you going to host it, can it use a shared platform, does it need a
database. This is all thought about at the start of the project, and if need
be a systems guy is bought in.

~~~
logicalstack
If the system guys you have worked with have lots of downtime you have only
worked with crap sysadmins.

~~~
scalyweb
I'm guessing in this case downtime does not refer to systems being in a down
state but rather the sysadmins have time to work on whatever interests them.

~~~
9oliYQjP
logicalstack's comment still makes sense if you interpret downtime to mean
what you've just explained. Here are some ways that good systems guys fill
their time in between deployments and triaging problems:

1\. Reading security bulletins and proactively trying to determine if systems
are affected and what to do about it.

2\. Reading about upcoming hardware and software so that they can plan the
best platforms to deploy applications to.

3\. Auditing applications for various problems.

4\. Doing routine testing of the validity of backups and how easily they might
be restored.

I upvoted logicalstack back to 0 because he might have implied that if your
systems guys are reading Hacker News all day and commenting on threads about
TechCrunch articles, they might have better places to spend their reading
time. Believe it or not, there really are systems guys that have their hands
full doing real work and don't just sit back in their chair surfing the net
and playing video games.

------
logicalstack
One missing item is instrumentation and metrics. Understanding and debugging a
complex application is made much easier by having an abundance of easily
collectable metrics that describe the running or cumulative state of the
system.

~~~
gnubardt
Totally. As important as logging is to profiling and debugging a system,
collecting metrics is invaluable for keeping a system running. Being alerted
to potential (or actual) problems can allow an admin to respond to them
effectively, rather than trying to patch things together after it's hit the
fan.

------
Goladus
Another one: usually one of the first places a sysadmin is going to look when
there is a problem is the log file. A good sysadmin will be able to read
compiler-generated errors but still may not be able to do anything if the
messages reference cryptic variable names and methods from code they've never
seen. Spend a little time thinking about how an end-user, rather than a
developer, would read the log file.

Obviously don't short-change yourself if you're the one who will be writing
bug fixes and need detail, but a little can go a long way here, especially for
problems a sysadmin may be able to diagnose and fix (connection issues,
resources issues, etc.)

~~~
thirdstation
It can also save you a 2a.m. phone call.

Some simple, "If you see message X, it means Y, so you should do Z. If that
doesn't work _then_ call me."

Your debug-level messages can be esoteric but, make your error-level messages
reflect what's broken from the perspective of the SA, who hasn't seen your
source code.

------
CWIZO
Great read, but I was hoping for some secret advice on how to be loved by sys
guys/gals that can't be known to us programmers unless somebody tells us :)

~~~
thwarted
Okay, here are some:

Be aware of how the init/startup system works on the machines your software is
going to be deployed to. If you're writing a long running network service,
then this means providing an init.d script that provides at least start, stop,
restart, and status actions. Learn enough of shell scripting to write this
script in sh or bash, not in some more heavyweight language like python or
perl or php or ruby.

Write services that don't need to be running as root and/or know how to cease
being root after their setup stage. Don't hardcode any settings (like port
being listened on, or user to run as). Make a config file. And don't make it
XML. Make it easy to build a package, deb or rpm, by providing the necessary
debian/ dir contents or .spec file.

When the systems team says there's a bug in your code, fix it. If you don't,
you risk having the notifications for the failed service being sent to _your_
phone waking you up. When code is changing all the time, there's more likely
to be a bug in your code than there is to have a random system configuration
or hardware issue. In many cases, machines with hardware problems are most
likely already removed from service before you even know about it.

Don't _make_ the systems team debug your code and provide a patch. I know a
lot of developers who think systems people don't/can't program or they only
know how to "script"; that's not true, being able to program is one thing
that, I think, is required to be on the systems team. We'll help you debug and
provide the tools to get things resolved, but we shouldn't have patch your
code for you.

Don't write services that keep a lot of internal, in-memory state that isn't
saved to disk periodically such that it can't pick up where it left off if it
is restarted.

Do write services that allow easy introspection into its state. The memcached
"stats" command is an example. This allows us to easily hook your code into
things like ganglia. Some programs, when sent a signal, dump info to a log
file.

Speaking of log files, use techniques to detect if your log file has changed
and reopen it if its rotated (stat(logfilename) != fstat(logfilefd)).

Those who don't learn /bin and /usr/bin are doomed to reinvent them, poorly.
Don't waste time reimplementing things that already exist. Like xargs. Or
head. Or wc. Or yes. Or host. Or watch.

Learn how to use the stuff in binutils, like objdump, nm and strings.

Don't write scripts in python or perl or ruby or php that do some weird setup
(like parsing command line arguments) and then just invoke something via the
shell with system(). On the off chance you do need to do this, use exec
instead.

Don't name scripts that might be integrated with other tools or batch jobs
with a language specific extension (unless this is required by your platform,
_cough_ ahem) or it's a language specific library. Reserve .py for python
modules, for example. The reason here is that your perl script might be
rewritten as ruby one day or optimized by porting it to C, and there might be
stuff calling it, and then you end up with a ruby script named .pl. We don't
name C binaries ending with a .c extension, your "production" scripts are
logically "binaries" if not physically. (I've actually run across this).

Use or create proper library routines to access structured data files. As a
contrived example, use getpwent, don't parse /etc/passwd by yourself.

Put a password on your ssh private key.

Don't architect distributed things that require passwordless ssh keys to run.
If you need to distribute files to a bunch of machines, they should be pulled
from an rsync or http server, rather than via scp or remoting invoking rsync
over ssh.

Learn about sudo, because you're going to have to use it.

Learn the difference between tmpnam, mktemp and mkstemp, and why you should
use mkstemp.

Honor the TMPDIR environment variable.

Don't fill up the disks. When you get automated quota emails, do something
about it.

Don't work around the limits that have been put in place on systems, they are
most likely there for a reason; if you are hitting a limit, we're open to
having them changed. For example, if there's a 4 gig address space limit on a
machine for user xyz, don't work around that by changing the user your code
runs as.

Write runbook documentation. This means somewhat of a Q&A style "How do I..."
or "What to do when X, Y, Z". Make them easy to visually scan, make keywords
standout. Hopefully, your systems team has provided a place for these to be
stored that is easily accessible and easy to update. A lot of people get
bogged down when writing documentation, trying to figure out what's
appropriate to document. A runbook, which is a living document, makes this
easier, because it's obvious that you add to it as problems or corner cases
crop up.

I'm sure I could come up with some more.

~~~
FooBarWidget
> init script

Doesn't seem very sensible to me. Why should the programmer write this, and
not the packager or sysadmin? As I programmer I already have to deal with
2949832 platforms, isn't it the packager's/sysadmin's job to make sure it
integrates into your specific system provided that the programmer writes
general usage instructions?

> Honor the TMPDIR environment variable.

Funny. My software used to honor $TMPDIR until a sysadmin complained to me
that it _shouldn't_ , and should honor some other my-app-specific env
var/config option instead. And he's a competent sysadmin.

> Don't work around the limits that have been put in place > on systems, they
> are most likely there for a reason

My experience is that the limits are very, very often _not_ there for a reason
and the administrator didn't even know about the limits until my software hits
those limits.

One good example is the file descriptor limit. If too many clients connect to
my server then the fd limit will be hit, but instead of raising the limit it
would seem a considerable number of people to go my support forum to ask
what's going on. I'm considering increasing the limit automatically on the
next version just to get rid of the support emails.

Another example is the stack size. A major user accidentally set his system
stack size to 80 MB. My software creates a bunch of threads but after a while
it runs out of virtual memory address because of the huge stack size. We found
out about the 80 MB stack size after a while and we had to explain to them why
that was a mistake. They fixed it afterwards but we wanted to avoid support
questions like this again so from that point on we hardcoded our thread stack
sizes to 64 KB.

~~~
thwarted
This was a list of things that get your systems guy to love you. If you want
your systems team to love you, make their job easier. If your systems guy
creates the init script for you or just increases limits when you say, great:
your systems guy _already_ loves you and you need to do nothing.

 _[init script] Doesn't seem very sensible to me. Why should the programmer
write this, and not the packager or sysadmin? As I programmer I already have
to deal with 2949832 platforms, isn't it the packager's/sysadmin's job to make
sure it integrates into your specific system provided that the programmer
writes general usage instructions?_

I assume that anyone who has a systems guy/team is working towards deploying
to an internally controlled production environment. Production environments
for the kind of software the audience of hacker news creates usually do not
include 2.95 million different target systems. If they do, then your systems
team is doing it wrong.

The programmer should write this because the programmer knows how their
software should be started and stopped.

 _My software used to honor $TMPDIR until a sysadmin complained to me that it
shouldn't, and should honor some other my-app-specific env var/config option
instead. And he's a competent sysadmin._

The way to figure out where to write temporary files is whichever one of the
following succeeds first 1) a configuration specific location that is
writable, 2) the value of TMPDIR that is writable 3) /tmp. The point of this
tip was to not just assume /tmp is the best place to put temporary files, and
don't read some other environment variable when TMPDIR is the defacto place to
specify the temporary directory via the environment. Obviously, you should
favor what your local systems guy says, since you're trying to win his love,
not mine. But if he's anything like me, he can appreciate this list.

 _My experience is that the limits are very, very often not there for a reason
and the administrator didn't even know about the limits until my software hits
those limits._

This is a function of different distributions, of any operating system, having
different defaults. A good systems guy will be aware of defaults and be able
to almost immediately know that a limit is being reached after a few questions
about the problem.

 _One good example is the file descriptor limit. If too many clients connect
to my server then the fd limit will be hit, but instead of raising the limit
it would seem a considerable number of people to go my support forum to ask
what's going on. I'm considering increasing the limit automatically on the
next version just to get rid of the support emails._

You should be using getrlimit to determine how many file descriptors can be
used and then issuing a _meaningful error message_ ("out of file descriptors,
try increasing per-process file descriptor limits, see ulimit" or something of
the sort) if that limit is reached. Even better, use getrlimit to get the
softlimit and the hardlimit and use setrlimit to expand to the hardlimit. If
you reach that, then also issue a meaningful error message. This would save a
lot of support time because your software's error messages indicate what to
do, and any competent systems person will know how to increase that (or will
know how to find out).

 _Another example is the stack size. A major user accidentally set his system
stack size to 80 MB. My software creates a bunch of threads but after a while
it runs out of virtual memory address because of the huge stack size. We found
out about the 80 MB stack size after a while and we had to explain to them why
that was a mistake. They fixed it afterwards but we wanted to avoid support
questions like this again so from that point on we hardcoded our thread stack
sizes to 64 KB._

This is the perfect example of what happens when one just randomly increases
limits. "Mmh.. if 16k is good, then 80meg will be even better!" Someone who
sets an 80meg stack size _obviously_ doesn't know what they are doing and is
just blinding increasing limits in an attempt to get rid of some problem they
don't understand. This is an attempt at a bandaid solution, not a real
solution.

------
jedwhite
This is one of the reasons for programmers making startups to use "platforms
as a service" like Google App Engine, Heroku, Rackspace Cloud et al. They
really do simplify deployment and managing systems, and let you focus on code,
particularly for platforms that traditionally are hard to scale without a lot
of knowledge and work.

------
v21
Can I just indulge in a bit of patriotism here? A few paragraphs in, I knew I
was reading a British dude, and was confirmed right by the URL. Sure, no doubt
there are many shibboleths to tell, but the main clue was the joy and easy
humour of the writing. Just an incredibly British style, an utter lack of fear
in mingling jokes and serious points.

The biggest thing the author won't dare admit - that he's a nice guy. That he
cares about making it easier for others. I guess it's self-effacing, and I can
see how others may prefer actual open honesty, but this is the way I work, and
as such, this prose is like a warm bath to me.

"What are the five things most likely to break it? If somebody is trying to
fix a problem and you left them a solution on page 5 of the manual they’re
going to be really, really grateful. Once it’s working again they’ll buy you a
beer to say thanks for all the trouble you saved them. Seriously, they will."

------
danac
Most of those excuses mentioned really come down to one of two things:
arrogance (I know best so you don't need to know) or laziness (why bother
because see former), both of which are hardly unique to programmers.

------
moron4hire
It's much simpler than this. Buy him/her a case of beer. That's all you have
to do. Works wonders. People bend over backwards for you after you buy them a
case of beer. It doesn't even have to be good beer; you could get some shitty
$40/case beer and they will love you. Soma.

~~~
oomkiller
$40/case what? That's shitty range? Where do you live, around here you can get
shitty beer for $12/case!

~~~
Figs
How big is a case?

~~~
oomkiller
24? 30 for stones.

