
System administration wisdom - makecheck
I encounter more and more examples of people with scary habits when it comes to administering systems, or to a lesser extent, any project.<p>Occasionally I can forgive someone for not knowing that tools exist to help them do a job (while it may be painful to watch them work inefficiently for awhile, I can fix that).  But some habits are really hard to train them out of, and some make me wonder how the team functions at all.<p>I'll start with examples of course, and counter with what I consider to be acceptable.  But I'm also interested in your own experiences.  With any luck, this thread will become interesting to some, and educational to others. :)<p>[I've moved my list into a comment below.]<p>I'd be happy to see other mistakes people have seen, along with how you'd avoid the mistake, to educate the admins out there.
======
spudlyo
One thing that I see often that annoys has to do with the logical volume
manager. Many people allocate their logical volumes to use the entire volume
group. This doesn't leave any room for snapshots, which is the #1 reason I use
LVM in the first place. I've had to work around this in all kinds of crazy
ways in the past, like unmounting the swap partition and reformatting it as a
physical volume and adding it to the volume group. After I take my snapshot
and I'm through with it I'll put it back as swap.

~~~
CyberFonic
I'm with you on that one. Especially with current crop of huge disks. Volume
groups should be resources which you carve up into logical volumes.
Personally, I prefer to assign a separate LV to each application, DB
tablespace. I also configure LVs to the current requirements and grow them as
required. Have never seen a system that grows the way you expect it to.

With LVs you can also move things around without disrupting things. I often
move running databases from JBOD to SAN using LV mirroring.

------
CyberFonic
A sysadmin is like a member of the pit crew at the Formula One races. Don't
admit anyone who is under-qualified, lacking practical experience and
education.

I carry a notebook (dead-tree version) and take copious notes of before and
after states, note each step taken and ALWAYS backup configuration files
before changing them. Where applicable I snapshot configurations before and
after listing, etc to a directory for that session.

If you are using service/consulting firms, watch-out! Over time they will send
ever less experienced staff in their attempt to rack up profits. It never ends
well. Eventually a simple mistake takes out a critical part of your
infrastructure and they'll assign blame to some absolutely incredible and
unrelated factor.

------
yan
I'd add not using a configuration management tool.

------
allan_
doing everything by hand and not automating stuff

------
makecheck
(Rather than keep a long initial post, I'm moving my list to a separate
comment here.)

1\. _Lack of revision control._

I've seen this too many times. Someone messes with a file without even using a
rudimentary revision control mechanism like RCS, then something goes wrong and
no one knows why. What should require a simple "revert" suddenly takes hours
to recover from. What should be obvious from a "diff" is instead limited by
the accuracy of someone's memory. The sheer number of tools available for this
is astounding, and the fact that many are free and simple to use makes it
inexcusable to just "wing it".

2\. _Not using "sudo"._

The worst offenders I've seen just give "root" to everyone imaginable, who
then use it all the time (often from a "#" shell that's been open for ages, in
which they run all manner of random commands without paying close attention).
I once saw someone use "rm -rf" _as root_ to delete a single symbolic link; I
was cringing, hoping they would not type a space in the wrong place.

The "sudo" mechanism is a great friend to administrators. It forces a team to
clearly acknowledge what privileges are required to achieve objectives, and it
protects fallible humans from themselves.

3\. _Not using "visudo" and other helper tools._

I've actually been in situations where I was not the sysadmin, but had to clue
in a sysadmin that "visudo" existed. I'd asked for changes to the "sudo"
configuration that was used by some Very Important Scripts by Many Many
People. Sure enough, the person made a syntax error in the file, which was
promptly mirrored to a network of machines; this caused every single script to
fail for some time. Had "visudo" been used, nothing would have gone wrong.

There are entire classes of errors like this that really are 100% avoidable:
as long as you use the correct tools, nothing ever goes wrong. Administrators
should be well aware of these.

4\. _Constantly messing with the core OS._

When did people decide that it was OK to be adding to or overwriting what's in
/usr/bin and /usr/lib? The whole idea is that you do your meddling in
/usr/local, or /somewhere/else/entirely, and leave the rest of /usr alone.
This makes it clear if a problem is in the vendor's supported code or not, it
simplifies OS upgrades and reversions, and nothing is ever damaged
accidentally. I strongly recommend that Unix people read
<http://www.pathname.com/fhs/> in its entirety.

5\. _Being change-adverse._

I am routinely frustrated by administrators who become paralyzed, unwilling to
change anything except on a ridiculously-long-term schedule. What's worse,
they then decide to change everything that needs changing _at the same time_.
Ironically, what some see as avoiding disaster is actually a recipe for it;
something breaks, and one now has to debug one of 45 possible causes instead
of one.

An isolated environment for testing is essential (at least, to the extent
that's feasible with available hardware money). Rolling out changes should be
something that's easily done with confidence, due to positive experiences on a
test platform. But too often, change becomes a source of paranoia and CYA
instead of being a positive and regular way to boost productivity in an
organization.

