

Operations magic cure: nightly server restarts - samaparicio
http://blog.aparicio.org/2009/11/24/operations-magic-cure-nightly-server-restarts/

======
donw
This could be better titled as 'Operations Magic Cure: Test Like A Madman'.

It's not the act of restarting the servers that solves problems. Rather, their
nightly reboot policy forces them to have all of the infrastructure in place
to be able to rotate servers in and out of service quickly. It forces them to
centrally manage configuration. It forces them to architect their software and
networks to handle restarts.

They use server restarts as a way to force all of these things to be in place,
which isn't such a bad idea. More importantly, it doesn't sound like they use
the nightly restarts as a substitute for actual troubleshooting (e.g.,
rebooting a server to 'fix' a problem, rather than figuring out what's really
going on).

~~~
noss
I think you are being very insightful with using this as means of forcing
things to be in place. It inspired me to actually attempts something like it.

Being an Erlang fan, I also recognize some other aspects of high-availability
strategies such as:

* <http://en.wikipedia.org/wiki/Crash-only_software> * <http://en.wikipedia.org/wiki/Microreboot>

------
djcapelis
I do this with webservers running trac all the time. Except I just restart the
webserver I don't bounce the entire machine.

I also find it relatively unbelievable that the linked post claims that
restarting servers requires a 24/7 operations team. He trusts his servers to
do a bunch of other complex operations unwatched but doesn't trust them to do
a restart? Obviously you need to make sure the next server waits until the one
before it in the schedule comes back up before it dives down and send you an
alert if one dies and doesn't come back, but otherwise this isn't something
someone should have to run by hand.

These aren't mainframes, you don't need an operator anymore.

Have one if you want but you sure shouldn't need one.

------
rm-rf
Of course re-booting servers or re-starting processes forces app servers to
re-cache data, database servers re-parse, re-cache and re-optimize queries and
throws away good query plans & replaces them with new versions, some of which
will be worse than what you had.

On apps that cache database objects in app server memory, we see a noticeable
degrade of response time while the app server re-caches database objects.

------
junklight
Sometimes it _is_ ok to have less than optimal solutions to get you past
sticky patches. Is your goal to build a perfect beautiful engineering solution
or to deliver a service to end users? However - you _must_ recognise that its
a kludge and that it needs to be sorted out as soon as possible.

The opposite of this is the "server that must not be rebooted EVER" (and if it
has to be will require the whole engineering team -I've seen one or two of
these in my time). Likewise machines that are rebooted every night soon turn
into "that's the way we've always done it".

So - it can be ok to fix things like this sometimes but watch it doesn't
become a cargo cult fix.

~~~
rm-rf
Well said. The key is to recognize that daily/weekly re-boots are a temporary
band-aid to hold you over until the application vendors/developers can roll
out an app that's not quite so badly broken.

------
ghshephard
Anyone who hasn't restarted their server has: o Old versions of Firmware (Hard
Disks, Mother Boards, BIOS, SCSI Controllers, etc...) o Old kernels o In for a
surprise when their data center eventually loses power. :-)

~~~
jerf
I've also observed numerous problems in multiple places I've worked with
runtime configuration being different than the actual stored configuration
across various systems, and of course when a system goes down and comes back
up, it goes with the actual stored configuration. If your infrastructure goes
down and comes up all at once (think power outage), you're in for a very bad
week.

------
dazzawazza
Nope, sorry. I just can't accept this. I've though about it for 10 minutes
trying to convince myself that this is a pragmatic approach to a difficult
problem but I just can't see it.

It's just poor management of servers. If you have a process that routinely
leaks memory or stalls then manage that by restarting it & balancing it but
rebooting the entire machine! This implies that you don't trust your OS to NOT
leak resources (pipes, files, sockets, whatever) thus you need a new OS.

The 'excuse' that it forces you to build a scalable redundant system is bogus,
it's folly.

~~~
rm-rf
And - every time you re-boot you loose valuable data that could be used to
determine the root cause of whatever is breaking the server OS.

I've got a mantra that I have to repeat pretty often:

"Re-booting is not a valid troubleshooting technique."

------
ojbyrne
logrotate and "apachectl graceful." at 4am. Don't push code without a restart
during the day, then you don't have to have someone awake for the restart in
the middle of the night.

~~~
bigiain
Hmmm, yeah... Until the customer is running old (2.6.9-78.something) kernels
and you end up with a few new unkillable D state zombie apache process holding
open deleted multi-gig logfiles every day...

Grrr... Till we sort _that_ one out, it's gonna be regular reboots...

------
kristianp
Nightly restarts: Implies a Windows server platform. And/or bad software.

~~~
jordyhoyt
Definitely not necessarily. Think about machines with many different apps
running on them, and pumping out new feautures or more major bug fixes is more
important than some minor issue like a memory leak (in who knows which of your
apps) that is fixed - and prevents the box from getting in a bad state - with
a reboot of the machine. (There could be lots of things, memory leak is just
what I have seen mostly lately with the C++ I've been working with.)

------
plaes
Windows - Reboot, Unix - Be root!

