
Hire HN: Apache is crashing every hour. Seeking Drupal/Nginx-friendly sysadmin - brandnewlow
I run a social news site called Windy Citizen in Chicago. Right now and for some time, it has been majorly, and embarrassingly screwed up. It crashes every hour, like clockwork.<p>I first reached out to my network of developers I've hired in the past for contract work to fix the problem.  Three of them either a) couldn't figure it out or b) got bored working on it and said they didn't want to spend any more time on it.<p>I then asked a few gracious dev friends for advice.  This got me some theories and ideas that I've spent a great deal of time trying to prove out.  I have not been successful though.  We keep crashing hourly.<p>So I turn to you, HN to see if there's anyone on here who'd be interested in fixing this problem for our company.  We're a two-man business, ad-supported and ramen-profitable, that gives people in Chicago a central place to share and talk about their favorite local stories.<p>This has been going on for close to two months though, and our users are getting sick of it.  I'm doing a bad job taking care of them by not getting this fixed but am running out of options and people to ask about working on this.<p>I realize debugging the hosting set up on a complex Drupal site isn't the sexiest type of freelance gig to take on, but if there's interest in this, I will provide more detail in the comments and also explain why this is in fact a sexy job to take on.
======
whalesalad
Get rid of mod_php, replace it with a nice build of php-fpm (included in 5.3
actually) tied to nginx via fastcgi. Bam. Nginx will handle all static content
SUPER fast while keeping an incredibly small memory footprint, while passing
dynamic requests to the PHP side of things via fastcgi.

Example of one of my nginx configuration files, for a Wordpress site:
<http://paste.pocoo.org/show/256982/>

This will only do so much, but I suspect it will solve a great deal of your
problems. The final bit of advice that I can give is to look into any caching
modules for Drupal. Simple disk caching will do the trick, just something to
prevent lots of DB reads. I'm unfamiliar with Drupal, but I'd imagine there
are loads of modules out there that can assist with caching.

~~~
pilif
I don't know about php-fpm, but the old fastcgi implementation was extremely
buggy when we tried that with lighttpd back when there was no nginx yet.

Burned like this, I'm currently using Apache/mod_php for all dynamic content
but let nginx serve the static files (using a reverse proxy configuration.
Turn off KeepAlive in Apache!).

This works extremely well and totally stable.

Even though php/fastcgi is now a more common configuration than it was back
then, mod_php is still more common and hence better tested.

~~~
bartman
I have php-fpm running behind nginx in a production environment for over a
year now, it's been very stable. The graceful restarts and statistics (slow
execution log) far exceed what other fastcgi or mod_php implementations
provide.

------
po
Here's what your problem is:

One day, your software had a resource leak and was eating up too much of
memory/file handles/etc... One of your developers tried restarting the system
and like magic, it fixed it. After a few days of having to do this, he got
tired and simply put the restart command into an hourly cron job or a script
somewhere on the system.

I've seen it a thousand times. My consulting rate is $200 an hour. If this
turns out to be the problem, I'll expect $6.67 in the mail from you. :-)

~~~
mrtron
Very close. From what he later posts about supervisord monitoring - he is
basically hitting a server bottleneck of number of threads, which causes him
to run out of memory and get into swap space. Then everything goes to hell and
he (supervisord) has to restart.

Proof: You will have 0 memory free and start using swap when the restart
happens - type 'top' and watch this go down.

Fix: Use less threads. The number of concurrent users you can support is the #
of pages you can serve up per second in each thread times the number of
threads your server can handle. So a huge improvement can be caching. Cache
pages, blocks, hell maybe your whole site. The cache hits will serve 10-100
times more responses per second than generating your entire page.

I consulted on a project that was having horrible server performance on a huge
server with relatively static data. The way they produced pages was really
poor and took 3 or 4 seconds. There is no way you can serve up a large number
of concurrent requests with the turnaround time over a second. I added caching
as a quick fix and they ended up just being happy that 99.9% of their pages
served up in 1/100th of a second.

~~~
kranner
Instead of watching it in 'top', I recommend installing munin: <http://munin-
monitoring.org/>

The blurb from that site:

Munin is a networked resource monitoring tool that can help analyze resource
trends and "what just happened to kill our performance?" problems. It is
designed to be very plug and play. A default installation provides a lot of
graphs with almost no work.

------
buro9
This sounds like a simple mis-configuration of Apache and the
maxrequestsperchild.

If Apache is spawning 200 processes, each of which have PHP within that, then
you can easily run out of memory and then kill apache and the system.

It looks like, from your hardware specs, that you've had this problem for a
while and chose to solve it by throwing hardware at the problem... gradually
increasing the RAM of the VM until you have a 4GB web server.

My bet is that this 4GB isn't needed, and that if you had got someone with
some understanding of apache to look at it sooner that they could've
configured apache to have fewer child processes and this would avoid the heavy
memory requirements.

Go check serverfault.com, possibly hire someone to look at the problem for you
if you don't understand what you're doing here. You will need to provide real
information such as your apache2.conf and php.ini files, as well as info on
whether you're using things like XCache, Memcache (if so, where is this
installed?), Varnish, etc.

For your traffic and hardware, the numbers look low. You're most likely just
running a badly configured apache.

~~~
matthewphiong
I second this. I had a similar problem before, and yes I mis-configured my
apache. In my case it was random crashes instead of hourly.

I would suggest you to post your problem to your hosting community forum with
related info (logs, conf files, server setup info, etc). In my case, it's the
Linode forum. I'm sure the community is more than happy to assist you.

Good luck!

------
modeless
To solve this specific problem, perhaps try asking at Server Fault:
<http://serverfault.com/>

------
ritonlajoie
I'm a little embarrassed to see so many guys telling brandnewlow to change his
configuration in such a situation. If my understanding is well, then he is on
an urgent situation and you are basically telling him to take risks in having
a completely new setup ? (removing apache/mod_php and puttin nginx and stuff).

Seriously guys... brandnewlow, please don't read these advices. Maybe on the
short term you can think about changing your setup. But right now, listen to
those who try to fix your _current_ setup !

------
dotcoma
I think a posting like this should go under a special section called "other
jobs", be charged $10, and of course be subject to up/downvoting just like
every other posting.

------
gojomo
Good news! Crashing hourly is better than a more erratic schedule; it means
you can observe the crash nearly on demand.

I would disable the auto-restart for a couple cycles to get a closer look at
the unresponsive condition: total threads, any swapping, any busy threads. It
means a few more minutes downtime but more information for you.

It sounds like some connections are hanging indefinitely, eventually using up
'all' of some capped resource (threads, RAM, etc.).

Make sure you have a 'Timeout' setting in the applicable apache conf, and make
it really small (10-20) to see if that helps. (You may get complaints that
other long-running-requests that used to finish now fail -- but at least
service will remain available for usual requests, and you can adjust the value
back up later.)

Check the end of the error_log just after a crash/freeze for hints. Consider
adjusting the conf value MaxClients up, as long as there was no swapping
evident. Consider adjusting the conf value MaxRequestsPerChild down, but
nonzero, so that children are recycled sooner before they grow problematic
from memory leaks.

------
lazyant
(raises hand) my turn! (I fix these problems and other stuff for a living).

There are short answers and long answers to this problem.

The short answers try to fix quickly what the most likely cause of the problem
is. If a server is being unresponsive this is most likely because the server
runs out of memory or incurs in heavy disk I/O. For the memory problem find
what are the processes eating up the memory. For apache lowering the
MaxClients/MaxServers/KeepAlive helps. For I/O you'll have to look at the
database (see error log, log long queries etc).

But that's speculation.

The proper way to fix the issue is to understand what's really causing it. A
way to start is getting meaningful server utilization data and reviewing logs.

Spend a few minutes installing a graphical monitoring tool like munin or
google's quicklook (I like these because they are lightweight, there's a ton
of them like cacti etc). In a couple of hours you'll get a better picture of
what's going on.

Review the logs, specially error logs.

I use a little script that runs every 15 mins and logs the most important
stuff that is going on in the server, you can run something similar every 5
mins or whatever to find what's really going on:
<http://pastebin.com/Qv0J3WHY>

Another tip is to set up a mirror test/stage server and benchmark the heck out
of it; you don't want to make a lot of changes live in a production server.

I guess you can also ask the support people at Slicehost.

I'll gladly take a look at your server for up to 30 mins at no cost.

~~~
nl
Good advice, but one small nitpick: Quicklook isn't _Google's_ \- it's just
hosted at code.google.com.

------
chopsueyar
If there is anything viaweb has taught me, nothing beats performance like
static html pages.

Can you publish static html files and simply have an ajax component for the
commenting and any other dynamic stuff handled by node.js?

Just throwing that idea out there. Criticisms?

Yeah, I know he is just looking to get his existing setup stable, and he
shouldn't change it.

------
Jlambert
Um... I own one of the drupal shops here in the us. Drop me an email:
j@workhabit.com... I'm sure I can help you figure it out. Sounds like a config
problem.

------
christefano
Is Drupal's cron.php being hit every hour? If yes, my guess is that it's
related to an errant Drupal module. You can try isolating this by grepping for
hook_cron in your contrib folder. When you have a list of likely culprits, you
can check their corresponding issue queues on Drupal.org or just dig into the
code.

Either way, it's likely a bad Apache configuration. Try posting in the
Drupal.org support forums. You're likely to get the help you need without
being charged for it. The Drupal community likes to help its own, so I suggest
posting in the Chicago group at groups.drupal.org/chicago, too.

------
spooneybarger
I'm not a drupal person. Have very little experience with it, but I take a
perverse joy in tackling problems like this. Go figure.

I assume from the description that apache is running mod_php and that is how
drupal comes into the apache crashing part? how exactly is it crashing? can
you give a few details about any errors in the logs, a more detailed
description of what a 'crash' is etc?

~~~
brandnewlow
The last time I looked under the hood, I observed this behavior.

1\. After a restart of apache: Apache starts out running. He spins up a few
threads, 10 or so and then kills a few as he goes on his merry way.

2\. Eventually, the number of threads starts rising. 20 threads. 40 threads.
60 threads. 90 threads. 130 threads.

3\. 45-55 minutes after the previous restart, apache is now frozen.
SupervisorD, which has been checking every 5 seconds to make sure apache is
ok, spots this, and restarts apache. The site shows a fail-whale image for
5-10 seconds as SupervisorD does its thing. Then apache comes back online and
we repeat the cycle.

4\. About every 3 days, the semaphores pile up, and SupervisorD is unable to
restart apache. Apache goes into FATAL mode and I have to run a funky command
line command a friend showed me to clear out the semaphores so Apache can be
restarted and taken out of its tailspin.

~~~
spooneybarger
Is there a reason you are using apache in a threaded mode rather than pre-
fork? Basically... is threaded a requirement or a choice? If a choice, have
you tried running in pre-fork and if yes, do you get the same issues?

~~~
brandnewlow
I did not set up our hosting environment, a developer friend of mine did, so I
can't answer that first question. I don't know if Drupal requires threaded or
not or how to try running it in pre-fork mode.

~~~
brandnewlow
My hosting environment:

The site is hosted at Slicehost, on two slices.

Webserver: 4GB slice. DB server: 2GB slice.

We do about 3-5k uniques/day.

I've tried scaling the DB server up to 4GB to see if that solved the problem,
but we continued crashing every hour.

Part of the problem though with my own efforts is that I don't have much
experience with sysadmin stuff to be able to say for sure that any of my own
observations and attempts to solve were done properly. For instance, I scaled
up the slice, but maybe I didn't restart Apache in the right way afterward to
know if that made a difference or not.

~~~
spooneybarger
hn thread isnt the best way to try and ask all these questions. i sent you an
email at the contact address listed in your profile.

------
forkqueue
I run a company that provides sysadmin consultancy, and we deal with problems
like this on a pretty regular basis. We're not on your continent, but would
still be happy to help.

<http://bashton.com/>

I very much doubt it would be possible to resolve the problem without actually
logging onto the server and looking at what's happening.

------
dpcan
Check to see if the apache error log file is exceeding 2GB.

------
gaius
By "clockwork" you mean, every hour, on the hour?

That points to something running out of cron...

~~~
dpcan
Like log file rotations of some kind.

~~~
brandnewlow
Had 16 Apache freezes today, each about 80-90 minutes apart.

They happen late at night when we don't get hardly any visitors and during the
day when we get lots. Due to nginx though, I don't think most of our visitors
are being passed through into apache.

~~~
gaius
That's not what you said in your problem specification!

The key to solving any issue is to precisely and unambiguously define exactly
what is (or is not) happening and how this deviates from the norm.

~~~
brandnewlow
Very fair. Duly noted.

------
mkramlich
I wish you the best of luck in solving this problem but I also want to say I
really don't want this kind of post to become a trend on HN. Really bad venue
for it, so many better places. We don't want this to turn into the software
developer's equivalent of, "Could you guys do my homework for me?" like so
many other sites and forums have become.

~~~
paulbaumgart
He's a contributing member of the community and clearly has put a lot of
effort into solving the problem himself. I think as long as you've reached
some adequately high threshold of desperation, a post like this is perfectly
fine.

~~~
SkyMarshal
Agreed. He's also clearly not asking for a free solution to his homework, but
to hire someone to fix the problem as freelance contract, via his last
paragraph.

------
c00p3r
The obvious idea (for admins) is to convert the setup to nginx + php 5.3.3 in
fpm-mode + eaccelerator. (fpm patches are already part of the php distribution
as from 5.3.2)

There are millions of step-by-step tutorials for these keywords.

If you have a test installation you can do it yourself and then change
settings on your production server. If not you should setup an another copy on
the same server with a different port nginx should listen to and empty drupal
instance, test it. Then you can stop apache and start nginx on a correct port
with proper drupal instance.

~~~
samuel
I wouldn't recommend to blindly change his setup. This is the sysadmin
equivalent of changing lines of code almost randomly to fix a bug.

When there is a problem the very first step is to find out what's happening.
Apache is rock solid software so there must be something in his setup or
software that isn't right.

My line of action would be:

    
    
      - check the logs(apache and system ones), looking for errors or limits reached.
      - if no clue, try to reproduce it, same os, software versions. Make scripts to simulate the load.
      - strace the processes when they lockup (dtrace 'em in Solaris).
      - get a coredump(kill -SIGABRT), compile apache and their module with symbols (-g CFLAG), an gdb it.
    

I'm not a sysadmin anymore, but that's what I would have done when I was one.

~~~
c00p3r
good luck in debugging mpm_worker with all loaded modules and their
dependencies. btw, who will pay for hours and hours of such 'practicing'?

Is that never crossed your mind that the solution I recommended were battle-
tested one?

~~~
samuel
That's the last resort: almost never you need to get that
far(stracing/trussing usually do the trick), and it doesn't have to take
"hours and hours" if you have done it before, specially if it's a locking
issue. I don't remember if I have had to gdb apache(I think I have), but
sendmail + milters for sure.

I don't doubt that your solution has been "battle-tested", but I assure you
that apache + mod_php also has. Now if he has a different problem with that
new setup what you will recommend then? Change again to cherokee or lighttpd?
Or do you really think that nginx + fpm is a silver bullet?

I accept that spending some time cheap shooting randomly(raising os limits,
changing threading model, upgrading to new version...) may be worth the time,
though. But changing completely your middleware isn't a cheap shot in my book.

~~~
c00p3r
Re-read the OP - the problem is with threads. Switch apache to a prefork mpm -
is a good solution, but replace apache with simple and more efficient setup is
a much better solution.

nginx + fpm isn't a silver bullet, but it is much more efficient an flexible
solution. And avoiding using threads with such a terrible mess as PHP is also
good solution, which will save you a lot of time and effort.

Just think about how many things will be dynamically loaded, and how much of
this code is really thread-safe and properly tested.

btw, I bet there is a mere sigsegv.

------
oomkiller
You need to trash Apache/mod_php. The little PHP I have done has been with
Nginx (which it seems you are already using), and FastCGI. It may not be as
automagic as mod_php, but since I'm a Rails developer/deployer, I'm used to
it. Get a couple of FCGI processes running, set up and nginx upstream to talk
to them, and set nginx to direct the traffic to them. Also, you can use
something like God or Bluepill (I use the latter) to monitor these processes
and reboot them if they lock up. It seems monit would also work well for you.
The key thing here is the monitoring, as you can restart processes if they
lock, and nginx will just skip over that port and use another process that is
not locked/restarting. Finally, if you don't understand any of this, my info
is on my profile, and we might be able to help you with Nginx or migrate your
site to Rails ;)

