
Too many open files: Tracking down a bug in production - pigs
http://techblog.roomkey.com/posts/too-many-files.html
======
lkrubner
For those who are interested, here is some background reading on Roomkey:

60,000% growth in 7 months using Clojure and AWS

[http://www.colinsteele.org/post/27929539434/60000-growth-
in-...](http://www.colinsteele.org/post/27929539434/60000-growth-in-7-months-
using-clojure-and-aws)

Against the Grain: How We Built the Next Generation Online Travel Agency using
Amazon, Clojure, and a Comically Small Team

[http://www.colinsteele.org/post/23103789647/against-the-
grai...](http://www.colinsteele.org/post/23103789647/against-the-grain-aws-
clojure-startup)

------
potatosareok2
I wonder why they approached the problem from the way they did -- that is to
say the monitor the effect of the problem but not the cause (# open fd caused
by stuck threads). Why not monitor the # JVM threads.

I'm somewhat familiar with a tool we use at work for monitoring JVM - wily
introscope, but I'm sure there are other options available (newrelic java
agent? ps can check nwlp on Linux? prstat available on linux? other jmx
solutions?). Again I'm not familiar with tomcat as much as websphere/weblogic
but I see tomcat has this option
[https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#S...](https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#Stuck_Thread_Detection_Valve)
which you can use in conjunction with other monitoring to get alerts I think
for situations like this (your app threads getting stuck). I'm not sure if
tomcat will write a warning log msg or actually mark the the thread as "STUCK"
in thread dump.

The author did mention Java Mission Control but I wanted to point out that's
not the only option for JVM monitoring.

~~~
pigs
There were only two threads stuck, so monitoring the threads wouldn't have set
off any alarms. One was a Clojure agent thread, which basically allowed all
the clients marked for retirement to queue up.

I used both Yourkit and Java Mission Control for monitoring the JVM. They both
have some functionality of deadlock detection, but it did not flag these two
threads. Yourkit identified these threads as waiting, not blocked, but I'm not
sure how it makes that distinction.

------
anjanb
"You can use the free Java Mission Control instead to do this investigation,
but I do not have our production JVMs configured to accept remote connections"
The free JMC is only for dev time not for production use, as far as I know --
not sure if Oracle has a way to determine what is dev and what is production.

------
sriram_iyengar
excellent ! eye opener !

the last i faced this problem on a million request site, i just fixed the
ulimit

~~~
pigs
Yes, significantly upping the ulimit would have probably been sufficient for
all practical purposes, although we might run out of JVM heap space if we are
up long enough. Plus, it really bothered me that I didn't know what was going
on.

