I wonder why they approached the problem from the way they did -- that is to say the monitor the effect of the problem but not the cause (# open fd caused by stuck threads). Why not monitor the # JVM threads.
I'm somewhat familiar with a tool we use at work for monitoring JVM - wily introscope, but I'm sure there are other options available (newrelic java agent? ps can check nwlp on Linux? prstat available on linux? other jmx solutions?). Again I'm not familiar with tomcat as much as websphere/weblogic but I see tomcat has this option https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#S... which you can use in conjunction with other monitoring to get alerts I think for situations like this (your app threads getting stuck). I'm not sure if tomcat will write a warning log msg or actually mark the the thread as "STUCK" in thread dump.
The author did mention Java Mission Control but I wanted to point out that's not the only option for JVM monitoring.
There were only two threads stuck, so monitoring the threads wouldn't have set off any alarms. One was a Clojure agent thread, which basically allowed all the clients marked for retirement to queue up.
I used both Yourkit and Java Mission Control for monitoring the JVM. They both have some functionality of deadlock detection, but it did not flag these two threads. Yourkit identified these threads as waiting, not blocked, but I'm not sure how it makes that distinction.
"You can use the free Java Mission Control instead to do this investigation, but I do not have our production JVMs configured to accept remote connections"
The free JMC is only for dev time not for production use, as far as I know -- not sure if Oracle has a way to determine what is dev and what is production.
Yes, significantly upping the ulimit would have probably been sufficient for all practical purposes, although we might run out of JVM heap space if we are up long enough. Plus, it really bothered me that I didn't know what was going on.
60,000% growth in 7 months using Clojure and AWS
http://www.colinsteele.org/post/27929539434/60000-growth-in-...
Against the Grain: How We Built the Next Generation Online Travel Agency using Amazon, Clojure, and a Comically Small Team
http://www.colinsteele.org/post/23103789647/against-the-grai...