Hacker News new | past | comments | ask | show | jobs | submit login
Too many open files: Tracking down a bug in production (roomkey.com)
65 points by pigs on Dec 17, 2015 | hide | past | favorite | 7 comments



For those who are interested, here is some background reading on Roomkey:

60,000% growth in 7 months using Clojure and AWS

http://www.colinsteele.org/post/27929539434/60000-growth-in-...

Against the Grain: How We Built the Next Generation Online Travel Agency using Amazon, Clojure, and a Comically Small Team

http://www.colinsteele.org/post/23103789647/against-the-grai...


I wonder why they approached the problem from the way they did -- that is to say the monitor the effect of the problem but not the cause (# open fd caused by stuck threads). Why not monitor the # JVM threads.

I'm somewhat familiar with a tool we use at work for monitoring JVM - wily introscope, but I'm sure there are other options available (newrelic java agent? ps can check nwlp on Linux? prstat available on linux? other jmx solutions?). Again I'm not familiar with tomcat as much as websphere/weblogic but I see tomcat has this option https://tomcat.apache.org/tomcat-7.0-doc/config/valve.html#S... which you can use in conjunction with other monitoring to get alerts I think for situations like this (your app threads getting stuck). I'm not sure if tomcat will write a warning log msg or actually mark the the thread as "STUCK" in thread dump.

The author did mention Java Mission Control but I wanted to point out that's not the only option for JVM monitoring.


There were only two threads stuck, so monitoring the threads wouldn't have set off any alarms. One was a Clojure agent thread, which basically allowed all the clients marked for retirement to queue up.

I used both Yourkit and Java Mission Control for monitoring the JVM. They both have some functionality of deadlock detection, but it did not flag these two threads. Yourkit identified these threads as waiting, not blocked, but I'm not sure how it makes that distinction.


Jolokia is a nice solution which integrates well with Nagios & clones


"You can use the free Java Mission Control instead to do this investigation, but I do not have our production JVMs configured to accept remote connections" The free JMC is only for dev time not for production use, as far as I know -- not sure if Oracle has a way to determine what is dev and what is production.


excellent ! eye opener !

the last i faced this problem on a million request site, i just fixed the ulimit


Yes, significantly upping the ulimit would have probably been sufficient for all practical purposes, although we might run out of JVM heap space if we are up long enough. Plus, it really bothered me that I didn't know what was going on.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: