If your clojure pods are getting OOMKilled, you have a misconfigured JVM. The code (e.g. eval or not) mostly doesn't matter.
If you have an actual memory leak in a JVM app what you want is an exception called java.lang.OutOfMemoryError . This means the heap is full and has no space for new objects even after a GC run.
An OOMKilled means the JVM attempted to allocate memory from the OS but the OS doesn't have any memory available. The kernel then immediately kills the process. The problem is that the JVM at the time thinks that _it should be able to allocate memory_ - i.e. it's not trying to garbage collect old objects - it's just calling malloc for some unrelated reason. It never gets a chance to say "man I should clear up some space cause I'm running out". The JVM doesn't know the cgroup memory limit.
So how do you convince the JVM that it really shouldn't be using that much memory? It's...complicated. The big answer is -Xmx but there's a ton more flags that matter (-Xss, -XX:MaxMetaspaceSize, etc). Folks think that -XX:+UseContainerSupport fixes this whole thing, but it doesn't; there's no magic bullet. See https://ihor-mutel.medium.com/tracking-jvm-memory-issues-on-... for a good discussion.
> It never gets a chance to say "man I should clear up some space cause I'm running out".
To add to everything you said, depending on the type of framework you are using sometimes you don't even want it to do that. The JVM will try increasingly desperate measures, looped GC scans, ref processing, and sleeps with backoffs. With a huge heap, that can easily take hundreds to thousands of ms.
At scale, it's often better to just kill the JVM right away if the heap fills up. That way your open connections don't have all that extra latency added before the clients figure out something went wrong. Even if the JVM could recover this time, usually it will keep limping along and repeating this cycle. Obviously monitor, collect data, and determine the root cause immediately when that happens.
Of course you’re right and you really want to avoid getting to GC thrashing. IMO people still miss the old +UseGCOverheadLimit on the new GCs.
That said trying to enforce overhead limits with RSS limits also won’t end well. Java doesn’t make guarantees around max allocated but unused heap space. You need something like this: https://github.com/bazelbuild/bazel/blob/10060cd638027975480... - but I have rarely seen something like that in production.
This is one of the areas where OpenJ9 does things a lot better than HotSpot. OpenJ9 uses one memory pool for _everything_, HotSpot has a dozen different memory pools for different purposes. This makes it much harder to tune HotSpot in containers.
This article showcases 2 harder-to-articulate features of Clojure:
1) Digging in to Clojure library source code is unsettlingly easy. Clojure's core implementation has 2 layers - a pure Clojure layer (which is remarkably terse, readable and interesting) and a Java layer (which is more verbose). RT (Runtime) happens to be one of the main parts of the Java layer. The experience of looking into a clojure.core function and finding 2-10 line implementation is normal.
2) Code maintenance is generally pretty easy. In this case the answer was "don't use eval" and I've had a lot of good experiences where the answer to a performance problem is similarly basic. The language tends to be responsible about using resources.
While both of those things are true, I’d be hesitant to call out this:
> Once we moved to other tasks, we started seeing the pods go OOMKilled. We took turns looking into the issue, but we couldn’t determine the exact cause.
As a particular “yay clojure” kind of moment.
This was an obscure bug/“feature” in the clojure standard library. That’s not normal, and having to dig into the clojure standard library, even if it only a line or two, is certainly not something I’d be particularly calling out as standard practice or “easy” maintenance.
The standard library is for the most part enormously reliable.
1. I’m having a bit of trouble parsing this paragraph:
> The reason eval loads a new classloader every time is justified as dynamically generated classes cannot be garbage collected as long as the classloader is referencing to them. In this case, single classloader evaluating all the forms and generating new classes can lead to the generated class not being garbage collected.
To avoid this, a new classloader is being created every time, this way once the evaluation is done. The classloader will no longer be reachable and all it’s dynamically loaded class.
It sounds like the solution they adopted was to instantiate a brand new classloader each time a dynamic class is evaluated, rather than use a singleton classloader for the app’s lifetime.
Dynamic classes cannot be GC'd without the classloader being dereferenced. In this case, if eval used an existing classloader we would end up exhausting metaspace and leading to MaxPermGen exception.
Initial Clojure implementation was checking for an already created classloader and tried to reuse. They had commented out the code that was doing it.
These days there is some
Machinery in the JVM for creating truly anonymous classes that can be garbage collected (https://docs.oracle.com/en/java/javase/22/docs/api/java.base......) ), but they are trickier to generate as they can’t have static fields (you don’t have a class name so have no way to refer to them) etc.
Side note: You’re using the term “dereference” incorrectly (also in the article). It doesn’t mean “drop references”. It means “going from the reference to the thing being referenced”, or (in other words) “accessing the thing that is being referenced” [0]. It doesn’t mean the reference is going away.
> It sounds like the solution they adopted was to instantiate a brand new classloader each time a dynamic class is evaluated ...
I was confused too (and I may still be) but that's now how I understood their solution.
Their solution, IIUC from reading TFA, is that they simply didn't use eval at all anymore. So the whole "eval loads a new classloader" thinggy (so that it can be GC'ed later on) is totally moot.
The article makes it sound like the system was using eval (probably on a per-request basis, not just on start-up), and also like ceasing to use eval was pretty trivial once they realized eval was the problem. I'd be curious why they were using eval and what they were able to do instead.
If you can go from ~60ms p99 response times to ~45 from reduced garbage collection, that means GC has a major impact on user-perceptible performance on your application and proves that it is an extremely expensive operation that should be carefully managed. If you have a modern microservices Kubernetes blah blah bullshit setup, this fraud detection service is probably only one part of a chain of service calls that occurs during common user operations at this company. How much of the time users wait for a few hundred bytes of actual text to load on screen is spent waiting for multiple cloud instances to GC?
The only way to eliminate its massive cost is to code the way game programmers in managed languages do and not generate any garbage, in which case GC doesn't even help you very much.
What should be hard about app scalability and performance is scaling up the database and dealing with fundamental difficulties of distributed systems. What is actually hard in practice is dealing with the infinite tower of janky bullshit the Clean Code Uncle Bob people have forced us to build which creates things like massive GC overhead that is impossible to eliminate with totally rewriting or redesigning the app.
They only guess the performance difference is because of GC, generating code on the fly and compiling it to classes in your hot path is also probably not cheap.
Memory allocation/deallocation overhead is always present, just look at different allocators, fragmentation issues and so on. Using a GC is not intrinsically much different performance wise.
If you have an actual memory leak in a JVM app what you want is an exception called java.lang.OutOfMemoryError . This means the heap is full and has no space for new objects even after a GC run.
An OOMKilled means the JVM attempted to allocate memory from the OS but the OS doesn't have any memory available. The kernel then immediately kills the process. The problem is that the JVM at the time thinks that _it should be able to allocate memory_ - i.e. it's not trying to garbage collect old objects - it's just calling malloc for some unrelated reason. It never gets a chance to say "man I should clear up some space cause I'm running out". The JVM doesn't know the cgroup memory limit.
So how do you convince the JVM that it really shouldn't be using that much memory? It's...complicated. The big answer is -Xmx but there's a ton more flags that matter (-Xss, -XX:MaxMetaspaceSize, etc). Folks think that -XX:+UseContainerSupport fixes this whole thing, but it doesn't; there's no magic bullet. See https://ihor-mutel.medium.com/tracking-jvm-memory-issues-on-... for a good discussion.