Before any one panics, please actually read the article. For some context, Peter Lawry is pretty famous for being the OpenHFT guy so he (rightfully) cares at scales that have very different concerns than general application code.
He's working at the scale of 67 million events per second. Yea, for him a single allocation looks like : 67.8 M event/s, 472 ns vs. 50.1 M event/s, 638 ns
As always with perf: scale and context matters! An allocation in these tests at these volumes was still only 160ns and at lower volumes was less than 50ns. These are closer to the scale of cache misses then a C malloc call.
Even so, I wonder what advantages Java has, that makes using it in that environment so overwhelmingly compelling over other languages. Enough that it overcomes the fact that you have to write very non-idiomatic Java if you don't want to incur unacceptable performance costs.
It would be interesting to see what other languages were considered for the use case, and which cost/benefit considerations made Java come out on top.
It's not so much the language as it is the virtual machine / runtime environment. The JVM is one of the most solid and heavily invested systems in the history of business software.
Notable companies who started off with PHP or Python or what have you, reached a stage in their growth where they had to either migrate, or else write their own in-house compilers to fork those languages and make them scalable. No one's ever had the need to do this with Java (although different vendors do produce their own tweaked builds in order to compete in the support space).
People generally assume that a virtual machine is a disadvantage over AOT compilation. And for the "To Do List" apps that so many students and hobbyists on these forums are writing, they are correct. But JIT compilation is vastly superior for long-running server side business processes. Which is the niche that Java owns (facing some competition from .NET, which works in the same manner). It's not sexy or beginner-friendly, so the level of discussion that it gets here is disproportionately low.
JIT compilation is vastly superior for long-running server side business processes.
How is that possible? Assuming all things are equal AOT should always be better. The advantage of a virtual machine is being able to distribute the same files to different processors and operating systems. The advantage of a JIT is that it makes your virtual machine faster.... oh how I wish Python would have an official JIT, sigh.
But back to my question, if a JIT compiler takes bytecode and converts it to native code at runtime, and an AOT compiler takes source code and converts it to native code before being distributed, then how could JIT be faster? Is it that many people use AOT compilers in a generalized way instead of taking advantage of hardware specific optimizations? Or is the Java JIT compiler just so good at optimizing java byte code that it doesn't matter if it's compiled ahead of time or not?
Edit: I got too many answers, but thank you all. I didn't consider the extra optimizations possible with runtime data. Please keep replying if you think of other reasons.
One example of an optimization that cannot be performed AOT is anything that can be done in response to "noticing" that every time you get a List, you actually have an ArrayList. AOT can't make the assumption, but a JIT can start doing things like specializing/inlining code the implementation in question.
Let's consider the context here. HFT for advanced programmers. A sufficiently smart programmer working on the hot path of some HFT trading program in an AOT language can change the parameter type to ArrayList.
You can't if the method is in the standard library, or really any third-party code you don't control.
That's one of the biggest advantages of profile-guided optimization that works at the bytecode/IR level. It can recognize that generic libraries are only ever called with a single concrete type, and then specialize them to that type, and then inline the data representations of the concrete data types you actually pass, and then inline all the accessors that access that data.
To do that AOT you need to rewrite the library, which usually defeats the purpose of having libraries in the first place.
Or as one other possible instance of the design space you use the approach of cpp templates or any of Zig's comptime features -- when linking against a library you explicitly instantiate code corresponding to the things you know at compile time, apply optimizations, and throw away anything that's unused. Recognizing that generic libraries are only called with a single concrete type is child's play in those contexts.
> Recognizing that generic libraries are only called with a single concrete type is child's play in those contexts.
It's pretty trivial to prove this wrong. Return a different concrete type based on a config value that is loaded at startup. AOT can't save you here. We can keep moving goalposts but it shouldn't stretch the imagination that a program's runtime characteristics can't be exhaustively predicted at compiletime.
This is sort of like claiming that branch prediction isn't useful because can't you just throw away what's unused? No, obviously not, so if the branch exists, runtime analysis can find out which one is more likely to be called over time (and generics are just a branch at the type level).
The person I was responding to said that AOT was a non-starter because the tech I mentioned doesn't exist, and that itself was in the context of a domain where you often wouldn't want the perf hit of even having a poor program binary layout and thus probably wouldn't want such things to be config options anyway.
In any event, I was just saying that AOT can definitely cross library boundaries, not anything near as strong as what you seem to be arguing against (it seems you believe I'm asserting something like AOT being superior to JIT in all cases).
Sure. There's nothing Java can do that can't be done by hand.
The question is, how much effort are you going to put into the profiling to determine that that's an optimization worth making? The larger the program, the more difficult such optimizations are to find -- the example given here was a trivial one.
You could do all of this by writing machine code, or constructing a programmable logic array for it. But would you actually get more trading done that way, or would be get more profit turning your programmers onto other tasks rather than one that can be handled by a machine?
> Sure. There's nothing Java can do that can't be done by hand.
in practice there is. For example, the JVM can inline virtual calls to code that was loaded at runtime (even if that’s third party code for which you don’t have the source code).
To do that by hand, you more or less would have to write something like the JVM yourself.
Sure, it's possible to do it by hand, but it's not possible for the AOT to do it for you, and this then violates the parent's "all things considered equal" claim.
There are certainly going to be tradeoffs that runtime analysis can make that AOT cannot make, even if the example I wrote is possible to fix at compiletime.
> How is that possible? Assuming all things are equal AOT should always be better.
The primary thing here is that the hotter the code path the more optimized your code will be using a JIT (albeit compiled with a compiler which is slower) which is impossible with AOT (since we have a static binary compiled with -O2 or -O3 and that's it) also Java can take away the virtual dispatch if it finds a single implementation of interface or single concrete class of an abstract class which is not possible with c++ (where-in we'll always go thru the vtable which almost always resolves to a cache miss). So c++ gives you the control to choose if you want to pay the cost and if you want to pay the cost you always pay for it, but in java the runtime can be smart about it.
Essentially it boils down to runtime vs compile time optimizations - runtime definitely has a richer set of profiles & patterns to make a decision and hence can be faster by quite a bit.
> Java can take away the virtual dispatch if it finds a single implementation of interface or single concrete class of an abstract class which is not possible with c++ (where-in we'll always go thru the vtable which almost always resolves to a cache miss)
> Java can take away the virtual dispatch if it finds a single implementation of interface or single concrete class of an abstract class
It can even do that if there are multiple implementations, by adding checks to detect when it is necessary to recompile the code with looser assumptions, or by keeping around multiple versions of the compiled code.
It needs the ability to recompile code as assumptions are violated anyways, as “there’s only one implementation of this interface” can change when new JARs are loaded.
Mostly due to profiling, the JIT gets informed on actual program execution.
I wouldn't word it as vastly superior, because if you have a JIT you're probably doing some kind of tradeoff (most of the times, memory usage?), but the JIT can optimize based on the runtime profile, where an AOT't program cannot.
I know you're getting bombarded with a dozen answers but I don't think anyone's also mentioned speculative optimization/deoptimization. Jit can and often does assume things that can't be proven like a specific call site can be inlined through a virtual call because in practice it's always the same hard type. It can back out the optimization later if that proves to be untrue. Or if a function is always called with the same variable because it was read from a config: it can assume that it won't change.
1. AOT compilers have less information about the program than JITs (they only see the program as code; JITs see the code and the current execution profile)
2. AOT compilers can only make optimisations that they can prove don't change semantics, while JITs can be more aggressive, making optimisations that are speculative, deoptimising if wrong.
A JIT can see the actual values that occur in hot loops, among other things, which aren't accessible at compile time but are at runtime. There is simply more information available to a JIT.
A JIT compiler can elide a virtual method to a simple static call based on how many implementing classes are loaded at runtime, or can remove conditional branches from code (e.g. think of a setting whose boolean value is checked in a hot loop (for example something graphics related) - a JIT compiler will remove the conditional based on the setting used).
Fun fact, the linux kernel actually has something similar with self-modifying code, that will remove a conditional at runtime.
The JIT compiler has live runtime data on how the program is actually used and it optimizes around this instead of what the source code says. For instance, the JIT compiler can see that a function has never been passed a null object and optimize out null checks and other branches that are never taken at runtime. An AOT compiler doesn't have this information unless its provided PGO data at compile time.
> JIT compiler can see that a function has never been passed a null object
It cannot. It can only see that function has never been passed a null object so far.
And the cost of such optimization might be that edge cases are way slower than anticipated. But what if those edge cases are actually the only thing that provide value for your business?
Imagine you're building a monitoring system that alerts very rarely, but reaction time is really critical for you. Do you really want to JIT compiler to optimize the alerting branch out because it hasn't seen any alerts yet?
My bet is that JIT compiler rarely does those optimizations by default, because it cannot know which branches in your code contain business value. I don't need my code to be twice as fast on average but twice as slow on critical branches.
Now, what would be great if profiler would make suggestions about how to modify your code for performance based on usage. But those should be approved and commented by humans, because humans know whether those are valuable optimizations or a fad.
> Imagine you're building a monitoring system that alerts very rarely, but reaction time is really critical for you. Do you really want to JIT compiler to optimize the alerting branch out because it hasn't seen any alerts yet?
Yes I do, because once the constraints are invalidated the the runtime system will instruct the JIT to recompile the code with the now uncommon branch [0].
>> JIT compiler can see that a function has never been passed a null object
>>... so far.
>This is truly a nit pick.
It's not even a nitpick, it's just wrong - your original statement is true: you said "has never been". The "so far" reply you received mistakenly implies you said "will never be" (given that HotSpot will deoptimise+reoptimise if its assumptions are invalidated)
That said, the remainder of their point I think is reasonable in the extremely specific (I think to the point of being unrealistic, given this seems like a hand-optimised assembly level of perf importance they have alluded to) scenario they mentioned - the JIT optimiser can't know that the user wants to optimise an extremely rare branch for performance to the point where the overwhelmingly common case should be degraded.
Another useful jumping off point to explore for HotSpot's performance techniques is [1]
Can you show me how exactly does the optimization/deoptimization of unused branch look like? You need detect that you got into an unused branch and that requires... a branch!
In order to be able to optimize presumably dead branches you need a primitive that:
1. Can detect entering "dead" branches
2. Is faster than branch.
I'm not saying JIT optimizations are not possible, JIT compiler can totally choose to inline function call based on frequency or loop length. It can make "better" time/space complexity decisions (though "better" is still going to be controversial).
But optimizing branches because they were not called in runtime seems like common a myth. Like previous commenter who confused startup optimization with code optimization.
1) You catch the SEGV signal you get when failing to read the address, then it is complicated, but it is similar to the mechanism used to reach safe points (also does not use jumps).
2) If there are no nulls, it is faster not to do a branch.
> Java specification says that NullPointerException would be thrown when we access the null object fields. Does this mean the JVM has to always employ runtime checks for nullity?
Once again, we're talking about how JIT can be faster than AOT, not how to reduce JIT overhead. Those are two different types of optimization. You will never be faster than AOT just by reducing JIT overhead. Both links that were provided talk about JIT overhead alone.
This one talks about internal JVM optimizations. Not about optimizations JIT is capable of doing for your code.
What exactly are we even comparing when talking about NullPointerException? To my knowledge, most AOT-compiled languages don't even have NPE (not in Java sense at least). It's apples to oranges comparison.
If we have the code "if (x != null) { x.foo = 42; }", then it can rewritten as x.foo = 42 (with some extra magic in the runtime). If x is not null it will be faster if the cost of a branch is higher than zero. If x is null, the (slow) trick with SEGV can be used.
The trick of catching the SEGV can also be used by an AOT compiler (https://llvm.org/docs/FaultMaps.html), but the AOT compiler would need profiling data to do this optimization that is more expensive to do if the variable x happens to be null often. Even if you have profile data for your application, and that profile data will correspond to your application profile of this particular run (on average), you can not handle the case when x != null for five hours and then is set to null for five hours. If you use an AOT compiler you would have exponential blow-up of code generations for combinations of variables that can be null (if you would try to compile all combinations), and you would basically reinvent a JIT compiler --- badly.
Theoretically, a JIT can do anything an AOT can do, but the reverse is not true.
The article is talking about optimizations the JIT is capable of doing for your code. It is well written, and although it is talking about code throwing a NullPointerException, I think you can see that the same optimization can be done for my example at the top (that is applicable to c++ as well). So my comparison is not apples to oranges.
But your observation that most compiled languages do not have NullPointerExceptions is interesting. It could very much be because that it is too expensive with an AOT, have you thought about that?
Also, in an AOT-way you couldn’t really play the same SEGFAULT trick multiple times without some expensive tracking of where did it come from, where to return to, etc, at which point you are doing borderline JIT compiler with precached method bodies.
Well I guess the point is that with JIT you can do a lot of tricks based on the fact that a function always gets passed a null (or anything relatively constant) in practice even if that wouldn't be statically probable.
If that assumption gets violated a JIT can deopt and adjust later. In AOT you can only make the assumptions based on what the code can statically tell you.
Feels like we're going in circles. I guess that's what you get for nitpicking :)
JIT can indeed do a lot of tricks to reduce JIT overhead. But you can't present that as a benefit of JIT over AOT: AOT doesn't have JIT overhead at all.
JIT can definitely trade space for time. I bet that JIT will inline/memoize certain calls based on call frequency.
The only thing I'm arguing against is that low call frequency (like zero branch executions) somehow provides room for optimization compared to AOT. The only optimizations you can do in this case are optimizations for JIT overhead itself.
Anything beyond that is simply not possible: you can't eliminate a branch and detect the fact that you were supposed to enter eliminated branch at the same time: the simplest mechanism that allows you to do that is branching!
One trick that OpenJDK uses is optimization of null checks. Since in Java memory has to be initialized, null pointers will have the value 0. If a given function using a null check gets called many times, and all these times it was called with a non-null object the JVM can compile the branch totally away. If the function is finally called with ‘null’ the compiled code will try to load the memory at address zero, causing the TLB to trigger and the OS to send a segfault. This can be handled safely by the JVM which will interpret it in accordance with the above and will deoptimize the code (with all possible side effects reverted) and either run it in interpreted mode or an optimized mode that does contain the null check “properly”.
How will the runtime system know to invalidate and recompile the code, if the alert check isn't being done on every call?
At some level, the check for the alert has to be done on every call. If it isn't being done, the one time there should be an alert it will be missed. If it is being done, where is it being done if not in the optimised JITted code, and how is that code even more optimised than the JITted version would be? Why not just put that optimisation in the JITted code?
The terms you'll want to read up on are JIT speculation and deoptimisation. There's a whole range of things that can be done like trapping the null derefrence instead of an explicit check, once hot code is inlined the check could be lifted out several call frames outside of loops. Then you can ignore the check later.
We're talking about optimizations that JIT can do compared to AOT.
The link you shared talks about skipping compilation for certain branches to speed up the JIT compilation. Such optimization makes startup faster, but it doesn't make code execution faster. AOT compiled languages don't have slow startup problem in the first place.
It's like saying paper books are worse than e-ink readers because you cannot charge paper books.
Eliding a branch makes not only compilation faster, but the resulting code as well. That’s how for example some logging statements become zero-cost when the debug level is set lower than their corresponding one. JIT compiled languages don’t care too deeply about startup, they shine with longer runtimes.
Are you sure you're not confusing zero-cost with almost zero-cost? Even zerolog talks about negligible costs after most aggressive optimizations, not zero: https://github.com/obsidiandynamics/zerolog
If you're just using constants in your code to set logging level, AOT compilation can do exactly the same optimizations.
JIT compilers do exactly that. Not only do they have more information, they are also allowed to -- and very frequently do -- perform speculative optimisations. If proven wrong, they incur "deoptimisations" and then recompile with more information.
I don't believe this is right. The optimization as described does not preserve the correct program behavior unless you have a guard to do the null check anyway to fallback to the slow path
edit: just read kaba0 answer below and in fact it's possible via tlb miss, really cool
Something not generally mentioned is that Java can inline what would be virtual methods in C++, it can also remove them as behaviour changes or code is unloaded.
> Even so, I wonder what advantages Java has, that makes using it in that environment so overwhelmingly compelling over other languages.
Presumably not the whole thing needs to be that fast, only the part that actually handles the individual events. Java has the advantage of being an industry standard, or at least being very common in the financial sector.
So you can work in a single language while only having to write some un-idiomatic Java in places where performance is critical.
It is pretty much unparalleled in terms of the product of ecosystem size and performance. The other two languages in the top 3 are javascript and python, both have serious deficits regarding parallelism. As seen, it is a pretty strong contender in performance alone as well, with perhaps the best observability, and it has a well-defined execution semantics, even regarding erroneous execution — while any sort of memory error in a lower level language may silently corrupt the heap after which no assumption can hold regarding program state.
Also, primitive-only java programming is probably no worse than similarly low-level C/C++/etc.
Firstly: memory safety. You might not be allocating but you're benefiting from optimized bounds checks and type safety. If you're trading at HFT speeds then you need your program to throw an exception if something goes wrong, and not simply make a trade with a random bit of the heap sent to the exchange in the 'quantity' field. And you need it to be easily diagnosed when that happens, without taking your whole service offline.
Secondly: these programs do actually allocate sometimes. They just don't do it (much) in the hot path in prod. For instance they'll happily write normal Java code that runs at startup to load data files, and they'll happily write normal code on cold paths that aren't hit all that often. They may well do allocations in test mode for logging, test verification etc. Some HFT shops do even allocate in the hot paths but they just size the heap so it doesn't collect during the trading day, and then they run a GC once markets close.
Thirdly: HotSpot is a very easy way to get PGO which is normally considered to be a 10%-25% performance win. Getting it from C++ toolchains is possible, but hard, and Java has the advantage that if market conditions suddenly change and you start going down different codepaths it'll re-compile on the fly, whereas for C++ you'd have to wait for the next rebuild to re-establish an optimal hot path.
Fourthly: not allocating isn't actually all that weird or non-idiomatic. Quite a lot of apps use this "allocate up front, but not in the hot path" approach. Check out Mindustry, it's a game written in Java that hits a solid 60fps at minimum and can easily hit hundreds of fps. It's just silky smooth even on a non-gaming laptop. How? Doesn't allocate in the rendering loop. Object pooling is hardly radical, most apps with real time latency requirements do that even in C++.
Unacceptable performance costs are pretty contextual though. Plenty of code becomes a unidiomatic if you have to avoid a malloc() or new at all costs for example.
HFT folks in any language tend to break idioms to wring every cycle they can out of a machine. I'd bet you'd be faster in well tuned C++ code, but why not asm then?
One factor I've run into is the cost of offshore development for non-Java technologies. It's so much more difficult to find .NET Core developers in India than a Java developer for example. Forget about Rust or Go. Many enterprise development operations view developers as fungible which I disagree with, but it's much easier to pretend they are fungible if you're using something like Java rather than Rust.
I'd contend that non-allocating superfast code in any language ends up looking kind of odd. High Frequency Trade programs have a lot in common with embedded microcontroller apps. Except they're also doing business transactions.
>have to write very non-idiomatic Java if you don't want to incur unacceptable performance costs.
You would also, I guess have to choose any library dependencies very carefully as well. Or implement the functionality yourself, negating any advantage that might bring in speeding up development.
It's quite a paradox; Java is really fast if you keep allocations down, but Java "best practices" often involve tons of object creation.
It can also be quite hard to find out if your code will be fast or not because there are a whole bunch of rules to determine if and when your Java code will be compiled to machine code or executed through JIT. Not having a way to find out if the compiler is smart enough to cache certain values or optimize some boilerplate code away like you can with native compilers is quite annoying sometimes.
Java is really fast if you do plenty of allocations as well, it is cheaper to allocate on the JVM than most malloc implementations. Also, you are probably better served by writing “idiomatic” code, and only try to optimize allocation pattern if that part of the code turned up in the profiler.
It would be different. A large number of real world applications aren't especially performance sensitive and it makes sense to have tools that prioritize other things than performance. We also see how languages that do prioritize performance can suffer in other aspects (C++ is a clear example).
I think that this is fundamental in some way. Consider arithmetic. Fixed with integers suck from an ergonomics perspective. But variable-size integers are necessarily slower. Even designing something as basic as how arithmetic will work in your language forces you to consider performance against usability.
I'm gonna bet that vast majority of enterprise Java apps aren't so sensitive to performance. What they have is tons of business-specific logic leading to different code paths.
Accumulated business logic can be quite a heavy beast, as I learned in experience, but hundreds of `if`-s can rarely be significantly optimized with clever algos.
You only need to take care of object allocation count for hot paths. You can still use java "best practices" for everything around that (if they are what you want)
Interestingly this is an area where Java shows its Lisp roots. Avoiding “consing” (allocation to the heap) is a key technique for writing high performance code.
This is where using a profiler like YourKit or Flight Recorder is your friend. It will show you the objects that can't be optimised away and where code bottlenecks are. It's hard to determine this in advance for Java.
> The cost of object creation can be far higher than cleaning them up if they are very short-lived.
I thought this problem was solved decades ago? Doesn't Java have a well-tuned generational garbage collector that can handle this?
.Net was hyper-optimized for this over a decade ago with its 3-generation system, where the 0th generation allowed for extremely rapid allocation and collections. I thought Java had similar capabilities?
(Summary, which BTW is probably outdated)
0 generation: Pre-allocate a continuous chunk of RAM. Every allocation goes to the next bytes of the chunk. When there's no more free RAM left, find all reachable objects in the 0 generation and copy them to the 1 generation.
1 generation: Works the same as the 0 generation, except it's larger and objects are copied to the 2 generation.
2 generation: Garbage collections here are traditional "mark and sweep" collections.
The problem is that even if the allocation cost (slab allocator) is zero, and even if the GC cost is zero, a high allocation rate on modern hardware effectively flushes the cache lines at the rate of allocation, effectively reducing your L1 + L2 + L3 cache to 0MB total if your allocation rate is high enough. A slab allocator will be almost guaranteed to be allocating from a non-cached line, and thus the object init will evict a hot cache line every time.
But on top of that, neither the slab allocator nor the GC are free. They're fast, and they're very good, but they also make heavy use of the memory bus, thus competing heavily with the cache-thrashing already being caused by the objects being allocated.
This is why in Java, on a heavily threaded and heavily loaded process, you can see that the CPU cores are far from 100% utilization, but yet there's no blocked threads (and more threads than cores). In other words, once the memory bus saturates, the effective throughput of the CPU drops off and the system cannot make full use of the processing power available.
The relative cost of memory access has gone up 2 orders of magnitude in the past 25 years, and that's before the bus becomes saturated. When the bus is saturated, it can go up dramatically from there (see: queue theory).
Errr. If you read the article, his testing was on the allocation side. The collection side was explicitly called out as a non-issue.
That said, otherwise you're right that the JVM generally makes use of the generational hypothesis depending on the collector. The Gen 0 is referred to as Eden in Java-land and uses thread-local allocation buffers that just bump pointers to allocate. They're basically just like arena allocation.
.NET has added multiple C# language versions, a rewrite of the dominant language surface (Span instead of arrays/IEnumerable) and years of investment of a low level stack (http server, postgres driver and socket layers) to overcome heap allocations in favor of performance. It executed in scale what the article suggests.
The multi generation heap is irrelevant at that scale mentioned in the article.
Java uses something called TLABs (thread local allocation buffer) which are basically arena allocators where a new object creation is a pointer bump (no need to even synchronize!). Later the GC will move longer-lived objects to a more permanent location, reusing the whole buffer for basically free.
The GC isn't involved with allocation, the expensive bit here. The GC handles the clean up.
The allocations create memory pressure that can easily saturate the bus to DDR if you are not careful.
the article seems a bit fuzzy on what they are really calling out as responsible, but the hint seems to be that even with super fast allocation and garbage collection you still trash your L1/L2/L3 caches which leads to lower performance over artificially re-using the same object pinned in memory.
> Java can be very fast, however, it can be well worth avoiding object creation.
I would be extremely hesitant about taking this as general advice for your program. Java object creation should be treated as “fast enough until it isn’t”, at which point you can try applying an optimization technique like this.
I'd take "premature optimization is root of all evil" type of advice with a big grain of salt. IMO it's sensible for procedures, since they can often be optimized rather quickly. It's NOT sensible if you designing the application's basic data structures AND have an idea about the scale you want to achieve, which for data heavy applications is often the case (e.g. you know the rough file sizes that go into your importer beforehand).
From personal experience, once #records times #fields goes into the millions, it starts to make sense to think about a data centric design. Functional interfaces are perfect to encapsulate e.g. a bag of arrays and not expose it to all the rest of your application.
If you are designing your application's basic data structures and have an idea about the scale you want to achieve, the optimization is not premature. "Premature" isn't a decorative word, it's doing real work in that phrase.
There are plenty of people reading this article right now and going yeah, yeah, this is exactly what I need! while their task is writing an API that will receive 3 requests per minute that run in less than ten milliseconds each. We don't warn against "failing to optimize when it is needed" not because it isn't a problem, but because historically the bigger problem has been premature optimization.
(I phrase that carefully, because I'm not sure the balance isn't shifting. I feel like I'm seeing more failure to do basic optimizations lately than people going crazy prematurely optimizing. I've started lightly banging on this drum and in 10 years we may need a luminary to say something like "FFS, people, adding up a few thousand integers shouldn't take three seconds! How did you even make your code that slow??", only, you know, pithily and quotable. But I can vouch for the fact that historically, going all the way back to when Knuth said that, premature and excessive optimization has been the bigger problem.)
There are definitely problems on both end of the scale, but I’m seeing more of the former definitely - I think it depends a lot on what kind of environment you’re in. I more often encounter people using “premature optimisation” to not do any optimisation at all. It results in applications that have a much lower performance ceiling and more often require a rewrite relatively soon.
Edit: I also want to add that in this specific case with Java, using functional interfaces with primitive datastructures often doesn’t really have any downside. If you know you need to filter the data to a lower scale first before continuing (e.g. to the amount you can display on one page), it’s just not sensible to object box your entire data first, just so you can avoid having arrays anywhere in your code. Just have a class for your data and expose some rich interfaces to it, but implement paging on top of primitive data in this case.
> I'd take "premature optimization is root of all evil" type of advice with a big grain of salt.
Instead of responding by making this a conversation about “premature optimization”, we might do better by remembering that the Java runtime’s HotSpot compiler does apply automatic optimizations around instance creation.
Instead, we should make sure that (a) we haven’t accidentally taken a bad time measurement because we haven’t given the runtime enough time to see the slow parts of our code and apply it’s optimizations, and (b) that we haven’t accidentally written code that the runtime can’t optimize.
Edit: Perhaps after all of that we still have slow code, but at least we’ve made sure we’re keeping an eye on all of the JVM’s moving parts as we craft our solution.
Indeed. Looking at the benchmark results, the conclusion should be that the overhead of object creation can be neglected for most types of applications (until it can't, as you wrote). Edit: Based on my experience, object creation is ridicously fast in the OpenJDK JVM and trying to avoid it will, for most applications, just result in code that is harder to maintain.
When I worked in HFT with Java code, creating caches/pools of commonly used objects (generally serialized/deserialized real time parameter updates) was pretty common. It significantly reduced memory allocation in hotspots. But the better benefit, generally, was substantially reduced GC pause times.
Java was used back in the 2000s because it was seen as the cool kid. More productive than writing C++ and easier to hire and train for.
Yeah, afaik HFT folks also disable GC and just restart the app, or even the machine, at the end of the day.
Java game devs are also fond of object pools, to my knowledge: they do as much allocation as possible at the start of a game level, and then avoid it until the level ends.
Yep that was what the critical loop trading components did (the ones w/ the models for buy/sell hooked up to the exchange feed). Completely disable GC and restart it every day. Every line of code in those critical loops was pored over to make it as fast as possible. These days it is a lot of FPGA based stuff but I like to believe the JVM stuff is still running somewhere.
Depends on the type of HFT. On one end, not even C++ cuts it and FPGAs are used, while on the other hand I heard of plenty of Java deployments. Sure though, HFT itself is a niche.
Really wish that when people benchmark the GC they would include the GC settings they used or at the very least experiment across the several GC’s the VM has to offer.
In each case the default settings were used (i.e. no options were provided). In both Java 8 (Parallel GC) and Java 11+ (G1), As the GC was a small portion of the time, changing the GC wasn't expected to make a difference anyway.
I used an approach similar to Peter's here to both an old streaming parsing library as well as a GUI (event handling) library -- effectively, re-use the same instance of the event shell with new data every time an event fired and copy the data out if you need to retain it for a long-lived operation or persistence, otherwise calculate and move on... the reduction in object creation overhead was significant and performance increase was around +30% for doing this in the two individual use-cases... BUT, the API was like a loaded gun pointed at your face.
I knew what I was doing (with it) so it wasn't a problem, but if I over open sourced the API and provided it as a library I would envision a large portion of the population trying to handle the events in a multi-threaded context or throwing them into a List only to find the values changing on them during use (while the parser was still running on another thread).
Performance was so tempting, but usability-face-shot-gun was the greater evil.
Reusing your objects can be confusing for other developers. You need to be careful with anything exposed via an API, however internally you can me more optimised.
It's a short article. I was looking for any distinction between heap allocated and stack allocated objects. The compiler does escape analysis and will stack allocate anything that it can to a large degree.
It then goes on to say
> While allocation is as efficient as possible, it doesn’t avoid the memory pressure on the L1/L2 caches of your CPUs and when many cores are busy, they are contending for memory in the shared L3 cache.
So I suspect that it's not all about gc, but also the overhead of object memory and especially object packing for caching and use of objects between threads. That's why we have things like LMAX Disruptor[0].
Until we get Project Valhalla value objects, you're likely better off using Go than Java for the object packing efficiency.
Complex programs will likely need objects (not ‘value types’)+, and Go’s GC is nowhere near as performant as Java’s. Benchmarkgame’s binary tree test is the one that is specifically made to stress test the GC and Java beats out every single managed language by a huge margin there.
+ Even in Rust/c++ one often has to go with dynamic life times like (A)RC, shared pointers, etc.
There's good advice there but you definitely need to not go crazy and get into premature optimization land.
If you've got a tight section of critical code like he does this is a good technique, however reuse of the same objects through the critical section implies:
- Mutable objects
- Harder to verify correctness
- Need to be very careful to avoid some very very serious bugs and security issues.
E.x. in his HFT example. Improper reuse of an object without correctly resetting/setting all the fields could create bugs as serious as executing a trade against the wrong user or leaking data between users.
None of this is really rocket science or java-specific. It's all basic old school optimization & computer science.
The pre-allocation is a key feature. If you can lower GC you get all the benefits of hotspot and much less draw back. Whether or not you actually need this is another question. Anything outside of real-time trading is highly suspect.
Big Peter Lawry fan - I use Chronicle Wire (the library mentioned in the article) often, it's a really nice serialization/deserialization library when performance/allocation is a focus. Combining Chronicle tools + Real Logic tools you can build some extremely performant Java applications.
There are system languages and application languages. And application languages need a non operating system underpinning. So Java and .NET needs their high performance foundation.
Ah, but java seems to go out of its way to make you make so very many objects. It's almost as though it tries to rub your face in the fact that you don't have to remember to delete objects, by needlessly making you make them. Iterators for example. Why (why!) should it be necessary to incur memory thrash, just to traverse a collection? Why not allow a class that has a collection also have a reusable iterator? Just a simple .reset() method would work wonders for many of these "disposable" objects. And if .reset() offended anyone, they could always ignore it and continue to blithely throw candy wrappers out the car window, as it were.
A bigger issue I think is Java's need to box values, primitive or otherwise. For example, even an ArrayList of integers creates an object for each one. IMHO, the Java runtime has many of the design hallmarks of a dynamic language, with strong (generic) types somewhat bolted on.
> the Java runtime has many of the design hallmarks of a dynamic language, with strong (generic) types somewhat bolted on.
I'd say it's the opposite. Dynamic languages largely use boxing as the indirection mechanism for their highly dynamic type systems, since it works well there. But boxing isnt synonymous with type flexibility, and Java uses it for other reasons (mostly).
The Java runtime provides runtime type information for all types, and extensive reflection is available, including for example invoking arbitrary code. Very dynamic and a significant problem for static analysis, optimisation and security.
It’s not always obvious what’s actually boxed and what isn’t however. The JVM tries to do clever optimisation. For example, I’m fairly certain that when you manipulate an array of primitive numerical values, they are not actually unboxed before each operation before being boxed again.
Hmmm. As much as I hate having to fix boxing problems, I usually find it pretty obvious when boxing can occur. Pretty much any time a variable/parameter type or a function return type uses capitalized Integer, Long, Boolean, etc. instead of lowercase int, long, boolean, etc. boxing is probably occurring. Those ... and assigning those to Object or Number.
I think that's exhaustive. You can pretty much grep through and remove them from 90 % of coffee with a little refactoring and replacing HashMaps with specialized ones. Only if there's null returns or ConcurrentHashMaps involved does it get tricky.
An array of primitives like byte[] or long[] isn't Byte[] or Long[]. The former don't cause boxing. Maybe you're thinking of ArrayList<Byte>?
> Pretty much any time a variable/parameter type or a function return type uses capitalized Integer, Long, Boolean, etc. instead of lowercase int, long, boolean, etc. boxing is probably occurring.
That probably occurring hides a lot of complexity. That’s my original point. Optimisations are happening especially when Integer and Long are involved. There are sometimes actually less boxing than you would expect reading the code.
1. There's no point in optimizing memory usage by hunting iterators. If the collection overhead is becoming visible, then you need an array, not a reset method.
2. You cannot easily add "reset()" to all iterators now, since it will be a breaking change.
3. Having reset() on all iterators 25 years ago would reduce the number of use cases for it, if this method had to be reliably supported.
3. Throwing UnsupportedOperationException in default method to maintain compatibility of libraries with existing user code would require the new client code using that method to catch the exception and handle it, which ruins the whole idea of optimization (you will get more overhead on that).
I you are concerned about performance, you have to design your code with respect to it. Building your application the same way as if performance wasn't a big problem, using the same libraries and APIs is just a bad idea that won't be salvaged by such "optimizations".
Most Iterators should be placed on the stack anyway. If they are not and you have a random access collection like ArrayList you can use an index, but this is rarely needed.
If you are going to change code, perhaps Stream(s) are a better choice, these are also placed on the stack most of the time.
Sometimes arrays are not an option, because objects may not be in memory at all times. A fun trick we use is to have the iterator be attached to the object lifecycle, so that repeated (parallel) iteration does not require multiple iterators. I can't remember the exact percentage, but this change alone increased our throughput by 20-25%.
He's working at the scale of 67 million events per second. Yea, for him a single allocation looks like : 67.8 M event/s, 472 ns vs. 50.1 M event/s, 638 ns
As always with perf: scale and context matters! An allocation in these tests at these volumes was still only 160ns and at lower volumes was less than 50ns. These are closer to the scale of cache misses then a C malloc call.