JVM Anatomy Quarks

jnordwick · on Jan 31, 2020

I have been reading Aleksey Shipilev for a decade now, and his insight in high performance Java was been a well worth the time. He is also the maintainer of JMH, Java Microbenchmarking Harness, a tool I miss in about every other language especially Go. Because writing microbenchmarks is really difficult especially on a JIT that wants to rewritw and optimize things behind your back a couple times. Hotspot can be both brilliant and maddening.

Thanks to much of his writings and the rest of the Java performance community (pretty large in the fintech sector), I can write faster Java than most people can with C++. It just takes some effort, but the control and performance you can get from Java is really impressive.

I've had to deep dive into Go a little more lately, and I really miss some of the Java support. I've found Go to be much slower when you have to do anything interesting. In high performance Java you often rewrite a lot of the base libraries in a very different style that gives you tight control over escape analysis, GC, call site inlining, etc. You actually have a decent amount of control for such a high-level language.

In Go, I haven't been able to find that control. The Go team seems to have taken an opposite approach and removed your control (I often joke about Go just being short for "Go Fuck Yourself" because of its attitude against developer control and the teams's "if we don't need it you don't need it" attitude).

It is resources like this that really make Java shine in its pro high performance developer attitude. (Current Go issue, getting select and channels to operate anywhere remotely efficiently and trying to find a way to keep high CPU goroutines on different OS threads - so far not much luck).

pjmlp · on Jan 31, 2020

If you happen to do .NET, you can make use of BenchmarkDotNet.

As usual, one platform kind of mirrors the other. :)

closeparen · on Jan 31, 2020

Benchmarking and profiling are both built into the language, which is pretty nice. But you’re right, Go is designed for safety over control. High performance Go comes down to avoiding allocations; there’s not too much else you can do at the language level.

jnordwick · on Jan 31, 2020

Not sure if you can even control GCs that well some times. In Java I've worked on projects that never GC once they reach a steady state. I'm not sure I could do the same in Go, or maybe the techniques are just more involved or different?

closeparen · on Jan 31, 2020

The technique is basically to preallocate your structs and reuse them via *sync.Pool, never allocate on a hot path. I don't know about "never" GC but it at least minimizes GC, and the pauses are very short.

andreygrehov · on Jan 31, 2020

Or you can turn off GC altogether by setting an environment variable, GOGC=off.

mknyszek · on Jan 31, 2020

> He is also the maintainer of JMH, Java Microbenchmarking Harness, a tool I miss in about every other language especially Go. Because writing microbenchmarks is really difficult especially on a JIT that wants to rewritw and optimize things behind your back a couple times. Hotspot can be both brilliant and maddening.

I'm assuming you're aware of the microbenchmarking framework built into Go's testing framework? If so, can you elaborate on where it falls short? I have my own gripes, but it would be nice to understand where others are coming from on this.

> In high performance Java you often rewrite a lot of the base libraries in a very different style that gives you tight control over escape analysis, GC, call site inlining, etc. You actually have a decent amount of control for such a high-level language.

In case you're not aware, the standard Go toolchain does expose some of the optimizations the compiler does, notably escape analysis and bounds check elimination. In the latest release you'll have the option to even export this information as JSON to inspect it programmatically.

Other things, like inlining, are not there (AFAIK), but that's partially because the toolchain is still maturing. For example mid-stack inlining (i.e. inlining any calls to non-leaf functions) is actually a relatively recent addition. Go will also let you choose to not inline something. There's still a lot of work to be done around inlining in general, and visibility into the process would be a nice improvement.

I'd also like to point out that part of the reason the GC has so few knobs has to do with maintainability. Every new knob means expanding the space of configurations significantly, and making sure they all continue to work is a big task. For example, V8 exposes a lot of knobs, but (IIUC) aside from a few default configurations shipped in Chrome, any deviation from those and you're considered "on your own", mainly because of this maintainability problem.

With that being said, I'm not really sure how OpenJDK deals with this issue; maybe there's just enough people out there and enough resources behind the project that it's fine?

> In Go, I haven't been able to find that control. The Go team seems to have taken an opposite approach and removed your control (I often joke about Go just being short for "Go Fuck Yourself" because of its attitude against developer control and the teams's "if we don't need it you don't need it" attitude).

I don't think this is the intended messaging from the Go team, and there's been efforts on their part to shed this image. I think part of it is maturity of the toolchain; Java has nearly 15 years on Go and in some cases there honestly isn't all that much to give visibility into or control over yet. Another part of it is an overall conservative approach toward evolution of the language and of APIs, primarily for long-term maintainability and compatibility. Expanding the API (including performance knobs) usually needs to show a clear net win (see SetMaxHeap, which gives you more control but never really made it in; it still exists as a patch).

> It is resources like this that really make Java shine in its pro high performance developer attitude. (Current Go issue, getting select and channels to operate anywhere remotely efficiently and trying to find a way to keep high CPU goroutines on different OS threads - so far not much luck).

You should definitely file a bug if you have the bandwidth to do so and haven't already. Channels and scheduling should be efficient and smart by default, and finding situations where they make poor decisions is how the runtime improves. The team is fairly responsive to such bugs and even if they don't get resolved immediately, having it on their radar will only help the team make better design decisions going forward.

ajkjk · on Jan 31, 2020

This is good stuff. I wish it existed for a lot of other things -- like python and v8. Or maybe it does? Where can I find deep technical knowledge in bite-size form?

shortercode · on Jan 31, 2020

While not quite the same the V8 team has a blog that discusses various improvements they are making. Such as the Orinoco GC project. Which typically contains some details on the implementation.

It's a fairly fascinating window into the inner workings, and quite detailed.

https://v8.dev/blog

jashmatthews · on Jan 31, 2020

The WebKit blog also has really good info on JSC E.g. https://webkit.org/blog/7122/introducing-riptide-webkits-ret...

Filip Pizlo’s site has a bunch of his presentations of how their JIT compiler works http://www.filpizlo.com/papers.html

simula67 · on Jan 31, 2020

For Python, there is a code walkthrough by Philip Guo : https://www.youtube.com/playlist?list=PLzV58Zm8FuBL6OAv1Yu6A...

emmanueloga_ · on Jan 31, 2020

https://mrale.ph/ had quite a few deep dives on JS/v8, not sure how up to date it is now

xxs · on Jan 31, 2020

Most of the stuff is quite old, 1.6/1.7 times, though. It should be appreciated, it has been gathered in a single place and backed up with micro benchmarks and assembly.

fctorial · on Jan 31, 2020

There doesn't need to be. These VMs aren't designed to run anything other than their flagship languages.

ryeguy · on Jan 31, 2020

I'm not sure where you're getting that idea. Are you thinking the target audience of this content is only for people developing JVM languages? It's not, it's how the vm and compiler work at a low level and optimize your code.

wiradikusuma · on Jan 31, 2020

Man, I always use String.intern() for synchronized(theString). -- https://shipilev.net/jvm/anatomy-quarks/10-string-intern/

xxs · on Jan 31, 2020

Sync on a string is a bad to boot.

It's hard to explain how terrible the idea to synchronize/lock objects you don't control.

As for interning (not for strings only and not guaranteed) I have a lock free table (not CHM) that keeps most used objects (with possible random eviction) to provide a good trade off between unnecessary memory waste, fast access and low memory footprint.

diroussel · on Jan 31, 2020

If need to synchronise() and there is not an obvious choice then I create a new Object() for that purpose and make sure it’s available wherever locking is needed (ideally encapsulated in one class)

sorokod · on Jan 31, 2020

The ReentrantLock class and friends have been around for a while and typically provide a better alternative.

xxs · on Jan 31, 2020

Nowadays (java 7/8), "synchronize" is likely better than most uses of reentrant lock, though. It's especially good in (common) cases where the lock is not contended.

even CHM uses synchronized nowadays.

Flip note: don't touch ReadWriteLocks

diroussel · on Feb 1, 2020

In the past I have followed this approach myself. But that is changing. If you want to future proof your code, don't use classic locks. Project Loom doesn't work with classic locks, and the java.nio package is has been re-written to remove classic locks:

> In Project Loom, there will be support for efficiently switching between fibers that use Java 5 locks (that is, the java.util.concurrent.lock package) but not native C locks. As a result, it is necessary to migrate all blocking code in the JDK over to Java 5 locks. So the legacy Socket API required reimplementation to achieve better compatibility with Project Loom.

See: https://blogs.oracle.com/javamagazine/inside-java-13s-switch...

sorokod · on Jan 31, 2020

> don't touch ReadWriteLocks

I am interested in references that back this statement

MrBuddyCasino · on Jan 31, 2020

I was interested in this as well, and yes ReadWriteLocks are horrible[0]. Just using synchronized is a good default that performs well in most scenarios, StampedLocks are good too.

[0] https://blog.overops.com/java-8-stampedlocks-vs-readwriteloc...

xxs · on Jan 31, 2020

About the blog - testing with more threads than cores could be quite misleading, also testing on dual/quad socket vs single one exhibits the effects on coherency traffic a lot more (compared to L3 talk)

xxs · on Jan 31, 2020

Briefly:

- the lock has write a CAS on the =fast= read path, causing coherency traffic and a contention point between the readers. That's it the readers don't scale

- it's quite hard to use correctly, i.e. after read, determining the exclusive/write lock has to be acquired, the read lock has to be released 1st, the write lock acquired and the conditions that causes the grab to be rechecked

- Copy-On-Write should be a preferred solution for most cases, easy to understand and reason about. If not StampedLock is a better alternative.

jabba_d_hut · on Jan 31, 2020

StampedLocks performs much better than R/W locks.

hyperpape · on Jan 31, 2020

Aside from String.intern() being dubious, if you synchronized on an interned string, anything else that happens to work with an interned copy of that string is now contending with your lock. In the worst case, you could deadlock.

If you really need this sort of dynamic locking, you can use a ConcurrentHashMap<String, Object> to achieve it (lock on the object). I'm not sure whether it's ever the _best_ design, but it avoids interning the string, and it keeps an anonymous lock object that you know won't be shared.

chrisseaton · on Jan 31, 2020

> anything else that happens to work with an interned copy of that string is now contending with your lock

Isn't that the point?

People use it to create symbolic locks in situations where they don't want to use any more formal link of linkage provided by the JVM. In your case you need some way to get a handle to that concurrent hash map, so some kind of formal JVM linkage. Sometimes that's hard.

Not saying how it's how I'd chose to design an application, but I presume people doing this have their own good reasons.

hyperpape · on Jan 31, 2020

I think there's a subtle difference: you want to create mutual exclusion around specific operations using that string. However, it's at least conceivable that something else uses the interned instance of that string, and thereby creates contention.

Now, if your string is sufficiently unique, the chances are relatively low, as long as you don't leak a reference to it (an object, or explicit Lock only has one purpose, so that's less likely).

Still, it's basically mysterious action at a distance. Whereas using the ConcurrentHashMap, it's very explicit action at a distance. Granted, it does require explicit JVM linkage, which is a cost.

sk5t · on Jan 31, 2020

Yeah--mucking about with .intern for no good reason, or locking on strings in general, shouldn't pass a code review, this reeks of sticking one's fingers in all sorts of places they don't belong. ConcurrentHashMap.computeIfAbsent (and computing a value that is _not_ subject to mysterious external forces) is so much better.

xxs · on Jan 31, 2020

>ConcurrentHashMap.computeIfAbsent

It's leak prone as it'd hold the references forever. It's a cheap and easy way to do it but far from ideal. I'd not recommend it.

sk5t · on Feb 1, 2020

This seems a strange criticism--of course entries in a map don't disappear for no reason at all. Consider using an atomic cache if you are going to stick so many keys into the map that it becomes an issue. Or else bucket the keys into a known set, if you can tolerate the occasional collision.

xxs · on Feb 1, 2020

String.intern does remove references (it used not to and it was considered a major issue). Imagine parsing documents and keeping the words - it works well as long as the input is not malicious. Otherwise words would pile up with no one cleaning them, give it enough time/input to slow the application due to memory pressure, crashing with OOM.

aardvark179 · on Jan 31, 2020

I hesitate to ask what exactly you’re trying to achieve locking On a string which you have to intern like that. There may well be a different design that would avoid having to do it.

vbezhenar · on Jan 31, 2020

Avoiding this pattern is surprisingly hard to get right.

MaxBarraclough · on Jan 31, 2020

I'm not getting #17: Trust Nonstatic Final Fields.

> what happens if someone changes that field?

The field is final. The static is final. How can the field be changed? Reflection?

UncleMeat · on Feb 1, 2020

Final applies to the field, not the object. You can still mutate the object pointed at by a final field.

MaxBarraclough · on Feb 1, 2020

Sure, but the instance's only field, is final.

The static reference KNOWN_M is final, and the only field of the pointed-to instance, x, is also final.