Hacker News new | past | comments | ask | show | jobs | submit login
JVM Anatomy Quark #10: String.intern (2019) (shipilev.net)
119 points by hyperpape 12 days ago | hide | past | favorite | 41 comments

(Note: the article is from 2019, but it is still relevant and interesting.)

If all you're doing is deduplicating string objects in order to save memory, the article makes a good case to use `HashMap` or `ConcurrentHashMap` instead of `String.intern`.

The reason to use `String.intern` is in the fairly narrow case where you want to use `==` to compare strings, and some of these strings originated as string literals in Java source code. Java specifies that string literals for equal strings are always the same Java object, even if they occur in different class files that are compiled separately. This is accomplished by interning the strings, and `String.intern` provides user-level access to this same intern pool.

the guava library has an interner that does exactly what you'd want for a cache of strings, without blowing up the jvm's internal code cache (which iirc is what String.intern() uses). https://guava.dev/releases/22.0/api/docs/com/google/common/c...

Isn’t this concept originally from C#? (Where interning is the default behavior)

Unlikely, as Java’s string interning behavior was specified in the first edition of the JLS (1996). This predates C# as far as I know.


See section 3.10.5.

The JVM interns string literals under the hood, yes.

I've been writing Java for about 15 years. There hasn't been a single instance where I've used String.intern(). I'm starting to think that there is never an appropriate place for it. What do you get from it? Anything else other than the ability to reliably use == for String comparison instead of .equals()?

Nevermind the architecture astronauts saying that the string intern pool is actually global state and global state is always problematic.

You'll need to go back earlier than 15 years. ConcurrentHashMap and friends were added in 1.6, but String.intern has been in there since the beginning. Since Java shipped with Threads in the standard library, (but no memory model) that meant it would have been very difficult to do concurrent String deduplication yourself. If you agree string deduplication is needed, then String.intern() was a good implementation for a long while.

String.intern() also offers some other benefits for certain use cases. Earlier versions of java did not cache the String hashcode, which meant that to use Strings as hash table keys meant hashing a lot more. But, an interned string can be used in an IdentityHashMap, which was faster for a long portion of Java's early life.

(I worked on a moderately popular Java library that targeted Java 1.5 as the minimum version. It does occasionally come up useful, but only in specific, and increasingly rare circumstances)

That would explain it - I think I first touched Java in 2009 or 2010 and this is the first I've heard of String.intern in all that time.

I used it a few times for great effect, many years ago, when dealing with very data heavy Swing applications. We dealt with millions of rows of data, and some keys were strings with very low cardinality. Interning some of those columns lead to major reductions in memory usage: Tens of millions of copies of the same string aren't all that great for 2005 era laptop, as headroom between garbage collections, especially back in the day, could make a big difference. Now, the fact that the strings were comparable with == vs .equals? never bothered trying to play those games.

Now, could we have solved this with other ways other than interns? Absolutely: Just make deserialization code do mostly the very same thing, but done by hand and on the heap.

In today's JVM, if you are using the G1 garbage collector, you can get 99% of the memory savings almost completely for free with a VM parameter: You still pay for an extra object, just pointed to the same character array underneath. I'd argue it's one of the most underutilized flags out there: It's not uncommon to end up with half the memory footprint in many use cases, and memory savings are often runtime savings too. Now that we have this, maybe intern always loses, but it was definitely useful in the right circumstances.

I never understood it, either. It would be useful (at least historically) in cases where you’re willing to give up performance in exchange for lower memory usage, and then, you’d still have to make good choices on what strings to intern and what strings not to intern (I think the standard example is a XML parser that may choose to intern all node and attribute names of a XML schema. If you’re parsing a large file using a DOM parser, that can significantly decrease memory usage)

Problem 1: it can easily depend on your input data whether interning is necessary to prevent out of memory exceptions.

Problem 2: there often are better solutions to memory usage than interning strings (for example, by replacing that DOM parser by a streaming one)

Problem 3: if you’re writing a library, you can’t know anything about the application that will use your code, so you have to guess whether interning some of the strings you create will help the user.

And of course, that’s historically. (Some) modern JVMs already deduplicate strings.

I don't think using String::intern makes sense, especially now that Java's garbage collector is capable of deduplicating strings (https://openjdk.org/jeps/192). In the past it could have been used to reduce memory usage when a given string was used a lot, but now there are better ways of dealing with that issue.

G1's deduplication is nice, but note that G1's deduplication is a lot weaker than what String.intern does. G1 deduplicates the underlying byte array, but leaves separate strings (so s1 == s2 will evaluate false). So you still have an extra object header.

If you have (like one our applications did) millions of copies of the string "USA" in memory, that's many megabytes of memory that explicit deduplication can save that the garbage collector can't.

String.intern isn't the way, for all the reasons this post outlines, but just using G1 isn't the right approach either.

IIRC all of the concurrent GCs can dedupe now. Not just G1.

Hopefully soon object headers will be negligible with progress from Lilliput though.

Strings have an object header, an int for the hashcode and a pointer to the array. Assuming a < 32 GB heap (so 4 byte pointers), that's 24 bytes for the string, even once the array is deduped. Lilliput is awesome, but an 8 byte header would only reduce that to 16 bytes.

One common thing I've seen is an attempt to de-duplicate a poorly-crafted cache.

Let's say you have a cache with Map<Language, BusinessInfo>. After a while, the cache is quite large, and you want to reduce its memory usage. You realize within BusinessInfo, you have a MailingAddress, but it's not actually localized. You're duplicating millions of address lines for no reason.

You could split this out of your cache, but pulling that thread is a bit tricky. Instead, you decide to store addressLine.intern().

It allows you to do this:

synchronised(bob.intern()) { //This block executes in parallel for requests with different values of bob, but sequentially for each request that has the same value of bob. }

If I didn't have to integrate with a weird 3rd party API where this was necessary, I wouldn't know about String.intern()...

I think using == for strings is a mistake even if they are interned, except of course in some very controlled circumstances internally in a library if it gives a performance benefit. I don't want to look at my code and need to know if I string is interned to know if it is correct.

I've seen it used in an old code base so that a specialized Map implementation could be used with String keys.

I'm not even sure of the name of the library it came from or what it was called - maybe "ObjectHashMap" or something like that - but its semantics were that it used reference equality instead of the structural equality used by a regular HashMap. As a result map operations depended on base Object.hashCode and reference equality which are constant and presumably very quick operations.

Using a regular HashMap, the map operations would depend on String.hashCode() and String.equals() which are O(n) in the length of the String - although the return value from hashCode is cached.

On reflection and this was an old code-base, I'm somewhat sceptical that this was a win given that String keyed-HashMaps are so common that I'd imagine a lot of JVM optimisation effort has gone into this area. I guess it would depend on the pattern of use. If the universe of String keys was known in advance, then I guess it operated like a poor mans perfect hash. On the other hand, this is the kind of optimization in Java which I've found to be unstable with performance varying significantly between JVM versions and platforms.

> Anything else other than the ability to reliably use == for String comparison instead of .equals()?

Sometimes that's key.

Why not use an int or some other token? Well then you have to garbage collect yourself. Maybe use your own class for the token though instead of a string? But then you still have to do half the garbage collection, as you need a weak map for lookup and possibly some cleaning infrastructure.

The appropriate way to think of it is as an optimization that you will almost certainly never need. If you do, you'd typically confirm that by profiling your code. And it would merely be one of several other strategies that you could employ to save a bit of memory and cpu. I'm sure there are some niche domains where every ms of performance matters enough that you'd go through a lot of trouble for very marginal gains in performance. But otherwise, this should not be something you'd ever need.

I've used String.intern() in genetic data analysis to reduce memory usage by a large factor. Lots of repeated Strings like "AA", "AB", "BB", etc.

I used it a long time ago in a program that read large structured text files. What was 100s of MB of text in memory became a fraction in size after interning the tokens.

Once upon a time I found string interning useful in conjunction with IdentityHashMap for quick lookup of lots of strings.


> In almost every project we were taking care of, removing String.intern() from the hotpaths, or optionally replacing it with a handrolled deduplicator, was the very profitable performance optimization.

> Do not use String.intern() without thinking very hard about it, okay?

Overall fantastic article covering intern() in depth!

The only thing I'd like to see added would be visual graphs instead of numeric tables for the benchmark results.

I've sort of felt that String.intern is sort of a primordial vestige from really old Java landscape that doesn't quite make sense anymore. Can't remove it for backwards compatibility issues, but I don't understand when or where you would want to use it in today's landscape.

Having worked in the JVM ecosystem building low latency systems for over a decade, I always aspire to have the minute knowledge Aleksey has. I wish there was a book/MOOC for more senior engineers, but i've found myself content and happy with the Anatomy Quark series.

I share the feeling that I wish there were more resources out there. I've collected a motley set at https://justinblank.com/notebooks/jvmarchitecture.html, but I haven't found anything systematic.

this is the one that got me started in low latency java a decade ago - http://blog.vanillajava.blog/ - I've spoken with Peter a couple of times, he's a phenomenal technologist

Vanilla Java is a great resource. The work that Chronicle does got me into low latency Java as well (not doing java now though).

Some other great resources I used:

* http://psy-lob-saw.blogspot.com/ - Not updated since 2018, but still great stuff in there

* https://richardstartin.github.io/ - A few posts per year. Richard does a lot of interesting things.

https://shipilev.net/jvm-anatomy-park yields 404. Maybe that should link to the TFA now?

Thanks, will update later. https://shipilev.net/jvm/anatomy-quarks/ is the address.

Why is Java used in low latency systems? Isn't C++ a better choice there?

It works 24/7, I guess. Reliability is the key -- it stays running.

It's got a vast array of third-party libraries available. It performs pretty well even in a low-latency environment.

Developers can spend time solving actual business problems, rather than waiting to compile, fooling around with memory allocation, templating problems & inefficient STL data structures, or badly handrolling data structures.

There is a little extra work to stay within a latency envelope, but there are definitely productivity boosts from just forgetting all the old C legacy and being able to work at a higher level of thinking.

Even better, use the JVM but try Kotlin as a language.

C++ is actually an esoteric language, not unlike Haskell or Prolog. It is just a widely used one, for historical reasons.

Meanwhile, you may just get things done in Java without giving much thought.

When there are orders of magnitude more people working with C++ compare to Haskell or Prolog, I would hesitate to call it "esoteric." The problem with C++ is everyone uses their own little dialect, ranging from little more than C with classes / nicer syntax, to every new "modern" feature under the sun.

Not core to the article's ultimate point and conclusion, but in the premise it has:

> The performance is at the mercy of the native HashTable implementation, which may lag behind what is available in high-performance Java world, especially under concurrent access.

Does anyone have any idea what this is referring to, if anything? Java's HashMap performance is pretty mediocre and without value types seems un-improvable. Certainly no where close to the SIMD-accelerated cache-line optimized hash maps you can find in C++ like the one in abseil or F14.

It's a good article otherwise, just this note comes across as wishful thinking than a premise with any truth behind it.

The "native HashTable" referred to here is the string hash table that's inside the JVM. The JVM has to intern strings from class files' string literals here, so it's all implemented in native code. The comment about performance probably refers to Java constantly improving in performance because of improvements to the JIT compiler, whereas that native HashTable in the JVM is pretty static in performance unless somebody rewrites it or if the C++ optimization gets significantly better. Essentially Shipilëv is saying that Java code gets faster more quickly than the C++ code in the JVM.

JITs don't improve a hashtables performance as they are dominated by memory access patterns (eg, open vs closed addressing), load factor, and hash quality.

Of those 3 factors, the first is borderline impossible to optimize in Java at all. Regardless, the runtime isn't helping with any of those 3, it requires changes to the algorithms used. Which doesn't seem to be happening in Java land broadly?

It's possible that the JVM just ignores its c++ code, sure, but that's not commentary on Java or the Java ecosystem.

Also offline compiler optimization has so far consistently outpaced (or at least kept up with) JIT optimizations. So it's incorrect to consider the native code as static anyway unless you're just never upgrading the compiler.

For anyone that uses XML stuff in Java, you are probably aware of String.intern memory and performance issues. I don't know if that stuff persists today, but a ton of XML parsing goes through that method.

I would like also a comparison with the String deduplication option ( XX:+UseStringDeduplication ).

From what I understand it doesn't use String.intern but makes different instances share the same underlying char array so I am wondering how it behaves.

In my tests it gives nice memory savings "for free", but I never had the chance to test with a production workload

Huh. I assumed that all languages with immutable strings did this sort of thing transparently

Edit: according to another comment, it sort of can

> In today's JVM, if you are using the G1 garbage collector, you can get 99% of the memory savings almost completely for free with a VM parameter: You still pay for an extra object, just pointed to the same character array underneath.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact