If all you're doing is deduplicating string objects in order to save memory, the article makes a good case to use `HashMap` or `ConcurrentHashMap` instead of `String.intern`.
The reason to use `String.intern` is in the fairly narrow case where you want to use `==` to compare strings, and some of these strings originated as string literals in Java source code. Java specifies that string literals for equal strings are always the same Java object, even if they occur in different class files that are compiled separately. This is accomplished by interning the strings, and `String.intern` provides user-level access to this same intern pool.
See section 3.10.5.
Nevermind the architecture astronauts saying that the string intern pool is actually global state and global state is always problematic.
String.intern() also offers some other benefits for certain use cases. Earlier versions of java did not cache the String hashcode, which meant that to use Strings as hash table keys meant hashing a lot more. But, an interned string can be used in an IdentityHashMap, which was faster for a long portion of Java's early life.
(I worked on a moderately popular Java library that targeted Java 1.5 as the minimum version. It does occasionally come up useful, but only in specific, and increasingly rare circumstances)
Now, could we have solved this with other ways other than interns? Absolutely: Just make deserialization code do mostly the very same thing, but done by hand and on the heap.
In today's JVM, if you are using the G1 garbage collector, you can get 99% of the memory savings almost completely for free with a VM parameter: You still pay for an extra object, just pointed to the same character array underneath. I'd argue it's one of the most underutilized flags out there: It's not uncommon to end up with half the memory footprint in many use cases, and memory savings are often runtime savings too. Now that we have this, maybe intern always loses, but it was definitely useful in the right circumstances.
Problem 1: it can easily depend on your input data whether interning is necessary to prevent out of memory exceptions.
Problem 2: there often are better solutions to memory usage than interning strings (for example, by replacing that DOM parser by a streaming one)
Problem 3: if you’re writing a library, you can’t know anything about the application that will use your code, so you have to guess whether interning some of the strings you create will help the user.
And of course, that’s historically. (Some) modern JVMs already deduplicate strings.
If you have (like one our applications did) millions of copies of the string "USA" in memory, that's many megabytes of memory that explicit deduplication can save that the garbage collector can't.
String.intern isn't the way, for all the reasons this post outlines, but just using G1 isn't the right approach either.
Hopefully soon object headers will be negligible with progress from Lilliput though.
Let's say you have a cache with Map<Language, BusinessInfo>. After a while, the cache is quite large, and you want to reduce its memory usage. You realize within BusinessInfo, you have a MailingAddress, but it's not actually localized. You're duplicating millions of address lines for no reason.
You could split this out of your cache, but pulling that thread is a bit tricky. Instead, you decide to store addressLine.intern().
//This block executes in parallel for requests with different values of bob, but sequentially for each request that has the same value of bob.
If I didn't have to integrate with a weird 3rd party API where this was necessary, I wouldn't know about String.intern()...
I'm not even sure of the name of the library it came from or what it was called - maybe "ObjectHashMap" or something like that - but its semantics were that it used reference equality instead of the structural equality used by a regular HashMap. As a result map operations depended on base Object.hashCode and reference equality which are constant and presumably very quick operations.
Using a regular HashMap, the map operations would depend on String.hashCode() and String.equals() which are O(n) in the length of the String - although the return value from hashCode is cached.
On reflection and this was an old code-base, I'm somewhat sceptical that this was a win given that String keyed-HashMaps are so common that I'd imagine a lot of JVM optimisation effort has gone into this area. I guess it would depend on the pattern of use. If the universe of String keys was known in advance, then I guess it operated like a poor mans perfect hash. On the other hand, this is the kind of optimization in Java which I've found to be unstable with performance varying significantly between JVM versions and platforms.
Sometimes that's key.
Why not use an int or some other token? Well then you have to garbage collect yourself. Maybe use your own class for the token though instead of a string? But then you still have to do half the garbage collection, as you need a weak map for lookup and possibly some cleaning infrastructure.
> In almost every project we were taking care of, removing String.intern() from the hotpaths, or optionally replacing it with a handrolled deduplicator, was the very profitable performance optimization.
> Do not use String.intern() without thinking very hard about it, okay?
Overall fantastic article covering intern() in depth!
The only thing I'd like to see added would be visual graphs instead of numeric tables for the benchmark results.
Some other great resources I used:
* http://psy-lob-saw.blogspot.com/ - Not updated since 2018, but still great stuff in there
* https://richardstartin.github.io/ - A few posts per year. Richard does a lot of interesting things.
It's got a vast array of third-party libraries available. It performs pretty well even in a low-latency environment.
Developers can spend time solving actual business problems, rather than waiting to compile, fooling around with memory allocation, templating problems & inefficient STL data structures, or badly handrolling data structures.
There is a little extra work to stay within a latency envelope, but there are definitely productivity boosts from just forgetting all the old C legacy and being able to work at a higher level of thinking.
Even better, use the JVM but try Kotlin as a language.
Meanwhile, you may just get things done in Java without giving much thought.
> The performance is at the mercy of the native HashTable implementation, which may lag behind what is available in high-performance Java world, especially under concurrent access.
Does anyone have any idea what this is referring to, if anything? Java's HashMap performance is pretty mediocre and without value types seems un-improvable. Certainly no where close to the SIMD-accelerated cache-line optimized hash maps you can find in C++ like the one in abseil or F14.
It's a good article otherwise, just this note comes across as wishful thinking than a premise with any truth behind it.
Of those 3 factors, the first is borderline impossible to optimize in Java at all. Regardless, the runtime isn't helping with any of those 3, it requires changes to the algorithms used. Which doesn't seem to be happening in Java land broadly?
It's possible that the JVM just ignores its c++ code, sure, but that's not commentary on Java or the Java ecosystem.
Also offline compiler optimization has so far consistently outpaced (or at least kept up with) JIT optimizations. So it's incorrect to consider the native code as static anyway unless you're just never upgrading the compiler.
From what I understand it doesn't use String.intern but makes different instances share the same underlying char array so I am wondering how it behaves.
In my tests it gives nice memory savings "for free", but I never had the chance to test with a production workload
Edit: according to another comment, it sort of can
> In today's JVM, if you are using the G1 garbage collector, you can get 99% of the memory savings almost completely for free with a VM parameter: You still pay for an extra object, just pointed to the same character array underneath.