
String.intern() - dmit
https://shipilev.net/jvm-anatomy-park/10-string-intern/
======
Terr_
Every time I see String.intern() my mind leaps to the problem of new Java
programmers who are misled into this:

    
    
        String a = "hello";
        String b = "world";
        assert a != b;
        b = "hello";
        assert a == b;
        // OH NICE I'LL USE == FOR STRING COMPARISONS NOW
    

It works cause source-code literals are intern'ed down into identical objects
by the compiler, but that's a special case that won't apply to strings created
at runtime.

~~~
userbinator
_new Java programmers_

Or perhaps _interns_? ;-P

In fact, I've wondered about the origins of the term "intern" since I first
heard of it; the technique was already known to me and likely others as
"tokenising" before that, however, since compilers would often have to compare
identifiers and doing this to them made it much faster.

~~~
wtetzner
> In fact, I've wondered about the origins of the term "intern" since I first
> heard of it

I always assumed it came from internalize, or something like that.

~~~
derefr
I was curious, but a good amount of Googling didn't turn up any etymology.

I'll take a guess, though, that it's from Common Lisp, because CL's intern
function
([http://clhs.lisp.se/Body/f_intern.htm](http://clhs.lisp.se/Body/f_intern.htm))
is clearly part of a suite of names (intern, unintern, :internal, :external)
where, in context, "intern" makes complete sense as the name for the function.

~~~
smarks
Usage of "intern" in this sense is much older than Common Lisp. The earliest
reference I could find in a few minutes of searching is from 1974, from the
MACLISP Reference Manual:

[http://www.softwarepreservation.org/projects/LISP/MIT/Moon-M...](http://www.softwarepreservation.org/projects/LISP/MIT/Moon-
MACLISP_Reference_Manual-Apr_08_1974.pdf)

See sections 6.3 and 6.4 and the glossary for definitions and uses of the
word.

Unfortunately they don't shed any light on where the term "intern" comes from.
Typical usage is that a string is «"interned" or registered in an obarray» but
it doesn't explain much more than that. Interning occurs most often at read
time, when a string read from the input is turned into an atom, so I'd guess
that "intern" is short for "internalize" or similar.

~~~
derefr
Well, I was about to say, the Common Lisp version makes enough sense (and the
MacLisp version might be equivalent, for all I know):

In CL, you don't really have a global "symbol table" or "atom table"; instead,
interning is about the runtime lexical scopes under which code compilation
happens, which consist of stacks of imported namespaces ("packages"), where
each namespace exists as a mutable object holding identifier "slots".

(intern foo) is a mutation on the current namespace that creates a local
variable "slot" for the identifier foo, shadowing any other foo that might
have been in lexical scope from an import. That is, (intern foo) makes foo
resolve _internally_ to the current namespace.

(unintern foo) is the complementary mutation; it drops the foo identifier from
the current namespace, allowing other symbols (slots in packages) that were
previously shadowed by it to re-appear.

------
derriz
This is really an unscientific claim but I ran a hand crafted/hacked benchmark
just to get a feeling for the numbers. For 5 to 35 character Strings, == is 20
to 40 times faster than String.equals().

Given that s1.equals(s2) if and only if s1.intern() == s2.intern() (assuming
you haven't filled the string table), then this looks like an opportunity for
a significant optimization.

Before doing this, I had hoped that String.equals might check if both were
"interned" and shortcut the character by character comparison if this was the
case by just comparing references. But interpreting the results of my rough
benchmark would suggest this isn't what is happening which would agree with
the source provided for the String.equals method.

Java String comparison is absolutely ubiquitous so I would have expected that
an optimization like this might have been considered?

Having said that, the supplied rt.jar source also suggests that the
String.hashCode() computation isn't cached/memoized. This strikes me as odd
given that Strings are immutatable and Strings are one of the most common key
type for Maps.

~~~
etatoby
You should compare the cost of one (or two) .intern() plus '==' against a
single .equals()

For example, if your code needs 10 .intern() and then performs 100 '==', that
is _possibly_ going to be faster than performing the equivalent 100 .equals()
(or maybe not)

But if you are comparing a single couple of strings, .equals() is always going
to be faster.

Not to mention that if you .intern() random strings, you risk running out of
PermGen space. Even in the case of 100 comparisons among 10 strings, you may
be better off using a custom HashMap.

You can find all this information in the article.

~~~
derriz
I'm not suggesting that .equals() should be implemented by 2 inter()
operations followed by a reference comparison. But that the implementation
check whether the pair of Strings happened to be already .intern()ed, and then
use the short circuit of a reference comparison.

A programmer could then decide to intern _some_ non-constant Strings in their
application if it made sense (i.e. they were frequently used in comparisons or
as hash keys).

And I don't think you can't run out of permgen as the String pool is a fixed
size - unless I am misreading that section of the article.

However now that I've thought about a bit more, I think the flaw in the idea
is unrelated to your points but that currently it's not cheap to determine if
a particular String is interned. The StringTable implementation would have to
be changed otherwise it would require adding storage overhead to every String
which would not be acceptable. This is also probably why .hashCode() doesn't
memoize it's results.

------
filereaper
Total aside from main topic, I love shipilev's posts.

If there are other core JVM developers that have similar blogs, I'd love to
hear about them here.

------
lorenzosnap
We built an inmemory map and we were using String.intern for both keys and
values. We could see that we were saving lots of memory but we had the
problems described in the article. We then built our own 'String.intern' by
using yet another static HashMap. It worked. It was the simplest alternative
and it just did the job. Thanks alekskey for the nice article.

------
emmelaich
I'd never seen the @Benchmark annotation before so I looked it up.

The blog author is also one of JMH's authors.

[http://openjdk.java.net/projects/code-
tools/jmh/](http://openjdk.java.net/projects/code-tools/jmh/)

------
deepsun
Have been doing Java for 14 years so far, never ever needed the .intern(). I
can imagine it's use-case, but anyway does seem pretty rare case.

~~~
35bge57dtjku
Deserializing tons of records all having one of a few values for a particular
field is an easy and probably fairly common use case.

------
Robotbeat
Is "Anatomy Park" a Rick and Morty reference?
[http://rickandmorty.wikia.com/wiki/Anatomy_Park_(episode)](http://rickandmorty.wikia.com/wiki/Anatomy_Park_\(episode\))

------
jwilk
Please consider adding "JVM Anatomy Park" to the title.

------
TheGuyWhoCodes
The code creates unique strings to "interns" which most likely isn't what
would happen in a real world application (unless you know... code without
thought), you'd inter strings with low variance usually. Not saying that it
won't be slower but the memory usage might be lower.

~~~
wtetzner
Yeah, blindly interning strings isn't a good idea, but it can be useful for
certain use cases. For example, keys of maps where you know the keys won't
vary much. It can reduce memory usage, and improve lookup performance (since
doing a lookup requires a string comparison).

------
gravypod
This, and the few other articles up, are a great series. Having done Java
development now for 30% of my life these are some amazing pointers.

I'd love to buy a hard copy of these if they ever get up to a few dozen
articles. Would be good to give to middle-experience devs (like myself) in the
future.

~~~
AKluge
Some good content along the same vein with 16 years worth of articles:
[http://www.javaspecialists.eu/](http://www.javaspecialists.eu/)

------
relics443
"The performance is at the mercy of the native HashTable implementation, which
may lag behind what is available in high-performance Java world, especially
under concurrent access."

What native HashTable is used? Shouldn't the JVM be using an optimized one?

~~~
QuercusMax
I think by "native" they mean "implemented-in-c++-by-the-JVM", which
potentially vary; not java.util.HashTable, which should be pretty standard
across JVM implementations.

~~~
electrum
Yes, although Hashtable is legacy from JDK 1.0. New code should use HashMap or
ConcurrentHashMap.

~~~
Tharkun
Please be careful with blanket statements like this. HashMap, Hashtable and
ConcurrentHashMap behave differently in certain (subtle?) ways.
ConcurrentHashMap doesn't like NULL but is thread-safe. Hashtable is
synchronized, slow and thread-safe, but doesn't mind NULL. HashMap is not
thread safe and doesn't mind NULL.

Edit: pardon the dupes, on unreliable mobile link :-(

~~~
josefx
> Hashtable is synchronized, slow and thread-safe, but doesn't mind NULL.

The issue with Hashtable or rather all legacy collection classes is that the
API is rarely useful without additional synchronization. So you might as well
use a HashMap wrapped by Collections.synchronizedMap or use a
ConcurrentHashMap with a placeholder object instead of null.

------
zde
String.intern() would suck much less if strings had an "IS_INTERNED" flag
which would prevent hashtable lookups for already interned strings. Really sad
given the insane overhead Java strings have.

~~~
leventov
No, it would suck more, because currently you have an option to avoid
string.intern() altogether (and that is what you should do), and pay nothing
for that in runtime. Another boolean flag may cost extra 4-8 bytes on the heap
for each String object, whether you use String.intern() or not.

------
pimlottc
> in OpenJDK, String.intern() is native, and it actually calls into JVM, to
> intern the String in the native JVM String pool.

How much of this also applies when using the standard Oracle JDK?

------
kristianp
It would be interesting to know where it's used, was it used in the JDK for
example?

------
kazinator
Also see:

XInternAtom (XWindow function)

RegisterClass (Windows)

------
zaroth
The instrumentation here is impressive. The amount of data inspection done
with just a few simple commands is a bit overwhelming. Frankly, I rarely hope
to find myself looking at this level of metrics.

There's a lot down there I like to take for granted. But more likely I try to
use methods like string.Intern() exactly never.

Use code you know and understand. Frankly, use code you can trust. And wtf
would trust a method string.Inter() to do... exactly, what?

If you are writing a function to _do something_ the name of the function must
be the thing being done. What the heck is a 'static internalize'? The explicit
HashMap was a few lines of code, and it's the most basic and obvious, and
surprisingly performant approach. So definitely I agree you must use your own
HashMap and not a static internalizer.

