

String Deduplication – A new feature in Java 8 Update 20 - thescrewdriver
https://blog.codecentric.de/en/2014/08/string-deduplication-new-feature-java-8-update-20-2/

======
TheLoneWolfling
Related: I'm still frustrated by Java randomly (in a minor release, none the
less!) switching to copy-on-substring. If I, a random nobody, had something
that had its running time increased by a factor of ~50 (simple recursive-
descent-ish parser for coercing tabulated data from one format to another -
just call substring to trim off the first token repeatedly), how many dev
hours were required overall to fix the results of the change? And there's no
simple alternative either or way to preserve the old behavior - the simplest
one, rolling your own String class or wrapper, ends up being relatively slow
and annoying.

And "all" of this would be solved by having a proper way of doing array
slicing - for things like substring's previous worst case (something like a
single character being referenced in a substring holding up a gigabyte+-sized
string) Java's garbage collector could recognize that the array was only
referenced through slices of part of the array, copy that section into another
array (updating references to it), and free the large one.

Also, Java's lack of a way to specify that a class is immutable (and that all
children classes thereof must also be) is frustrating. Because optimizations
like this can and should apply to more than just strings!

~~~
pjmlp
Those changes are only specific to Oracle's implementation, there are lots of
others to choose from.

Most of those changes don't have any relation whatsoever with the Java
Language Specification or Java Virtual Machine Specification.

~~~
thescrewdriver
In theory yes. In practice 99.99% of mainstream Java developers will run their
code on Oracle's JVM.

~~~
pjmlp
I doubt only 00,01% accounts for:

\- Real time deployment scenarios

\- J2EE containers, which tend to work best with the respective vendor JVM

\- factory control JVMs

\- car infotaiment JVMs

\- smart card JVMs

\- Compilation to native code

\- Android

\- Embedded deployments like MicroEJ

\- Commercial UNIX systems

\- Mainframes

\- ...

------
phunge
Here's a counterargument for this, just to be a fuddy-duddy:

One of the key activities in programming is reasoning about time and space
cost, and this is a space optimization that's _nondeterministic_. It kicks in
sometimes, or sometimes not at all, and happens behind the scenes at garbage
collection time when it's nearly invisible. If you're sloppy, your program may
have a huge asymptotic space usage, and this may paper over it. But the impl
has heuristics, it may not work all the time -- even their example program
needed Thread.sleep() calls! Unpredictable semantics help nobody. So I always
liked explicit string interning (whatup, Lisp!).

All the same, faster is better and I'm sure this makes things faster.

Oh and can we talk about how broken it was that Java 6 and under had a fixed
size pool for .intern()'d strings?

~~~
brandonbloom
Your applications already sit upon a mountain of non-determinism.

A typical web app runs against a database with a genetic query optimizer, on
top of a VM with a concurrent generational GC, sharing virtual memory with a
dozen other processes, arranged in a massive pyramid of caches, which are
competing for CPU time from a multi-core monstrosity of a data-flow engine.

The sooner we as programmers embrace stochastic methods, the better.

------
hibikir
This sounds extremely valuable, if just because it requires far less tuning
than interning strings.

Not so long ago I had to do maintenance on a pretty large Swing application,
still stuck on Java 6, that was built around displaying huge amounts of data,
holding it all in memory, after retrieving it from some rather slow web
services. The poor thing easily ended up using about a couple of gigs, mostly
due to the many Strings it held, many of them pretty repetitive. While I tried
to reduce the memory footprint, I couldn't just intern everything coming from
the service: In Java 6, interned strings come from PermGen.

Using a hashmap to allocate everything would have been better than nothing,
but the dataset had a whole lot of strings that didn't repeat themselves, so
the hashmap would have been far bigger than we needed for the application.

What I ended up having to do what to figure out where the data was the most
repetitive, and then intern only those strings.

I cut memory use by over 30%, but it took days of profiling, evaluating the
data and making simple code changes to get there, as opposed to just a runtime
flag. Now, I wonder how much better, or worse, it performs that the hashmap
solution in cases like the one I faced: a few hundred strings repeated tens of
thousands of times, and hundreds of thousands of strings with almost no
repetition.

~~~
blinkingled
Wow,that sounds very familiar including the swing app part! Except in my case
data came from database and code used StringBuffers leading to massive amounts
of duplication. I had to build a COW wrapper and profile the app to find that
95% of the data was only ever read in normal use cases!

Thankfully Java tooling is fairly mature - things like MAT and OQL were of
great use in finding the memory hogs and leaks.

------
gioele
For those interested in the same feature in plain old C, have a look at the
DSO Howto, section 2.4.2 "Forever const" and the LD flags SHF_MERGE and
SHF_STRING.

With a little bit of magic, C compilers and linkers are even allowed to turn
(simplified example)

    
    
        const char s1[] = "some string";
        const char s2[] = "string";
    

into

    
    
        const char s1[] = "some string";
        const char s2[] = s1+5;
    

and place these constants in a read-only section shared between multiple
loaded instances of the same library or program.

~~~
riffraff
I think it's standard java behaviour to .intern string constants appearing in
code (i.e. since they are immutable, just share them). The new thing is that
the JVM is going to do this automatically for strings that _don't_ exist at
compile time, IIUC.

------
dj-wonk
From [http://openjdk.java.net/jeps/192](http://openjdk.java.net/jeps/192)

> Taking the above into account, the actual expected benefit ends up at around
> 10% heap reduction. Note that this number is a calculated average based on a
> wide range of applications. The heap reduction for a specific application
> could vary significantly both up and down.

~~~
tetha
I don't think an average is useful in any way here. It will reduce the heap
use by some amount, since you reduce the number of char arrays in N strings to
some number less than - or equal - than N. There won't be any new arrays
allocated.

But in my team alone, we have services which deal with many equal but
transient strings, some services with long-living, mostly equal strings, some
services with long-living, radically different strings. In some of them, I
expect rather massive reductions, in others, interning probably has most of
the work done already and in others, there's no application or the GC will
handle the issue already.

~~~
asdfaoeu
The garbage collector needs to store its internal arrays.

------
nly
Boost flyweight is pretty useful for doing this in C++

[http://www.boost.org/doc/libs/1_56_0/libs/flyweight/doc/inde...](http://www.boost.org/doc/libs/1_56_0/libs/flyweight/doc/index.html)

------
karthikkolli
We once maintained a hashmap with key and value as the instance of string to
avoid duplication in a search application. Wouldn't that be more beneficial
than keeping it in GC if the application uses more strings?

Edit: changed avoid deduplication to avoid duplication

~~~
chrisseaton
Why did you want to avoid deduplication? You can't even tell it's happened as
it only works on the char[] which is internal to the string. Did you find it
didn't work as expected.

~~~
karthikkolli
It was in a typeahead search application built on 20GB of names. These names
have common first names and last names which were stored as different strings.
With deduplication, string memory was reduced to 20%

Will benchmark that application with +UseStringDeduplication

~~~
chrisseaton
So what was the downside of deduplication? Why did you want to avoid it?

~~~
karthikkolli
Sorry deduplication was a typo. Corrected

~~~
chrisseaton
Ah right. The reason they're doing it in the GC rather than in the mutator
threads is that it only has an impact on strings long lived enough to be
evacuated. Short lived strings don't get deduplicated, and probably don't need
to be. Without the GC I don't know how you'd automatically determine that it
was a good idea to deduplicate.

------
mbq
As a curiosity, global string cache is an old feature of R -- however strings
are matched upon creation rather than detected by the GC.

~~~
chrisseaton
The problem with that is every time you create a string you have to do the
work to look it up in the cache. The benefit of the JVM's approach here is
that it only bothers to deduplicate it if it is long-lived enough to be
evacuated.

~~~
mbq
Sure; in R strings are copied way more frequently than created, so it pays
off.

------
cr4zy
Does this mean "" == ""?

~~~
jontro
No, string deduplication takes place in the internal String char array. Each
String will still have it's own object.

~~~
zackangelo
Not according to the article:

> In fact the String Deduplication is almost like interning with the exception
> that interning reuses the whole String instance, not just the char array.

~~~
chrisseaton
I think that might mean the opposite of what you think it does.

Interning reuses whole String instances. Deduplication is like interning with
the exception that it does not reuse the whole String instance, it just reuses
the char array. Therefore surely deduplication does not reuse String
instances.

------
pjmlp
This is only specific to the Oracle JVM.

