
Oracle Tunes Java's Internal String Representation - devBiscuit
http://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String
======
throwawaykf03
Reddit comment from the author of this change discussing the reasons behind
it:

[http://www.reddit.com/r/programming/comments/1qw73v/til_orac...](http://www.reddit.com/r/programming/comments/1qw73v/til_oracle_changed_the_internal_string/cdhb77f)

------
justinsb
Here is the patch (I believe): [http://hg.openjdk.java.net/jdk8/tl-
gate/jdk/rev/2c773daa825d](http://hg.openjdk.java.net/jdk8/tl-
gate/jdk/rev/2c773daa825d)

I was quite surprised that we _always_ copy the array. I would have guessed
that .substring(1) (all but the first character) is very common, and that it
would be a win to share the array in some circumstances. A straw-man heuristic
would be "share the array if we're using at least half of it".

But they didn't do that. Anyone know if there's a public discussion?

~~~
hnriot
Does it matter with copy-on-write?

~~~
dietrichepp
What?

Java's strings are immutable, so copy-on-write is nonsense.

Performance tuning for mutable string types has been moving away from the
copy-on-write paradigm (it _used_ to be common in C++, but not any more). It's
a big mess because the design of the C++ string class was truly botched, it's
much less of a mess in other languages where you have different types for
mutable and immutable strings (Java, C#, Python, etc.)

------
TheLoneWolfling
So, I have to wonder. If I as a random nobody have been affected adversely by
this change (Parser that now takes >10x the time), what business applications
have been affected by this?

And I have a question: why could it not be done like this:

Java currently has weak references, soft references, and phantom references.
If there was an additional type of weak-like reference, that had a callback
with an object _before_ it was garbage collected, this conundrum would be
simple. Have any substrings have a weak-like reference to the character array.
If the parent string is garbage-collected, then do the copying like the new
behavior, but until then don't: use the old behavior.

~~~
MaulingMonkey
> If there was an additional type of weak-like reference, that had a callback
> with an object before it was garbage collected, this conundrum would be
> simple.

You've basically described weak references to objects with finalizers.
Unfortunately, I'd wager this is even more expensive, not less.

~~~
twic
No, TheLoneWolfling's idea can't be implemented with weak references with
finalizers. Firstly, finalizers only run after weak references are cleared (i
think - please correct me if i'm wrong!). Secondly, he needs GC to trigger
code that mutates the no-longer-referring object; the finalizer runs in the
no-longer-referred-to object.

He's right that a new VM-level hook would be required to implement his idea.
The benefits don't clearly justify making a change of that magnitude.

~~~
MaulingMonkey
Depends. I'm imagining e.g. your large parent string invasively giving it's
sub-slices copies on finalization through it's own list, in which case the
weak references clearing isn't much of an issue. Not easy to make fast and
thread safe however.

In C#, WeakReference can take a boolean parameter during construction to
specify if it tracks post-finalize or not (defaulting to the same behavior as
Java which google tells me is as you say: not), and you can re-register your
finalizer, but even so this kind of thing isn't what the GC is primarily
designed for and it will probably show. I'm not sure if you can re-register
finalizers in Java.

------
kenshiro_o
Back in the days where I used to work in low latency FX trading, we identified
this "feature" as we noticed the memory footprint of our process just kept
growing and growing but couldn't quite pin down the root cause.

A bit of profiling and then code-digging helped find this oddity. That was
probably an optimization where the side effects were not carefully considered.
Even the Javadoc is vague
([http://docs.oracle.com/javase/6/docs/api/java/lang/String.ht...](http://docs.oracle.com/javase/6/docs/api/java/lang/String.html#substring\(int\))):

 _Returns a new string that is a substring of this string. The substring
begins with the character at the specified index and extends to the end of
this string._

Now I see the benefit of the optimization so I am almost against removing it
(we coded our own substring at the time). How about Oracle provide a static
method to return a substring in 0(1) ?

I believe most Java devs don't even know about the current substring
implementation and most programs are not adversely impacted by it either.

~~~
brazzy
> How about Oracle provide a static method to return a substring in 0(1) ?

Not possible anymore, since they used the occasion to eliminate the offset and
length fields.

------
exabrial
I wish oracle would put back in compressed strings... where strings are
represented internally as byte[] rather than char[]. Huge memory and
performance advantage if you application only handles ASCII-like characters.

See discussions here: [http://stackoverflow.com/questions/8833385/is-support-
for-co...](http://stackoverflow.com/questions/8833385/is-support-for-
compressed-strings-being-dropped)

~~~
meddlepal
Can't you implement a custom CharSequence for this where and when you need it?

~~~
exabrial
like... everywhere? :D

------
idunno246
We did the whole new on substring trick on a sparse csv. Turned out we had
over a hundred mb of empty strings. Lots and lots of duplicates. So empty
string is special cased to return a constant

------
mytummyhertz
so does this mean .length() is also now O(n) ?

~~~
jzwinck
Of course not: it would simply pass through to the underlying char[].length,
which is a field.

