
Java 7 changed the structure of String - antoaravinth
http://www.javaspecialists.eu/archive/Issue230.html
======
jaawn
Based on all of the coding "best practices" books/blogs I've read, it is
unwise to develop something which heavily depends on the internal
implementation of the String class (or any library class). Many other
commenters here are bemoaning the impacts the specific implementation of
String will have on their existing applications. However, we should be
designing based on the interface/contract presented, not specific
implementation details of interfaces/libraries. Any optimization based on
specific implementation details is "at your own risk."

If changes to the implementation of String could cause issues for you, you
really should be using a custom solution anyway. A quick and dirty option
would be to write a wrapper class which uses all of your preferred String
hacks in a central location, and using that in place of String throughout your
code. When String's implementation changes, you can update your hacks all in
one place.

Here is some relevant advice from The Pragmatic Programmer:
[https://pragprog.com/the-pragmatic-
programmer/extracts/coinc...](https://pragprog.com/the-pragmatic-
programmer/extracts/coincidence)

~~~
scott_s
Algorithmic complexity should be considered part of an API, because it is not
mere "internal implementation". It has observable effects from the outside.
When a function in an API goes from O(1) to O(n), that should be considered a
breaking change.

For the record, C++'s standard library has algorithmic complexity as a part of
the API. See the analogous std::basic_string::substr:
[http://en.cppreference.com/w/cpp/string/basic_string/substr](http://en.cppreference.com/w/cpp/string/basic_string/substr)

I've been bitten by std::list::size in pre C++11 code, as it was allowed to be
O(n). But they were clear about it, the fault was mine. Now, it is guaranteed
to be O(1):
[http://en.cppreference.com/w/cpp/container/list/size](http://en.cppreference.com/w/cpp/container/list/size)

I think it's likely that the change to String.substring was actually a good
one. But I also agree with others that it was rolled out poorly, and should
probably not have been in a fixpack release, as from how I see it, they
changed the API.

~~~
spiralpolitik
substring() was always defined in the 6 and 7 JavaDoc as:

"Returns a new string that is a substring of this string..."

Both implementations satisfy the API description so the API didn't change, the
implementation did.

Optimizing your code based on the internals of a supposedly opaque data
structure is a bad practice and if you get burned you only have yourself to
blame.

~~~
x0x0
In reality, performance often matters. Then there's no way to avoid needing to
understand what the implementation does. The java6 implementation was a big,
somewhat well-known gotcha. Read in mb of data in a String, substring a couple
kb out, and tada! you're still holding all those mb.

You inescapably either:

1 - substring introduces a new string, creating lots of garbage since it's an
incredibly common operation,

or

2 - substring saves garbage in many cases by having substring return sub-
chunks of the full char array, but you will be unable to gc the full string as
long as _any_ substring remains reachable

~~~
jaawn
or 3 - stop trusting the internal String implementation to be as efficient as
possible, write a wrapper you can update with the best method for the current
version, never look back :-)

~~~
x0x0
why not reimplement the entire standard library so you understand and control
all the performance characteristics? This can be just like writing C in the
90s: start by writing a hashmap, a string, an arraylist or equivalent, etc.
Because your code is sensitive to every common data structure you use.

~~~
jaawn
Because, most of the time, performance doesn't matter enough to justify it.
Custom solutions are only necessary if you know performance is critical to
your application's purpose, and that you need to squeeze as much out as
possible. If that isn't the case, then you don't need to worry about the
implementation details, because, in general, any performance changes won't
matter.

------
stygiansonic
Here is an explanation of the change to `substring()`[0] and why it was done.
The change was done in Java 7u6.

In short, the previous way of keeping the same underlying character array and
just updating the {offset, count} indexes has a drawback in that if the
original string is large, it is prevented from being GC'd if one keeps a
reference to even a single substring generated from it.

So, it's a trade-off between the original and new behaviour; the original way
more or less caps the memory usage at the size of the original string, but at
the expense of not being able to GC it if even a single substring exists,
while the new way increases memory usage for each substring generated but does
not prevent any of the strings from being GC'd.

This is why the article's code example yields such a huge difference in memory
usage in Java 6 vs Java 7; it is effectively a sort of "anti-pattern" when
used against the new `substring()` method. (i.e. iterating through a large
string and generating lots of sub-strings)

The article I linked to, which came out in late 2012, basically had the same
advice:

 _" If you are writing parsers and such, you can not rely any more on the
implicit caching provided by String. You will need to implement a similar
mechanism based on buffering and a custom implementation of CharSequence_"

0\. [http://www.javaadvent.com/2012/12/changes-to-
stringsubstring...](http://www.javaadvent.com/2012/12/changes-to-
stringsubstring-in-java-7.html)

~~~
TheLoneWolfling
The thing I mind is that there is now no way to get the old behavior. String
is a final class, so you cannot override it and add a field, even. You can
roll your own - if there is no code you do not control that takes a string.
(And if you don't mind having to write your own string class!)

And it being done in a "bugfix" release? That's unacceptable.

~~~
tuckermi
Given the pain that would be associated with rolling your own, why not make
the case that a new method be added to the String interface that provides the
old behavior? Legacy applications still need to change, but it would be a
relatively straight-forward mechanical replacment.

~~~
mhaymo
That requires storing an extra two fields per String (int length, offset;),
which is costly. Users who need constant-time substrings can simply implement
their own class `Subsequence extends CharSequence` with a constructor taking a
CharSequence and two ints. Users who need to pass a substring to a foreign
function which only accepts String do need to copy, but that's not a major
enough use-case to justify upping the memory usage of most applications.

------
stevoski
I recently started to write an article describing how java.lang.String was one
of the classes that the Java creators got mostly right in Java 1.0. This
contrasts to other Java 1.0 classes such as java.util.Date, which, as the
usage of Java spread, and some good practices emerged, revealed itself to be
poorly implemented.

I wanted to show in my article how one of String's strengths is that String
efficiently shares memory with substrings. But then when I looked into Java
8's String source code, I was surprised to find that this was not the case. It
seemed I had been working with false assumptions for years.

Now, thanks to Heinz's article (OP), I discover that this was a change in the
String class.

Long ago, a man who had been programming for 30 years told me about his theory
of the half-life of programming knowledge being roughly 18 months. Today, Java
has made me believe his theory a little bit more.

~~~
ygra
> Today, Java has made me believe his theory a little bit more.

I think Java is sort of an exceptional case here in that the language stays
relatively stable and evolves very little over time. Also they tend to not
change internal details of the standard library without reason, so even lots
of things programmers _shouldn 't_ rely on, such as behaviour of substring()
will stay unchanged for quite long.

Now JavaScript on the other hand probably has programmign language and
framework knowledge half life of considerably less than 18 months ...

~~~
nkassis
JS now is moving fast but for the longest time it was stuck in stasis with
various incompatible implementations floating around. And almost any knowledge
on it was probably invalid in one of those implementations.

------
midko
The article is an example of using an anecdote ("During our course, the
customer remarked that they had a real issue with this new Java 7 and 8
approach to substrings") and a skewed microbenchmark to extrapolate to general
advice.

The substring change was done to address common real scenarios, as the author
of the change described here:
[http://www.reddit.com/r/programming/comments/1qw73v/til_orac...](http://www.reddit.com/r/programming/comments/1qw73v/til_oracle_changed_the_internal_string/cdhb77f)

~~~
rsynnott
It does address common real scenarios. However, it also causes issues for
other common real scenarios. For instance, say you have a CSV parser where you
generally only look at a few columns. Under the old system, it wasn't crazy to
implement this by, for each field, checking if any unescaping is necessary,
and if not (generally the common case) doing a substring. This produces some
garbage, of course, but manageable amounts. This approach abruptly gets _far_
slower under Java 7.

I can see why they made the change, but it absolutely did cause problems for
real-world use-cases.

~~~
midko
The way the change was distributed (in a bugfix release) and (mis)communicated
was completely wrong, I definitely agree with that. As for the parsing
example, yes, that's one way to implement this but definitely not the only
right one -- if you care only about a couple of fields from a massive string,
it depends on your application domain whether you can tolerate the additional
memory bloat. What I like about the current behaviour is that the method is
less surprising and more GC friendly

~~~
rsynnott
Oh, there are definitely other approaches to the problem I mentioned (and in
fact approaches which are better than the naive split even under Java 6).
Changing it out from under people should have been done a lot more carefully,
though.

------
mcherm
I would not have objected to releasing this change in Java 1.7. But releasing
it in a BUG FIX RELEASE causes me to lose confidence in the maintainers of the
JVM. One of the reasons that large companies like mine build in Java is
because of Sun's long history of extremely careful attention to backward
compatibility. Oracle is no Sun.

~~~
easytiger
100% right. Sun were psychotically obsessed with maintaining a completely
stable and reliable and predictable platform for both the JVM & Solaris. This
went out the window with Oracle for the sake of "innovation". The number of
major incidents caused by cavalier changes in 1.6 & 1.7 is ridiculous. Youd be
hard pushed to even find anything worse than a documentation formatting bug in
1.5.

~~~
stevoski
I've been programming in Java since Java 1.1.8 and I hazily recall several
regression problems under Sun's stewardship. My gut feeling is that the rate
of regressions in Java updates has remained steady.

If the rate has increased significantly under Oracle's stewardship, this is
something I'd like to write about in my Java newsletter.

Do you have some statistics to demonstrate that regressions have become
substantially worse since Oracle took over Java? I'd happily acknowledge you
as a source!

~~~
alblue
The only real one was System.getenv() which was stubbed out to a no-op in Java
1.3 and then brought back in 1.4.

------
bcg1
Can't understand why people in this thread seem so ticked about this.

The author's observation is interesting and useful to know, but this is an
edge case that is easily fixable and actually kind of the fault of user code,
not the implementation or the spec.

Its not so simple to just say "I can't believe they changed the performance of
xyz"... the author's article is about just one usage pattern, and there is no
evidence presented to show it is very common compared to others. Developers
needing specific behavior for this pattern could easily have used
java.nio.CharBuffer which makes allocation/copying/access to the underlying
array/etc explicit, and just happens to implement CharSequence. No
reimplementation of String necessary.

In general, Java is not C or C++, and has never been billed as a platform
where you could rely on the internal representation of anything unless it is
specified as such. Even the same bytecode running on the same platform on the
same machine can run differently if the hotspot compiler says Make It So. That
of course assumes you are even using Oracle's JVM or OpenJDK... there are in
fact other implementations of both the JVM and the standard library that are
free to implement things however they want.

Even within the Oracle/OpenJDK sphere - there are 4 different garbage
collectors which all could affect this pattern differently. There is even a
new String dedup feature that would have an impact on the internal behavior of
Strings:

[https://blog.codecentric.de/en/2014/08/string-
deduplication-...](https://blog.codecentric.de/en/2014/08/string-
deduplication-new-feature-java-8-update-20-2/)

tl;dr - don't assume and pick the right tool for the job

------
jeb_douche
And there is a good chance it will change again to have compact String in
OpenJDK 9, see
[http://openjdk.java.net/jeps/254](http://openjdk.java.net/jeps/254)

~~~
stevoski
I'm dubious as to whether this change will occur. It's been tried before, and
given up due to complexity.

I think it is an excellent idea, but in practice too awkward to implement
seamlessly.

~~~
ptx
Python 3.3 did it:
[https://www.python.org/dev/peps/pep-0393/](https://www.python.org/dev/peps/pep-0393/)

------
epmatsw
The author of the change participated in a Reddit thread discussing it when
the update came out. Thought he made a pretty good case for it.

[https://reddit.com/r/programming/comments/1qw73v/til_oracle_...](https://reddit.com/r/programming/comments/1qw73v/til_oracle_changed_the_internal_string/cdhb77f)

~~~
e40
I don't know, this

[https://www.reddit.com/r/programming/comments/1qw73v/til_ora...](https://www.reddit.com/r/programming/comments/1qw73v/til_oracle_changed_the_internal_string/cdhupx4)

seems to bring up persuasive arguments on the other side.

~~~
threeseed
His suggestion is pretty ludicrous though.

You can't just deprecate substring() which is a function that is used in what
90% of the millions of Java applications around the world. All because life is
difficult for a tiny few edge cases (and for which workarounds exist i.e.
checking Java version numbers). Sure it wasn't great that it was done during a
bug fix but we need to be mindful that Java releases do span multiple years.

------
strictfp
Stop being so dismissive of this being a problem. I've had lots of problems
with runaway allocation rates in parsers due to this issue. They could at
least have introduced an alternative method which returns a view into the
original String IMO. Not everyone is silly enough to not read the doc and get
confused when memory blows up.

~~~
the8472
> and get confused when memory blows up

this is fairly easy to spot with a profiler though

~~~
strictfp
Yes that is what I mean. Is it really more common that people do substring and
expect the rest to be thrown away than the other way around? I would like to
argue that this is an optimization aimed at sloppy code while it's penalizing
well-written code.

~~~
jaawn
I would say: definitely yes. I definitely think it is more common for a
programmer to expect that if they make a substring, and never reference the
original string again, only the substring would remain in memory after GC. It
is more intuitive. Without prior knowledge of the implementation, why would
you expect a reference to a substring to hold the entire original string in
memory? I don't think it is necessarily sloppy code to interpret the method
this way when you do not know the underlying implementation.

The description in the Java 7 API for substring is "Returns a new string that
is a substring of this string." That does not suggest any sharing with the
parent string; it says "new string".

~~~
strictfp
OK I agree that it's more logical from a GC point of view. But why would you
end up with a gigantic string? Are you not doing something wrong then?

Anyway, I have moderately sized strings and am trying to analyze parts of
them. And I don't want to make copies for the sake of analysis. This has to be
quite common as well.

------
brandonbloom
This reminds me: Generic Persistent RRB vectors should be part of every
standard library. There's a great master's thesis out there waiting to be
written about a similar data structure for variable length encodings of
elements (such as UTF-8 strings).

~~~
agumonkey
Interesting, another Phil Bagwell paper
[http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf](http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf)

------
jkot
String on its own is inefficient class. There is extra pointer, int field...
ASCII strings are not downgraded from char[] to byte[]... You get big
performance bonus just by replacing String with raw byte[].

I wrote Map<String,String> which does this transparently, and it consumes
about 5x less memory compared to HashMap<String,String>.

~~~
joajoa
It's not that simple. See [1] and [2]

[1]
[https://bugs.openjdk.java.net/browse/JDK-8054307](https://bugs.openjdk.java.net/browse/JDK-8054307)
[2] [http://shipilev.net/blog/2015/black-magic-method-
dispatch/](http://shipilev.net/blog/2015/black-magic-method-dispatch/)

~~~
hyperpape
It's been a little while since I read the method dispatch post...can you
explain what you're driving at?

------
ExpiredLink
> _Java 7 quietly changed the structure of String. Instead of an offset and a
> count, the String now only contained a char[]. This had some harmful effects
> for those expecting substring() would always share the underlying char[]._

In theory they could have implemented both. Substring optimization for small
but not larger strings.

~~~
aembleton
That would been terrible and unexpected. The behaviour would differ based upon
the length of the input String.

~~~
david-given
It shouldn't. The behaviour is specified by the external interface, and if the
internal implementation is changing the behaviour, then the implementation is
broken.

For example, back when I was at a JVM company, we had two string
implementations, one for ASCII strings and one for UCS-2 strings; the JVM uses
a lot of ASCII strings, and frequently you could figure this out at code load
time. Having an implementation based on an array of bytes saved quite a lot of
space, and was completely transparent.

------
pron
OTOH, Java 8u20 added string deduplication in the GC:
[https://blog.codecentric.de/en/2014/08/string-
deduplication-...](https://blog.codecentric.de/en/2014/08/string-
deduplication-new-feature-java-8-update-20-2/)

------
blueplanet
Can someone explain to me why using a cached value for hashCode in an
immutable string is a bad thing?

~~~
abollaert
I don't think he refers to the caching being a bad thing, but that it probably
does not affect performance that much, since most Strings are likely to be
small (and the number of times hashcode is called is also likely to be small).

There's also a catch : if the hashcode of the string is 0, the hashcode will
be recalculated every time (since the code assumes it has not been cached
yet).

~~~
moonchrome
>There's also a catch : if the hashcode of the string is 0, the hashcode will
be recalculated every time (since the code assumes it has not been cached
yet).

At least that part should be easy to fix by defining a hash function that
returns numbers != 0 - even the article says they did it for JVM7 but it's
gone in JVM8 - with no explanation ?

~~~
Ironlink
The hash32 implementation in Java 7 was not intended to fix the case where the
actual hash value was zero, nor did it have an impact on that case as the
hash32 value was stored in a separate field.

It was made to decrease the number of hash value collisions in large data
structures (hash maps and such). Its replacement is described here:
[http://openjdk.java.net/jeps/180](http://openjdk.java.net/jeps/180) . Given
that most Strings never end up in a large collection, allocating an extra 4
bytes for every one of them was a waste.

This post has been edited once for factual correctness.

~~~
mtdewcmu
It's hard to imagine a situation where the hash function would be so slow as
to justify adding 4 bytes to every string.

~~~
joosters
If your strings are large, then calculating the hash will take time AND the
extra 4 bytes for hash storage will be minimal extra overhead.

You can easily find good & bad cases for all of these string implementations.
They all have tradeoffs.

------
MichaelGG
Why does he say SubbableString example is not threadsafe? The private fields
are only written during object creation.

~~~
rlmw
Because he takes a reference to a char[] in the constructor without copying
it. char[]s are mutable so this means that the internal state of his String
can be mutated by something else without any synchronisation guarantees.

~~~
MichaelGG
But that's nothing to do with thread saftey - even with a single thread if the
char[] changes then stuff blows up right?

~~~
sgustard
Correct, and Java's String(char[]) makes a copy of the array for that reason.
This article's benchmark gets a boost by avoiding that.

------
hobarrera
Obligatory: [https://xkcd.com/1172/](https://xkcd.com/1172/)

