Based on all of the coding "best practices" books/blogs I've read, it is unwise to develop something which heavily depends on the internal implementation of the String class (or any library class). Many other commenters here are bemoaning the impacts the specific implementation of String will have on their existing applications. However, we should be designing based on the interface/contract presented, not specific implementation details of interfaces/libraries. Any optimization based on specific implementation details is "at your own risk."
If changes to the implementation of String could cause issues for you, you really should be using a custom solution anyway. A quick and dirty option would be to write a wrapper class which uses all of your preferred String hacks in a central location, and using that in place of String throughout your code. When String's implementation changes, you can update your hacks all in one place.
Algorithmic complexity should be considered part of an API, because it is not mere "internal implementation". It has observable effects from the outside. When a function in an API goes from O(1) to O(n), that should be considered a breaking change.
I've been bitten by std::list::size in pre C++11 code, as it was allowed to be O(n). But they were clear about it, the fault was mine. Now, it is guaranteed to be O(1): http://en.cppreference.com/w/cpp/container/list/size
I think it's likely that the change to String.substring was actually a good one. But I also agree with others that it was rolled out poorly, and should probably not have been in a fixpack release, as from how I see it, they changed the API.
substring() was always defined in the 6 and 7 JavaDoc as:
"Returns a new string that is a substring of this string..."
Both implementations satisfy the API description so the API didn't change, the implementation did.
Optimizing your code based on the internals of a supposedly opaque data structure is a bad practice and if you get burned you only have yourself to blame.
In reality, performance often matters. Then there's no way to avoid needing to understand what the implementation does. The java6 implementation was a big, somewhat well-known gotcha. Read in mb of data in a String, substring a couple kb out, and tada! you're still holding all those mb.
You inescapably either:
1 - substring introduces a new string, creating lots of garbage since it's an incredibly common operation,
or
2 - substring saves garbage in many cases by having substring return sub-chunks of the full char array, but you will be unable to gc the full string as long as any substring remains reachable
or 3 - stop trusting the internal String implementation to be as efficient as possible, write a wrapper you can update with the best method for the current version, never look back :-)
why not reimplement the entire standard library so you understand and control all the performance characteristics? This can be just like writing C in the 90s: start by writing a hashmap, a string, an arraylist or equivalent, etc. Because your code is sensitive to every common data structure you use.
Because, most of the time, performance doesn't matter enough to justify it. Custom solutions are only necessary if you know performance is critical to your application's purpose, and that you need to squeeze as much out as possible. If that isn't the case, then you don't need to worry about the implementation details, because, in general, any performance changes won't matter.
It's true that they didn't violate the API documentation, but it's also true that (as your parent suggests) it probably would have been better if the API documentation had explicitly stated that `substring()` is O(1), such that this change would have been breaking. At the very least it would have made it more obvious, and at most it would have forced them to make a new method for the new behavior.
This is a fair argument to make, but to my knowledge, this is not currently true of the Java API. It probably would make sense for some future version of the Java API to include O notation as part of the contract, at least for certain elements such as String. Since the performance expectations of Java have increased dramatically, it really would be beneficial to maintain stable-or-better performance (edit: for all use cases) across versions.
I guess my point is that, regardless of your opinions of what the API contract should include, you have to work with the current contract and plan/design/develop accordingly to avoid issues with library implementation changes.
I think it's unreasonable to assume programmers can cope with an API that is free to change its algorithmic complexity, particularly when that API is part of the core library for a language. In other words, I do not fault the Java programmers who assumed that the algorithmic complexity would not change in a minor update.
The end-game for what you're suggesting is that all Java developers who care about performance, and do heavy string manipulation in their performance critical code, should re-implement the String class. I do not find that realistic.
Like I said, this is a really good point, and it supports including algorithmic complexity as part of the Java API in the future, but unless I am mistaken, it currently is not formally included.
The thing is, you end up depending on algorithmic complexity whether or not it's guaranteed. There's simply no way around it. If this API changed to an exponential or factorial running time, almost all users of it would be completely broken. Since the running time isn't guaranteed then it hasn't technically broken any API contract, but pretty much the only way not to be broken by this change is not to use the API.
That means that if there's no explicit running time guarantee given, your alternatives are to either not use that API at all, or depend on the running time you can observe or guess at.
Since people write APIs to be used, the former is obviously not what's intended. So it's pretty reasonable to consider any API without an explicit running time guarantee to have an implicit one along the lines of "this won't get vastly worse" because your only other choice is not to use it at all.
I think you're alluding to a generic, implicit promise of reasonable performance. I definitely think this is a valid assumption to make, so you are right to point it out. Any API function needs to have reasonable performance, or else there is no point in using it and it is not serving its purpose. It is very true that people write APIs to be used, and if their performance is unreasonably slow, they won't be used. This makes complete sense.
If your argument is that the String implementation change constitutes "unreasonably poor performance," as in...for most normal use cases, then that makes sense. I disagree with it, but it makes sense to argue that if you think that is what's happening. That goes beyond the implementation vs. contract discussion, and maybe that is the real issue many people have. Maybe some feel like the performance hit will affect so many use cases that are common for String, that it is an unreasonable performance change.
I have no opinion about this particular change, as I haven't looked into it much and I haven't done any Java in a while. I'm just pointing out that there is something of an implicit time complexity guarantee with any API, so the lack of an explicit API contract isn't enough to say that code which breaks due to a change is at fault.
>> However, we should be designing based on the interface/contract presented, not specific implementation details of interfaces/libraries. Any optimization based on specific implementation details is "at your own risk."
That's a bit naive. Let's say you want to code something which heavily depends on a String class. You test the standard one, it's fast enough. You decide you don't need to roll your own because of it.
Then it changes and it's no longer fast and changing it at this point might be a big pain. I mean, it's hard to predict that fundamental class in a language is going to change its performance characteristics.
This is exactly the kind of "programming by coincidence" discussed at the link I provided.
If you are designing something that makes heavy use of String and String operations, and you know it needs to be as optimized as possible, you should consider whether or not you need a custom String solution as part of the design process. Of course there is a hypothetical scenario where a developer might run into this even when "doing everything right", but generally it can be avoided by being aware of these potential interactions and "program deliberately."
I am not saying there will never be times when you need to "just make it work" and reasonably can't account for everything like this. The real world does not always provide the opportunity to do everything "the right way." My point is just that, when you run into this issue, it should be understood that any issues caused by implementation changes aren't really the fault of the maintainers if they haven't broken the contract/interface, and it performs well under the use cases it is designed for.
> If you are designing something that makes heavy use of String and String operations, and you know it needs to be as optimized as possible, you should consider whether or not you need a custom String solution as part of the design process.
And how do you do that? By writing a test case and seeing what the standard string does. Clearly, we expect things to change over time, so Oracle didn't necessarily do something underhanded here. But Oracle did change the behavior of String enough that people need to hear about it, and perhaps reconsider whether the standard string still does what they need to do.
One method would be to have an informal heuristic you use when you're about to call a library: are the performance characteristics I need guaranteed by this library? If yes: proceed, if no: use a different solution or proceed with caution.
So, if you are implementing something that deals with a large input string, and creates a large number of substrings, you know there could potentially be a performance impact depending on the way the implementation works with substrings. The API makes no performance or memory guarantees about substrings. If your application has strict performance requirements, you know you either need to find another solution, or "proceed with caution."
In this case, performance shouldn't have changed much -- at least in a big-O sense -- but memory use did. And the JDK guys have been swearing for years that the garbage collector handles that for you, so I'm not sure alarm bells would have gone off.
(We'll ignore the fact that every article I see about high performance Java talks about using memory buffers or some kind of "off heap" scheme specifically to get around the garbage collector).
I need a random number. I see that Java has a class for this ( http://docs.oracle.com/javase/8/docs/api/java/util/Random.ht... ). It doesn't guarantee anything about performance. It doesn't even really guarantee anything about the algorithm it will use in the future, but does say that it currently uses a linear congruential algorithm. Should I (1) write my own random number generator? (2) "proceed with caution"? or (3) write a test case, see if the thing is fast enough for what I want, and pay attention to any changes to java.util.Random in future releases? Personally, I generally pick (3).
I would usually choose option 2 or 3, but that is because usually small performance discrepancies are unimportant. If performance is paramount at the point in the code where the library is used, I would opt for (1) or a fourth option you omitted: find a library which has already implemented a performance-driven solution.
It is similar to sorting implementations. If you have a small list of items in a collection, you might not care about how fast they can be sorted as long as it is "reasonably fast," so you can just use a built-in sort() function. If you know your collection will likely be larger than most, or you need them sorted as fast as possible, you might choose to implement your own sorting function, or use a different library which specifies a given performance level.
You could just use this hypothetical, built-in sort() for now, and deal with it later if any issues arise, or you could plan ahead if it is important and future proof the code now. The former would be "proceeding with caution" while the latter would be the more maintainable, "best practices" approach when you know performance is higher priority than normal.
If you "proceed with caution" you might run into a situation where the maintainers switch from insertion sort to heap sort, because in many (most?) use cases, heap sort is faster. However, your collection is often already sorted or mostly sorted, which is a case where heap sort performs much worse than insertion sort. Now your specific application has worse performance, but other applications have improved performance. If your collection needs to be sorted at a specific performance level, it is better (but not required) for you to implement a custom sort function, or use a library that is designed for your use case. This way you can have appropriate control over the performance.
One method would be to have an informal heuristic you use when you're about to call a library: are the performance characteristics I need guaranteed by this library? If yes: proceed, if no: use a different solution or proceed with caution.http://www.trainingintambaram.in/dot-net-training-in-chennai...
Here is an explanation of the change to `substring()`[0] and why it was done. The change was done in Java 7u6.
In short, the previous way of keeping the same underlying character array and just updating the {offset, count} indexes has a drawback in that if the original string is large, it is prevented from being GC'd if one keeps a reference to even a single substring generated from it.
So, it's a trade-off between the original and new behaviour; the original way more or less caps the memory usage at the size of the original string, but at the expense of not being able to GC it if even a single substring exists, while the new way increases memory usage for each substring generated but does not prevent any of the strings from being GC'd.
This is why the article's code example yields such a huge difference in memory usage in Java 6 vs Java 7; it is effectively a sort of "anti-pattern" when used against the new `substring()` method. (i.e. iterating through a large string and generating lots of sub-strings)
The article I linked to, which came out in late 2012, basically had the same advice:
"If you are writing parsers and such, you can not rely any more on the implicit caching provided by String. You will need to implement a similar mechanism based on buffering and a custom implementation of CharSequence"
The thing I mind is that there is now no way to get the old behavior. String is a final class, so you cannot override it and add a field, even. You can roll your own - if there is no code you do not control that takes a string. (And if you don't mind having to write your own string class!)
And it being done in a "bugfix" release? That's unacceptable.
I agree - echoing the sentiments of another commentator here, I feel like one of the tenets of Java is backwards compatibility. While the change doesn't affect functionality, it can turn code that previously had a space complexity of O(1) into one that is O(n). This is probably a Bad Thing.
Conversely, for people new or somewhat-new to the language, the change probably makes sense from a principle of least surprise. From the start, you're taught that Strings are immutable objects, so you probably understand that `.substring()` produces a new instance object. Not having the original memory freed when you remove all references to the original string would likely be puzzling at first.
In this respect, the Java/Oracle folks likely decided that optimizing for the "parsing/tokenization" use case (where you make lots of substrings from a large original string and thus it makes sense to use the same underlying character array) was more novel and less frequent than the use case of "just pulling a small substring from a much larger one and then discarding the large one."
> You can roll your own - if there is no code you do not control that takes a string.
You can roll your own for the standard String too, through the bootstrap class loader.
Although I don't know how many assumptions about the internals of the string class are baked into the JVM. But I think you could replace substring() relatively safely.
Given the pain that would be associated with rolling your own, why not make the case that a new method be added to the String interface that provides the old behavior? Legacy applications still need to change, but it would be a relatively straight-forward mechanical replacment.
That requires storing an extra two fields per String (int length, offset;), which is costly. Users who need constant-time substrings can simply implement their own class `Subsequence extends CharSequence` with a constructor taking a CharSequence and two ints. Users who need to pass a substring to a foreign function which only accepts String do need to copy, but that's not a major enough use-case to justify upping the memory usage of most applications.
I recently started to write an article describing how java.lang.String was one of the classes that the Java creators got mostly right in Java 1.0. This contrasts to other Java 1.0 classes such as java.util.Date, which, as the usage of Java spread, and some good practices emerged, revealed itself to be poorly implemented.
I wanted to show in my article how one of String's strengths is that String efficiently shares memory with substrings. But then when I looked into Java 8's String source code, I was surprised to find that this was not the case. It seemed I had been working with false assumptions for years.
Now, thanks to Heinz's article (OP), I discover that this was a change in the String class.
Long ago, a man who had been programming for 30 years told me about his theory of the half-life of programming knowledge being roughly 18 months. Today, Java has made me believe his theory a little bit more.
> Today, Java has made me believe his theory a little bit more.
I think Java is sort of an exceptional case here in that the language stays relatively stable and evolves very little over time. Also they tend to not change internal details of the standard library without reason, so even lots of things programmers shouldn't rely on, such as behaviour of substring() will stay unchanged for quite long.
Now JavaScript on the other hand probably has programmign language and framework knowledge half life of considerably less than 18 months ...
JS now is moving fast but for the longest time it was stuck in stasis with various incompatible implementations floating around. And almost any knowledge on it was probably invalid in one of those implementations.
The article is an example of using an anecdote ("During our course, the customer remarked that they had a real issue with this new Java 7 and 8 approach to substrings") and a skewed microbenchmark to extrapolate to general advice.
It does address common real scenarios. However, it also causes issues for other common real scenarios. For instance, say you have a CSV parser where you generally only look at a few columns. Under the old system, it wasn't crazy to implement this by, for each field, checking if any unescaping is necessary, and if not (generally the common case) doing a substring. This produces some garbage, of course, but manageable amounts. This approach abruptly gets _far_ slower under Java 7.
I can see why they made the change, but it absolutely did cause problems for real-world use-cases.
The way the change was distributed (in a bugfix release) and (mis)communicated was completely wrong, I definitely agree with that.
As for the parsing example, yes, that's one way to implement this but definitely not the only right one -- if you care only about a couple of fields from a massive string, it depends on your application domain whether you can tolerate the additional memory bloat. What I like about the current behaviour is that the method is less surprising and more GC friendly
Oh, there are definitely other approaches to the problem I mentioned (and in fact approaches which are better than the naive split even under Java 6). Changing it out from under people should have been done a lot more carefully, though.
I would not have objected to releasing this change in Java 1.7. But releasing it in a BUG FIX RELEASE causes me to lose confidence in the maintainers of the JVM. One of the reasons that large companies like mine build in Java is because of Sun's long history of extremely careful attention to backward compatibility. Oracle is no Sun.
100% right. Sun were psychotically obsessed with maintaining a completely stable and reliable and predictable platform for both the JVM & Solaris. This went out the window with Oracle for the sake of "innovation". The number of major incidents caused by cavalier changes in 1.6 & 1.7 is ridiculous. Youd be hard pushed to even find anything worse than a documentation formatting bug in 1.5.
Tbh, though, it's a "damned if you do" scenario: pre-acquisition, all talk was about the glacial pace of evolution in the Java world and how things were just not progressing. Oracle started releasing hard and fast, and now people complain it's too much and too buggy. Can't please everyone...
You must be kidding. Up until 1.6 I had regressions on even the minor releases. Since oracle took over I update without even thinking about it. And I've been delivering desktop apps since 1.1.8
I've been programming in Java since Java 1.1.8 and I hazily recall several regression problems under Sun's stewardship. My gut feeling is that the rate of regressions in Java updates has remained steady.
If the rate has increased significantly under Oracle's stewardship, this is something I'd like to write about in my Java newsletter.
Do you have some statistics to demonstrate that regressions have become substantially worse since Oracle took over Java? I'd happily acknowledge you as a source!
I don't think that is fair 1.5 introduced the enum keyword.
That was not a fun change in a codebase I worked on. Java 6 and later are more sticklers in what is allowed by the api contacts and not breaking those than in the sun days.
Can't understand why people in this thread seem so ticked about this.
The author's observation is interesting and useful to know, but this is an edge case that is easily fixable and actually kind of the fault of user code, not the implementation or the spec.
Its not so simple to just say "I can't believe they changed the performance of xyz"... the author's article is about just one usage pattern, and there is no evidence presented to show it is very common compared to others. Developers needing specific behavior for this pattern could easily have used java.nio.CharBuffer which makes allocation/copying/access to the underlying array/etc explicit, and just happens to implement CharSequence. No reimplementation of String necessary.
In general, Java is not C or C++, and has never been billed as a platform where you could rely on the internal representation of anything unless it is specified as such. Even the same bytecode running on the same platform on the same machine can run differently if the hotspot compiler says Make It So. That of course assumes you are even using Oracle's JVM or OpenJDK... there are in fact other implementations of both the JVM and the standard library that are free to implement things however they want.
Even within the Oracle/OpenJDK sphere - there are 4 different garbage collectors which all could affect this pattern differently. There is even a new String dedup feature that would have an impact on the internal behavior of Strings:
You can't just deprecate substring() which is a function that is used in what 90% of the millions of Java applications around the world. All because life is difficult for a tiny few edge cases (and for which workarounds exist i.e. checking Java version numbers). Sure it wasn't great that it was done during a bug fix but we need to be mindful that Java releases do span multiple years.
Stop being so dismissive of this being a problem. I've had lots of problems with runaway allocation rates in parsers due to this issue. They could at least have introduced an alternative method which returns a view into the original String IMO. Not everyone is silly enough to not read the doc and get confused when memory blows up.
Yes that is what I mean. Is it really more common that people do substring and expect the rest to be thrown away than the other way around? I would like to argue that this is an optimization aimed at sloppy code while it's penalizing well-written code.
I would say: definitely yes. I definitely think it is more common for a programmer to expect that if they make a substring, and never reference the original string again, only the substring would remain in memory after GC. It is more intuitive. Without prior knowledge of the implementation, why would you expect a reference to a substring to hold the entire original string in memory? I don't think it is necessarily sloppy code to interpret the method this way when you do not know the underlying implementation.
The description in the Java 7 API for substring is "Returns a new string that is a substring of this string." That does not suggest any sharing with the parent string; it says "new string".
OK I agree that it's more logical from a GC point of view. But why would you end up with a gigantic string? Are you not doing something wrong then?
Anyway, I have moderately sized strings and am trying to analyze parts of them. And I don't want to make copies for the sake of analysis. This has to be quite common as well.
This reminds me: Generic Persistent RRB vectors should be part of every standard library. There's a great master's thesis out there waiting to be written about a similar data structure for variable length encodings of elements (such as UTF-8 strings).
String on its own is inefficient class. There is extra pointer, int field... ASCII strings are not downgraded from char[] to byte[]... You get big performance bonus just by replacing String with raw byte[].
I wrote Map<String,String> which does this transparently, and it consumes about 5x less memory compared to HashMap<String,String>.
> Java 7 quietly changed the structure of String. Instead of an offset and a count, the String now only contained a char[]. This had some harmful effects for those expecting substring() would always share the underlying char[].
In theory they could have implemented both. Substring optimization for small but not larger strings.
It shouldn't. The behaviour is specified by the external interface, and if the internal implementation is changing the behaviour, then the implementation is broken.
For example, back when I was at a JVM company, we had two string implementations, one for ASCII strings and one for UCS-2 strings; the JVM uses a lot of ASCII strings, and frequently you could figure this out at code load time. Having an implementation based on an array of bytes saved quite a lot of space, and was completely transparent.
The behavior would differ, but it would be the better behaviors for the length of the String.
For large strings, it would work exactly like it did before with fast views. And since the existing code already knew about memory sharing this wouldn't be a new problem for any code.
For small strings, it would now be faster and have smaller memory footprint. Existing programs would be faster and better performing.
I think the main reason this is a bad idea is that String is final. The way to do this properly is with two classes, SmallString and LargeString, and then using hotspot to optimize the program for how they are used. But that is not possible without changing the API.
I don't think he refers to the caching being a bad thing, but that it probably does not affect performance that much, since most Strings are likely to be small (and the number of times hashcode is called is also likely to be small).
There's also a catch : if the hashcode of the string is 0, the hashcode will be recalculated every time (since the code assumes it has not been cached yet).
>There's also a catch : if the hashcode of the string is 0, the hashcode will be recalculated every time (since the code assumes it has not been cached yet).
At least that part should be easy to fix by defining a hash function that returns numbers != 0 - even the article says they did it for JVM7 but it's gone in JVM8 - with no explanation ?
The hash32 implementation in Java 7 was not intended to fix the case where the actual hash value was zero, nor did it have an impact on that case as the hash32 value was stored in a separate field.
It was made to decrease the number of hash value collisions in large data structures (hash maps and such). Its replacement is described here: http://openjdk.java.net/jeps/180 . Given that most Strings never end up in a large collection, allocating an extra 4 bytes for every one of them was a waste.
This post has been edited once for factual correctness.
I assume it's because most of the standard Java collections (HashMap, HashSet) actually cache the hash codes themselves. So caching it again inside the object is a bit redundant.
However other data structures (notably the ones in guava), don't do this caching - which does hurt performance for any sort of heavy writes and/or complex object.
It can be a bad thing if all programmers assume that the caching works, and it doesn't. That's where the bit in the article mentions strings that produce a hash code of zero. Zero is used as a sentinel for "hash code not yet calculated", so for those particular strings the code is recalculated every single time.
That's not worse than not caching at al. Hashing implementation usually is a black box. Actually simple if (h == 0) h = 0xdeadbeef would probably prevent that issue from happening (if that should be considered a real issue at all).
Because he takes a reference to a char[] in the constructor without copying it. char[]s are mutable so this means that the internal state of his String can be mutated by something else without any synchronisation guarantees.
If changes to the implementation of String could cause issues for you, you really should be using a custom solution anyway. A quick and dirty option would be to write a wrapper class which uses all of your preferred String hacks in a central location, and using that in place of String throughout your code. When String's implementation changes, you can update your hacks all in one place.
Here is some relevant advice from The Pragmatic Programmer: https://pragprog.com/the-pragmatic-programmer/extracts/coinc...