

JEP254: proposal to represent Java Strings as ISO-8859-1 - alblue
http://openjdk.java.net/jeps/254
According to the proposed Java enhancement &quot;most strings&quot; fall in the Latin-1 character set. Instead of storing all character arrays as 16-bit elements (in UTF-16) the proposal is to have a boolean flag to indicate if the string is in UTF-16 or Latin-1&#x2F;ISO-8859-1 encoding, thus saving memory overall.
======
SiVal
This sounds like a ridiculous idea. Java treats text as all modern systems
should: as a sequence of the open-ended Universal Character Set (UCS). This
proposal doesn't challenge that claim, nor could it. Java strings are
sequences of Unicode code points, logically at least. This proposal is just
about how to encode such a sequence internally.

There are numerous alternatives for encoding sequences of Unicode code points,
each with its own pros and cons. UTF-8, UTF-16, UTF-32, and various
compression schemes emphasize backward compatibility, or space savings, or
rapid, random access of characters, or whatever. Different designs optimized
for different benefits, with different associated costs. The UCS character
sequence is the universal thing, but how it is represented should be optimized
for the application.

It sounds as though this proposal is only about minimizing the number of bytes
needed to represent a sequence of UCS characters, but ISO 8859-1 was never
designed as a minimal-space representation of UCS characters. Its design
reflects entirely different goals. If all you care about is an internal (not
externally visible) representation that minimizes string size, you should do a
proper statistical analysis first. You will find, for example, that the curly
quotes and dashes that are nearly ubiquitous in serious English text and the
Euro character so fundamental to international economics are far more commonly
included in Java strings than are most control characters. Yet control
characters are a part of ISO 8859-1, and those vastly more useful characters
are not. Why, if your goal is to optimize for space, would you privilege so
many obscure control characters in your default, internal representation, yet
force a switch to a different encoding for any string on a blog containing an
m-dash or on an e-commerce site containing a Euro sign?

Instead, if it's all about space, and it is an internal, hidden
representation, you do a proper statistical analysis of exactly what it is you
most need to represent, and either assign bytes in inverse proportion to
commonness, or you use a proper information-theoretic compression "encoding"
scheme based on the results of your analysis and your weighted goals. Or you
stick with the existing, standard, run-anywhere representation Java has used
since the beginning.

Either way, switching to ISO 8859-1 for this makes no sense.

~~~
pilif
_> Java treats text as all modern systems should: as a sequence of the open-
ended Universal Character Set (UCS)_

unfortunately, Java treats text as data encoded in UCS-2 which uses a maximum
of two bytes to represent a character and all the Java string APIs assume the
UCS-2 encoding.

The problem with UCS-2 is that it doesn't allow to encode characters outside
of the Unicode BMP.

In order to solve this, Java actually uses UTF-16 encoding to encode
characters outside of the BMP, but none of the APIs actually know about this
and still assume UCS-2.

This leads to the API being wrong about character lengths and to regular
expressions wrongly matching certain strings and sometimes destroying data
when you apply them to replace substrings.

Case in point is this little test class here:
[https://gist.github.com/5745601abbfbf7068fcd](https://gist.github.com/5745601abbfbf7068fcd)

which prints 2 on the console. I've also given a talk about this at a swiss JS
conference in 2012: [http://pilif.me/unicode.pdf](http://pilif.me/unicode.pdf)
\- JS has/had the same issues as Java in that regard.

The only well-known contemporary languages that get this (somewhat) right are
Perl (since forever), Python 3, Ruby 1.9 and Swift, though ES6 is in process
of getting up to speed too.

Python 3 and Swift have the issue of being totally tied to unicode, so if the
politics and issues around Han Unification matter to you then those two are
actually also not usable for you.

~~~
Hinrik
> The only well-known contemporary languages that get this (somewhat) right
> are Python 3, Ruby 1.9 and Swift, though ES6 is in process of getting up to
> speed too.

And Perl.

~~~
pilif
You are right of course (I knew this as you can see when you look at the PDF I
linked - slide 50). I have edited my original post.

------
guard-of-terra
Most of software developers don't care about encodings all day. This will give
them enough rope to hang half of the world.

Memory is dirt cheap these days anyway, why now? And if you have mounds of
text, compress it.

~~~
Skinney
Reducing memory usage also reduces time spent in garbage collection.

~~~
Skinney
Why the downvote?

The smaller the strings, the more strings you can allocate before triggering a
garbage collection. Also, the smaller the objects, the less copying is needed
when promoting survivors.

~~~
aardvark179
And the more you can allocate in a thread local allocation buffer before your
bump allocator has to request a new TLAB.

------
TheLoneWolfling
An interesting alternate idea, though I fear the constant factor may be too
high for a language like Java, especially given Java's whole lack-of-value-
types thing. (And Java 8's "value types" don't work for this, due to their
immutability. Yay.)

You store a string as a self-balancing tree, preferably one that supports
constant-time appends and prepends.(For example, a skew-binary random access
list or a finger tree.) Each node of the tree has an encoding enum, an array
(+ length of said array), and the length including children (alternatively,
the length of the left child). A node's characters are all the same length
given by the encoding of the node, and a node stores characters by grapheme
clusters.

This allows efficient string building, among other things. About the only
problems are that lookup/replacement in the middle of strings is now O(log n),
and the constant factor.

So, for example, if you have the string "This is a t̴̟̟͙̞̑ͩ͌͝est", it'd
(probably) be stored in three nodes: one that stores "This is a t" with an
encoding of one byte per character, a direct lookup into the first 256 unicode
characters, one that stores "̴̴̟̟͙̞̟̟͙̞̑ͩ͌̑ͩ͌͝͝e", with an encoding of, what,
20 bytes per character, utf8 within a character, and one that stores "st",
same as the start.

Unfortunately, Java has too much overhead to make this practical.

------
PythonicAlpha
Python already uses alternative representations for Strings since Python 3.3
as much I know. When latin-1 (ISO-8859-1) is enough, one byte per code point
is used, when UCS16 is sufficient 2 bytes and 4 bytes in all other cases. So
the whole range of Unicode is supported and still the representation is space
efficient and because inside one string every code point has the same size,
the speed is also acceptable.

------
MrBuddyCasino
I have always wondered why they didn't do that, it seems to be such a simple
optimization. With UTF-8 you would give up some constant time operations like
character lookups, while ISO-8859-1 shouldn't have any performance regressions
I can think of. Because once you add a non-latin character to a String and the
encoding has to change, the data has to be copied anyway due to String
immutability.

~~~
chrisseaton
I guess at the time the JVM was less aggressively optimised, and UCS-2 was a
single simple approach that represented all required code points. Anything
else would probably have looked like a micro-optimisation.

~~~
desdiv
At the time UCS-2 was the best thing they could choose from. Remember, Java
predates UTF-8.

------
moru0011
pointless. Its the number of objects not the size of primitive fields (e.g.
the char array) which hurts GC and consumes memory. This proposal will save
<10% on an average short string instance but probably waste performance.

~~~
needusername
I assume this is motivated by heap analysis done by Oracle on their
could/SAAS/... applications. They may have quite a few large strings in old
gen. To give you an example, for every application deployed Tomcat builds an
retains a 200kb String. Other candidates are SQL queries, manifests or in-heap
caches.

But I agree with you on the performance side. My impression is that a lot of
Java applications are simple data pumps. They read data from a database and
send it to a client. It's hard to see how this JEP helps in that case:

\- read bytes from the network (probably UTF-8)

\- convert bytes to Java Strings

\- compress Java String (ASCII, Latin-1, UTF-8) _new_

\- "render" String to Writer (HTML, XML, JSON, ...)

\- decompress Java String for Writer _new_

\- encode to again UTF-8 for OutputStream/browser

In this case this would increase allocation rates and increase CPU load for a
potentially smaller old gen.

The only real way to optimize this would be to redesign the String class to be
encoding aware and update the Writer classes accordingly. This is unlikely to
happen and would hurt other use cases.

~~~
TheLoneWolfling
Can this particular case not be solved by adding a constructor that doesn't
compress the string?

Edit: "There are no plans to add any new public APIs or other interfaces.". :(

~~~
needusername
> Can this particular case not be solved by adding a constructor that doesn't
> compress the string?

Presumably if you use one of the byte[] constructors and the encoding is
already in the compression format or something compatible then yes.

Whether you'll be able to do that depends on very much on how you implemented
your IO. We're still seeing way to many String#substring in our traces after
it become slow in 1.7.0_06. Some of them can be fixed easily, others not so
much.

~~~
TheLoneWolfling
Agreed. The copy-on-substring behavior is a real pain, and I don't know if
there's any workaround.

~~~
needusername
Not using String, eg. using CharBuffer (and #slice) or building your own. It's
annoying and not always an option.

~~~
TheLoneWolfling
Yep, as you said, not always an option. No way to pass them to external things
expecting strings, for one. Wouldn't be an issue, except that you can't extend
strings.

------
outworlder
This can give rise to pointless optimizations, as clueless 'architects' from
BigCo learn about this and start forbidding non-ISO-8859-1 strings everywhere
on unsubstantiated performance claims.

------
based2
[http://en.wikipedia.org/wiki/ISO/IEC_8859-15](http://en.wikipedia.org/wiki/ISO/IEC_8859-15)

------
ape4
Once we have strings that are made of 8-bit chars we can put UTF-8 in there ;)

