
Java May Use UTF-8 as Its Default Charset - ingve
https://marxsoftware.blogspot.com/2018/02/java-utf-8-default-charset.html
======
emergie
Article is about setting up locale during compilation/runtime. Internally
java/jvm uses only UTF-16 and it is not going to change because of backward
compatibility reasons.

Char is a 16bit type, so some unicode 'characters' have high/low surrogate
pairs, those pairs are encoded in 32bit int as `codePoint`. Example of a
problem: String#length returns number of chars, not number of codepoints.

String/Character api is a one big record of utf16 madness:
[http://grepcode.com/file/repository.grepcode.com/java/root/j...](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/lang/String.java#736)
[http://grepcode.com/file/repository.grepcode.com/java/root/j...](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8u40-b25/java/lang/Character.java#5180)

Most of contemporary technologies use UTF8 for string encoding and exchange
format. This back and forth UTF8<>UTF16 conversion in java doesn't come
without a computational&memory price.

------
bhauer
Yes, please! This aligns with my rants on the subject [1] because I want all
software to use UTF-8 by default. In my experience as an application developer
nearly all of my character encoding grief has been caused by an application or
library somewhere in my overall system architecture not using UTF-8 by
default. This is followed by me needing to find how to configure it. I am by
no means a character encoding expert, but I want everything to use UTF-8
unless there's a very particular need for something else (I'm still waiting to
find out what that might be).

[1]
[http://tiamat.tsotech.com/utf8-everything](http://tiamat.tsotech.com/utf8-everything)

~~~
lmm
Software that needs to display Japanese names (particularly if it may also
need to display Chinese names) tends to use traditional encodings to avoid
Unicode's Han unification.

(Some characters are considered by Unicode to be the same character despite
the Japanese and Chinese ways of writing that character being visually
distinguishable; Unicode regards this as the same character being written in
different fonts (just as in our Latin alphabet there are multiple visually
distinct ways to write the lower-case "a" or the number 4, one can choose to
cross "7"s or not, etc.) but customers sometimes don't find that acceptable;
in particular many Japanese people are very unhappy to see their names written
with the Chinese versions of those characters, which tend to be how
international fonts render them)

~~~
mey
So it is the same symbol, recognized as that symbol, meaning the same thing
but showing up as what some people consider comic sans?

So the problem is the font?

~~~
lmm
Yes except that there's no way to choose a font that all users will be happy
with - if you switch your application to a font that uses the Japanese way of
writing the character, some Chinese people with that character in their name
will be equally unhappy. (It's hard to create an exact analogy because there's
no country that the US has the same kind of tense relationship with, and we
don't attach quite the same personal meaning to the exact reading of our name
- we're used to seeing our names written in capitals and lower-case and being
the same name, we don't tell people the meanings of the characters used in our
names as we spell them, but I guess imagine having a 7 in your name that you
care about quite personally and then a program you use renders it as a French
7 even though you hate the French - you can tell that it's the same character,
but it's still wrong and feels very rude. It's the sort of thing that seems
trivial in the abstract but ends up being felt surprisingly personally - I'm
always a lot more pissed off than I expected to be when people call me "Imm").

You could do something like switching the font based on the system locale, but
at that point you're no better off than when you were using the system's
character encoding, and following a less well-trodden path.

~~~
unscaled
Well, there is really no way to solve this other than using a language-
specific font or markup to indicate the _language_ of the text. The first
solution is probably the more common one, since there are very few
international fonts that actually do CJK languages well to begin with.

The number of characters you need to cover is hard enough to make it a
daunting task, and if Han Unification didn't exist it could have been 3 times
harder...

The TRON Multilingual System (which was championed by the Han Unification
opponents in the 90s) basically does second solution. It did avoid character
unification between Chinese, Japanese and Korean, but it still chose to
specify the language of the text to support different overall processing based
on the language (e.g. different ligatures may be activated for different
European languages).

If you're talking about Japanese users being unhappy that their name
characters change when its rendered _within Chinese Text_ - well, that's how
it's always been in newspapers, books and official documents. Signs all over
Japan (made by Japanese for Chinese tourists) show the Chinese rendition of
Japanese place names such as 涩谷 (Shibuya) and nobody seems to bother. It goes
the other way too, of course.

~~~
lmm
> Well, there is really no way to solve this other than using a language-
> specific font or markup to indicate the language of the text. The first
> solution is probably the more common one, since there are very few
> international fonts that actually do CJK languages well to begin with.

Indeed, but - more by accident than design - traditional codepages end up
making language-specific fonts easy, since you'll naturally use different
fonts for different encodings (and if you don't, it's easy to switch to doing
so, because you were already carrying around the encoding for each piece of
text so you know which texts are Chinese and which are Japanese). I mean,
using UTF-8 you could carry around that metadata manually and achieve the same
result, but that's a lot less fail-fast; you usually only realise you needed
that information after you've got a substantial amount of text of unknown
language stored in your system. Whereas if you fail to store which encoding
text is in then you (hopefully) notice sooner.

> Signs all over Japan (made by Japanese for Chinese tourists) show the
> Chinese rendition of Japanese place names such as 涩谷 (Shibuya) and nobody
> seems to bother.

Sure, which I think is what mislead the Unicode consortium - for ordinary text
no-one cares, and most people's names aren't affected. But the people whose
names are affected can take it quite personally.

------
codeulike
Is that "UTF-8 with BOM" or "UTF-8 without BOM"? - because I've had lots of
trouble with things supporting only one of those two options.

~~~
foobarrio
Wait there's a UTF-8 with BOM? Why do you need a byte order mark for utf-8?
The stream "unit" is a byte. I thought only utf-16 and above had an optional
BOM?

~~~
grzm
They're likely referring to this behavior on Windows:

> _" Many Windows programs (including Windows Notepad) add the bytes 0xEF,
> 0xBB, 0xBF at the start of any document saved as UTF-8. This is the UTF-8
> encoding of the Unicode byte order mark (BOM), and is commonly referred to
> as a UTF-8 BOM, even though it is not relevant to byte order. A BOM can also
> appear if another encoding with a BOM is translated to UTF-8 without
> stripping it. Software that is not aware of multibyte encodings will display
> the BOM as three garbage characters at the start of the document, e.g. "ï»¿"
> in software interpreting the document as ISO 8859-1 or Windows-1252 or "∩╗┐"
> if interpreted as code page 437, a default for certain older Windows console
> applications."_

> _" The Unicode Standard neither requires nor recommends the use of the BOM
> for UTF-8, but warns that it may be encountered at the start of a file as a
> transcoding artifact. The presence of the UTF-8 BOM may cause problems with
> existing software that can handle UTF-8..."_

[https://en.wikipedia.org/wiki/UTF-8#Byte_order_mark](https://en.wikipedia.org/wiki/UTF-8#Byte_order_mark)

~~~
foobarrio
I should have searched. This is actually quite useful to be aware of. Thanks.

~~~
merb
you should remember it when creating csv's especially if your consumer uses
windows. saves a ton of "what is a charset" questions.

