I know, and I was replying to the comment saying that UTF-16 is something that’s very rarely needed.
Personally, when working with strings in RAM, I have slight preference towards UTF-16, 2 reasons:
1. When handling non-Western languages in UTF-8, branch prediction fails all the time. Spaces and punctuations use 1 byte/character, everything else 2-3 bytes/character in UTF-8. With UTF-16 it’s 99% 2 bytes/character, surrogate pairs are very rare, i.e. simple sequential non-vectorized code is likely to be faster for UTF-16.
2. When handling east Asian languages, UTF-16 uses less RAM, these languages use 3 bytes/character in UTF-8, 2 bytes/character in UTF-16.
But that’s only slight preference. In 99% cases I use whatever strings are native on the platform, or will require minimum amount of work to integrate. When doing native Linux development this often means UTF-8, on Windows it’s UTF-16.
I can't remember if I ever ran into an issue with Java because it used UTF-16.
If you look at the example code of the OP link where it reads a line from a file, you only see UTF-16 mentioned in a comment.
At a first glance, you only see a UChar* being filled.
https://begriffs.com/posts/2019-05-23-unicode-icu.html#readi...