Most of that is abstracted away by "use what your library uses". I can't remembe...

Const-me · on June 11, 2019

I know, and I was replying to the comment saying that UTF-16 is something that’s very rarely needed.

Personally, when working with strings in RAM, I have slight preference towards UTF-16, 2 reasons:

1. When handling non-Western languages in UTF-8, branch prediction fails all the time. Spaces and punctuations use 1 byte/character, everything else 2-3 bytes/character in UTF-8. With UTF-16 it’s 99% 2 bytes/character, surrogate pairs are very rare, i.e. simple sequential non-vectorized code is likely to be faster for UTF-16.

2. When handling east Asian languages, UTF-16 uses less RAM, these languages use 3 bytes/character in UTF-8, 2 bytes/character in UTF-16.

But that’s only slight preference. In 99% cases I use whatever strings are native on the platform, or will require minimum amount of work to integrate. When doing native Linux development this often means UTF-8, on Windows it’s UTF-16.

detaro · on June 11, 2019

1. sounds interesting. Do you have numbers on an example?