Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Most of that is abstracted away by "use what your library uses".

I can't remember if I ever ran into an issue with Java because it used UTF-16.

If you look at the example code of the OP link where it reads a line from a file, you only see UTF-16 mentioned in a comment.

At a first glance, you only see a UChar* being filled.

https://begriffs.com/posts/2019-05-23-unicode-icu.html#readi...




I know, and I was replying to the comment saying that UTF-16 is something that’s very rarely needed.

Personally, when working with strings in RAM, I have slight preference towards UTF-16, 2 reasons:

1. When handling non-Western languages in UTF-8, branch prediction fails all the time. Spaces and punctuations use 1 byte/character, everything else 2-3 bytes/character in UTF-8. With UTF-16 it’s 99% 2 bytes/character, surrogate pairs are very rare, i.e. simple sequential non-vectorized code is likely to be faster for UTF-16.

2. When handling east Asian languages, UTF-16 uses less RAM, these languages use 3 bytes/character in UTF-8, 2 bytes/character in UTF-16.

But that’s only slight preference. In 99% cases I use whatever strings are native on the platform, or will require minimum amount of work to integrate. When doing native Linux development this often means UTF-8, on Windows it’s UTF-16.


1. sounds interesting. Do you have numbers on an example?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: