> Unless you have very special needs, forget about UTF-16. I don’t think program...

Asooka · on June 11, 2019

Even on Windows it's best to keep your text in UTF-8 and convert it to and from UTF-16 when interacting with win32 APIs. Java, dotNet and JavaScript are the worst of all worlds because you're both stuck with wide characters (in their native string types) and have the intricacies of UTF-16 to consider. I guess the advice might have been better phrased as "Unless you're forced to, or have very special needs, stay away from UTF-16".

Const-me · on June 11, 2019

> it's best to keep your text in UTF-8 and convert it to and from UTF-16 when interacting with win32 APIs

It’s extra source code to write and then support, extra machine code to execute, and likely extra memory to malloc/free. Too slow, in my book automatically means “not best”.

> Java, dotNet and JavaScript are the worst of all worlds because you're both stuck with wide characters (in their native string types) and have the intricacies of UTF-16 to consider.

Just a normal UTF-16, like in WinAPI and many other popular languages, frameworks and libraries. E.g. QT is used a lot in the wild.

> the advice might have been better phrased as

It says exactly the opposite, “Use the encoding preferred by your library and convert to/from UTF-8 at the edges of the program.”

vardump · on June 11, 2019

Spot on. When coding against "raw" win32 API (or NT kernel APIs and perhaps rare native usermode NT API), using UTF-16 is the only way to keep your sanity. Converting strings back and forth between UTF-8 and UTF-16 in that kind of case is just senseless waste of CPU cycles.

One API call might take multiple strings and each conversion often means memory allocation and freeing — something you usually try to avoid as much as possible if it's something that's going to run most of the time the system is powered on.

The situation can be different in cross-platform code. In those cases, UTF-8 is a preferable abstraction.

Just don't use it for filenames. Filenames are just bags of bytes on at least on Windows (well, 16-bit WCHARs, but the idea is same) and Linux, and considering them anything else is not a great idea.

fwip · on June 11, 2019

"Too" slow depends on a lot of factors.

Const-me · on June 11, 2019

When you’re writing code that you 100% sure won’t ever become a performance bottleneck, you still care about time of development. Very often, unless it’s a throwaway code, also about cost of support.

Writing any code at all when that code is not needed is always too slow, this is regardless of any technical factors.

fwip · on June 11, 2019

Very little code in this world is needed. Much of it is, however, useful.

The person you replied to obviously isn't advocating for something they find useless.

Perhaps you could have instead asked "Why do you recommend doing this? I don't understand the benefit." But instead, you decided that they're advocating to do something useless for no reason.

Const-me · on June 11, 2019

> you decided that they're advocating to do something useless for no reason.

No, I decided they’re advocating to do something harmful for no reason.

They're advocating to waste hardware resources (as a developer I don’t like doing that), waste development time (as a manager I don’t like when developers do that). But the worst of all, UTF8 on Windows and converting to/from UTF16 at WinAPI boundary is a source of bugs, the kernel doesn’t guarantee the bytes you get from these APIs are valid UTF16, quite the opposite, it guarantees to treat them as opaque chunk of words.

UTF-8 has it’s place even on Windows, e.g. it makes sense for some network services, and even for RAM data when you know it’ll be 99% English so it saves resources, and that data never hits WinAPI. But as soon as you’re consuming WinAPI, COM, UWP, windows shell, any other native stuff, UTF-8 is just not good.

ChrisSD · on June 11, 2019

That very much depends on what you're doing. Constantly reencoding between UTF-16 and UTF-8 would be pointless. Not to mention that "UTF-16" on Windows usually means UCS-2, so you risk losing information if you reencode.

But if your application's strings are mostly independent of the WinAPI then sure, use UTF-8 and only convert when absolutely necessary.

Const-me · on June 11, 2019

> "UTF-16" on Windows usually means UCS-2

Wikipedia says it's UTF-16 since Windows 2000: https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows

Supporting Windows NT4 or Windows 95 in 2019 is what I would call "very special needs".

burntsushi · on June 11, 2019

UTF-16 is only a convention for file names on Windows. This is why we have things like WTF-8: https://simonsapin.github.io/wtf-8/

Const-me · on June 11, 2019

It’s works exactly the same way on Linux. Neither kernel nor file system changes the bytes passed from userspace to kernel, regardless whether they are valid UTF-8 or not.

Pass invalid UTF-8 file name, and these exact bytes will be written to the drive. https://www.kernel.org/doc/html/latest/admin-guide/ext4.html says “the file name provided by userspace is a byte-per-byte match to what is actually written in the disk”

Also try this test: https://gist.github.com/Const-me/dcdc40b206fe41ba200fa46b2e1... Runs just fine on my system.

kazinator · on June 11, 2019

It's not exactly the same on Linux, because Linux doesn't have duplicate pairs of system calls for one-byte character strings and wide strings. Linux system calls are all char strings: null-terminated arrays of bytes. It's a very clear model. Any interpretation of path-names as multi-byte character set data is up to user space.

Const-me · on June 11, 2019

> Linux doesn't have duplicate pairs of system calls for one-byte character strings and wide strings.

Neither is Windows. These “DoSomethingA” APIs aren’t system calls, they’re translated into Unicode-only NtDoSomething system calls, implemented in the kernel by OS or kernel mode drivers as ZwDoSomething.

Windows system calls all operate on null-terminated arrays of 16-bit integers. It's a very clear model. Any interpretation of path names as characters is up to user space.

burntsushi · on June 11, 2019

I don't know what you're talking about. You said Windows uses UTF-16 and pointed to Wikipedia. I'm only pointing out that that's only true by convention. Windows, even today, does not require that its file names be UTF-16.

Whether Linux analogously does the same or not (indeed it does) isn't something I was contesting.

ygra · on June 12, 2019

The file system / object manager is only one part of the whole, though. Object names and namespaces in general will have that restriction, but in user-space there's a lot of Unicode that's treated as text, not a bag of code units. And those things are UTF-16.

ChrisSD · on June 11, 2019

Wikipedia is wrong on the technical details.

E.g. the filesystem accepts any sequence of WCHARs, whether or not they're valid UTF-16: https://docs.microsoft.com/en-us/windows/desktop/FileIO/nami...

> the file system treats path and file names as an opaque sequence of WCHARs.

The same is true more generally, there's no validation so anything goes.

Const-me · on June 11, 2019

From your comment I was replying to:

> "UTF-16" on Windows usually means UCS-2, so you risk losing information if you reencode.

On Windows, you normally call this API to convert UTF-8 to UTF-16: https://docs.microsoft.com/en-us/windows/desktop/api/stringa... As you see, the documentation says it converts to UTF-16, not UCS-2, so no information is lost re-encoding.

And the article you’ve linked says “file system treats path and file names as an opaque sequence of WCHARs.” This means no information is lost in the kernel, either.

Indeed, kernel doesn’t validate nor normalize these WCHARs, but should it? I would be very surprised if I ask an OS kernel to create a file, and it silently changed the name doing some Unicode normalization.

Linux kernel doesn’t do that either, https://www.kernel.org/doc/html/latest/admin-guide/ext4.html says “the file name provided by userspace is a byte-per-byte match to what is actually written in the disk”

ChrisSD · on June 11, 2019

I'm sorry if I was unclear but my point was that when you receive a string from the Windows API you cannot make any assumptions about it being valid UTF-16. Therefore converting it to UTF-8 is potentially lossy. So if you then convert it back from UTF-8 to UTF-16 and feed it to the WinAPI you'll get unexpected results. Which is why I feel converting back and forth all the time is risky.

This is one reason why the WTF-8[0] encoding was created as a UTF-8 like encoding that supports invalid unicode.

[0] https://simonsapin.github.io/wtf-8/

ygra · on June 11, 2019

> I would be very surprised if I ask an OS kernel to create a file, and it silently changed the name doing some Unicode normalization.

Doesn't OS X do that? AFAIK files names are in NFD there.

loeg · on June 11, 2019

Yes, Mac normalizes and decomposes. It's weird.

bhaak · on June 11, 2019

Most of that is abstracted away by "use what your library uses".

I can't remember if I ever ran into an issue with Java because it used UTF-16.

If you look at the example code of the OP link where it reads a line from a file, you only see UTF-16 mentioned in a comment.

At a first glance, you only see a UChar* being filled.

https://begriffs.com/posts/2019-05-23-unicode-icu.html#readi...

Const-me · on June 11, 2019

I know, and I was replying to the comment saying that UTF-16 is something that’s very rarely needed.

Personally, when working with strings in RAM, I have slight preference towards UTF-16, 2 reasons:

1. When handling non-Western languages in UTF-8, branch prediction fails all the time. Spaces and punctuations use 1 byte/character, everything else 2-3 bytes/character in UTF-8. With UTF-16 it’s 99% 2 bytes/character, surrogate pairs are very rare, i.e. simple sequential non-vectorized code is likely to be faster for UTF-16.

2. When handling east Asian languages, UTF-16 uses less RAM, these languages use 3 bytes/character in UTF-8, 2 bytes/character in UTF-16.

But that’s only slight preference. In 99% cases I use whatever strings are native on the platform, or will require minimum amount of work to integrate. When doing native Linux development this often means UTF-8, on Windows it’s UTF-16.

detaro · on June 11, 2019

1. sounds interesting. Do you have numbers on an example?