UTF-16 is only a convention for file names on Windows. This is why we have thing...

Const-me · on June 11, 2019

It’s works exactly the same way on Linux. Neither kernel nor file system changes the bytes passed from userspace to kernel, regardless whether they are valid UTF-8 or not.

Pass invalid UTF-8 file name, and these exact bytes will be written to the drive. https://www.kernel.org/doc/html/latest/admin-guide/ext4.html says “the file name provided by userspace is a byte-per-byte match to what is actually written in the disk”

Also try this test: https://gist.github.com/Const-me/dcdc40b206fe41ba200fa46b2e1... Runs just fine on my system.

kazinator · on June 11, 2019

It's not exactly the same on Linux, because Linux doesn't have duplicate pairs of system calls for one-byte character strings and wide strings. Linux system calls are all char strings: null-terminated arrays of bytes. It's a very clear model. Any interpretation of path-names as multi-byte character set data is up to user space.

Const-me · on June 11, 2019

> Linux doesn't have duplicate pairs of system calls for one-byte character strings and wide strings.

Neither is Windows. These “DoSomethingA” APIs aren’t system calls, they’re translated into Unicode-only NtDoSomething system calls, implemented in the kernel by OS or kernel mode drivers as ZwDoSomething.

Windows system calls all operate on null-terminated arrays of 16-bit integers. It's a very clear model. Any interpretation of path names as characters is up to user space.

burntsushi · on June 11, 2019

I don't know what you're talking about. You said Windows uses UTF-16 and pointed to Wikipedia. I'm only pointing out that that's only true by convention. Windows, even today, does not require that its file names be UTF-16.

Whether Linux analogously does the same or not (indeed it does) isn't something I was contesting.

ygra · on June 12, 2019

The file system / object manager is only one part of the whole, though. Object names and namespaces in general will have that restriction, but in user-space there's a lot of Unicode that's treated as text, not a bag of code units. And those things are UTF-16.