Hacker News new | past | comments | ask | show | jobs | submit login
What it takes to pass a file path to a Windows API in C++ (gamedev.place)
153 points by AshleysBrain on Sept 24, 2023 | hide | past | favorite | 168 comments



Alternative sane approach:

- Forget about long paths unless you really need to care. You need to care if you're accepting a file path from another app, for example. You don't need to care if you're using your own files in your install directory; too much of Windows doesn't support long files (including explorer, as mentioned, not to mention third party apps), so it's very unlikely that you'll need to deal with them if you're just handling your own data.

- Work in utf-16 when dealing with file paths. Convert back and forth at the point of entry / exit to / from your app. It's unsurprising that you need to call the conversion function twice in order to find out how much memory to allocate, that's just how that works.

Don't change random registry settings or code pages on the user's computer, that's mad.


> Don't change random registry settings or code pages on the user's computer, that's mad.

A million times this.

Multiple Valve games set the microphone volume in Windows to 100% whenever you launch the game, which, for my setup, makes me sound unbearably loud. Unless your app has a valid reason to change my settings, like if I've set it up to do that on a schedule or with some condition, sure, otherwise don't mess with my settings.


Story time

When the USB audio device class was created [1], it had support for AVR-like audio controls (volume, balance, tone etc.), and, like would be reasonable, made all that work with decibels. But, as far as I know, that part of the standard got ignored by basically everyone, including some of whom wrote it (Microsoft), by presenting e.g. the on-device volume control as a 0-100 scale, like Windows and ALSA still do. Except... some hardware devices actually implemented the spec, including the PCM29xx family of audio codecs. These, and clones of them, have been used in quite a few USB audio cards and some very low-end audio interface over the decades. And for those, "100" in Windows means something like "+30 dBfs" [2] - so you'll get extreme distortion with basically any signal.

[1] 1.0 is actually impressively complex and basically models the entire stack of audio devices someone could have possibly had in their TV cabinet in the 1995, including stuff like dolby and surround decoders.

[2] This is defined around 5.2.2.2.3 and/or 5.2.2.4.3.2 in https://www.usb.org/sites/default/files/audio10.pdf


> Work in utf-16 when dealing with file paths. Convert back and forth at the point of entry / exit to / from your app.

But that's not compatible with other OSes, so you can't write native multiplatform C++ code just because windows keeps insisting on their worst-of-both-worlds UTF-16

(worst of both worlds because it's not single value per unicode character like UTF-32, and not ASCII compatible like UTF-8)


Not to be that guy, but utf-32 is fixed length code point, not fixed length character.

Characters can consist of multiple code points.

I agree that utf-8 is preferable to utf-16 though in most cases where the root script is English/Latin.


> Characters can consist of multiple code points.

I enjoy being "that guy", and so, to be unfathomably pedantic, I will point out that the Unicode standard actually uses the word "character" to refer to _encoded characters_, i.e. the mapping from a code point to an abstract character [0]. So 1 code point = 1 character for Unicode. Of course, in real life, character is a silly made-up word that means whatever you think it means and using it to refer to an extended grapheme cluster or glyph is probably closer to how people really think of them anyway.

[0]: See section 3.4 of https://www.unicode.org/versions/Unicode15.0.0/UnicodeStanda...


Your position is obviously wrong both with regard to the Unicode standard and with regard to the meaning of the word "character."

Unicode does not define "character" because that term is far too imprecise for a deterministic standard. And the Unicode standard is extremely clear that an abstract character may be represented by multiple code points.

"A single abstract character may also be represented by a sequence of code points—for example, latin capital letter g with acute may be represented by the sequence <U+0047 latin capital letter g, U+0301 combining acute accent>, rather than being mapped to a single code point."

> Of course, in real life, character is a silly made-up word that means whatever you think it means and using it to refer to an extended grapheme cluster or glyph is probably closer to how people really think of them anyway.

Human language and writing are the defining inventions of the known universe. Unicode just happens to be one way of representing them. Referring to them as "silly and made up" in comparison to Unicode is nothing short of extreme mental illness.


Unicode Standard section 3.4:

> Unless specified otherwise for clarity, in the text of the Unicode Standard the term character alone designates an encoded character.

Reading is free. I even gave you the section number.

> A single abstract character may also be represented by a sequence of code points

An "abstract character" is not the same thing as a character. The whole point of my comment was that the general word "character" is a general class of definitions but means nothing by itself. Much like the word "number" - you will notice that mathematicians define many kinds of numbers, but never the word itself.

> Human language and writing are the defining inventions of the known universe. Unicode just happens to be one way of representing them. Referring to them as "silly and made up" in comparison to Unicode is nothing short of extreme mental illness.

All words are made up. I'm sorry you had to find out this way.


> All words are made up. I'm sorry you had to find out this way.

It's so, so interesting that you believe this to be true.

Do you believe all things made by humans are made up? Is HN "made up"? Are other people "made up"?


Makes me wonder how bad utf-8 in off-latin environments really is: what's the fraction of strings that is "language content", and how much is colons, quotes, decimal numbers and the like? What fraction of coding cultures embraces home locale script for identifiers, and how many continue on the trajectory set in motion by pre-unicode programming languages? My guess would be that utf-8 is still quite bad, but I wouldn't be entirely surprised if actual numbers turned out considerably less bad than one might expect. And how much of the "utf-8 tax" remains if the actual format of most data at rest and in transit isn't json but gzipped json?


Experiments I recall from a while back from browsers and JDK indicated the vast majority of memory is consumed by latin1 strings (even for apps in Asia). Of course, those experiments I recall are something like 20 years old and likely biased (websites were smaller, content was still latin1 dominated). It would be interesting to know the results from Chinese apps and websites today. For websites, the markup used the most space and that stuff remains latin1. For apps I don’t know. But also, text size in memory is typically a joke because the expensive bit is all the multimedia that accompanies text these days. Remember, 100mib is 25million characters even in utf32. That’s 1 hour or so of high quality audio or a few seconds of video. In terms of compression, the smaller initial file will typically compress better so not sure what you mean by “utf-8” tax. UTF8 is not slower to process than utf32 as far as I’m aware except for one small corner case of random codepoint access (but then in UTF text processing, you really shouldn’t be doing that I think but I’m not a utf expert ).


Speaking as someone who did what GP suggested, it can be cross-platform; I just abstracted the string to path and path to string conversions, as well as any path operations that are different across platforms.

The first thing to realize to make it work is that paths are not strings, so you actually don't want to treat them like they are.


Our native multiplatform c++ code uses wstring (win, mac, android, wasm). It’s not the worst option. AFAIK There is no 100% method to have single codebase tackle the differences in platforms. My recipe for keeping sane is to have single implementation for storage (wstring), and then very explicit points of converting paths from ‘platform’ to program internal.

Platform differences unfortunately go way deeper than just using UTF-16. Lots of gnarly details. No platform is worse than another - they just have different kinds of good and bad in different places.


If you're calling Win32 functions it's already not compatible with other OSes... If you want to use cross platform code with files, use the C or C++ standard library's file handling functions.

C++ also supports wstring, but its implementation is dependent on the platform for god knows what reason.

utf-16 is also supported in C11:

https://en.wikipedia.org/wiki/C11_(C_standard_revision)#Chan...

> Improved Unicode support based on the C Unicode Technical Report ISO/IEC TR 19769:2004 (char16_t and char32_t types for storing UTF-16/UTF-32 encoded data, including conversion functions in <uchar.h> and the corresponding u and U string literal prefixes, as well as the u8 prefix for UTF-8 encoded literals).

You can use uchar.h for converting instead of the Win32 functions.

Here's a helper I wrote to wrap these string types in std::string and std::wstring:

    bool mbStrToWideChar(string const& in, wstring& out)
    {
        bool status = true;
        i32 bufferSize = MultiByteToWideChar(
            CP_ACP, 0, in.c_str(), -1, nullptr, 0
        );

        WCHAR* buffer = new WCHAR[bufferSize];
        i32 conversionResult = MultiByteToWideChar(
            CP_ACP, 0,
            in.c_str(), -1,
            buffer, bufferSize
        );

        if (!conversionResult)
        {
            cerr << "Failed to convert ANSI string [" << in << "] to Wide string!" << endl;
            status = false;
            goto Delete;
        }

        out = wstring(buffer);

    Delete:
        if (buffer)
        {
            delete[] buffer;
        }
        return status;
    }
and

    bool wideStrToMbStr(wstring const& in, string& out)
    {
        // Get buffer size
        i32 bufferSize = WideCharToMultiByte(
            CP_UTF8, 0,
            in.c_str(), (i32)in.size(),
            nullptr, 0,
            nullptr, nullptr
        );

        // Set outstring to new buffer size
        out = string(bufferSize, '\0');

        //Convert
        i32 res = WideCharToMultiByte(
            CP_UTF8, 0,
            in.c_str(), (i32)in.size(),
            &out[0], (i32)out.size(),
            nullptr, nullptr
        );
        if (!res)
        {
            wcerr << L"Failed to convert Wide string [" << in << L"] to ANSI string!" << endl;
            return false;
        }
        return true;
    }


Have a few functions taking UTF-8 paths for the filesystem operations you need, have separate versions for Windows and for other systems, and enable one or the other version with the preprocessor?


This is how 99% of cross-platform C++ works. Very few C++ is actually "cross platform" in that it compiles the same functions for different platforms


linux uses utf8 for display of filenames but, the paths themselves allow non-utf8 byte sequences.


It's an oversimplification to say Linux uses UTF-8 for display. Linux just stores bags of bytes and leaves interpretation to userspace. You could store paths in ISO-8859-1 if you wanted. The only special bytes are '\0' and '/'.


Not only could you, this actually happens in practice. Not necessarily ISO-8859-1, but specifically SHIFT-JIS, a Japanese encoding that you will run into if you run old Japanese software. To make things even worse, SHIFT-JIS is almost entirely incompatible with any form of UTF based encoding, and depending on the attempted normalisation you can quickly end up with paths that have been messed up multiple times in a row.

I forgot what Japanese emulator I tried to run when I found all of this out, ut sufficed to say I didn't enjoy the experience.


I buy digital Japanese Doujin music on sites like booth.pm, and their provided zip files extracts "beautifully" on Linux if you simply `unzip` them.


Lots of Japanese products are switching or have switched to UTF-8, so I have no doubt that modern ZIP files will extract without a problem.


Don't you mean `unzip -O shift-jis` them?


Except that Linux does support several filesystems that do claim to store the filenames in a specific encoding and therefore the kernel must do conversion. Mostly Windows FSes, but nowadays case-insensitive ext4 also applies.


These are exceptions, not the norm. The VFS layer does not care.


linux don't display them, the shell (emulator) do. Linux just send the bytes back to userland and let shell interpret them with a proper format to human. And even then. Tons of distro default the global LANG to c for some reason. So utf8 display isn't even working by default.


Even better: Each user can have his own locale and charset, and may even change that per program/shell/session. One may save filenames as UTF-8, one as ASCII, one as ISO8859-13, one as EBCDIC.

However, the common denominator nowadays is UTF-8, which has been a blessing overall getting rid of most of the aforementioned mess for international multi-user systems. And there is the C.UTF-8 locale which is slowly gaining traction.


Store it as std::path


How do you write code that you can't write one function that opens a file?


If you need to consider slightly older Windows versions, it's WTF-16, not UTF-16.


Isn't it UCS-2 rather than any form of UTF-16?


Not after Windows 2000.


There’s no such thing as WTF-16—or if you want to use that label for “potentially-ill-formed UTF-16”, then everything that says it does UTF-16 is actually WTF-16, because no one validates UTF-16.


Yes and no - some programs or systems of programs regularly transform between UTF-8 and UTF-16. A file path in Windows can be "valid" (or whatever we wanna call it) and still not convert cleanly to UTF-8.

Edit to clarify: many systems which deal with non-path UTF-16 strings, nowadays are so intertwined with the UTF-8 world, that they get the validation "for free" so to speak, here and there, wherever they interact with the rest of the world.

Paths though, remain a common little corner case, working fine until it suddenly doesn't, for some weird combination.


That’s just potentially-ill-formed UTF-16, which is what all UTF-16 is in practice. The novel thing about UTF-8 environments is that they (or most of ’em, anyway) finally actually validate stuff.


This is exactly the approach I took in my own cross-platform "standard" library. It works.

In particular, paths must be a different type than strings.

As a bonus: it becomes easier to implement niceties like path concatenation and realpath.


You can also just over-allocate to avoid having to call it twice, or do some really basic heuristics to fast path the call (it's very specific UTF-8 sequences that need more than 2 bytes in UTF-16).


Talking about how strange the current state of Windows API is, there are actually 3 different APIs that can move the mouse cursor: SetCursorPos [1], SendInput [2], and mouse_event [3].

Applications might react differently to input events generated by these APIs. For example, in Windows 11's display settings, the position of screens can not be dragged if input events are coming from SetCursorPos, but works fine if using SendInput. Microsoft's own PowerToys uses a mixture of both under certain (complex) conditions, but I never found out the actual difference between them.

I was writing an application that sends mouse input from a Linux machine to a Windows one (similar to Synergy), and I originally received mouse movement events that are accelerated (with user or system-defined acceleration factors), and I found out that there are no Windows API (three of them) that accepts relative movements without further acceleration (i.e. Windows will always apply further acceleration, making the mouse hard to use). I ended up directly hook into evdev to get raw mouse movements and let Windows accelerate them.

[1] https://learn.microsoft.com/en-us/windows/win32/api/winuser/... [2] https://learn.microsoft.com/en-us/windows/win32/api/winuser/... [3] https://learn.microsoft.com/en-us/windows/win32/api/winuser/...


Only three ways to move the mouse seems relatively tame compared to the forest of Linux API calls when it comes to input (depends on X11/Wayland among other things). Also, your last link clearly states that the API call has been superseded, so if you follow the documentation you only have two options.

I don't think the difference between the two is all that strange. One sets the position of the cursor, the other interacts with the system like a normal mouse. The mouse and the cursor are separate things, and they're handled at different levels in the API stack, like XSendEvent and sending data to libinput.


I guess it depends on what level you're generating the events at. On Linux, it would be completely reasonable to inject the input events at the input device level.

https://www.kernel.org/doc/html/latest/input/event-codes.htm...

This is very straightforward (EV_REL) and requires a very small amount of code. There can be different problems to deal with when working at this level, but in my experience, everything works as expected with keyboards, mice, and gamepads.


That's the thing, it really depends on what you're trying to accomplish. If you're trying to move the mouse as if some remote program was a mouse attached to your computer, generating inputs makes sense. If you provide some kind of remote support application that just needs to make the mouse appear at the place the remote tech indicates, changing the raw cursor makes sense.

Both approaches are reasonable and both are implemented in desktop operating systems for this reason.


What do remote desktop tools do? TeamViewer allows mouse movement from even a phone touchscreen just fine, and unless they're computing the inverse of Windows's mouse acceleration, they must have another solution.


Some at least don't even bother to send relative mouse movements, what matters is where you click. So they disable the guest mouse-pointer and rely on the host instead and then only send positions every now and then and when you are actually doing something interactive.

Synergy types of applications don't have that freedom because the host mouse cursor don't extend to the other devices display.


On desktop clients, they fetch the cursor bitmap and renders it locally, then send absolute movements. For mobile clients, it is possible (and better) to send relative coordinates and let Windows accelerate that action, since touch events are not accelerated. Doing this acceleration actually makes the client feel more like a laptop touchpad if you are using relative mode.


Actuall you're missing ClipCursor as another way of moving the cursor.

Also, mouse_event is just a wrapper around what's basically SendInput(mouse)


Like explained in https://utf8everywhere.org/#windows , you can write simple wrapper functions `narrow`/`widen` used when you are about to call window api functions.

    ::SetWindowTextW(widen(someStdString).c_str());
Implementation is straightforward, relying on `WideCharToMultiByte`/`MultiByteToWideChar` to do the conversion:

https://github.com/neacsum/utf8/blob/master/src/utf8.cpp


I have good experiences with this approach. It isolates the weirdness of windows and let's the rest of your code do things idiomatically


Microsoft added a setting in Windows 10 to switch the code page over to utf-8 and then in Windows 11 they made it on by default. Individual applications can turn it on for themselves, so they don't need to rely on the system setting being checked.

I haven't tried it yet but with that you can just use the -A variants of the winapi with utf-8 strings. No conversion necessary.


Do you have any references about it being enabled by default in Windows 11? I've seen conflicting reports and often seems to vary depending on the system locale whether it gets enabled or disabled by default.


You are missing OP's point - this still costs you 2 extra calls.

If this cost really matters (and practically speaking it never does), then, as the other commenter said, the correct solution is to just use OS-native encoding for all file system paths and names used by the program, hidden behind an abstraction layer if needs be. UTF16 for Windows, UTF8 elsewhere.


The above manifesto makes the argument to use UTF-8 *everywhere*, even on windows where the internal representation is not native utf8.

The conversion overhead is really negligible: https://utf8everywhere.org/#faq.cvt.perf

(note: the two api calls per conversion is because how those specific functions work, first call to get the size to allocate, second to do the actual conversion, but you can always use another library in the implementation for the utf8<->utf16 conversion that might be more optimized than those windows api functions)


Especially negligible versus the trip to the file system you are setting up for.



Not all API calls are for filesystem access.


Sure, but basically everything having to do with file paths on Windows, the topic here, relates to the file system.


"2 extra calls" is a weird metric here. Some calls are vastly more expensive than others. Syscalls come with a significant cost, encoding conversion of short strings (esp. filenames) does not. Hiding just the syscalls behind an abstraction layer is vastly simpler than doing that and additionally hiding the string representation, so "UTF-8 everywhere" is IMHO the right solution.


I thought the OP's point is there are too many considerations when doing this?

Someone is suggesting a way of making it less tedious, and your response is "performance?!" even though in both scenarios you're running the same code and it is likely the compiler in release would remove the intermediary.


> your response is "performance?!"

No, that's not what I said.


It gets worse. Windows doesn't use UTF-16 for file paths it uses UCS-2. Example: Windows paths can contain unpaired UTF-16 surrogates which is illegal in UTF-16. Unix paths are no better; they are "bags of bytes" and not necessarily UTF-8. File paths are not guaranteed to be Unicode normalized. Even if they are canonically equivalent (equal graphemes) it doesn't mean they are binary equivalent (equal code points). You need to treat file paths as their own special thing and not strings.


Windows uses potentially-ill-formed UTF-16.

No language or environment that uses UTF-16 validates it, ever. Not a single one that I know of. But that doesn’t make it UCS-2, it just makes it potentially-ill-formed UTF-16.


I just have the impression that they s/UCS-2/UTF-16/g without changing anything. Was that not the case? Genuinely curious whether there were UTF-16-specific changes made before they declared that Windows uses/supports UTF-16.


Windows validates UTF-16 when ansi functions are called, they convert string arguments to UTF-16 and call unicode function, then convert results from UTF-16 to ansi.


That’s nothing to do with validating UTF-16. What you’re describing is converting from code pages to UTF-16, which incidentally should produce well-formed UTF-16.


And Linux paths can contain anything that isn't a nul character or a /.

Your points about normalization only matter in so far as Windows is generally setup to be case insensitive (which can be disabled).

All UTF strings face normalization issues for a whole host of reasons.


> (which can be disabled)

Wouldn't that break spectacularly? The whole "the fs is case insensitive" assumption has kind of been a given on Windows for about 40 years now.

I remember accidentally creating an NTFS flash drive with two names with difference cases in Linux. Windows programs sure didn't like that.


It was added in 2018 as a folder attribute. There isn't a global option but you can annotate something as case sensitive.


Wrong. Windows uses UTF-16. It used to use UCS-2 a long time ago.


Unpaired surrogates are not allowed in UTF-16, so whatever Windows uses is not quite it.


Windows doesn't bother checking every string after every modification, but neither does Linux with UTF-8. You can pass invalid UTF-8 to tons of APIs and almost all of them will just work, just like you can with UTF-16 in Windows.

Doing a string validation check in every single API call would waste cycles for no good reason.


Linux (the kernel) doesn't claim (AFAIK) to use UTF-8. It takes NUL-terminated strings in any charset you choose (or none) and either compares them with other NUL-terminated strings, or spits them back out again later.

Interpreting a string as encoding text in particular character set is, as far as the kernel is concerned, a problem for userspace.

Windows does make claims of "supporting" UTF-16.


Linux supports UTF-8 the same way Windows does: by trusting you that the encoding is correct, unless you access APIs that are encoding dependent somehow. The Windows API is a lot more complete than the Linux API, so comparing them is rather pointless.

Windows supports UTF-16, it doesn't guarantee UTF-16 correctness. The native methods annotated with the W suffix all take UTF-16, so unless you want to render your own fonts, you're going to need to provide it with either that or ASCII.


It's possible to get a file into a windows filesystem whose name contains "*" or ":" characters. Windows reacts quite poorly when it find those in file names; but only in certain APIs, others pass them without concern. Makes for fun explosions.

The "260 character limit" is a good one; mostly seen that one in "I can do anything to this file but delete it!" complaints.


It's also possible to create files whose names start (or contain) a NUL character. Doing so will trigger every single antivirus, but it's doable nonetheless and messes a lot of things very thoroughly.


You can also create a file (or even better a folder) named CON or COM4. Good luck doing anything with it using the Explorer. (15 years ago I've seen some malware using that to make removal difficult)


Also, possible in the Windows Registry, which no normal UI tools (inc. Reg Edit) could remove.


A similar quirk of the registry API is that it has a simplified set of methods that don't specify the data type of the value, so you can have an application that expects to read a null terminated string value, but substitute a binary value with no null termination. If the app doesn't handle that right, it's buffer overrun time.

Usually that's foot-shooting, but sometimes you can do that in HKCU as a low privileged user to a value that gets read by a higher privileged process and causes it to "misbehave".


you can also get "files into a Linux filesystem" that contain the "/" character, which also leads to fantastic kernel crashes...

Admittedly, it's (probably) harder to pull off than on Windows, but still...


I had the can’t delete a file problem the other day. Got around it my moving it’s parent folder to a less nested location


C++'s filesystem package (part of the standard library) should take care of all that nonsense for you. It's whole point is so that you don't have to pay attention to format of the underlying filenames, directories, etc.


> C++'s filesystem package (part of the standard library) should take care of all that nonsense for you. It's whole point is so that you don't have to pay attention to format of the underlying filenames, directories, etc.

Unfortunately by default it doesn't, because the constructors which take `const char *`/`std::string` will default to the local codepage, instead of just using UTF-8. So you have to use `u8path` or C++20's `u8` string literals to get sane portable behavior.

So yes, you technically don't have to pay attention to the format, but as it is typical of many features in C++ the default is wrong, and you have to manually make sure you don't write code that's subtly broken.

(Or alternatively use a linter which would warn about this, but does such linter exist?)


I think some people might be wary of the std::filesystem APIs because the standard allows implementations to completely disregard TOCTOU issues internally, to the point of breaking memory safety [0]:

> A file system race is the condition that occurs when multiple threads, processes, or computers interleave access and modification of the same object within a file system. Behavior is undefined if calls to functions provided by subclause [filesystems] introduce a file system race.

It's not just implementation-defined behavior, but full UB! You're utterly at the mercy of your implementation to do something reasonable when it encounters a TOCTOU issue, or, for that matter, any kind of concurrent modification to a file or directory. And C++ has a long history of implementations being unreliable in their behavior when UB is encountered.

[0] https://wg21.link/fs.race.behavior#1


The more limited C API (system calls) is also full of race opportunities. The windows file APIs are better, but not a lot better.

Anyway, the problem described in the post (scanning the string twice when converting) is as race prone, or not, as the the filesystem API.


> The more limited C API (system calls) is also full of race opportunities. The windows file APIs are better, but not a lot better.

Sure, there are opportunities for races, but POSIX and the Windows API limit the possible outcomes of concurrent modification, and they also provide tools (fixed file handles, etc.) for well-written programs to prevent TOCTOU bugs. Meanwhile, the std::filesystem API just throws its hands in the air and says that filesystem races are UB, period, making it unusable if the program isn't in complete control of the directory tree it operates on.

> Anyway, the problem described in the post (scanning the string twice when converting) is as race prone, or not, as the the filesystem API.

I don't see what you mean? The typical scenario here is, you receive some absolute or relative path from an external source, encoded in ASCII or UTF-8 or some other non-UTF-16 encoding, and you want to operate on the file or directory at that path. Your internal path string is never going to change, only the filesystem can change. So there aren't any races from scanning your internal string twice to re-encode it; races can only come from using the re-encoded path in multiple calls to the filesystem API.


> most modern software uses UTF-8

[citation needed]

UTF-8 is used for storage and serialization, yes, but most mainstream programming languages store unicode strings in memory as UTF-16. The notable exception is Rust, which does indeed use UTF-8 even in memory.


> most mainstream programming languages store unicode strings in memory as UTF-16

This is increasingly frequently false. For stupid historical reasons, far too many languages adopted potentially-ill-formed UTF-16 semantics (never well-formed, sadly), but this makes for such a bad representation that they’ve mostly abandoned it as a sole representation, in favour of more complicated arrangements. For example, Java strings can internally be single-byte Latin-1 or two-byte UTF-16. JavaScript engines do similar tricks these days, and although Python 3 just never did UTF-16 at all (it uses code point semantics, which is almost worse than UTF-16 semantics) it does a 1-byte/2-byte/4-byte hybrid representation thing too.

For one of the most extreme examples of becoming unglued from UTF-16, the Servo browser engine manages to use WTF-8 (UTF-8 plus lone surrogates, to allow representing ill-formed UTF-16) despite being forced by these historical reasons to expose UTF-16 semantics, and although the performance of some operations are harmed by it, others are improved, and in the balance it was a pretty convincing win when they tested it (which many people did not expect).

PyPy is also a great example of changing your encoding to one that has mismatched semantics: it applies UTF-8 (or I suppose it must actually be UTF-8 plus surrogates) to Python, and likewise they found it surprisingly good for performance.

These things give me hope that even environments like the JVM and .NET CLR might eventually manage to switch their internal string representations to UTF-8 or almost-UTF-8.

But also I emphasise that these are historical languages that are stuck with the horrendous 16-bit decisions of the early-to-mid-’90s. When you look at newer languages, they almost always choose something more sensible, and UTF-8 is almost always at the heart of it.


> it uses code point semantics, which is almost worse than UTF-16 semantics

What's so bad about storing a unicode string as a series of codepoints? UTF-8 is also a series of codepoints in essence.

> When you look at newer languages, they almost always choose something more sensible, and UTF-8 is almost always at the heart of it.

I still think that UTF-8 for memory is a bad idea. For example, it wastes a lot of bits to make sure that a string can be decoded starting from an arbitrary byte offset. This is generally a desirable property for files and some networking applications, but it's an absolute waste of space when used in memory.


Code point semantics, not scalar value semantics which is what Unicode strings are supposed to be. That is, a Python string can contain surrogate code points, which cannot be encoded in any UTF-*:

  >>> '\udead'.encode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode character '\udead' in position 0: surrogates not allowed
  >>> '\udead'.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
(UTF-8 is a sequence of 8-bit code units, representing a sequence of Unicode scalar values.)

—⁂—

> For example, it wastes a lot of bits to make sure that a string can be decoded starting from an arbitrary byte offset.

If you ditch self-synchronisability and ASCII purity of extension by ditching the first two bits (“1-0”) on continuation bytes, you can encode some more briefly: roughly, U+0800–U+1FFF go from three-byte to two-byte, and U+10000–U+FFFFF go from four-byte to three-byte, leaving just U+100000–10FFFF as four-byte. But honestly the difference isn’t often that much (unlike UTF-16 → UTF-8, which roughly halves memory usage most of the time, commonly even saving when dealing with text that makes heavy use of other scripts that are more compact in UTF-16 than in UTF-8 since you commonly mix things in with ASCII markup), and if you’re really trying to shave memory, other techniques like compression are far more effective.

I love microoptimisation (and spent a large number of hours on my Casio GFX-9850GB PLUS, where having only 30KB of memory encouraged learning to shave single bytes here and there!), but there’s also a lot to be said for consistency. Yes, some purposes could have an encoding that is more efficient than UTF-8. But doing this complicates matters quite a bit, and I think we’re better off with just UTF-8.

Also I will note that the self-synchronising nature of UTF-8 is useful for some operations: you can scan through strings much more quickly that way. So it’s not pure waste as a memory representation.


That's arguably because those languages are not modern. Nobody would design a modern language today that uses UTF-16 (unless they really need compatibility with Java or JavaScript).

Go also uses UTF-8 btw.

I think you're being needlessly pedantic (I know it's unusual for HN).


> most mainstream programming languages store unicode strings in memory as UTF-16

Uh. What languages do you have in mind? C/C++ don't have any preference for UTF-16. Python3 doesn't use UTF-16. As you mentioned, Rust doesn't.


Java, C# and the rest of .net, JavaScript, Swift/ObjC.

C/C++ don't have a preference but, for example, if you're writing for Windows, you'd probably want to use wide characters. I'm not sure what kind of strings Linux GUIs use but I suspect it's wide characters as well.



Qt uses UTF-16, Gtk uses UTF-8.


In addition to the examples already given, JavaScript as well (although modern implementations do wacky stuff with the internal representations, the APIs are required to expose only UTF-16).


java, .net, apple ecosystem (i.e. swift/objc),...



Which is interesting because for example C#/.NET base class library provides very high quality APIs

They even wrote a book about API design "Framework Design Guidelines: Conventions, Idioms, and Patterns for Reusable .Net Libraries"


C# was effectively built from scratch with the benefit of hindsight from both C++ and Java. Decent practices should be considered been the bare minimum, surely.


Why are people obsessing about the overhead of two function calls? What application are you writing that needs to optimize passing millions of long file names in real time? Is this some kind of file path based video game?


Git, for example, used by millions of developers every day.


Crazy idea... but what if you stored the files in a fast purpose-built database instead of the normal file system. You would have to mount it I suppose, to be able edit the files. But I think it could be very performant.


https://en.wikipedia.org/wiki/WinFS

I was actually secretly cheering for that project, because I thought it could be actually innovative

But, as I look back on that with my modern understanding of Microsoft's backward compatibility focus, I realize such a thing would never, ever fly


The filesystems is a database for storing files already. There is no particular reason a different approach be faster in principle.


Git actually does have a separate disk format that's way more database-like, but it's only used for older data that doesn't need to be shuffled around as often. Git was written assuming the filesystem is decently fast at metadata mutation, which is a good assumption on Linux and less so on Windows.


Git needs some queries that filesystems arent very good at natively, like "get all modified files". There are workarounds but yeah.


  .WAD
  .PAK
  .PK3


It's still going to be a tiny fraction of what git does, so speeding it up will have a negligible impact on performance.


Git reads and writes a lot of files in the working tree as well as the object store. It's a large part of what it does


I'm aware that passing a file name is a part of working with a file. I'm just skeptical that it's such a large fraction of the computation that optimizing it would provide a noticeable performance boost.


Windows kernel uses UTF-16, so reencoding happens anyway, even if you don't know about it.


One time I installed a buggy version of Arduino and it installed recursively, creating file paths far larger than 260 char. I was unable to delete the files using explorer, cmd line, or power shell. I had to use cygwin. That was the most disappointed I've ever been with Windows.


You can remove these files using the command line by using the \\?\ syntax to bypass all normalisation and compatibility tricks Windows employs. Definitely works in CMD, never checked in Powershell (but I assume it works).

I've had a similar issue with a recursive folder and transferring it to another host in Linux. A megabyte on my machine filled up the server pretty quick.


Half a decade ago anything that needed Node packages - even just a Gulp build system - would wind up constructing extremely deep directory trees that Explorer would choke on. So you would npm install the project, realize you needed a different version of something, try to delete node_modules, and fail horribly with a bunch of long path errors.


The built-in "robocopy" tool (normally used for making backups / mirrors) is capable of deleting any messed up directories like that, by mirroring an empty directory over it.


In addition to maxpath (which can be ignored in most regular apps), I remember short filenames used to have some severe performance implications. For instance, with Picasa we tried never to stat a file over Samba that didn't exist, because it would make Windows enumerate all the other files in the folder (presumably to see if they were a SFN match). That was a while ago (XP), but I just strace'd and it still seems to do it.


> Alternatively you can set the process code page to UTF-8 and call the 'A' variant API directly, but only sometimes, and only with Windows 10 v1903+, and you might still have to change the system locale setting and reboot

There is no need to do this; you can just call the A variant by name. Wide char and ascii functions are distinguished with W or A at the end, such as CreateFileA or CreateFileW. CreateFile is a macro.


I always directly call the A or W versions of these functions. It makes things unambigious.

Additionally I have my own Windows header file which undefines all of the macros so my function names do not get mangled with A and W suffixes.


It really caught me by surprise. Sure, I'm not a dev but in my mind A variant was for 9x compatibility.


Ah yes, the summoner of outages if you have production systems with windows server and allow user input in paths.


The article should be called "What it takes to do anything Windows API in C++". For example, this Unicode issue is applicable to almost every API there. And it is not as simple as "call MultiByteToWideChar twice". Microsoft did not support UTF-16, they UCS-2, which they called "Unicode", but in fact back then it was just a wider ASCII. The worst thing about UCS-2 is that it does not support roundtrip.

For example, imagine, that you wrote your Enterprise MS Tech Contoso Ltd.(R) authentication system, where user registers in UTF-8 on webpage, then on some layer it checks that user with the same name (in UTF-8 modern DB) does not exist, then it writes user basic information in UCS-2 encoded legacy "Enterprise DB". Aaand... Voila, Evil User overrides data of other user, because UTF-8 can't losslessly represent arbitrary sequences of 16-bit code units (should have used https://simonsapin.github.io/wtf-8/ instead, but your Enterprise tech was destroyed).


They called it “Unicode” because, at the time, thanks to Han Unification, UCS-2 was widely believed to be the one and only encoding that anyone would ever need. Two bytes per codepoint. No more, and no less.

C#/.NET even inherit this misnomer and call UTF-16LE “Unicode”: System.Text.Encoding.UnicodeEncoding


Now try doing this for Windows 98 or older, specially the non-ASCII file paths part (which was a common use case outside of the US).



> The MSLU was announced in March 2001, and was first made available as a compatibility layer for Unicode-supporting code written for the then-new Windows XP RC1 in the July 2001 edition of Microsoft's Platform SDK.

People had to deal with local language file names since Windows 3.1 at least and it became very common with Windows 95. Good luck if you wanted to deal with files named in more than 1 non-ASCII language (very common in Europe or the Middle East).


So, a lot of what people think of when they think of OS is UI based (GUI/TUI/CLI) but kernel APIs are the real bread and butter and this is where the UNIX philosophy really shines.

Passing a string as a file name in C++ with macOS or with Linux in my experience was simple. The permitted length of using ASCII characters is about 4 times as long (may god have mercy on your soul).

I am not here to shit on Windows but the Windows devs clearly have a very different set of priorities (e.g. backwards compatibility) than the other breeds of modern OS devs.

I guess to a large extent we all expect to be in the browser (gross) in some number of years but Windows seems so much harder from the perspective of someone that has programmed for Unixen and studied Windows as an OS.


  Passing a string as a file name in C++ with macOS or with Linux in
  my experience was simple. The permitted length of using ASCII characters
  is about 4 times as long (may god have mercy on your soul).
Macs are simple enough too if you ignore the quirks. HFS (which was never seen on a modern MacOS) usually stores no information about what encoding was used for filenames. It's entirely dependent upon how the OS was configured when the file was named (although some code I've seen suggests that something in System 7 would save encoding info in the finderinfo blobs). So non-latin stuff gets mangled pretty easily if you're not careful. Filenames are pretty short (32 bytes) minus the one byte because (except for the volume name) they're Pascal strings with the length at the front.

HFS+ (which is what you'll find on OSX volumes) uses UTF-16 but then mandates its own quirky normalization and either Unicode 2.1 or 3.2 decomposition depending… which can create headaches because most HFS+ volumes are case-insensitive. It's been so long since I've touched anything Cocoa, but I assume the file APIs will do the UTF-16 dance for you and the POSIX stuff is obviously OK with ASCII.

And, of course, let's not forget the heavily leveraged resource forks. Of course NTFS has forks but nobody seems to use them.

APFS standardized on Unicode 9 w/ UTF-8.

CDs? Microsoft's long filenames (Joliet) use big endian UTF-16 (via ISO escape sequences that theoretically could be used to offer UTF-8 support). Which sounds crazy until you realize their relative simplicity (a duplicate directory structure) compared to the alternative Rockridge extensions which store LFNs in the file's metadata with no defined or enforced encoding. UDF? Yeah that's more or less UTF-16 as well.

I think we're perhaps forgetting just how young UTF-8 is.


Thanks for the comment. HFS/HFS+ is a fascinating bit of history.

It strikes me how developer ergonomics have improved as computers have become cheaper/increased in power.

As to UTF-8, we may say it’s young but in 14 months it will be old enough to purchase and consume alcohol in the United States. From other comments it seems like Microsoft don’t think the tech debt is too great so long as they have good libraries in C#


In fairness, not that many people (including Microsoft) write native C++ apps for Windows anymore, certainly not without tried and tested libraries.

You can write C# code dealing with reading/writing files once and compile it on Linux/Windows/Mac and it'll work pretty much the exact same.


Microsoft does write native C++ apps for Windows all the time.

First of all, games are apps, second even if apps unit keeps mostly ignoring WinUI/UWP (written in C++), whatever they do with Web widgets is mostly backed by C++ code, not C#.

On of the reasons why VSCode is mostly usable despite being Electron, is exactly the amount of external processes written in C++.

Applications being written in .NET is mostly on the Azure side.


“Applications being written in .NET is mostly on the Azure side.”

You are of course, wrong about this. Most .Net/C# code is not Azure (yet anyway) -related; it is the billions of lines of enterprise application code across businesses around the world (for me, since 2001)…


You are not Microsoft apps unit, the subject of what is being discussed here.


Microsoft has literal teams with budgets of several millions USD just for the file open/save in Office which is written in C++.


But despite that they cannot fix it. Consistently they make perhaps the worst APIs of any major tech company.


Maybe for file handling in C++, but DirectX/HLSL is the best Graphics API I've worked with and C# is easily my favorite language to develop in. It's easy for us to talk shit about Win32 today, 30 years after it was initially developed, but there are myriad historical reasons why UTF-16 is used by Java, Windows, and other languages/runtime environments and why it's not simple to just break compatibility with decades of software running at hospitals and financial trading firms because the 32 year old armchair experts at HN said so.

According to wikipedia:

https://en.wikipedia.org/wiki/Universal_Coded_Character_Set

> The UCS has over 1.1 million possible code points available for use/allocation, but only the first 65,536, which is the Basic Multilingual Plane (BMP), had entered into common use before 2000. This situation began changing when the People's Republic of China (PRC) ruled in 2006 that all software sold in its jurisdiction would have to support GB 18030. This required software intended for sale in the PRC to move beyond the BMP.


I will take Win32 over anything related to X Windows, OpenGL and Vulkan, with pleasure.


True. They broke the basic Windows search functionality some time in 2007 and broke Outlook search around 2013 and neither of which have been fixed since.


Those file/save dialogs are an application of their own, and with multiple versions across all supported platforms.


It's not all backwards comparability. I'm willing to bet that some (a large part?) is just sloppy software development.

SQL Server (2017?) breaks if you update it on a UTF-8 Windows because it runs a TSQL that doesn't work with that code page. That script is a mess. Some of it is indented using tabs, some space. Trailing whitespace. Yuck


My hot take: Code quality is not measured by formatting issues, but by error resilience and number of actual bugs.

Much of modern linting and commit hooking is dedicated to checking whitespace placement, variable naming and function lengths but the well-formatted newly rewritten code is still buggy as hell - it just looks pretty


Formatting doesn’t remove bugs, but it’ll help you detect them. Linted code helps you scan the code faster and provides valuable pattern recognition, allowing us to detect common mistakes.

There have been numerous bugs caused by incorrect code formatting, most notably an SSL security bug from 2014: https://dwheeler.com/essays/apple-goto-fail.html

Another reason for formatting is the “minimal diff” paradigm. If a formatting rule would not be followed, in the next commit hitting this code, the format would also be affected, causing a larger diff than necessary.

There are other reasons for simple format linting, but the reasons above are the most profound.

Lastly, formatting is part of a range of static code analysis tools. Generally, formatting inconsistencies are the easiest to detect and resolve, as opposed to more sophisticated tools.


True, but personally for me at least it is easier to find bugs in neatly formatted consistent code than something written in multitude of styles.

It is kind of like "pattern matching" on the error patterns.


I've often found that people that don't care about white-space also don't care that much about other aspects of code quality

The inverse may not have the same correlation, since it can be automated.


I never understood what backward compatibility was met by windows api not supporting >260 chars file paths. It will work just in the same was if you pass any short path and no old application expects a long path anyway.


Decades of binaries are in use doing something like

``` wchar_t filename[MAX_PATH]; CreateFileW(...) ```

in both first part and third party Windows code, often in deep callstacks passing file names around. Changing the length requires fixing them all.

See comments in https://archives.miloush.net/michkap/archive/2006/12/13/1275...


Your example isn't problematic API-wise, because CreateFileW doesn't need to care if you pass in 16 characters or 1600 - if it does, that is mostly a matter of refactoring and not inherent to how the function works. The real problem are APIs that inherently assume that you pass in a string of at most MAX_PATH characters, because you provide a pointer but no size, and the API is expected to write to that pointer. This affects most shell32 getter functions, e.g. SHGetKnownFolderPath.

But for functions outside of Windows itself, this is the exact reason why the long path feature is hidden behind an opt-in flag.


MAX_PATH is a #define. So its value is baked in to old binaries.

In RAM-constrained world of the past, you would stack-allocate `char buff[MAX_PATH]` and do all your strcpy/strspn in there with no problems.

Now, if that app receives a long path into a too short buffer, it will instantly stack overflow and may cause exploitable problems.


They have API calls that fill user-supplied buffers that have room for MAX_PATH characters.

See for example https://learn.microsoft.com/en-us/windows/win32/api/fileapi/..., which also shows how they gradually made the input argument more flexible:

- “By default, the name is limited to MAX_PATH characters. To extend this limit to 32,767 wide characters, prepend "\\?\" to the path”

- “Starting with Windows 10, Version 1607, you can opt-in to remove the MAX_PATH limitation without prepending "\\?\"”

I also guess there’s lots of code that sees those paths (anti-virus software, device drivers)


This article doesn't mention std::filesystem::path once, which is odd for a C++ article. Just set your executable code page to UTF-8, and let the filesystem::path constructor convert for you. It's pretty easy actually.


1607 has reached End of Servicing. So unless you need to support software (= $$$) on older versions that shouldn't be a concern.

Second, I don't see a risk in setting `LongPathsEnabled` on a machine as it's only 1 piece of the puzzle. You still need to opt-in at the application level:

https://learn.microsoft.com/en-us/windows/win32/fileio/maxim...


Most windows software uses utf16. This guy is used to web and *nix software where UTF-8 is popular. All your strings should default to UTF-16 on Windows so you don't have to deal with this against specific APIs.

The longpath thing shouldn't be something you have to deal with each time you call an API. You should do what Git for windows and other software do which is enable long paths (assuming you actually need it) during setup/install.


A lot of people would disagree with you:

>This document also recommends choosing UTF-8 for internal string representation in Windows applications, despite the fact that this standard is less popular there, both due to historical reasons and the lack of native UTF-8 support by the API. We believe that, even on this platform, the following arguments outweigh the lack of native support.

https://utf8everywhere.org/


Linux and MacOS use UTF-8. Qt, Windows itself, Java, Javascript, and (on Windows) C# will all use UTF-16.

Most computers run Windows, by a rather big margin. Even more computers use Javascript. Are you running KDE? Don't be surprised if half the text in your screen is secretly UTF-16!

UTF-8 is definitely better, but if you're on Windows, you're making your own life harder by using UTF-8. Microsoft is migrating APIs to UTF-8 but it'll take years before that's finished. Just look at all the conversion code you need according to the page you linked.

Now, if you're writing a cross platform library, you'll have to decide what encoding(s) you use. I personally don't really see why you wouldn't make your library encoding agnostic, but if you want to stick to one single encoding, you'll have to decide.

If you're writing a program for Windows, technical superiority is a mediocre argument for making your own life that much harder. There are tons of formats and design decisions that may be technologically superior, but just end up wasting programmers' time, and this is one of them.


Windows is not open source, I don't prefer either encoding I am just telling you that as that document acknowledges, UTF-16 is the default whether "everyone" (tech bubble) likes it or not. If you write windows software assume that default, not your favorite utf8. Just like how you assume endianess on a system because of the default.


when I first ran into the utf8/utf16 etc a lot of windows stuff was "recommending" utf16. I see quite a few windows text editors default to it.

After a day or so of testing and trying it with various different SDKs across different platforms I settled on utf8. did everything needed, compatible with ASCII, all the functionality of 16 none of the _absolute insanity_.

looking back, these kind of issues are why its now some 20 years since I wrote any windows software (rather than porting linux/mac stuff or just using cross platform VMs like Javascript, Lua or Java).

MFC..... Oh my.... I actually feel sorry for the poor souls that still have that in their life.


Whats even worse: Trying to figure out if the current user/process is allowed to write to a file in a specific path.


Because you shouldn't. Either try and write, and see if it fails, or if need-be ask for an exclusive lock on the file and see if the OS gives it to you. That isn't just permissions either, it actually picks up a lot of potential problems (e.g. device disconnected, lock conflict, et al).


This. Seeing if you can do something before actually doing it falls victim to TOCTOU[0] bugs. Until one actually tries to perform the security-sensitive operation, there's no guaranteeing that you will succeed.

[0]: https://en.wikipedia.org/wiki/Time-of-check_to_time-of-use


For game development this should only matter on the game developers' machine, where there is more interaction with loose files on disk.

For any shipped game being played on a player's machine the data files should be some form of packed files, which are close to the executable and only a handful need to be opened.


Every time I created software for Windows in recent years I used Qt, WxWidgets or even Delphi, which all have wrappers to handle system-specific quirks. Is it common these days to code directly to the Windows API? It feels like it would lead to code that's harder to maintain and certainly less portable.


Tl:DR: there are many different ways to send file paths to system library functions. That's mostly because of Windows' long legacy support history. Don't worry about it - just pick one and go.


No. Because the API is fragmented into old, not-so-old and new parts, some higher-, some lower-level[0]. You need things from all of them, the new API isn't complete enough and has warts in places where using one of the older ones is better/easier/safer. So in the end you will have to use all the ways, and know when to use which one.

All this can of course be fixed with another layer of indirection. But if you ever wanted to know why Windows file APIs and filesystems are so dog-slow, here is (part of) your answer.

[0] e.g. there are lower level APIs that can create and access files and filenames that higher level APIs will filter out or disallow.


I feel inclined to bypass all that Win32 gubbins and talk to the Kernel directly.

Back in my day we just walked into the office of the man in charge and shook hands, we struck deals like real men. If we ask ask the NT kernel politely in a language it understands, we will get our filepath. Right?

http://undocumented.ntinternals.net/index.html?page=UserMode...


NT API is now documented by Microsoft. The coverage is not complete, but enough to be useful. E.g. https://learn.microsoft.com/en-us/windows/win32/api/winternl....


> new API isn't complete enough and has warts in places where using one of the older ones is better/easier/safer

Describes Windows Forms vs everything else MS has been trying to do in the past 15 years.


I don't know why but Microsoft programmers seem to make bizarre code choices. I worked with a lot of principal developers in my life but the ones from Microsoft wrote the most convoluted code I have ever seen. Even though it's anecdotal I wonder what is going on in the culture that Microsoft keeps producing these shabby programs all over the place.

Casey has also spoken out about the horrible state of affairs in Microsoft in the past https://news.ycombinator.com/item?id=27728177 so I'm sure it's not a one off phenomena.


I think I read on another hacker news post that some windows API where intentionally made convoluted/hard to work with AND poorly documented in the official documentation. So that the API author's could go off and write a "win32 explained" book, and become fabulously wealthy.

Allegedly.


Yes! I remember that too. I wouldn't be surprised.


Ehhhhhh. Passing ascii file paths to a Windows API is really straight forward and trivial. No conversion needed.

MaxPath is real though. Exceeding it not recommended.


Windows uses unicode because NTFS uses unicode instead of efficient ascii.


When it was designed there was no UTF-8. You had to know the code point; and file systems like FAT did not encode the code point, so file names would be interpreted by whatever the current code point was.

So having a file system and other APIs support Unicode natively made a lot of sense at the time; it simplifies many things including deploying into organizations that have global users. This is now over 30 years ago. And the A vs the W suffixed APIs were not simple ascii; they were multi-byte, so to work with strings you still had to know the code page. Obviously UTF-8 is much simpler for that now, but that wasn’t the world when this was designed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: