Hacker News new | past | comments | ask | show | jobs | submit login

As someone who has worked on Windows for a long time, the title was entirely unsurprising. Widening or narrowing always uses the current codepage.

If a name contains values beyond ASCII — technically out of spec

I'm not sure what spec it's referring to, but this is normal and expected for files in non-English systems.

Such tools often incorrectly assume UTF-8, which is what motivated this article.

Those tools are likely to be from the *nix world, where UTF-8 is far more common for the multibyte encoding --- but even there, you can have different codepages; and I have worked on Linux systems using CP1252 and 932 before.




> I'm not sure what spec [the prohibition on non-ASCII names is] referring to, but this is normal and expected for files in non-English systems.

The description of import directories[1,2] in the PE/COFF spec explicitly (if somewhat glibly) restricts imported DLLs to being referenced using ASCII only:

> Name RVA - The address of an ASCII string that contains the name of the DLL.

[1] https://learn.microsoft.com/en-us/windows/win32/debug/pe-for... (current, unversioned)

[2] https://github.com/tpn/pdfs/blob/master/Microsoft%20Portable... §6.4.1 (version 6.0, 1999)


That is, for lack of a better term, a "Microsoft-ism"; in MS documentation, "ASCII" consistently means "single-byte or MBCS" and is interpreted relative to the current codepage, as opposed to "Unicode" which means "UCS-2 or UTF-16". You can also see examples of "Unicode" in the docs you link to.


> That is, for lack of a better term, a "Microsoft-ism"; in MS documentation, "ASCII" consistently means "single-byte or MBCS" and is interpreted relative to the current codepage

I've read my share of MS docs and I do not recall ever seeing this to be the case. Like the parent says, I've seen "ANSI" used to refer to that, not ASCII. Do you have any examples of where they say "ASCII" where the intention is obviously something broader than 0-127? It makes me wonder how I've missed this if that's the case.


Not sure if I've ever seen a Microsoft doc do it, but many other places including articles be MS "MVP"s use ASCII and ANSI interchangeably.

In MS output in my experience consistently means standard 7-bit ASCII.

Things they routinely do oddly are using ANSI to specifically refer to the WIN1252 code page (a superset of ISO8859-1 otherwise referred to as CP1252) when the institute of that name did not define nor dictate user of the codepage, and including (or requiring for correct interpretation) the BOM sequence in UTF-8 encodings when the standard allows recommends against a BOM in this context.


“ASCII”, not “ANSI”?


I suspect it could be to avoid making distinction between ANSI and OEM codepages.


It's the import table of an EXE/DLL where non-ascii is out of spec. Meanwhile LoadLibraryW is happy to load any filename. (But don't you dare try to call LoadLibraryW from within DLLMain, that's under loader lock)


Sadly, it's not unusual at all to see a Windows app crash and burn when paths contain non-ASCII characters. It's just what it is to non-English computer users.


I've couldn't build some files in Android Studio on windows, because path to gradle contained non-english characters. It's a common problem for a lot of linux-based tools. Sometimes even spaces in directory names are enough to break those tools.


oh yeah. You have to re-create user account with ASCII username if that happens.

And clearly those things are why there is C:\ProgramData\(that's under 8+3 characters!) since late XP era - even "C:\Program\ Files\" must have been too much sometimes, and having that folder is a useful harm reduction; developer response to app not working under Program Files is to make random top level directories under C:\, not taking time to clear dech tebts.


C:\Progra~1


I’m pretty sure there are non-ASCII characters on English keyboards too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: