The section titled "How to do text on Windows" on http://utf8everywhere.org/#windows covers the insanity in more detail.
The Right Thing To Do at this point is to make UTF-8 a multi-byte code page in Windows and build a UTF-8 implementation in the msvc libc. The milquetoast excuse I hear from Microsoft people is that some win32 APIs can't handle MBCS encodings with more than 3 bytes per character. Which sort of sounds like a problem for developers to fix; perhaps Microsoft could hire some?
The linux user-space ABI is extremely stable. I think the one thing that would frustrate the development of long service lifetime software on linux would just be library availability on various distros, but the actual operating environment presented by the kernel and the image loader for user-space software on linux is very, very stable.
And all future Win32 implementations must be 100% binary compatible with previous ones
And no one is asking microsoft to break the windows user-space ABI. Adding a new, sufficiently-tested, MBCS codepage would have no impact at all on existing windows software. None whatsoever. Other than to make localization a whole lot easier. And compared to the cost of building in a whole new csrss-level subsystem like the one Microsoft just built in to windows last month (linux), it's probably a lot safer and easier to test.
Working in this world myself, windows developers become philosophical when discussing localization. "Someday, we'll turn on UNICODE," "We really should be using TCHAR" (as if that silly thing would fix anything at all), "Shouldn't we really be using wstring?"
OSX, Linux, and mobile developers just do it. It's mostly a solved problem on those platforms.
The decision to stick with 16-bit Unicode is an engineering tradeoff.
It's a cost tradeoff - Microsoft doesn't want to spend the developer and testing time adding a UTF-8 codepage.
Currently we do "support" CP65001 in the console, but things break if you enable it. One of the problems, for example, is that .NET sees 65001 and starts outputting the UTF-8 BOM everywhere, breaking applications that don't even care about the character encoding. I suspect that's going to be difficult to fix without breaking compatibility.
Having said that, I think it's apparent that we are investing heavily in the console for the first time in a long while, so I'm more hopeful than ever that we can get this fixed.
The guarantees that Windows makes about backwards-compatibility go far beyond just the image-loader and syscalls. For comparison, the glibc backwards-compatibility story is... messier. https://www.kernel.org/pub/software/libs/glibc/hjl/compat/
This was logged as an issue on Connect in 2008,  and Microsoft's response was:
"Thanks for your suggestion. We are considering adding support for UTF8 in the next version of SQL Server. It is not clear at this point if it will be a new type or integrate it with existing types. We understand the pain in terms of integrating with UTF8 data and we are looking at ways to effectively resolve it."
Edit: By the way the worst handling of Unicode paths on Windows I have found is by Ruby, which is still partially broken. (last time I checked)
It never worked quite right, since ancient DOS times there was several bugs with this.
What surprises me, is that it STILL doesn't work right.
I have a Windows set to english, keyboard to Brazillian, and had to set "locale" to Japanese to play some Japanese games that outright crash otherwise (they don't even render wrong, they just crash).
I lost count of how many times programs instead of using Unicode when they could, tried to figure a codepage from my location (and end with Brazillian), language (english) or "locale"( Japanese), and do it all mixed up and wrong.
Stuff I saw:
Important installers (for example drivers and expensive software) that render the interface in japanese and EULA in portuguese but with US codepage.
Japanese font + portuguese text (ending in utterly non-sense).
English text + brazillian font... and many others.
Still, to my is is merely an "annoyance", but a similar issue made my mother panic completely:
My dad coded for my family business a software to do some mandatory tax stuff in Brazil if you do any business at all (a sort of tax report for every single transaction), he did it in PHP, for Linux, but running on Windows, so far so good...
Then one day he had to fix a bug, and the only machine was WinXP... he fixed the bug, and suddenly the program started to dump lots of corrupted data to government servers (as you can imagine that is really bad).
We went to see, and fors ome reason now it is sending codepage 437 formatted data, and we have no idea why, we can't find what changed, an we didn't even used a windows based editor (we used Geany on WinXP).
In honesty, I think these edge cases are sufficiently rare as to get little priority in testing.
You can do a "dir.listFiles()" and iterate it, and find that some of the entries are impossible to open because there is no way to represent the ISO-8859-1 bytes that make up the filename in a String object, and therefore no way to give the correct file name to the java io classes for opening.
That gives more weight to the argument that character encodings should be entirely a "presentation-layer" concern, and anything below that should treat strings as opaque byte sequences. There should also be a way to bypass that "presentation layer" transformation upon input.
Which can result in files with seemingly identical names when one UTF-8-encoded file name uses combining characters and the other one uses precomposed ones. Not to mention the fact that e.g. Å is contained in Unicode as both "latin capital letter A with ring above" and "Angstrom sign".
After you've solved encoding, next comes Unicode normalization ;-)
Unicode normalization is not the only problem here, e.g. Latin 'a', 'e', 'T' are exactly the same as Cyrillic 'а', 'е', 'Т' in most fonts which makes it possible for two files to have seemingly same names even in some 8-bit encodings.
codepoint -> encoding in UTF-8
U+022F -> C8 AF
U+042F -> D0 AF
U+062F -> D8 AF
* char being 16bit and thus can't hold a modern unicode character
* lots and lots of string and byte stream methods that have variants without explicit charset parameters, making it very easy to accidentally have a hidden dependency on the implicit default charset (aka -Dfile.encoding)
* Using strings instead of byte for filename values
* Having to deal with UnsupportedEncodingExceptions as checked exceptions whenever referencing the UTF8 encoding
If only it were that simple. POSIX paths are encoding-less bags of bytes, you can literally find anything in them aside from NUL, LC_CTYPE isn't even a consideration.
And the funny thing is: it's not that complex.[edit: let me rephrase: it is, but it doesn't have to be. C family is a nightmare, Python et al are delightful]. Unless you don't know how it works---then it's the mystery Gordian knot, as you describe it.
The irony is that encoding is a worry precisely for those who try and stay away from it.
Don't shy away from
encodings; embrace them. Then you will learn to love them.
(Another irony: with UTF8 gaining more and more mind share, encoding issues actually become harder to find and debug: they don't show up, and when they do, fewer and fewer people know how to deal with them. Everyone switching to UTF8 just hides the bugs, until it doesn't.)
Unless you do multiplatform development, then the language has a hard time saving you (and Python definitely does not)
> (Another irony: with UTF8 gaining more and more mind share, encoding issues actually become harder to find and debug: they don't show up, and when they do, fewer and fewer people know how to deal with them. Everyone switching to UTF8 just hides the bugs, until it doesn't.)
That's not true at all. A ton of byte sequences (and 13 standalone bytes) are outright illegal in UTF8, there's a fair amount of error handling in a validating UTF8 decoder, whereas there usually isn't any invalid byte (let alone byte sequence) in 8-bit codepages or character sets. When you decode random bytes in Windows-1256 or ISO-8859-9 and re-encode them as UTF8, the UTF8 encoder isn't the one at fault for your garbage output, as far as I could see it got perfectly valid unicode data.
People passing through unvalidated data (possibly assuming it's UTF8) isn't a problem with UTF8 either, by the way.
 whether that's used in strict mode or in replacement mode is a different concern and not a blemish on UTF8 itself
 which will always succeed, you can try it at home, just get a bunch of bytes from urandom and feed them to various decoders, chances are low that you'll generate anything the UTF8 decoder will accept, chances are also low that you'll generate anything an ISO-8859 character set will reject.
No. In python you have the absolutely retarded behaviour of a program running fine in the command line but crashing if you redirect the output to a file.
It's sad that a "modern" OS that had a mostly ground up rewrite (NT) after utf-8 was invented, doesn't have better support for it. I get in memory storage being utf-16, and I'd even accept modern OSes storing files in utf-16, but utf-8 is so elegantly backwards compatible with ascii, it's brain dead not to fix -everything- to work with it.
Perhaps this is the biggest difference between a closed OS like Windows and it's more open counterparts, if Windows were open then the community could have fixed this issue long ago.
Windows NT was released in July 1993, but development started in 1989 (https://en.m.wikipedia.org/wiki/Windows_NT#Development)
Also, even after UTF-8 was accepted to be a good idea, it wasn't considered a good idea for in-memory usage; indexing UCS-2 encoded strings is way easier, and UTF-16 didn't exist yet. It arrived with Unicde 2 in July 1996 (https://en.m.wikipedia.org/wiki/UTF-16#History)
> The ASCII standard has nothing to say about backspace overstriking
3.2 Diacritical Signs
(Positions: 2/2, 2/7, 2/12, 5/14, 6/0, 7/14)
In the 7-bit character set, some printing symbols may be
designed to permit their use for the composition of acce‐
nted letters when necessary for general interchange of
information. A sequence of three characters, comprising
a letter, BACKSPACE and one of these symbols, is needed
for this composition; the symbol is then regarded as a diacrit‐
ical sign. It should be noted that these symbols take
on their diacritical significance only when they precede or
follow the character BACKSPACE; for example, the symbol
corresponding to the code combination 2/7 normally has the
significance of APOSTROPHE, but becomes the diacritical
sign ACUTE ACCENT when preceded or followed by the character
ANSI made this optional in the 1986 revision, in §2.1.2 — “The use of BS for forming composite characters is not required.” — with a note that it would likely be removed from a future revision (but there never was another one).
> In a more common example, ASCII has nothing to say about how you move the cursor to the start of a new line.
CR Carriage Return
A format effector which moves the active position
to the first character position *on the same line*.
LF Line Feed
A format effector which advances the active position
to the *same character position* of the next line.
The Format Effectors are intended for equipment in
which horizontal and vertical movements are effected
separately. If equipment requires the action of
CARRIAGE RETURN to be comhined with a vertical movement,
the Format Effector for that vertical movement
may be used to effect the combined movement. For example,
if NEW LINE (symbol NL, equivalent to CR + LF)
is required, FE2 shall be used to represent it. This
substitution requires agreement between the sender and
the recipient of the data.
The use of these combined functions may be restricted
for international transmission on general switched telecommunication
networks (telegraph and telephone networks).
C is much better, because at least it doesn't pretend to support encodings. So you're forced to use a 3rd party library anyway.
C is better because it _doesnt_ provide some kind of interface into encoding. A byte is a byte is a byte. Whether that's some slice of a binary blob or the first byte in a multi-byte unicode string is completely irrelevant. The most infuriating thing I've found with higher level languages (Ruby/Python/etc.) is their abstractions on top of string encodings that fail to cover the hundreds of thousands of edge-cases (which, they rightfully shouldn't cover either), meaning that the only time you do end up having to wrangle encodings is when things go south and you have to fight the language's defaults just to get some stupid email address that most likely had a bit flip in transit somewhere to properly write to a CSV.
Excel CSVs default to the system codepage when reading and writing, which makes some sense given that the files are plain text. You can actually control what codepage is used when importing data but you have to use the import option.
The bigger problem is when Excel tries to be helpful and reformats the data upon opening the file. I can't remember the intricacies, but it used to behave differently whether you opened it from a local path or directly from a browser. I seem to recall using an "incorrect" MIME type helped with that.
I found many of the protocols to be both incredibly clever and incredibly annoying at the same time.
What is a fixed delimited file? A 'mainframe' file is generally fixed length or delimited.
I gave up. This article describes why.
The problem here is Microsoft clinging to legacy character encodings that have no place on a modern operating system as default for anything.
It's not the "enabling" of Unicode that's problematic, it's the features of more exotic Unicode codepoints that aren't correctly supported by many programs in all major operating systems, because they weren't designed with these features in mind.
Tell me, can you copy/paste the "𖥶" character with the mouse after double-clicking it in your browser? It's a letter - http://unicode-table.com/en/16976/, so it should IMO be treated like a quoted word (correct me if I'm wrong). Safari on OS X selects either the quote and the letter or just the right quote depending on where exactly on the letter I double click. Safari/Webkit is relatively modern and well-maintained too. Unicode "supported"? Sure. Working well? Nope, just for a few common use cases.
No problem (Linux/Firefox).
The Unicode standard is not static (unlike those old codepages mentioned by OP); it is being changed and improved continuously. Support for characters outside of the BMP (such as the one you mention) is mostly there, but it may not be at the level of the basic scripts supported by, say, Unicode 5.0. That is fine. New features in standards take time to implement (the same thing happens with HTML and CSS).
For developers Unicode support is there. Has been for years.
On OS X/Safari, it renders fine but copy-paste is bugged like you mention (can't select with double click).
That seems more like a comment on how terminals suck, not Unicode. Everything is painful in the terminal - I mean, good luck making your text red or green too!
> Using UTF-8 typically means having to think long and hard about filtering input/data, normalization
Typically? I really don't think so. I almost never worry about such things. It's a rare occasion when I have to deal with those kinds of things.
Its really a mistake to think that because your program can handle a particular character set that it can therefore handle any language which uses those characters. You're always going to need to add support for specific languages, even when using ASCII, e.g. the sorting order of names is different in German and English. French typographical rules are different w.r.t spacing. Russia. Typesetting tradition formats math rather differently. Line breaking in Thai requires a dictionary because the language has no spaces.
Any software will run into problems eventually, usually with something as simple as German. Decide which languages you're going to support, not which character codes!
Oh Christ on a cracker. Why doesn't Oracle just default everything to UTF-8 in the next version of Java?
Sane platforms default to UTF-8 and output exactly what you would expect: €Σ