Hacker News new | comments | show | ask | jobs | submit login
Code Page 437 Refuses to Die (horstmann.com)
197 points by ingve 583 days ago | hide | past | web | favorite | 89 comments



Trying to handle character encoding on Windows in multi-platform programs is a nightmare. In C++ you can almost always get away with treating C strings as UTF-8 for input/output and you only need special consideration for the encoding if you want to do language-based tasks like converting to lowercase or measuring the "display width" of a string. Not on Windows. Whether or not you define the magical UNICODE macro, Windows will fail to open UTF-8 encoded filenames using standard C library functions. You have to use non-standard wchar overloads or use the Windows API. That is to say, there is no standard-conformant internationalization-friendly way to open a file by name on Windows in C or C++. I really wish Microsoft would at least support UTF-8, even if they want to stick with UTF-16 internally.

The section titled "How to do text on Windows" on http://utf8everywhere.org/#windows covers the insanity in more detail.


For a company that claims to be so supportive of "developers, developers, developers", Microsoft's stubborn and developer-hostile approach to internationalization and their dogged loyalty to the awful UTF-16 encoding is ironic.

The Right Thing To Do at this point is to make UTF-8 a multi-byte code page in Windows and build a UTF-8 implementation in the msvc libc. The milquetoast excuse I hear from Microsoft people is that some win32 APIs can't handle MBCS encodings with more than 3 bytes per character. Which sort of sounds like a problem for developers to fix; perhaps Microsoft could hire some?


Before "developers, developers, developers" comes "backward compatibility, backward compatibility, backward compatibility". Windows is perhaps the first commercial platform to commit to Unicode; they made that commitment when UTF-8 was still some notes scribbled on Brian Kernighan's napkin. And all future Win32 implementations must be 100% binary compatible with previous ones. That creates inertia for UTF-16 (or UCS-2), true, but the backwards compatibility guarantees make Windows an absolute joy compared to Linux if you want to write software with a long service lifetime. The decision to stick with 16-bit Unicode is an engineering tradeoff.


but the backwards compatibility guarantees make Windows an absolute joy compared to Linux if you want to write software with a long service lifetime

The linux user-space ABI is extremely stable. I think the one thing that would frustrate the development of long service lifetime software on linux would just be library availability on various distros, but the actual operating environment presented by the kernel and the image loader for user-space software on linux is very, very stable.

And all future Win32 implementations must be 100% binary compatible with previous ones

And no one is asking microsoft to break the windows user-space ABI. Adding a new, sufficiently-tested, MBCS codepage would have no impact at all on existing windows software. None whatsoever. Other than to make localization a whole lot easier. And compared to the cost of building in a whole new csrss-level subsystem like the one Microsoft just built in to windows last month (linux), it's probably a lot safer and easier to test.

Working in this world myself, windows developers become philosophical when discussing localization. "Someday, we'll turn on UNICODE," "We really should be using TCHAR" (as if that silly thing would fix anything at all), "Shouldn't we really be using wstring?"

OSX, Linux, and mobile developers just do it. It's mostly a solved problem on those platforms.

The decision to stick with 16-bit Unicode is an engineering tradeoff.

It's a cost tradeoff - Microsoft doesn't want to spend the developer and testing time adding a UTF-8 codepage.


Vote here: https://wpdev.uservoice.com/forums/266908-command-prompt-con....

Currently we do "support" CP65001 in the console, but things break if you enable it. One of the problems, for example, is that .NET sees 65001 and starts outputting the UTF-8 BOM everywhere, breaking applications that don't even care about the character encoding. I suspect that's going to be difficult to fix without breaking compatibility.

Having said that, I think it's apparent that we are investing heavily in the console for the first time in a long while, so I'm more hopeful than ever that we can get this fixed.


The BOM is an aBOMination. If you simply assume that everything is UTF-8 until proven otherwise, you can get pretty far - the legacy code pages produce text that's not usually valid UTF-8 unless they stick to ASCII.


Yes, you can assume, but existing applications, maybe written in 1990s even 1980s, won't. And there are millions of computers, maybe in important industrial companies, are still using them.


If that's the case they won't handle a BOM either.


Well, if I could use UTF-8 with CreateFileA() et. al. and never have to use wchar_t again, that would just be like Christmas. I don't know whether this survey is for blanket win32 UTF-8 compatibility or whether this just means making the admittedly hugely improved cmd.exe work with UTF-8 command parsing, but anything that gives us better UTF-8 support is a step in the right direction.


the actual operating environment presented by the kernel and the image loader for user-space software on linux is very, very stable.

The guarantees that Windows makes about backwards-compatibility go far beyond just the image-loader and syscalls. For comparison, the glibc backwards-compatibility story is... messier. https://www.kernel.org/pub/software/libs/glibc/hjl/compat/


Linux had the luxury of waiting until UTF-8 was available before settling on a Unicode strategy. Given that UTF-8 was extremely compatible both forwards and backwards, it was a fortuitous choice. I weep a little whenever I think about how Microsoft messed up by being an early adopter of Unicode - being an early adopter should have made life better, not worse.


True story: Office maintained its API to be compatible with a plugin which its developer has already bankrupted, and its source code is lost. Because millions of US Government computers are still using them.


Especially hilarious how SQL server nchar is widechar so most all string data is twice as large as it needs to be. But there isn't a utf8 option to enable! Though this might have been addressed in one of the later versions, I wouldn't be surprised if it was limited to enterprise edition or other crap.


Bloody hell... I looked this up as I was certain it couldn't still be the case. But yes, it is!

This was logged as an issue on Connect in 2008, [1] and Microsoft's response was:

"Thanks for your suggestion. We are considering adding support for UTF8 in the next version of SQL Server. It is not clear at this point if it will be a new type or integrate it with existing types. We understand the pain in terms of integrating with UTF8 data and we are looking at ways to effectively resolve it."

1. https://connect.microsoft.com/SQLServer/feedback/details/362...


It's a bit annoying and lame but not really a huge deal. For example, you can just open an fstream like so:

  std::ofstream file(convertUTF8ToFStreamPath(pathname).c_str());
Where convertUTF8ToFStreamPath converts from UTF-8 to the Windows wide encoding (UTF-16?)


It gets even more fun when you have no choice but to use a poorly maintained closed-source third-party library (par for the course on Windows). If the library authors didn't use this trick themselves you can get stuck in situations where you have no way to make the library open the file without resorting to the DOS 8.3 name. And if the user disabled 8.3 names system-wide... then the only consolation is that something even more important on the system will probably break first. A sad state of affairs in 2016.


The third party library situation is indeed trickier. I've never found a library that I couldn't get to open Unicode paths eventually though. Sometimes you have to open a file handle yourself and pass it to the library.

Edit: By the way the worst handling of Unicode paths on Windows I have found is by Ruby, which is still partially broken. (last time I checked)


What is broken? Have you reported it? I've reported a related bug this week (File.truncate called CreateFileA() with a UTF-8 string). I even had a working patch but forgot to attach it. Anyway it was almost immediately fixed including various additional tests for other File class methods (which handled Unicode without any problems).



I am from Brazil, meaning that we needed here a custom code-page for our language characters (including the Ç mentioned in the article).

It never worked quite right, since ancient DOS times there was several bugs with this.

What surprises me, is that it STILL doesn't work right.

I have a Windows set to english, keyboard to Brazillian, and had to set "locale" to Japanese to play some Japanese games that outright crash otherwise (they don't even render wrong, they just crash).

I lost count of how many times programs instead of using Unicode when they could, tried to figure a codepage from my location (and end with Brazillian), language (english) or "locale"( Japanese), and do it all mixed up and wrong.

Stuff I saw:

Important installers (for example drivers and expensive software) that render the interface in japanese and EULA in portuguese but with US codepage.

Japanese font + portuguese text (ending in utterly non-sense).

English text + brazillian font... and many others.

Still, to my is is merely an "annoyance", but a similar issue made my mother panic completely:

My dad coded for my family business a software to do some mandatory tax stuff in Brazil if you do any business at all (a sort of tax report for every single transaction), he did it in PHP, for Linux, but running on Windows, so far so good...

Then one day he had to fix a bug, and the only machine was WinXP... he fixed the bug, and suddenly the program started to dump lots of corrupted data to government servers (as you can imagine that is really bad).

We went to see, and fors ome reason now it is sending codepage 437 formatted data, and we have no idea why, we can't find what changed, an we didn't even used a windows based editor (we used Geany on WinXP).


Mixing locales confuses a lot of poorly implemented programs. My situation is notably simpler: OS in French, keyboard set to qwerty. The number of games where keyboard layout assumes azerty consequently breaking WASD is frustratingly high.

In honesty, I think these edge cases are sufficiently rare as to get little priority in testing.


Rather than using Japanese locale all the time, you can try Locale Emulator (https://xupefei.github.io/Locale-Emulator/), which allows you to run individual programs in a given locale.


From personal experience I can say that Locale Emulation doesn't reliable work for many games.


The most hilarious issue with -Dfile.encoding is that if it set to, for example, UTF-8, then it is literally impossible to open files with names encoded in, say, ISO-8859-1, which can happen on a Linux system where users use different LC_CTYPE.

You can do a "dir.listFiles()" and iterate it, and find that some of the entries are impossible to open because there is no way to represent the ISO-8859-1 bytes that make up the filename in a String object, and therefore no way to give the correct file name to the java io classes for opening.


You can do a "dir.listFiles()" and iterate it, and find that some of the entries are impossible to open because there is no way to represent the ISO-8859-1 bytes that make up the filename in a String object, and therefore no way to give the correct file name to the java io classes for opening.

That gives more weight to the argument that character encodings should be entirely a "presentation-layer" concern, and anything below that should treat strings as opaque byte sequences. There should also be a way to bypass that "presentation layer" transformation upon input.


Indeed, this is exactly what the Linux kernel does; it treats filenames like a binary blob, with the exception that the bytes 0x2f (ASCII '/') and 0x00 (ASCII NUL) are not accepted (regardless of where they are in the byte string).


> it treats filenames like a binary blob

Which can result in files with seemingly identical names when one UTF-8-encoded file name uses combining characters and the other one uses precomposed ones. Not to mention the fact that e.g. Å is contained in Unicode as both "latin capital letter A with ring above" and "Angstrom sign".

After you've solved encoding, next comes Unicode normalization ;-)


It is not a big deal as long as seemingly identical file names are treated by OS as different.

Unicode normalization is not the only problem here, e.g. Latin 'a', 'e', 'T' are exactly the same as Cyrillic 'а', 'е', 'Т' in most fonts which makes it possible for two files to have seemingly same names even in some 8-bit encodings.


Even Latin 'I' and 'l' and the digit '1' are visually indistinguishable in some fonts! So are trailing spaces and different numbers of spaces. This is such a pervasive problem that maybe we can just give up and expect users to get used to it.


Wow, so characters like U+022F (ȯ) and U+042F (Cyrillic letter Я) U+062F (Arabic letter د) are not allowed but nearly everything else is? Some of those are letters used in actual languages. That's sure to make people scratch their heads.


    codepoint -> encoding in UTF-8

    U+022F    -> C8 AF  
    U+042F    -> D0 AF  
    U+062F    -> D8 AF
Remember, UTF-8 is self-synchronising: when you pick up at a random point within a stream, there is no ambiguity as to whether you are in the middle of a sequence or not. Valid lower codepoints appearing in the encoding of higher codepoints would violate this property.


When encoding those characters in UTF-8, you will never end up with 0x2F as a byte. On of the properties of UTF-8 is that bytes with the high bit not set (e.g. 0x0 to 0x7e) never appear unless the are representing the 0 - 127 codepoints.


If you encode them in UTF-8, they are allowed. In UTF-8, to represent a given code point, only the first code unit can be in the range 0x00 to 0x7F. So if you're reading UTF-8 content and find a byte with value 0x00-0x7F, you can be sure it is a 1-byte long code unit sequence representing the code points U+0000 to U+007F.


Absolutely, there's quite a few gaffes in java (other languages have similar issues too):

* char being 16bit and thus can't hold a modern unicode character

* lots and lots of string and byte stream methods that have variants without explicit charset parameters, making it very easy to accidentally have a hidden dependency on the implicit default charset (aka -Dfile.encoding)

* Using strings instead of byte[] for filename values

* Having to deal with UnsupportedEncodingExceptions as checked exceptions whenever referencing the UTF8 encoding


The reason for char being 16 bits is that the JLS predates Unicode 2.0 (1996), which added 32 bit codepoints. At that time, UCS-2 was widely believed to be sufficient to represent anything "international."


> which can happen on a Linux system where users use different LC_CTYPE.

If only it were that simple. POSIX paths are encoding-less bags of bytes, you can literally find anything in them aside from NUL, LC_CTYPE isn't even a consideration.


My heart goes out to anyone that ever has to work with encoding issues. I feel like I know 5% of this stuff and there looms a vast mass of chaos like a Gordian knot whenever I need to dip my toes in it.


Every programmer has to deal with encoding; it's a fact of life. A list of bytes without encoding is just that: a list of bytes. Handling bytes as text (I.e.: "I/O")? You are now working with encodings.

And the funny thing is: it's not that complex.[edit: let me rephrase: it is, but it doesn't have to be. C family is a nightmare, Python et al are delightful]. Unless you don't know how it works---then it's the mystery Gordian knot, as you describe it.

The irony is that encoding is a worry precisely for those who try and stay away from it.

Don't shy away from encodings; embrace them. Then you will learn to love them.

(Another irony: with UTF8 gaining more and more mind share, encoding issues actually become harder to find and debug: they don't show up, and when they do, fewer and fewer people know how to deal with them. Everyone switching to UTF8 just hides the bugs, until it doesn't.)


> let me rephrase: it is, but it doesn't have to be. C family is a nightmare, Python et al are delightful

Unless you do multiplatform development, then the language has a hard time saving you (and Python definitely does not)

> (Another irony: with UTF8 gaining more and more mind share, encoding issues actually become harder to find and debug: they don't show up, and when they do, fewer and fewer people know how to deal with them. Everyone switching to UTF8 just hides the bugs, until it doesn't.)

That's not true at all. A ton of byte sequences (and 13 standalone bytes) are outright illegal in UTF8, there's a fair amount of error handling in a validating UTF8 decoder[0], whereas there usually isn't any invalid byte (let alone byte sequence) in 8-bit codepages or character sets. When you decode random bytes in Windows-1256 or ISO-8859-9[1] and re-encode them as UTF8, the UTF8 encoder isn't the one at fault for your garbage output, as far as I could see it got perfectly valid unicode data.

People passing through unvalidated data (possibly assuming it's UTF8) isn't a problem with UTF8 either, by the way.

[0] whether that's used in strict mode or in replacement mode is a different concern and not a blemish on UTF8 itself

[1] which will always succeed, you can try it at home, just get a bunch of bytes from urandom and feed them to various decoders, chances are low that you'll generate anything the UTF8 decoder will accept, chances are also low that you'll generate anything an ISO-8859 character set will reject.


Python is delightful? Many of the only times i've had issues were in python.


Python3 is delightful.


Yes, the danger is when people think 'Encodings are too complicated! Surely my framework/platform/library should be abstracting all this stuff for me?' and then proceed to assume that because it should be dealt with at a lower level, that is is dealt with at a lower level. All I/O encoding abstractions leak sooner or later. For which reason, all programmers need to stop pretending encodings don't affect them and read this: http://www.joelonsoftware.com/articles/Unicode.html


> let me rephrase: it is, but it doesn't have to be. C family is a nightmare, Python et al are delightful

No. In python you have the absolutely retarded behaviour of a program running fine in the command line but crashing if you redirect the output to a file.


This article is trying to describe how to work on Windows specifically. I don't think it's saying ignore encodings, though honestly why -everything- isn't just encoded utf-8 I don't know.

It's sad that a "modern" OS that had a mostly ground up rewrite (NT) after utf-8 was invented, doesn't have better support for it. I get in memory storage being utf-16, and I'd even accept modern OSes storing files in utf-16, but utf-8 is so elegantly backwards compatible with ascii, it's brain dead not to fix -everything- to work with it.

Perhaps this is the biggest difference between a closed OS like Windows and it's more open counterparts, if Windows were open then the community could have fixed this issue long ago.


Realistically, there is no way NT could have used UTF-8 in-memory. UTF-8 was scribbled on a placemat on September 2, 1992 and officially presented in January 1993 (https://en.m.wikipedia.org/wiki/UTF-8#History)

Windows NT was released in July 1993, but development started in 1989 (https://en.m.wikipedia.org/wiki/Windows_NT#Development)

Also, even after UTF-8 was accepted to be a good idea, it wasn't considered a good idea for in-memory usage; indexing UCS-2 encoded strings is way easier, and UTF-16 didn't exist yet. It arrived with Unicde 2 in July 1996 (https://en.m.wikipedia.org/wiki/UTF-16#History)


The problem with elegant backwards compatibility with ASCII is that few systems ever stuck to ASCII - everybody used different, incompatible extensions to ASCII, like the ISO 8859 family and codepage 1252. So by being elegantly backwards compatible with ASCII, UTF-8 also manages to be subtly incompatible with the majority of almost-aSCII data that exists, in ways that sometimes don't matter. Until they do.


ISO 8859 (or big brother ISO 10367) with ISO 4873 is fairly sane, backwards compatible with ASCII, distinguishable from UTF-8, and historically supported by X terminals… but not much else. ‘Plain ASCII’ is not often fully supported either; how many people even know that the standard included composing accented characters by overstriking (e.g. a BS " →​ ä)?


The ASCII standard has nothing to say about backspace overstriking - that would be something defined by a terminal, file format, or wire specification. In a more common example, ASCII has nothing to say about how you move the cursor to the start of a new line. Some terminal standards will do so on an LF, others just move the cursor down on LF; file and wire specs need to take a position on how they want to represent a line break; and now we get to the point where CRLF is a magic sequence needed to trigger a line break in some formats and it is completely unrelated to how a particular terminal behaves.


  > The ASCII standard has nothing to say about backspace overstriking
It seems impossible to get ANSI copies of old standards (even for money) but the ECMA (1973) printing says:

    3.2 Diacritical Signs
    (Positions: 2/2, 2/7, 2/12, 5/14, 6/0, 7/14)
    In the 7-bit character set, some printing symbols may be
    designed to permit their use for the composition of acce‐
    nted letters when necessary for general interchange of
    information. A sequence of three characters, comprising
    a letter, BACKSPACE and one of these symbols, is needed
    for this composition; the symbol is then regarded as a diacrit‐
    ical sign. It should be noted that these symbols take
    on their diacritical significance only when they precede or
    follow the character BACKSPACE; for example, the symbol
    corresponding to the code combination 2/7 normally has the
    significance of APOSTROPHE, but becomes the diacritical
    sign ACUTE ACCENT when preceded or followed by the character
    BACKSPACE.
This is precisely the reason ASCII 1967 replaced ← with _ and ↑ with ^ (explicitly still “circumflex accent” in Unicode) and added ` (“grave accent”, likewise).

ANSI made this optional in the 1986 revision, in §2.1.2 —​ “The use of BS for forming composite characters is not required.” — with a note that it would likely be removed from a future revision (but there never was another one).

  > In a more common example, ASCII has nothing to say about how you move the cursor to the start of a new line.
It does; it just says something slightly unfortunate about code 0x0A.

  CR  Carriage Return
      A format effector which moves the active position
      to the first character position *on the same line*.

  LF  Line Feed
      A format effector which advances the active position
      to the *same character position* of the next line.
[Italics added] But then it says:

  The Format Effectors are intended for equipment in
  which horizontal and vertical movements are effected
  separately. If equipment requires the action of
  CARRIAGE RETURN to be comhined with a vertical movement,
  the Format Effector for that vertical movement
  may be used to effect the combined movement. For example,
  if NEW LINE (symbol NL, equivalent to CR + LF)
  is required, FE2 shall be used to represent it. This
  substitution requires agreement between the sender and
  the recipient of the data.

  The use of these combined functions may be restricted
  for international transmission on general switched telecommunication
  networks (telegraph and telephone networks).
So CR LF will unambiguously get you the first position on the next line. The code for LF is allowed to be replaced by NL by “agreement”, but CR can't move to the next line.


That's fair, and thanks for digging up those actual standards - they are indeed not easy to find these days. I think though that tends to be the standard speaking to the intended usage of the ASCII codes, more than it does to an attempt to standardize behavior of all input and output devices. It's more a suggestion that some equipment might choose to use backspace for diacritics, and some might separately effect X and Y movement, but others might not. Certainly in practice terminals have always exposed their different capabilities by giving special terminal code meanings to ASCII sequences, and, when terminal capability negotiation was not an option (e.g. in specifying a file format or defining how to separate SMTP headers) ASCII users have made their own standards for these things (I guess this is the sort of 'agreement' the standard alludes to).


Yes, the ‘dual use’ characters were define at the request of European delegations and the language makes it clear that it was expected that English-language terminals would continue to have " look like " and not ¨ and so on. ASCII was defined before video terminals were developed, so no one thought overstriking was remarkable. As it turned out, non-English European language versions ended up not generally using overstrikes anyway, but redefined the ‘national use’ characters #$@[\]{|} instead, which unfortunately did lead to ambiguity.


NT was initially designed (1989-1990) before UTF-8 came along. UTF-16 instead of UCS-2 was shoehorned in later with Win2k.


Python (the 90% of programs 2.x branch) is horrible. It makes the cardinal mistake of mixing binary data and encoded data in one String type.

C is much better, because at least it doesn't pretend to support encodings. So you're forced to use a 3rd party library anyway.


> C is much better, because at least it doesn't pretend to support encodings. So you're forced to use a 3rd party library anyway.

C is better because it _doesnt_ provide some kind of interface into encoding. A byte is a byte is a byte. Whether that's some slice of a binary blob or the first byte in a multi-byte unicode string is completely irrelevant. The most infuriating thing I've found with higher level languages (Ruby/Python/etc.) is their abstractions on top of string encodings that fail to cover the hundreds of thousands of edge-cases (which, they rightfully shouldn't cover either), meaning that the only time you do end up having to wrangle encodings is when things go south and you have to fight the language's defaults just to get some stupid email address that most likely had a bit flip in transit somewhere to properly write to a CSV.


Python delightful? In Python 2, the behavior of Unicode strings (UTF-16 vs. UTF-32) depends on what options the interpreter was compiled with. Then they made a backward-incompatible change to "fix" Unicode but did the exact wrong thing (UTF-32, UCS-2 or ISO-8859-1 depending on string content) instead of the right thing (UTF-8 storage with UTF-32 iterators).


I got the impression that the internal encoding Python chooses for a string is an implementation detail you shouldn't (and cannot) care about. All the public ways if accessing string contents operate on code points, unless you convert the string to a byte array first.


I've noticed that Excel seems to assert that CSVs are not in UTF8. I know the standard is actually ASCII, but nobody on the team seemed willing to understand why their smart quotes came out as "weird characters" (or even understand what smart quotes are). It came out correctly in LibreOffice.


Excel predates Unicode by a few years. The first true Excel version on Mac was released in 1985. The first generally recognized use of Unicode dates back to 1988.

Excel CSVs default to the system codepage when reading and writing, which makes some sense given that the files are plain text. You can actually control what codepage is used when importing data but you have to use the import option.


This is why I always write a BOM in CSV files (yes, even though it's UTF-8 and it shouldn't have a BOM). Excel will actually respect it.

The bigger problem is when Excel tries to be helpful and reformats the data upon opening the file. I can't remember the intricacies, but it used to behave differently whether you opened it from a local path or directly from a browser. I seem to recall using an "incorrect" MIME type helped with that.


Adding a BOM to a CSV file then causes many libraries to add the BOM to the first column's heading name.


It's an option that just defaults to the system ANSI codepage. After clicking Data > From Text, in the "Text Import Wizard - Step 1 of 3" screen, change the "File origin" to "65001: Unicode (UTF-8)"


Some years ago I had to work with text encoded with 3GPP TS 23.038 [1], which is a 7bit packed encoding with a few quirks. It wasn't particularly fun.

[1] https://en.m.wikipedia.org/wiki/GSM_03.38


The telephony standards are all kinds of bizarre. It's a testament as to how well the specs are written that it actually works.


The specs are incredibly dense and hard to read, but they are quite detailed.

I found many of the protocols to be both incredibly clever and incredibly annoying at the same time.


Agreed completely, I recently had to work with some EBCDIC output files from old mainframe systems. Packed string data in a fixed delimited file. Nail bitingly frustrating stuff.


Having built countless similar conversions, EBCDIC[1] to ASCII can be as simple as a lookup table and "packed string data" is just binary coded decimal [2] with a sign.

What is a fixed delimited file? A 'mainframe' file is generally fixed length or delimited.

[1]: https://en.wikipedia.org/wiki/EBCDIC

[2]: https://en.wikipedia.org/wiki/Binary-coded_decimal


I have used delimiters within a fixed-width file so that I can fit more data in a width field in a fixed-width file layout. This is usually when I have to work with a mainframe, and can't define the layout, of course.



It is the basis for the "graphics" of Dwarf Fortress. The game now uses SDL and a tileset, but still includes a text mode (but not on windows).


I tried to make it work on Windows.

I gave up. This article describes why.


Things would probably be slightly better if Unicode wasn't so difficult to support properly. For example, good luck getting your UTF-8 text output to render correctly in a terminal window under all circumstances (try combining diacritics or funny letters like "𖥶" - can you copy/paste that using the mouse?), or in a text editor like vi, Emacs with terminal UI. Using UTF-8 typically means having to think long and hard about filtering input/data, normalization and stuff like http://www.unicode.org/reports/tr36/ . If you don't, you'll run into such issues eventually even if you think your code needs to handle only a few "western" languages.


A decade ago that would be a valid argument, and processing Unicode text still isn't trivial, but nowadays Unicode support (UTF-8 in particular) is so well embedded in all operating systems and programming languages that this is not the problem. And on most operating systems the problem simply doesn't exist for most people because Unicode is the default.

The problem here is Microsoft clinging to legacy character encodings that have no place on a modern operating system as default for anything.


> so well embedded in all operating systems and programming languages that this is not the problem.

It's not the "enabling" of Unicode that's problematic, it's the features of more exotic Unicode codepoints that aren't correctly supported by many programs in all major operating systems, because they weren't designed with these features in mind.

Tell me, can you copy/paste the "𖥶" character with the mouse after double-clicking it in your browser? It's a letter - http://unicode-table.com/en/16976/, so it should IMO be treated like a quoted word (correct me if I'm wrong). Safari on OS X selects either the quote and the letter or just the right quote depending on where exactly on the letter I double click. Safari/Webkit is relatively modern and well-maintained too. Unicode "supported"? Sure. Working well? Nope, just for a few common use cases.


> Tell me, can you copy/paste the "𖥶" character with the mouse after double-clicking it in your browser?

No problem (Linux/Firefox).

The Unicode standard is not static (unlike those old codepages mentioned by OP); it is being changed and improved continuously. Support for characters outside of the BMP (such as the one you mention) is mostly there, but it may not be at the level of the basic scripts supported by, say, Unicode 5.0. That is fine. New features in standards take time to implement (the same thing happens with HTML and CSS).

For developers Unicode support is there. Has been for years.


Your example works fine on Chrome and Internet Explorer on Windows, by the way.


On OS X/Google Chrome, it fails to render the character (all I see is a square), but copy-paste works as expected, and I can view the character correctly after pasting it into Sublime.

On OS X/Safari, it renders fine but copy-paste is bugged like you mention (can't select with double click).


> good luck getting your UTF-8 text output to render correctly

That seems more like a comment on how terminals suck, not Unicode. Everything is painful in the terminal - I mean, good luck making your text red or green too!

> Using UTF-8 typically means having to think long and hard about filtering input/data, normalization

Typically? I really don't think so. I almost never worry about such things. It's a rare occasion when I have to deal with those kinds of things.

Its really a mistake to think that because your program can handle a particular character set that it can therefore handle any language which uses those characters. You're always going to need to add support for specific languages, even when using ASCII, e.g. the sorting order of names is different in German and English. French typographical rules are different w.r.t spacing. Russia. Typesetting tradition formats math rather differently. Line breaking in Thai requires a dictionary because the language has no spaces.

Any software will run into problems eventually, usually with something as simple as German. Decide which languages you're going to support, not which character codes!


On 16-bit vs 32-bit Unicode/ISO 10646, I think it basically boils down to ISO 10646 wanting 32-bit but had no software folks on that committee, while the Unicode people was basically software folks but thought that 16-bit was enough.


"Disclaimer: The file.encoding property is undocumented and not officially supported, and it has been reported to act inconsistently across Java versions and platforms."

Oh Christ on a cracker. Why doesn't Oracle just default everything to UTF-8 in the next version of Java?


Too late for that kind of change, really. 20 years of legacy systems are not going to enjoy that.


If their legacy system doesn't support UTF-8, I doubt they'd update their system at all.


They probably want update the JVM for security purposes, though. There's plenty that will absolutely kill you in any public facing JVM from more than a few years back. The first release of 7 has at least two very bad denial of service attacks that allow a remote user to saturate as many CPUs on your server as they like, all they need to do is send magic http headers to your tomcat or whatever.


People say that about operating systems, yet you have businesses running Windows XP still simply because they don't want to update. Yes, there's ones that don't for compatibility, but there are some that don't upgrade simply because they don't want to.


Now imagine you are debugging the console output after it went thru a web server, your browser, your local dev environment, your local console, and then pasted into your text editor.


However if you use WriteConsoleW, you can actually output both of the "€∑" in ANY code page. The NT console is a matrix of UTF-16 code points, and there is an API to write it. So the question becomes why Java (and many other platforms, like Python) did not use this mechanism. Notably Java's internal encoding of strings is UTF-16 either, this should be a problem.


One of the great benefits of a console interface is that it's a lowest common denominator between different OS. That interface is based on a byte-level I/O model. Things like I/O redirection depend on it.


They can implement an alternative driver to convert IO streams into API calls. And that's how libuv did.


Notably this is a Windows problem.

Sane platforms default to UTF-8 and output exactly what you would expect: €Σ


lol. Just failed to mount a sd card on my openwrt modem yesterday. the error was that the exfat partition was using that code page and my modem didn't have it.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: