
Code Page 437 Refuses to Die - ingve
http://horstmann.com/unblog/2016-05-06/index.html
======
alexbock
Trying to handle character encoding on Windows in multi-platform programs is a
nightmare. In C++ you can almost always get away with treating C strings as
UTF-8 for input/output and you only need special consideration for the
encoding if you want to do language-based tasks like converting to lowercase
or measuring the "display width" of a string. Not on Windows. Whether or not
you define the magical UNICODE macro, Windows will fail to open UTF-8 encoded
filenames using standard C library functions. You have to use non-standard
wchar overloads or use the Windows API. That is to say, _there is no standard-
conformant internationalization-friendly way to open a file by name on Windows
in C or C++_. I really wish Microsoft would at least support UTF-8, even if
they want to stick with UTF-16 internally.

The section titled "How to do text on Windows" on
[http://utf8everywhere.org/#windows](http://utf8everywhere.org/#windows)
covers the insanity in more detail.

~~~
CountSessine
For a company that claims to be so supportive of "developers, developers,
developers", Microsoft's stubborn and developer-hostile approach to
internationalization and their dogged loyalty to the awful UTF-16 encoding is
ironic.

The Right Thing To Do at this point is to make UTF-8 a multi-byte code page in
Windows and build a UTF-8 implementation in the msvc libc. The milquetoast
excuse I hear from Microsoft people is that some win32 APIs can't handle MBCS
encodings with more than 3 bytes per character. Which sort of sounds like a
problem for developers to fix; perhaps Microsoft could hire some?

~~~
bitwize
Before "developers, developers, developers" comes "backward compatibility,
backward compatibility, backward compatibility". Windows is perhaps the first
commercial platform to commit to Unicode; they made that commitment when UTF-8
was still some notes scribbled on Brian Kernighan's napkin. And all future
Win32 implementations must be 100% binary compatible with previous ones. That
creates inertia for UTF-16 (or UCS-2), true, but the backwards compatibility
guarantees make Windows an absolute joy compared to Linux if you want to write
software with a long service lifetime. The decision to stick with 16-bit
Unicode is an engineering tradeoff.

~~~
CountSessine
_but the backwards compatibility guarantees make Windows an absolute joy
compared to Linux if you want to write software with a long service lifetime_

The linux user-space ABI is extremely stable. I think the one thing that would
frustrate the development of long service lifetime software on linux would
just be library availability on various distros, but the actual operating
environment presented by the kernel and the image loader for user-space
software on linux is very, very stable.

 _And all future Win32 implementations must be 100% binary compatible with
previous ones_

And no one is asking microsoft to break the windows user-space ABI. Adding a
new, sufficiently-tested, MBCS codepage would have no impact at all on
existing windows software. None whatsoever. Other than to make localization a
whole lot easier. And compared to the cost of building in a whole new csrss-
level subsystem like the one Microsoft just built in to windows last month
(linux), it's probably a lot safer and easier to test.

Working in this world myself, windows developers become philosophical when
discussing localization. "Someday, we'll turn on UNICODE," "We really should
be using TCHAR" (as if that silly thing would fix anything at all), "Shouldn't
we really be using wstring?"

OSX, Linux, and mobile developers just do it. It's mostly a solved problem on
those platforms.

 _The decision to stick with 16-bit Unicode is an engineering tradeoff._

It's a cost tradeoff - Microsoft doesn't want to spend the developer and
testing time adding a UTF-8 codepage.

~~~
jstarks
Vote here: [https://wpdev.uservoice.com/forums/266908-command-prompt-
con...](https://wpdev.uservoice.com/forums/266908-command-prompt-console-bash-
on-ubuntu-on-windo/suggestions/6514191-utf-8).

Currently we do "support" CP65001 in the console, but things break if you
enable it. One of the problems, for example, is that .NET sees 65001 and
starts outputting the UTF-8 BOM everywhere, breaking applications that don't
even care about the character encoding. I suspect that's going to be difficult
to fix without breaking compatibility.

Having said that, I think it's apparent that we are investing heavily in the
console for the first time in a long while, so I'm more hopeful than ever that
we can get this fixed.

~~~
mark-r
The BOM is an aBOMination. If you simply assume that everything is UTF-8 until
proven otherwise, you can get pretty far - the legacy code pages produce text
that's not usually valid UTF-8 unless they stick to ASCII.

~~~
be5invis
Yes, you can assume, but existing applications, maybe written in 1990s even
1980s, won't. And there are millions of computers, maybe in important
industrial companies, are still using them.

~~~
mark-r
If that's the case they won't handle a BOM either.

------
speeder
I am from Brazil, meaning that we needed here a custom code-page for our
language characters (including the Ç mentioned in the article).

It never worked quite right, since ancient DOS times there was several bugs
with this.

What surprises me, is that it STILL doesn't work right.

I have a Windows set to english, keyboard to Brazillian, and had to set
"locale" to Japanese to play some Japanese games that outright crash otherwise
(they don't even render wrong, they just crash).

I lost count of how many times programs instead of using Unicode when they
could, tried to figure a codepage from my location (and end with Brazillian),
language (english) or "locale"( Japanese), and do it all mixed up and wrong.

Stuff I saw:

Important installers (for example drivers and expensive software) that render
the interface in japanese and EULA in portuguese but with US codepage.

Japanese font + portuguese text (ending in utterly non-sense).

English text + brazillian font... and many others.

Still, to my is is merely an "annoyance", but a similar issue made my mother
panic completely:

My dad coded for my family business a software to do some mandatory tax stuff
in Brazil if you do any business at all (a sort of tax report for every single
transaction), he did it in PHP, for Linux, but running on Windows, so far so
good...

Then one day he had to fix a bug, and the only machine was WinXP... he fixed
the bug, and suddenly the program started to dump lots of corrupted data to
government servers (as you can imagine that is really bad).

We went to see, and fors ome reason now it is sending codepage 437 formatted
data, and we have no idea why, we can't find what changed, an we didn't even
used a windows based editor (we used Geany on WinXP).

~~~
Latty
Rather than using Japanese locale all the time, you can try Locale Emulator
([https://xupefei.github.io/Locale-
Emulator/](https://xupefei.github.io/Locale-Emulator/)), which allows you to
run individual programs in a given locale.

~~~
ferongr
From personal experience I can say that Locale Emulation doesn't reliable work
for many games.

------
0x0
The most hilarious issue with -Dfile.encoding is that if it set to, for
example, UTF-8, then it is literally impossible to open files with names
encoded in, say, ISO-8859-1, which can happen on a Linux system where users
use different LC_CTYPE.

You can do a "dir.listFiles()" and iterate it, and find that some of the
entries are impossible to open because there is no way to represent the
ISO-8859-1 bytes that make up the filename in a String object, and therefore
no way to give the correct file name to the java io classes for opening.

~~~
userbinator
_You can do a "dir.listFiles()" and iterate it, and find that some of the
entries are impossible to open because there is no way to represent the
ISO-8859-1 bytes that make up the filename in a String object, and therefore
no way to give the correct file name to the java io classes for opening._

That gives more weight to the argument that character encodings should be
entirely a "presentation-layer" concern, and anything below that should treat
strings as opaque byte sequences. There should also be a way to bypass that
"presentation layer" transformation upon input.

~~~
tankenmate
Indeed, this is exactly what the Linux kernel does; it treats filenames like a
binary blob, with the exception that the bytes 0x2f (ASCII '/') and 0x00
(ASCII NUL) are not accepted (regardless of where they are in the byte
string).

~~~
Hondor
Wow, so characters like U+022F (ȯ) and U+042F (Cyrillic letter Я) U+062F
(Arabic letter د) are not allowed but nearly everything else is? Some of those
are letters used in actual languages. That's sure to make people scratch their
heads.

~~~
rdancer

        codepoint -> encoding in UTF-8
    
        U+022F    -> C8 AF  
        U+042F    -> D0 AF  
        U+062F    -> D8 AF
    

Remember, UTF-8 is self-synchronising: when you pick up at a random point
within a stream, there is no ambiguity as to whether you are in the middle of
a sequence or not. Valid lower codepoints appearing in the encoding of higher
codepoints would violate this property.

------
lubujackson
My heart goes out to anyone that ever has to work with encoding issues. I feel
like I know 5% of this stuff and there looms a vast mass of chaos like a
Gordian knot whenever I need to dip my toes in it.

~~~
nothrabannosir
Every programmer has to deal with encoding; it's a fact of life. A list of
bytes without encoding is just that: a list of bytes. Handling bytes as text
(I.e.: "I/O")? You are now working with encodings.

And the funny thing is: it's not that complex.[edit: let me rephrase: it is,
but it doesn't have to be. C family is a nightmare, Python et al are
delightful]. Unless you don't know how it works---then it's the mystery
Gordian knot, as you describe it.

The irony is that encoding is a worry precisely for those who try and stay
away from it.

Don't shy away from encodings; embrace them. Then you will learn to love them.

(Another irony: with UTF8 gaining more and more mind share, encoding issues
actually become _harder_ to find and debug: they don't show up, and when they
do, fewer and fewer people know how to deal with them. Everyone switching to
UTF8 just hides the bugs, until it doesn't.)

~~~
bluejekyll
This article is trying to describe how to work on Windows specifically. I
don't think it's saying ignore encodings, though honestly why -everything-
isn't just encoded utf-8 I don't know.

It's sad that a "modern" OS that had a mostly ground up rewrite (NT) after
utf-8 was invented, doesn't have better support for it. I get in memory
storage being utf-16, and I'd even accept modern OSes storing files in utf-16,
but utf-8 is so elegantly backwards compatible with ascii, it's brain dead not
to fix -everything- to work with it.

Perhaps this is the biggest difference between a closed OS like Windows and
it's more open counterparts, if Windows were open then the community could
have fixed this issue long ago.

~~~
jameshart
The problem with elegant backwards compatibility with ASCII is that few
systems ever stuck to ASCII - everybody used different, incompatible
extensions to ASCII, like the ISO 8859 family and codepage 1252. So by being
elegantly backwards compatible with ASCII, UTF-8 also manages to be subtly
incompatible with the majority of almost-aSCII data that exists, in ways that
_sometimes_ don't matter. Until they do.

~~~
kps
ISO 8859 (or big brother ISO 10367) with ISO 4873 is fairly sane, backwards
compatible with ASCII, distinguishable from UTF-8, and historically supported
by X terminals… but not much else. ‘Plain ASCII’ is not often fully supported
either; how many people even know that the standard included composing
accented characters by overstriking (e.g. a BS " →​ ä)?

~~~
jameshart
The ASCII standard has nothing to say about backspace overstriking - that
would be something defined by a terminal, file format, or wire specification.
In a more common example, ASCII has nothing to say about how you move the
cursor to the start of a new line. Some terminal standards will do so on an
LF, others just move the cursor down on LF; file and wire specs need to take a
position on how they want to represent a line break; and now we get to the
point where CRLF is a magic sequence needed to trigger a line break in some
formats and it is completely unrelated to how a particular terminal behaves.

~~~
kps

      > The ASCII standard has nothing to say about backspace overstriking
    

It seems impossible to get ANSI copies of old standards (even for money) but
the ECMA (1973) printing says:

    
    
        3.2 Diacritical Signs
        (Positions: 2/2, 2/7, 2/12, 5/14, 6/0, 7/14)
        In the 7-bit character set, some printing symbols may be
        designed to permit their use for the composition of acce‐
        nted letters when necessary for general interchange of
        information. A sequence of three characters, comprising
        a letter, BACKSPACE and one of these symbols, is needed
        for this composition; the symbol is then regarded as a diacrit‐
        ical sign. It should be noted that these symbols take
        on their diacritical significance only when they precede or
        follow the character BACKSPACE; for example, the symbol
        corresponding to the code combination 2/7 normally has the
        significance of APOSTROPHE, but becomes the diacritical
        sign ACUTE ACCENT when preceded or followed by the character
        BACKSPACE.
    

This is precisely the reason ASCII 1967 replaced ← with _ and ↑ with ^
(explicitly still “circumflex _accent_ ” in Unicode) and added ` (“grave
accent”, likewise).

ANSI made this _optional_ in the 1986 revision, in §2.1.2 —​ “The use of BS
for forming composite characters is not required.” — with a note that it would
likely be removed from a future revision (but there never was another one).

    
    
      > In a more common example, ASCII has nothing to say about how you move the cursor to the start of a new line.
    

It does; it just says something slightly unfortunate about code 0x0A.

    
    
      CR  Carriage Return
          A format effector which moves the active position
          to the first character position *on the same line*.
    
      LF  Line Feed
          A format effector which advances the active position
          to the *same character position* of the next line.
    

[Italics added] But then it says:

    
    
      The Format Effectors are intended for equipment in
      which horizontal and vertical movements are effected
      separately. If equipment requires the action of
      CARRIAGE RETURN to be comhined with a vertical movement,
      the Format Effector for that vertical movement
      may be used to effect the combined movement. For example,
      if NEW LINE (symbol NL, equivalent to CR + LF)
      is required, FE2 shall be used to represent it. This
      substitution requires agreement between the sender and
      the recipient of the data.
    
      The use of these combined functions may be restricted
      for international transmission on general switched telecommunication
      networks (telegraph and telephone networks).
    

So CR LF will unambiguously get you the first position on the next line. The
code for LF is allowed to be replaced by NL by “agreement”, but CR can't move
to the next line.

~~~
jameshart
That's fair, and thanks for digging up those actual standards - they are
indeed not easy to find these days. I think though that tends to be the
standard speaking to the intended usage of the ASCII codes, more than it does
to an attempt to standardize behavior of all input and output devices. It's
more a suggestion that some equipment might choose to use backspace for
diacritics, and some might separately effect X and Y movement, but others
might not. Certainly in practice terminals have always exposed their different
capabilities by giving special terminal code meanings to ASCII sequences, and,
when terminal capability negotiation was not an option (e.g. in specifying a
file format or defining how to separate SMTP headers) ASCII users have made
their own standards for these things (I guess this is the sort of 'agreement'
the standard alludes to).

~~~
kps
Yes, the ‘dual use’ characters were define at the request of European
delegations and the language makes it clear that it was expected that English-
language terminals would continue to have " look like " and not ¨ and so on.
ASCII was defined before video terminals were developed, so no one thought
overstriking was remarkable. As it turned out, non-English European language
versions ended up not generally using overstrikes anyway, but redefined the
‘national use’ characters #$@[\\]{|} instead, which unfortunately did lead to
ambiguity.

------
userbinator
437 is well-known for its use of line-drawing and other artistic characters:

[https://en.wikipedia.org/wiki/File:Norton_Commander_5.51.png](https://en.wikipedia.org/wiki/File:Norton_Commander_5.51.png)

[https://en.wikipedia.org/wiki/ASCII_art#.22Block_ASCII.22_.2...](https://en.wikipedia.org/wiki/ASCII_art#.22Block_ASCII.22_.2F_.22High_ASCII.22_style_ASCII_art_on_the_IBM_PC)

~~~
stepvhen
It is the basis for the "graphics" of Dwarf Fortress. The game now uses SDL
and a tileset, but still includes a text mode (but not on windows).

~~~
Filligree
I tried to make it work on Windows.

I gave up. This article describes why.

------
lazyjones
Things would probably be slightly better if Unicode wasn't so difficult to
support properly. For example, good luck getting your UTF-8 text output to
render correctly in a terminal window under all circumstances (try combining
diacritics or funny letters like "𖥶" \- can you copy/paste that using the
mouse?), or in a text editor like vi, Emacs with terminal UI. Using UTF-8
typically means having to think long and hard about filtering input/data,
normalization and stuff like
[http://www.unicode.org/reports/tr36/](http://www.unicode.org/reports/tr36/) .
If you don't, you'll run into such issues eventually even if you think your
code needs to handle only a few "western" languages.

~~~
Freak_NL
A decade ago that would be a valid argument, and processing Unicode text still
isn't trivial, but nowadays Unicode support (UTF-8 in particular) is so well
embedded in all operating systems and programming languages that this is not
the problem. And on most operating systems the problem simply doesn't exist
for most people because Unicode is the default.

The problem here is Microsoft clinging to legacy character encodings that have
no place on a modern operating system as default for anything.

~~~
lazyjones
> _so well embedded in all operating systems and programming languages that
> this is not the problem._

It's not the "enabling" of Unicode that's problematic, it's the features of
more exotic Unicode codepoints that aren't correctly supported by many
programs in all major operating systems, because they weren't designed with
these features in mind.

Tell me, can you copy/paste the "𖥶" character with the mouse after double-
clicking it in your browser? It's a letter - [http://unicode-
table.com/en/16976/](http://unicode-table.com/en/16976/), so it should IMO be
treated like a quoted word (correct me if I'm wrong). Safari on OS X selects
either the quote and the letter or just the right quote depending on where
exactly on the letter I double click. Safari/Webkit is relatively modern and
well-maintained too. Unicode "supported"? Sure. Working well? Nope, just for a
few common use cases.

~~~
Freak_NL
> Tell me, can you copy/paste the "𖥶" character with the mouse after double-
> clicking it in your browser?

No problem (Linux/Firefox).

The Unicode standard is not static (unlike those old codepages mentioned by
OP); it is being changed and improved continuously. Support for characters
outside of the BMP (such as the one you mention) is mostly there, but it may
not be at the level of the basic scripts supported by, say, Unicode 5.0. That
is fine. New features in standards take time to implement (the same thing
happens with HTML and CSS).

For developers Unicode support is there. Has been for years.

------
yuhong
On 16-bit vs 32-bit Unicode/ISO 10646, I think it basically boils down to ISO
10646 wanting 32-bit but had no software folks on that committee, while the
Unicode people was basically software folks but thought that 16-bit was
enough.

------
phyzome
"Disclaimer: The file.encoding property is undocumented and not officially
supported, and it has been reported to act inconsistently across Java versions
and platforms."

Oh Christ on a cracker. Why doesn't Oracle just default everything to UTF-8 in
the next version of Java?

~~~
chillydawg
Too late for that kind of change, really. 20 years of legacy systems are not
going to enjoy that.

~~~
colejohnson66
If their legacy system doesn't support UTF-8, I doubt they'd update their
system at all.

~~~
chillydawg
They probably want update the JVM for security purposes, though. There's
plenty that will absolutely kill you in any public facing JVM from more than a
few years back. The first release of 7 has at least two very bad denial of
service attacks that allow a remote user to saturate as many CPUs on your
server as they like, all they need to do is send magic http headers to your
tomcat or whatever.

~~~
colejohnson66
People say that about operating systems, yet you have businesses running
Windows XP still simply because they don't want to update. Yes, there's ones
that don't for compatibility, but there are some that don't upgrade simply
because they don't want to.

------
michaeledwards
Now imagine you are debugging the console output after it went thru a web
server, your browser, your local dev environment, your local console, and then
pasted into your text editor.

------
be5invis
However if you use WriteConsoleW, you can actually output both of the "€∑" in
ANY code page. The NT console is a matrix of UTF-16 code points, and there is
an API to write it. So the question becomes why Java (and many other
platforms, like Python) did not use this mechanism. Notably Java's internal
encoding of strings is UTF-16 either, this should be a problem.

~~~
mark-r
One of the great benefits of a console interface is that it's a lowest common
denominator between different OS. That interface is based on a byte-level I/O
model. Things like I/O redirection depend on it.

~~~
be5invis
They can implement an alternative driver to convert IO streams into API calls.
And that's how libuv did.

------
hapless
Notably this is a Windows problem.

Sane platforms default to UTF-8 and output exactly what you would expect: €Σ

------
gcb0
lol. Just failed to mount a sd card on my openwrt modem yesterday. the error
was that the exfat partition was using that code page and my modem didn't have
it.

