
When monospace fonts aren't: The Unicode character width nightmare - chx
http://denisbider.blogspot.com/2015/09/when-monospace-fonts-arent-unicode.html
======
jhallenworld
I recently changed how JOE dealt with this. Originally it used Markus Kuhn's
wcwidth function
([http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c](http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c)),
but I've changed it to use the data in EastAsianWidth.txt:
[http://sourceforge.net/p/joe-
editor/mercurial/ci/default/tre...](http://sourceforge.net/p/joe-
editor/mercurial/ci/default/tree/joe/unicode.c#l510)

JOE uses 4-level radix trees for character classes. These work well because
the leaf nodes are highly redundant and can be merged together. The resulting
structure is often smaller than a binary tree. Character classes are also used
for regular expressions, so there is code to build them on the fly from a list
of ranges (it's tricky to do this efficiently).

Anyway, I'm surprised that emoji are not double-wide characters.

JOE is still missing Unicode normalization for string searches.

------
waqf
In most languages there's not really much need to display source code in
monospace, except to the extent that a previous programmer formatted code with
that assumption.

When you're trying to align lines of a block, you only need that a string of
_n_ initial spaces (or tabs) is the same width every time: you don't care if
it is the same as a string of _n_ arbitrary characters. This suffices for the
compiler-enforced indentation rule in Python (though not, I think, for the
indentation rule in Haskell).

(I code in a variable-width font in Emacs and I work on shared codebases. The
codebase style does have a couple alignment rules that don't make sense for
variable-width fonts, but I just let Emacs enforce them and I'm happy that
overall the code as I see it is easier on the eyes than it would be with
fixed-width formatting.)

~~~
blahedo
The other main case where monospace (as such) is important is when multi-line
expressions should be lined up in some semantically relevant way, e.g. to
reflect boolean and/or grouping, or to line up function arguments. (This is
especially true in Lisp/Scheme indent styles, but I use it fairly frequently
in the C-style-syntax languages as well.)

Separately from the "lining things up" argument, though, there's an argument
to be made for characters in _any_ editing situation to be _wider_ , even if
they're not all the same width. For naturally-narrow letters like I and l, and
for periods and commas, and of course spaces, the non-monospace fonts often
make them so narrow that they become harder to target with a mouse, harder to
distinguish from each other, and in some cases hard to even see at a glance
whether the insertion point is to their left or right. This is a UI problem
with _a lot_ of editors (e.g. text entry for data or comments on the web) that
doesn't get enough attention; if the content and presentation can be separate
(i.e. it's not a WYSI-more-or-less-WYG editor) then the editor should use a
font that doesn't have super-narrow _anything_. Monospace is a convenient way
to achieve that.

~~~
Stratoscope
You don't need a monospaced font to line up multiline expressions or function
arguments. All you have to do is use _indentation_ instead of _column
alignment_. In other words, follow the same rules for expressions that you
most likely already use for statements.

For example, instead of this:

    
    
      someObject.someMethod(oneArgument,
                            anotherArgument,
                            oneMoreArgumentForTheRoad);
    

Format it exactly as you would if the parentheses were curly braces:

    
    
      someObject.someMethod(
          oneArgument,
          anotherArgument,
          oneMoreArgumentForTheRoad
      );
    

Or instead of this:

    
    
      string someVariable = oneThing +
                            anotherThing +
                            yetAnotherThing;
    

Do this:

    
    
      string someVariable =
          oneThing +
          anotherThing +
          yetAnotherThing;
    

The code is just as readable this way, and you gain many benefits:

* Your code looks exactly the same in a proportional or monospaced font, so people viewing the code are free to choose either.

* You no longer need to fiddle with adding or removing spaces when you change the length of one of your variables or function names.

* The diffs in your revision history no longer show spurious changes that result from that column fiddling.

* Your code lines become much shorter.

* Instead of having one formatting rule for curly braces and a completely different rule for parentheses, you use the same formatting rule for everything.

* It doesn't matter if the code is indented with spaces or tabs. With no column alignment, rules like "tabs for indentation, spaces for alignment" are no longer needed.

* If you need to change the code to a new indentation style (e.g. change 2 or 4 spaces to tabs, or vice versa), you can do that reliably with a simple regex search and replace, with no damage to the code formatting and no manual cleanup required.

I adopted this practice long before I switched to coding in proportional fonts
- and once I started formatting code this way I realized that it didn't matter
any more what kind of font I used. I was free to choose any font that pleased
my eyes, with no impact on the readability of the code for those who prefer
monospaced fonts.

A good place to see the problems that column alignment causes is the Servo
source code, whose coding standard mandates column alignment. I posted a few
examples here:

[https://news.ycombinator.com/item?id=9469713](https://news.ycombinator.com/item?id=9469713)

I've often wondered why column alignment is so popular given its drawbacks. I
have a theory that I think explains some of it, at least in the case of
function calls and parenthesized expressions.

It's because of the very common objection to putting spaces inside the
parentheses. For example, PEP8 and many coding standards explicitly forbid
this.

Here's what happens. Take my first example above, written as one line without
spaces inside the parentheses:

    
    
      someObject.someMethod(oneArgument, anotherArgument, oneMoreArgumentForTheRoad);
    

That line is too long, so let's fix it. The natural starting place is to
change every space to a newline:

    
    
      someObject.someMethod(oneArgument,
      anotherArgument,
      oneMoreArgumentForTheRoad);
    

That's a bit messed up, so what can we do? Indent the extra lines?

    
    
      someObject.someMethod(oneArgument,
          anotherArgument,
          oneMoreArgumentForTheRoad);
    

Ugh. Now the arguments don't line up at all. It's really ugly. The only cure
is to align the columns:

    
    
      someObject.someMethod(oneArgument,
                            anotherArgument,
                            oneMoreArgumentForTheRoad);
    

But what if we adopted the practice of putting spaces inside the parentheses?

    
    
      someObject.someMethod( oneArgument, anotherArgument, oneMoreArgumentForTheRoad );
    

If we make the same substitution, changing each space to a newline, we get
this:

    
    
      someObject.someMethod(
      oneArgument,
      anotherArgument,
      oneMoreArgumentForTheRoad
      );
    

And now it makes perfect sense to indent the arguments:

    
    
      someObject.someMethod(
          oneArgument,
          anotherArgument,
          oneMoreArgumentForTheRoad
      );
    

This is the same thing we intuitively do with curly braces. After all, hardly
anybody codes like this:

    
    
      while(true) {oneStatement;
                   anotherStatement;
                   oneMoreStatement;}
    

I think one reason is that we pretty much tend to put spaces inside the curly
braces, even in a one-liner. It's not too common to write this:

    
    
      while(true) {oneStatement; anotherStatement; oneMoreStatement;}
    

Instead, this seems more typical (if you use a one-liner at all, which I'm not
arguing for or against):

    
    
      while(true) { oneStatement; anotherStatement; oneMoreStatement; }
    

No one seems to mind spaces inside the braces, but I've had developers totally
freak out over the idea of putting spaces inside the parentheses. It Simply Is
Not Done.

I don't understand why there's such an objection to that, especially when it
seems to lead directly to the difficulties of column alignment.

I tend to think that it's because spaces never go inside the parentheses in
English text. But this is code, not English, and we are free to choose
conventions that benefit us, even if they differ from how we'd write prose.

Edit: I added a number of thoughts after first posting this. If you upvoted
this an earlier version of this comment and now think I'm insane for
advocating spaces inside the parentheses, let me know and I'll post another
comment that you can downvote. ;-)

~~~
djent
I believe what the comment you're replying to is implying multiple instances
of indentation on the same line. Such as:

    
    
        if		( $test[0] )	func1()
        elseif	( $test[1] )	func2()
        elseif	( $test[2] )	func3()

~~~
Stratoscope
That's a good point. Well, it was just an excuse for me to go off on a
tangent! :-)

For the example you posted, I would classify that as column alignment rather
than indentation. Just to clarify my terminology, what I call indentation is
something that happens at the beginning of a line only. Any additional spacing
after the first nonblank character is column alignment.

So the spaces before the if and elseif are indentation, and the extra spaces
within the lines are column alignment (in my nomenclature).

The column alignment in your example is pretty appealing, but it does have
some of the same problems as other forms of column alignment. When we get to
ten or eleven tests, things have to get juggled around again. Do you do this:

    
    
        if		( $test[0] )	func1()
        elseif	( $test[1] )	func2()
        ...
        elseif	( $test[10] )	func3()
    

or this?

    
    
        if		( $test[0]  )	func1()
        elseif	( $test[1]  )	func2()
        ...
        elseif	( $test[10] )	func3()
    

or go all-out with something like this?

    
    
        if		( $test[ 0] )	func1()
        elseif	( $test[ 1] )	func2()
        ...
        elseif	( $test[10] )	func3()
    

I think basically I am just lazy and found all of this alignment so tedious
that I looked for ways to avoid it. :-)

------
rspeer
A project I work on, "ftfy", deals with various Unicode issues. On the master
branch, it's recently gained a module for aligning monospaced Unicode in a
terminal:

[https://github.com/LuminosoInsight/python-
ftfy/blob/master/f...](https://github.com/LuminosoInsight/python-
ftfy/blob/master/ftfy/formatting.py)

The result works pretty well in my gnome-terminal and Mac OS Terminal. You can
probably see that GitHub in your web browser doesn't even come close to lining
them up, though.

And the problem can't be completely solved, because the standards have gaps,
and there are some scripts where monospacing just isn't a thing.

------
breadbox
This stuff is a nightmare if you're trying to write a nice-looking terminal
application. AFAICT there is no reliable way to determine how many cells an
arbitrary Unicode glyph will occupy when output to a terminal. None. You can
use various wcwidth() functions as a first approximation, but you have to give
up on things working in the general case, because there's no guarantee (in
theory or in practice) that the terminal's font will actually honor the width
defined by the standards. Hopefully this situation will slowly improve with
the years, but given the level of neglect the terminal environment gets from
standardization processes these days, I'm not entirely optimistic.

~~~
mark-r
You could always place each glyph individually, but that's likely to have
subtle bugs too, while not being very performant.

~~~
breadbox
I have used such a solution at times. But then you usually have the problem of
overwriting half of a wide character, which is worse.

~~~
jquast
There is a method, using the report cursor position query to determine the
current location of the cursor, you can then print question characters and re-
read the loctation, the difference determining how many cells a character
forwards the carriage.

------
jquast
If anybody here has serious expertise on this subject, please consider
reviewing my python implementation of Markus Kuhn's wcwidth, updated for the
latest Unicode Specification (programmatically by "python setup.py update").

[https://github.com/jquast/wcwidth](https://github.com/jquast/wcwidth)

------
glandium
Relatedly, on GNU/Linux, depending on your locale, fontconfig configuration
and characters in the strings displayed (no, I'm not making that up), your
monospace font might end up not being a monospace font at all. Look at this
nice ghex screenshot:
[http://i.imgur.com/z0Dp60H.png](http://i.imgur.com/z0Dp60H.png)

------
fred256
Related, it always annoyed me in programming books when program listings using
monospaced fonts used an "fi" ligature crammed into one character width.

~~~
toothbrush
Wow, that is an atrocity i have luckily never stumbled into—do you have
examples?

~~~
kylebgorman
Don't quote me on this but I think there may have been an example of this in
the most recent Stroustrup C++ book?

------
kevin_thibedeau
There is no way to standardize this. Just because legacy encodings provided
for single cell and double width characters doesn't mean that all monospace
fonts should have to conform to that scheme to support Asian characters. It is
up to the font designer to decide how much advance to use and some fonts are
designed with Latin characters the same width as the Chinese.

~~~
arm
Many Chinese characters are completely unreadable at small sizes if you try to
fit them in the width of a halfwidth character. It’s just not practical.

~~~
masklinn
The monospace font designer could integrate CJK logographs as double the
standard width (that's essentially what CJK fonts do in reverse with halfwidth
latin characters). CJK is monospace to start with though, I'd think cursive &
highly ligatured scripts like arabic or brahmic scripts would be a bigger
issue when designing monospace fonts (if you want something which doesn't look
like utter shite) (though I guess Kufic — the very blocky arabic script found
on the flags of iraq and iran — might be better source material than the more
modern and cursive Naskh variations, it's commonly manipulated into tilings
and brickwords already:
[https://en.wikipedia.org/wiki/Kufic#/media/File:Alijlas_kufi...](https://en.wikipedia.org/wiki/Kufic#/media/File:Alijlas_kufi.png))

------
microcolonel
This seems more like a problem with GDI and/or DirectWrite(as well as how each
browser is making use of them), less to do with chrome vs. firefox vs. IE vs.
VS.

Chrome on Chrome OS (using FreeType 2) properly aligns the text in that <pre>,
as does firefox on GNU/Linux (also using FreeType 2, in addition to graphite).
On FreeType with Chrome or Graphite, the full-width latin characters are also
rendered with the correct weight and face. Something that Chrome and IE on
Windows seem to get wrong in your screenshots.

------
nsajko
Isn't the best general answer to assume mono is mono and put the font
configuration responsibility on the user?

------
mpweiher
Seems to work fine in OS X Terminal.

~~~
kalleboo
But not in Safari, TextEdit, Xcode or BBEdit. Seems like the Terminal has
special monospace handling

------
gcb0
gotta love how things catch up faster than you can complain about.

the rendered text for me on firefox (actually, not even proper firefox, but
iceweasel, on debian stable, which is almost a year behind firefox) align
perfectly fine. but the picture labeled "firefox" is all out of whack.

------
douche
I was originally a history major, who had a hell of a professor that taught
the history of pre-modern imperial China, so I ended up focusing in that area.
One thing that baffled me was that characters never really fell out of usage
in favor of a syllabary or alphabet the way similar ideographic systems have
tended to elsewhere. Other countries in the Chinese cultural sphere, that
originally imported Chinese script, developed alpha/syllabic replacements,
like the Korean hangul and Japan's two kana systems. And it is not as though
China was not exposed to alphabets - Buddhism, Tibet, and the various Turkic,
Mongol and Manchurian tribes that bordered on (and not infrequently ruled over
significant portions of China), all used variations and descendants of Semetic
alphabets. It seemed like such a huge inefficiency and barrier to wide-spread
literacy, to have to memorize at least a thousand or more characters, compared
to 26-odd letters, 10 digits, and the 50 or so more common phonetic spelling
rules, that could get you to an equivalent level of literacy in a Latin-based
language. Once you've learned all the characters, and digested all the
Confucian classics that make up the shared context of classic Chinese, you can
be wonderfully succinct and expressive, but the learning curve to get there
can be measured in decades, judging by accounts of prospective scholar-
officials studying for the civil service examinations.

It's perhaps the difference between becoming a proficient Vim or Emacs user,
and popping open Notepad.

To get back to Unicode, most of the hairiness could have been avoided if
Chinese (and related regional variations) used a syllabic or alphabetic script
- 100-200 characters vs ~80000 would make 2-byte wide chars sufficient to
express nearly every non-dead script, with plenty of space for any number of
poo or banana emoticons (if such a thing is really necessary, about which I
have my doubts).

------
Kenji
That reminds me of how I wanted to start a project with a name containing the
German 'ö' character. Of course, as customary, I save my projects into folders
with the name of the project. And, also of course, GCC issues instantly ensued
when I tried to compile the C++ source of that project. What did I learn? Stay
with ASCII for your code and names. It might be 2015 but apparently it's more
important to put poop into unicode than to actually implement the thing in all
the important software.

~~~
jepler
Have you tried this lately? On a Debian Jessie machine using the en_US.UTF-8
locale, I encountered no problems doing this, though I picked œ instead of ö.
I did have a little bit of pain from git (git status showed \305\223 for œ)
until I did "git config core.quotepath false".

Not sure how this'll come through on ycombinator, so it's also on a pastebin
for 24 hours:
[https://paste.debian.net/311370/](https://paste.debian.net/311370/)

~$ cd src

~/src$ mkdir œuvre

~/src$ cd œuvre

~/src/œuvre$ printf '#include <stdio.h>\nint main() {puts("hello œuvre");}\n'
> œuvre.c

~/src/œuvre$ g++ œuvre.c -o œuvre

~/src/œuvre$ ./œuvre

hello œuvre

~~~
eridius
That's a bit different. Presumably you're running this in a terminal emulator.
Terminal emulators (and programmers editors), if they're set to a monospace
font, operate on a grid. For characters in the selected monospace font this is
just like normal font rendering, but when fallback fonts are used, it still
uses the layout from the original monospace font (i.e. the grid) instead of
the layout from the fallback font. This means that the fallback font rendering
doesn't screw with the columns.

Naturally there's still the issue of properly identifying double-wide
characters. But as long as you can correctly identify them (and for the
ambiguous characters it generally treats them all the same, either as narrow
or as wide depending on the software in question), you can simply render them
with 2 cells instead of 1, and everything remains lined up.

But the article here was showing monospaced text in a web browser, and a web
browser doesn't do any of this (nor a rich text editor). Font fallback usually
attempts to maintain properties like monospacing, but the fallback font may
still have different metrics, meaning it won't line up with the monospaced
text from the source font (and depending on the fonts available, if you have
no monospaced font with asian glyphs it would have to fall back to a non-
monospaced font).

I suspect what's going on with the locale stuff here is simply that the locale
affects the fallback list, and it ends up picking a different font that
happens to have the right metrics to line up. But I can't say for certain that
this is the explanation.

