
Myanmar Prepares to Migrate from Zawgyi to Unicode - danso
https://globalvoices.org/2019/09/04/unified-under-one-font-system-as-myanmar-prepares-to-migrate-from-zawgyi-to-unicode/
======
9nGQluzmnq3M
This article conflates the _encoding_ uses to store Burmese text with the
_font_ used to render it. Here's a chart showing the discrepancies between the
two encodings:

[https://confluence.dimagi.com/display/commcarepublic/Myanmar...](https://confluence.dimagi.com/display/commcarepublic/Myanmar%3A+Android+Zawgyi+and+Unicode)

And here's an FAQ from Unicode itself:

[https://www.unicode.org/faq/myanmar.html](https://www.unicode.org/faq/myanmar.html)

Old-timers (particularly outside the US) may remember the ISO 8859 debacle,
where there were various encodings for primarily European languages using the
same codepoints, causing tons of confusion:

[https://en.wikipedia.org/wiki/ISO/IEC_8859#The_parts_of_ISO/...](https://en.wikipedia.org/wiki/ISO/IEC_8859#The_parts_of_ISO/IEC_8859)

~~~
xeeeeeeeeeeenu
No, the article doesn't conflate anything, the situation isn't comparable to
ISO-8859-* at all. Zawgyi is merely a font that renders certain glyphs
differently than specified by Unicode.

It isn't a real encoding, it's a nasty hack and that's what makes the
transition difficult.

~~~
9nGQluzmnq3M
Once more: the encoding is what's used to store the data, the font is what's
used to render it. If your data is not _encoded_ correctly, Zawgyi won't show
it either. Of course, the fact that Zawgyi isn't properly standardized doesn't
help.

Also, most of the technical content of the article is gibberish. Exhibit A: "
_It made use of the visual typing and encoding method as one would write it on
paper, rather than using logical linguistics and computer encoding conventions
of Unicode._ "

~~~
msla
The reporting here is so bad I'm honestly confused about whether _anything at
all_ is changing.

I wonder why this news outlet tried to report this specific piece of news at
all, instead of leaving it to the specialist press.

------
grenoire
What is the scale and potential difficulties involved for a migration like
this? I hope some HN readers can inform us on the technical challenges.

~~~
Ayesh
I'm native Sinhalese speaker, and we have many characters that were not
included in Unicode up until 10 or so years ago. To "fix" this, we created
different font files, so that when you type "w" in an En-US keyboard, the font
glyph is "අ", which we pronounce "A" sound for. This worked out OK in place
where we had control over the font. So rich text documents, this was less of a
problem.

In web sites, however, these text become mess because they use regular fonts
with correct glyph/code-point match. We just abuse the fonts to get the
characters we needed.

When Sinhalese characters are in Unicode, we couldn't immediately translate
them because you need to check if the fonts were using this botched font or
not, and needed to do some serious replacing which is difficult because even
with Unicode, we have diacritics and certain glyphs needed more than one
unicode code-point to represent them.

One glorious regex replace, in theory, can perform the similar migration for
Burmese as well, you just have to write it.

------
yla92
As a developer who've dealt with those issues briefly in the past, hopefully,
this really kicks off this time!

------
wangweij
Sorry but I don't quite understand what the problem here is. Is it about a
different character set? Or encoding? Or glygh rendering? The article keeps
using "font".

~~~
Someone
[https://frontiermyanmar.net/en/features/battle-of-the-
fonts](https://frontiermyanmar.net/en/features/battle-of-the-fonts) (mentioned
in the article being discussed) is much clearer.

Short version: in Burmese, the form a character takes depends on context.
Zawgyi ‘solves’ that by having separate code points for the different forms,
requiring the user to pick the right variant. The Unicode way is to make the
font and the (font + font renderer) pair smarter, just as Unicode renders “é”
instead of the two code points “e’”.

Zawgyi also, necessarily, uses Unicode code points assigned for other
characters to encode the variants.

~~~
allard
The shape of "lowercase sigma" depends on whether it's in the middle of a word
or at the end. These are adjacent in address space.

ς and σ. I won't shout out their names. Is this the case in modern Greek too?

~~~
Someone
Many of such warts in Unicode are for allowing round-tripping with 8-bit
character encodings. I suspect that’s the case here, too.
[https://en.wikipedia.org/wiki/ISO/IEC_8859-7](https://en.wikipedia.org/wiki/ISO/IEC_8859-7)
has them, too.

That doesn’t explain why Unicode seems to have 27 (!) different “sigma” code
points, though
([https://en.wikipedia.org/wiki/Sigma#Character_encoding](https://en.wikipedia.org/wiki/Sigma#Character_encoding))

------
Grue3
And Japan is still using Shift-JIS...

~~~
shpx
[https://en.m.wikipedia.org/wiki/Han_unification](https://en.m.wikipedia.org/wiki/Han_unification)

Imagiŋe if all the "n"s became "ŋ" if you accideŋtally used the Helvetica
British foŋt instead of Helvetica Americaŋ oŋ your website.

~~~
zozbot234
Han unification is a bit of a mess, but you can fix it with lang attributes or
the equivalent ("Language" selection in office document formats, for example).
You don't _need_ to fake things with a custom font.

~~~
yorwba
Except those are all application-specific and if _you_ 're the one writing the
application, rendering different languages differently still seems to require
messing around with fonts. (Or embedding an HTML renderer that does the
messing for you.) I'm not sure whether there's any solution for OS-level
strings that don't support changing the font (e.g. window titles, application
names) beyond requiring the user to pick one language for their system and
ignore all others.

~~~
geofft
Do variation selectors solve this problem? i.e., is there some way for me to
include a variation selector in a filename or terminal output or something and
have things render right?

~~~
yorwba
I think variation selectors are intended for the case where even knowing the
language is not enough to select the correct glyph, e.g. because they have
been unified in a national standard despite different variants remaining in
use. If one of those variants happens to be used in another country, you could
of course use it to fix at least some cases, but I don't think all unified
characters have a selector for each of their country-specific variants.

Even if they did, the Ideographic Variation Database [1] doesn't exactly make
it easy to use variation selectors for that purpose, because you only get an
example demonstrating what the glyphs should look like. To find out which
glyph (and hence variation selector) to use for a given language, you'd need
an additional database.

[1]
[https://unicode.org/ivd/data/2017-12-12/](https://unicode.org/ivd/data/2017-12-12/)

------
PokemonNoGo
Interesting to see this from a neighbor of India the land of fonts, regulary
in use across the board, such as krutidev and its devanagari.

------
bl00djack
Yes finally! I had been waiting for this moment

