
Banish missing glyphs with Unifont - edent
https://shkspr.mobi/blog/2019/04/banish-the-%ef%bf%bd-with-unifont/
======
segfaultbuserr
Indeed, GNU Unifont looks awful to many people, and even we only talk about
bitmap fonts, Unifont is not the best-looking one. However, it's a font
designed for practicability, not aesthetics. It's one of those few fonts that
covers the entire Basic Multilingual Plane, and is actively maintained. It's a
perfect choice as a fallback font.

And in fact, the font actually looks pretty good under a CLI console. I once
tried a Linux kernel patch that embedded the 10 MiB+ Unifont to the native
Linux VT console, and I got a perfect multilingual console (without using a
3rd-party framebuffer-based or KMS-based console), including all the CJK
characters. It's also a good choice for a dot-matrix LCD/LED display.

~~~
jfk13
> I got a perfect multilingual console

No, you didn't. If it was using Unifont, there are many scripts and languages
that it was unable to display correctly because they require "shaping" that
Unifont cannot support; it lacks the necessary glyphs, let alone the OpenType
(or equivalent) tables to control the rendering.

Displaying one nominal glyph per Unicode character does not result in readable
text for many languages. Just because it seems adequate for Western alphabets
or for CJK doesn't mean it is sufficient for true multilingual support.

~~~
usrusr
> Displaying one nominal glyph per Unicode character does not result in
> readable text for many languages

Completely unreadable, even to people with a basic understanding of the inner
workings of Unicode and some practice with the font, or just not matching the
norms of correctness?

Decipherable but clearly not correct would mark the sweet spot between being
just as bad as placeholders and being so good that it undermines efforts for
having a proper font for a given language.

~~~
chipuni
Unreadable.

Arabic has a strong difference between letters of the same word, and different
words. There is no exact equivalent in English...

b&u&t&i&t&m&i&g&h&t&b&e&s&o&m&e&t&h&i&n&g&l&i&k&e&t&h&i&s

~~~
usrusr
If that example is representative, I'd definitely put it in the "readable"
bin. Leaps and bounds from what you'd _want_ to read, but infinitely more
useful than a sequence of identical placeholders. Even just being able to tell
wether two strings might be identical or not would be an improvement over
blank placeholders.

------
jfk13
I tend to think this is a bad idea. Most of the characters that can't already
be displayed using fonts included in modern operating systems are likely to be
characters from "obscure" scripts that may need complex shaping for
correct/readable display, which won't work with Unifont because of its limited
1:1 character/glyph mapping, and/or they're characters (whether from historic
scripts or newly-encoded emoji) in higher Unicode planes, meaning you'll need
to provide not just the basic (plane-0) Unifont but also the extra resource
for higher planes. It adds up to a lot of bloat, for the sake of inadequate
rendering of characters that the user probably can't make sense of anyway.

The one case where there might be a worthwhile benefit would be for recent
emoji additions. But that would be better addressed by a more limited effort
to provide an up-to-date emoji-only font, not a resource that attempts (in
vain!) to cover the whole of Unicode.

~~~
lelf
Not to mention that many emoji are emoji sequences. So again, not a 1:1.

------
scrollaway
[https://www.google.com/get/noto/](https://www.google.com/get/noto/)

Google's Noto font family is of far higher quality than Unifont and serves the
same purpose. Better, too, since there's a lot of details in various languages
and scripts that go far beyond "put a glyph here". I've heard only good things
about Noto in that respect.

~~~
edent
It's also 1.1GB!

And hasn't been updated since 2017
[https://www.google.com/get/noto/updates/](https://www.google.com/get/noto/updates/)

But, other than that...

~~~
sophacles
> And hasn't been updated since 2017

This is only a useful data point if there have been significant changes to
"writing system" released since then.

~~~
Sharlin
Unicode 11 was released in 2018 and Unicode 12 in 2019. A "universal" font
that doesn't keep up with Unicode isn't very universal.

------
choeger
I am not an http expert, but could we put that file on one location on the web
so that I do not download it for every fricking site I open? Or even better,
could not the browser bundle it?

~~~
pferde
Why not just have it installed on your system, and disallow websites to
dictate which fonts to use?

Edit to make the comment (hopefully) more valuable: I mean, unless certain
choice of fonts is somehow intrinsic to the content or purpose a given website
serves, the browser, and by extension the user, probably know better which
fonts are preferable for them to comfortably read the textual content.

~~~
scrollaway
Web fonts have their place. For example they're extremely commonly used to
efficiently serve icons that seamlessly blend into text across many websites.

In my previous app I used web fonts to be able to render Hearthstone cards
using the correct font the game uses. In my previous app I used web fonts to

------
devit
There's also
[https://www.google.com/get/noto/](https://www.google.com/get/noto/) which is
what Ubuntu uses for emoji and is proportional and colored.

~~~
dingallero
I do use noto for this, but absolutely _hate_ the use of color in glyphs such
as the emojis.

~~~
chronogram
The black and white Noto emoji have not been updated in years sadly, but you
can still pretend it before Noto Color Emoji. Alternatively prefer Symbola,
which should be more up to date.

~~~
ygra
Couldn't you just strip the tables for the color map and the additional glyphs
for the other color layers? Or does Noto use bitmaps or SVGs for the colored
glyphs?

~~~
chronogram
It's a scalable bitmap glyph as far as my understanding goes. "It's scalable
even though it's a bitmap font." is what I remember reading last year, but I
cannot find it for you, sorry.

~~~
ygra
From looking at the Github repository, it seems like the source images are SVG
and the font might use either SVG or bitmaps, but not COLR/CPAL which would
degrade nicely to black and white.

------
aboutruby
> If your app or website uses a Unicode character which isn't supported on a
> device, the user will usually see � - a replacement character. If you
> include Unifont, they'll see the correct character.

Neat idea. I think the transition to UTF-8 is practically done, I'm not seeing
� anymore these days (used to be extremely common a while back).

~~~
tialaramex
This line is largely wrong.

Most systems, when called to display a character which they're unable to
render, will render a placeholder. This is most often a dotted box of some
sort, roughly the size of a large character. In some systems the dotted box
(assuming it's large enough for them to be readable) contains the Unicode
codepoint number that the system couldn't render. In a few the box contains
some representative symbol that gives you a hint what sort of thing is
missing, e.g. maybe it's a Han glyph to suggest that you should look for a
Chinese font.

I haven't seen any (they may exist of course) where they render U+FFFD the
replacement character �.

The most common reason to see U+FFFD is the reason it was created, something
was encoded or decoded in a way that is gibberish and the best option in that
case is to replace the minimum chunk of gibberish with U+FFFD and then keep
trying. On the Web you'd often see pages which claimed to be UTF-8 but were
actually ISO-8859-1 or Windows codepage 1252, neither of which is UTF-8 but
they share the most common Latin characters, these days most browsers will
auto-detect this goof, and besides most web pages really are UTF-8, but when
browsers were less good at guessing and more pages were wrong you'd see it
more often.

~~~
edent
Yup, I screwed up with that title! See the discussion at
[https://twitter.com/FakeUnicode/status/1113774985116434433](https://twitter.com/FakeUnicode/status/1113774985116434433)

------
amelius
Question: how do you specify in CSS that you want font X, except if a specific
glyph is not present, then you want font Y?

~~~
woodrowbarlow
when you specify your font-stack, like:

    
    
        font-family: Helvetica, Arial, Sans-Serif;
    

it will fall through. so if a user has Helvetica installed, but Helvetica
doesn't provide glyph X, then it will check whether Arial has glyph X.

so if you want all non-ascii glyphs to fall through to an alternative font,
you need to serve a version of your primary font that only includes ascii.

~~~
amelius
Ok, I wasn't aware that stacking fonts works at the glyph level.

------
lelf
See also: [https://github.com/rolandwalker/unicode-
fonts](https://github.com/rolandwalker/unicode-fonts) (a different direction,
starts from _removing_ Unifont)

------
moftz
I can understand the utility of a basic fallback font that just works but if
you are already building a large web client, why not just include the fonts
you actually want to use? That way you get the look you want at whatever
resolution without some horrible bitmap font popping up now and then.

~~~
edent
Because you don't always know what kind of content you'll be displaying.
Especially if you allow user-generated content.

Here's an example I found a few years ago -
[https://shkspr.mobi/blog/2015/11/premature-subsetting-of-
web...](https://shkspr.mobi/blog/2015/11/premature-subsetting-of-web-fonts/)
\- an English language website never expected their authors to use the é
(e-acute) character. So they removed it from their webfont.

------
TheRealPomax
I'd love to know how they're apparently fit more glyphs in a font than can fit
in a font. The opentype specification only allows up to a USHORT worth of
glyph ids, and 65335 ids is nowhere near enough to index even just all 137993
currently assigned code points.

~~~
jfk13
They don't. There are multiple truetype fonts involved:

* The Standard Unifont TTF Download: unifont-12.0.01.ttf (12 Mbytes)

* Glyphs above the Unicode Basic Multilingual Plane: unifont_upper-12.0.01.ttf (1 Mbyte)

* Unicode ConScript Unicode Registry (CSUR) PUA Glyphs: unifont_csur-12.0.01.ttf (1 Mbyte)

(from
[http://unifoundry.com/unifont/index.html](http://unifoundry.com/unifont/index.html)).
And note that unifont_upper only seems to cover plane 1 and plane 14 stuff;
they haven't attempted to tackle the plane 2 CJK repertoire.

~~~
TheRealPomax
That is really not what this article claims, though. It claims "It contains
/every/ Unicode glyph in one single file!".

~~~
jfk13
So it does - though later under "Use on the web", it turns out that it's
primarily talking about the BMP, and higher-plane characters will require
another file. (It doesn't seem to notice the lack of support for plane-2 CJK
at all.)

The article title "Banish the � with Unifont" is also misleading, actually. �
is U+FFFD, the REPLACEMENT CHARACTER that typically indicates an encoding
error or binary garbage; it's not the same thing as the missing-glyph symbol
(often a simple box, though it may vary) that generally appears when font
support for a valid character is lacking.

------
vortico
What's the point of rendering characters that can't be understood by the
reader? I can't read chinese, cyrillic, native american scripts, etc, so it's
not worth the 12MB in apps and definitely websites.

~~~
syrrim
If it renders in a facsimile of the correct script, this allows you to:

\- see that it's foreign text, rather than pictographs or weird english text.

\- see when the shapes of other languages are used to make pictures
(¯\\_(ツ)_/¯)

\- if it's a foreign language, you can guess which language

\- you can guess at the complexity of what was written, for example by looking
for repeated substrings.

------
throwaway_391
As a resident of a western country, I don't fully understand how to implement
or test Unicode compatibility on my browser, website, terminal, etc. Is there
a test suite of characters I can use to validate etc?

~~~
Arnt
The question doesn't really make sense, because that's not what unicode is.

Unicode encodes what's necessary for printing books since about 1900 (and a
bit more, but that's a fair one-sentence summary). What you want to validate
isn't that you'd be able to print every kind of book printed since 1900.
You're only interested in some of the alphabets, and you may be interested in
more functionality than just printing. For example you may need sorting, or
character input with the right sort of interactive appearance changes, or
equality testing.

If you decide what you want to work, then googling usually finds a suitable
test quickly.

~~~
michaelt
Right, but if you're making a word processor or a web forum or a registration
form what you want might be "Well, I don't speak languages that need complex
scripts, but I'd be happy to support other people's scripts if it's easy"

~~~
WorldMaker
The easiest test for "does my software handle Unicode somewhat better than
dumbly" is emoji. If your users aren't already deluging you with emoji in
their content in 2019, grab the emoji keyboard from your Operating System,
often easy to find on most "soft keyboard overlays" such as mobile platforms.
(In Windows 10 for the last year or so there are two keyboard shortcuts that
work everywhere: Windows Key+. and Windows Key+;)

Many emoji these days are quite complex Unicode sequences with a number of
them in the so-called "Astral Plane" meaning they need more than 16-bits to
accurately display (proving you aren't treating UTF-8 or UTF-16 as if it was
UCS-2), and as sequences include a lot of fun non-visible codepoints
("characters") such as the Zero-Width Joiner, and are very susceptible to
breaking if accidentally dropped, reordered, or otherwise spliced (possibly
proving you aren't doing back string math or manipulation at the codepoint
level rather than the glyph/sequence/combined-character level).

[ETA: Useful sequences to test are any that support the skin-tone and gender
modifiers. On Windows, the various "cat occupation" emoji are also interesting
sequences such as ninja cat and astro cat. Other platforms have similar unique
"fun" sequences that are noticeable at a glance when right/wrong.]

It's not entirely true that if you support emoji well you support any Unicode
user's script well, but if you support emoji well you probably don't do
anything particularly stupid to make other Unicode users unhappy.

------
jes5199
interesting that the author used images of the font rather than actually
embedding the font into this webpage

