
What does Unicode support actually mean? - boyter
https://boyter.org/posts/unicode-support-what-does-that-actually-mean/
======
kergonath
> Because cote, coté, côte and côté are all different words with the same
> upper-case representation COTE.

The last bit is wrong, at least in French. The upper case representation of
“côté” is “CÔTÉ”. It infuriates me to no end that Microsoft does not accept
that letters in caps keep their accents. I can believe that it might be
language-dependent, though.

Also, a search engine that does not find “œuf” when you type “oe” is broken.

Very interesting article, though. I wish more developer were aware of this.

~~~
AlanYx
>It infuriates me to no end that Microsoft does not accept that letters in
caps keep their accents.

That behaviour is actually locale-dependant in Microsoft apps. Setting
Microsoft Word to "French (Canada)" uses accented capitals, while "French
(France)" does not.

I imagine there's a lively internal debate about the latter. France's Académie
Française actually says the behaviour you'd prefer is the proper one, but the
local typographic culture is heavily influenced by a history of typewriter use
where such accents weren't available.

~~~
an_d_rew
Hooray! Somebody who knows about the different capitalization rules between
Québec and France!

There are dozens of us! Dozens! ;-)

~~~
VeejayRampay
there are no differences in this case though, you're supposed to accentuate
when you capitalize in France as well

------
hibbelig
Another thing OP does not mention is normalization: The German character ä can
be represented as a single character in Unicode, but it can also be
represented as the letter a followed by a combining diacritical mark. Both
will be displayed the same way.

I guess most Germans are not aware of these differences and would expect that
Ctrl+F treats them the same.

~~~
EForEndeavour
Any self-respecting Ctrl+F _should_ match "ä" when the user searches for "a."
Chrome and Firefox do.

~~~
karatinversion
On my firefox (68.9.0esr), a Ctrl+F for "The German character a" finds no
results.

Also, as a Finnish user, I would not be impressed by a search for "talli"
(stables) matching "tälli" (blow), or "länteen" (westwards) matching "lanteen"
('of the hip').

~~~
scbrg
In contrast, as a Swedish user, I'm spectacularly impressed when, say, a
flight or train booking site lets me search for Goteborg rather than Göteborg.
Because otherwise, you know, I probably _can 't get the fuck home_ from
wherever I happen to be with whatever keyboard the particular machine I happen
to be using has.

The number of times, in actual real life, that the inconvenience of the
ambiguity outweighs the convenience of the overlap are not many.

YLMV.

~~~
cyxxon
I would assume that sites like this know these words as synonyms and don't
simply do a lax string matching, because in this specific case "Gothenburg"
would probably also work - it did when I was planning a route through Sweden
with a German app that defaults to local spellings in foreign countries.

~~~
HelloNurse
Flight and train booking sites need to support search for a very small set of
place names: if they care about selling tickets they can list a synonym list
for each destination, sidestepping language and spelling norms. Fuzzy
completion in search forms as one types, like e.g. in Google Maps, is also
effective.

------
js2
Just last week I noticed the description for the movie "Us" in the HBO Max app
on iOS and AppleTV displayed all of the left-leaning and right-leaning quotes
and double-quotes as question marks, but displayed ® correctly. My guess is
something in their pipeline is using latin1 encoding[1].

Which is to say, just getting everything onto UTF-8 (or UTF-16 if you must)
would be a huge win.

[1]
[https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Quotation_marks](https://en.wikipedia.org/wiki/ISO/IEC_8859-1#Quotation_marks)

------
skissane
Unicode supports 150+ scripts which can be used to write thousands of
languages. No matter how good your Unicode support, there is probably some
obscure script for which you do the wrong thing. But, in the real world, most
software (whether proprietary or open source) supports a limited set of
languages, and you test it works correctly for all of them, and every now and
again you might add support for a new language (due to business requirements,
or, in the case of open source, because someone volunteers to do it)-at which
point any issues specific to that language and its script are found and fixed.

~~~
gumby
The unicode consortium maintains open source software to handle a lot of hard
cases (not just case folding but canonicalization, formatting, line
breaking...). Doesn't magically make all your problems go away but takes away
a lot of hard work that you're most likely to get wrong.

[http://site.icu-project.org/home](http://site.icu-project.org/home)

------
cryptonector
> Case folding can be done in one or two ways, simple and full

Oh, it's worse than that. There can be locale specific tailorings of case
folding rules.

By the way, do not confuse case-folding mappings with case mappings for up-
and down-casing. These are different in Unicode.

Also, don't think this is Unicode's fault. This is the fault of humans for
creating such rich scripts and languages and rules.

------
NullPrefix
>characters that display as white space but actually are not because they
represent a nordic rune carved on the side of the stone not visible from the
front

Can't tell if serious or not

~~~
GlitchMr
I don't recognize that one, but there are in fact characters that render as
whitespace but aren't, for example U+2800 (BRAILLE PATTERN BLANK).

~~~
a1369209993
Which is particularly egregious, since in actual braille, that _is_ a
whitespace character (specifically, a space with no dots in it represents the
character... space! 0x20).

~~~
throwaway_pdp09
There is considered to be semantics in individual glyphs. So IIRC there is a
unicode minus sign and a unicode dash and even if they are visibly
indistinguishable you're not supposed to mix them. I'm no unicode expert
though. I doubt the unicode consortium is comprised of fools and nincompoops.
Best assume they know what they're doing.

~~~
a1369209993
I'm not claiming that BRAILLE PATTERN BLANK is the same character as SPACE (I
don't particularly disagree, but that's not my point); I'm claiming that
BRAILLE PATTERN BLANK _is a whitespace character_ , just like SPACE or NEWLINE
or U+2003 EM SPACE (pretty much any of the U+200xs, really), regardless of
whether it's the same character.

~~~
throwaway_pdp09
I exactly get your point. I don't know why either, but I'm assuming there's a
reason.

~~~
a1369209993
Fair enough, but I'm assuming the reason, whatever it is, is at least as
stupid as the reason why U+01F1 exists and is not encoded as 44 5A.

~~~
dhosek
It's a backwards compatibility issue with an older character set where the
symbol is used in latin transcription of Macedonian.

[https://en.wikipedia.org/wiki/Dz_(digraph)#Unicode](https://en.wikipedia.org/wiki/Dz_\(digraph\)#Unicode)

~~~
a1369209993
Yes, and I'm assuming the reason BRAILLE PATTERN BLANK is not correctly
classified is at least as stupid as that, and can therefore be safely ignored
for the purposes of having legitimate reasons for things.

~~~
throwaway_pdp09
Finally you've got to the point, that braille whitespace is not classified as
a whitespace (which on checking is true). Why didn't you say that at the
start.

~~~
a1369209993
> that braille whitespace is not classified [correctly]

I did! That was the first thing I said.

> > characters that render as whitespace but aren't, for example U+2800
> (BRAILLE PATTERN BLANK).

> in actual braille, that _is_ a whitespace character

~~~
throwaway_pdp09
Indeed you did. My apologies.

------
sdflhasjd
> Unicode support in my experience seems to be one of those hand wavy things
> where most people respond to the question of “Do you support unicode?” with

> > Yeah we support emojis, so yes we support unicode!

And sometimes it's not even that simple, because things that say they support
UTF-8 can't encode the full range.

Looking at you, MySQL and "utf8mb3"

~~~
doubleunplussed
Emoji have been a boon for unicode support. In order to support emoji, a lot
of software ends up supporting unicode in general. I think I read that this
was intentional by the unicode consortium - it is arguable that emoji don't
belong in unicode, but is included to enhance adoption.

~~~
sdflhasjd
Unfortunately, I think control of the Emoji code blocks by the Unicode
Consortium stifles its potential.

~~~
tom_mellior
Could you elaborate?

~~~
sdflhasjd
It has more limited usefulness outiside of North America and Japan when it
comes to cultural or geographically limited characters, especially food, drink
and clothing; The Consortium does recognise this, but with a measly 50+100 new
emoji per year, it will be forever before we see this improved significantly.

~~~
microtherion
Without that control, you would be free to call custom emoji from the vasty
deep, but without vendor support, that would not do you much good. I find it
preferable to have a limited pipeline of additions, but with considerable
vendor support.

------
hibbelig
My take on this is that it's difficult to communicate the exact details of
“Unicode support”, so perhaps it's alright to say that one supports Unicode,
generally speaking, but Unicode is complicated and there are corners that
aren't supported fully.

~~~
cryptonector
My take is that it is important to understand the basics:

    
    
      - UTFs
      - [canonical] equivalence (form, normalization)
         - [canonical] decomposition
         - [canonical] precomposition
      - case mapping
         - case-folding
      - localization
      - the reason for all this complexity
        (hint: it's not the UC's fault)
    

You don't have to really grok the details, much less memorize them. It does
not much more than TFA's ~1,500 words to explain the concepts.

~~~
a1369209993

      > - the reason for all this complexity
      >   (hint: it's not the UC's fault)
    

This is a common fallacy, that just because some of the complexity is
unavoidable and irreducible, means that the UC hasn't made things much worse
than they needed to be with things like code points (which are neither full
characters (see eg U+308 combining umlaut) nor single characters (see eg U+1F1
the two letters D and Z)) or emoji.

But, yes, with the possible exception of normalization, all of those are
things you'd need to deal with in some fashion regardless of Unicode.

~~~
cryptonector
Even combining codepoints are a thing that predates Unicode and simplifies a
number of things, so they were a very useful thing to have in Unicode.

First off, diacritical marks have generally been just that: marks that get
added to other characters. Notionally that is a great deal like combining
codepoints. For example, typewriter fonts, and later ASCII, were designed to
have useful overstrike combinations. So you could type á as a<BS>', both in
typewriters and in ASCII! Yes, ASCII was a variable-length, multi-byte
internationalized encoding for Latin scripts. For example, this is why Spanish
used to drop diacritical marks when up-casing: because there was no room in
those typewriter fonts for accents on upper-case letters, though nowadays we
have much more luxurious fonts, and so you see marks kept on up-cased Spanish
a lot more than one used to.

Second, diacritical marks as combining marks adds a great deal of flexibility
and a fair degree of future-proofing to the system. This is a restatement of
the first point, I know. Read on before telling me that flexibility is nice
but expensive.

Third, it is much simpler to close canonical composition to new compositions
by having canonical decomposition than it is to keep adding pre-compositions
for every sensible and new character. Part of this is that we have a fairly
limited codepoint codespace while pre-compositions are essentially a cartesian
explosion, and cartesian explosion means consuming codepoint codespace very
quickly. Cartesian explosions complicate software as well (more on this
below). This really gets back to the point about flexibility.

Fourth, scripts and languages tend to have decomposition strongly built into
their rules. It would complicate things to _only_ let them have pre-composed
characters! For example, Hangul is really a phonetic script, but is arranged
to look syllabic. Now, clearly Hangul is a lot more like Latin in terms of
semantics than like Hiragana, say, so you should want to process Hangul text
as phonetic (letters) rather than syllabic, but the syllabic aspect of it
can't be ignored completely. So Hangul gets precompositions (lots, thanks to
cartesian explosion) but even in NFC Hangul is decomposed because that is much
more useful for handling in software.

Emoji are mostly like ideographic CJK gluphs, except with color. The color
aspect is what truly makes emojis a new thing in human scripts, but even color
can be a combining mark. Emoji clearly illustrates the fact that our scripts
are still evolving. This is a critical thing to understand: our languages and
scripts are not set in stone! You could say that there was no need to add
emoji all you want, but they were getting added in much software, and users
were using them -- eventually the UC was going to have to acknowledge their
existence.

Aside from emoji, which are truly an innovation (though, again, not the UC's),
conceptual decomposition of characters had long been a part of scripts. It was
there because it was useful. It's entirely natural, and a very good thing too,
that it's there in Unicode as well. Decomposition, and therefore combining
codepoints, was probably unavoidable.

Regarding digraphs and ligatures, these too are part of our scripts, and we
can't very well ignore them completely.

This is not to say that the UC didn't make mistakes. Han unification was a
mistake that resulted from not inventing UTF-8 sooner. Indeed, the UC did not
even invent UTF-8. Some mistakes are going to be unavoidable, but
decomposition and combining codepoints are absolutely not a mistake.

Finally, we're not likely to ever replace Unicode -- not within the next few
decades anyways. However much you might think Unicode is broken, however
widespread, right, or wrong that perception might be, Unicode is an unavoidble
fact of developers' lives. One might as well accept it as it is, understand it
as it is, and move on.

~~~
a1369209993
> combining codepoints

Combining _diacritics_ are a good thing, I'm talking about Unicodes
conflation^W _deliberate lies_ that a code point is a meaningful unit of data
in it's own right rather than a artifact of using a multi-level encoding from
characters like "ä" to bytes like "61 CC 88".

> For example, this is why Spanish used to drop diacritical marks when up-
> casing: because there was no room in those typewriter fonts for accents on
> upper-case letters

Huh, I remembered that, but though it was some badly-designed computer
typesetting system.

> Third, it is much simpler to close canonical composition to new compositions
> by having canonical decomposition than it is to keep adding pre-compositions
> for every sensible and new character.

Yeah, this is what conviced me that base+diacritic was a better design than
precomposed characters.

> Emoji are [...] with color.

That - that they are not actually text - is one of several problems with them,
yes.

> but decomposition and combining codepoints are absolutely not a mistake.

Decomposition and combining _diacritics_ are not a mistake. The mistake is
treating the decomposed elements as first-class entities.

> Finally, we're not likely to ever replace Unicode -- not [soon] anyways.

That has been said about literally every bad status quo in the history of
problems that existed long enough for people to compain about them.

"The dragon is bad!"[0], and no amount of status-quo bias is going to change
that.

0:
[https://www.nickbostrom.com/fable/dragon.html](https://www.nickbostrom.com/fable/dragon.html)

~~~
cryptonector
> I'm talking about Unicodes conflation^W deliberate lies that a code point is
> a meaningful unit of data in it's own right rather than a artifact of using
> a multi-level encoding from characters

What on Earth are you talking about. Where does the UC "lie" like this?
Remember, the base codepoint is generally a character in its own right when
it's not combined, and combining codepoints are mostly not characters in their
own right.

> Yeah, this is what conviced me that base+diacritic was a better design than
> precomposed characters.

If anything, precompositions probably exist primarily to make transcoding from
other previously-existing codesets (e.g., ISO-8859-*) easier.

> > Emoji are [...] with color. > > That - that they are not actually text -
> is one of several problems with them, yes.

How are Kanji text and emoji not? Both are ideographic. What is the
distinction? This is not rhetorical -- I'm genuinely curious what you think is
the distinction.

> > but decomposition and combining codepoints are absolutely not a mistake. >
> > Decomposition and combining diacritics are not a mistake. The mistake is
> treating the decomposed elements as first-class entities.

The base codepoints are generally characters in their own rights ("first-class
entities") while the combining ones are generally not. In what way am I
getting that wrong?

> > Finally, we're not likely to ever replace Unicode -- not [soon] anyways. >
> > That has been said about literally every bad status quo in the history of
> problems that existed long enough for people to compain about them.

You are making arguments for why Unicode should be replaced, though I think
they are unfounded, but you'll need more than that to get it replaced. Even
granting you are right, how would it be made to happen?

~~~
a1369209993
> and combining codepoints are mostly not characters in their own right.

The Unicode standard originally defined "character" as a code point (not vice
versa), and until less than five years ago I could not have a discussion about
the differences between characters and code points (and why the latter are
bad) without some idiot showing up to claim there was no difference, since
Unicode defined them to be same. However on looking for a citation, it seems
that [http://www.unicode.org/glossary/](http://www.unicode.org/glossary/) does
not actually support that claim (although bits such as "Character [...] (3)
The basic unit of encoding for the Unicode character encoding" do little to
oppose it). So I can't actualy prove that the lies of said idiots were
deliberate on the part the UC. Which is to say I was mistaken in assuming that
equivalence to be deliberate lie rather than a miscommunication (probably).

> How are Kanji text and emoji not?

Kanji are monochrome and can be written with physical writing implents such as
pencils. Emoji aren't and can't; a chunk of graphite can't make green and
purple marks for a image (not character) of a eggplant.

> I'm genuinely curious what you think is the distinction.

Apologies, I should have been a bit more explicit there.

> The base codepoints are generally characters in their own rights

And to extent that that's the case, we should, where possible, be talking
about the characters, not the implement details of representing them.

> Even granting you are right, how would it be made to happen?

No idea, but I'm not inclined to pretend something isn't bad just because I
can't do anything about it.

~~~
cryptonector
> So I can't actualy prove that the lies of said idiots were deliberate on the
> part the UC.

I started having to deal with Unicode back around 2001. Even back then the
distinctions between code unit, codepoint, character, and glyph, were all
well-delineated. Very little has changed since then, really, just more
scripts, some bug fixes, and -yes- the addition of emoji,

> Kanji are monochrome and can be written with physical writing implents such
> as pencils. Emoji aren't and can't; a chunk of graphite can't make green and
> purple marks for a image (not character) of a eggplant.

I mean, cave paintings were color and kinda ideographic, and you do know that
you can use colored crayons/pencils/pens/paint/printers to make colored text,
yes? :-)

I'm sure you can make a monochrome emoji font. In a few thousand years we
might evolve today's emoji to look like today's Han/Kanji -- that's what
happened to those, after all!

Seriously, I can't see how color is enough to make emoji not text. Kids
nowadays write sentences in Romance languages and English using mostly emoji
sometimes. It can be like a word game. In fact, writing in Japanese can be a
bit like a word game.

Animation begins to really push it though, because that can't be printed.

> And to extent that that's the case, we should, where possible, be talking
> about the characters, not the implement details of representing them.

And we do. We talk about LATIN SMALL LETTER A and so on. Yes, U+0061 is easier
sometimes.

> No idea, but I'm not inclined to pretend something isn't bad just because I
> can't do anything about it.

Fair. Myself I think of Unicode as... mostly faithful to the actual features
of human written language scripts, with a few unfortunate things, mostly just
Han unification, UCS-2, and UTF-16. Honestly, considering how long the project
has been ongoing, I think it's surprisingly well-done -- surprisingly not-
fucked-up. So I'm inclined to see the glass as half-full.

Let's put it this way: there are a lot of mistakes to fix once we get the time
machine working. Mistakes in Unicode are very, very far down the priority
list.

~~~
a1369209993
> I started having to deal with Unicode back around 2001.

I don't remember the date, but I first had to deal with unicode when it was a
16-bit encoding (pre- surrogate pairs).

> I mean, cave paintings were color and kinda ideographic

Sure, but they weren't text, which is what we were actually talking about.

> you can use colored [whatever]

That's a property of the rendering, not of the characters; eg <font
color=red>X</font> is a colored font tag containing a non-colored character
(er, a character without color information), not vice-versa.

> Let's put it this way: there are a lot of mistakes to fix once we get the
> time machine working. Mistakes in Unicode are very, very far down the
> priority list.

That's fair; I have a very long and very angry priority list, but I was mainly
just objecting to

> all [rather than some] this complexity [is] not the UC's fault

and grabbing a couple of specific examples off the top of my head.

(Although now I realize I never got around to complaining about zero-width
joiner, or 'skin tone' modifiers, or parity sensitive flag characters, or...)

------
hibbelig
I understand that Han unification in Unicode has resulted in the same
character encoding distinct glyphs, depending on the language in use.

So there would be a Japanese Kanji that looks similar to, but distinct from, a
Chinese character. And both would be encoded as the same character in Unicode.
And that character would look one way when included in a Japanese document and
another way when included in a Chinese document.

If the document contains both Japanese and Chinese, and the character appears
in both parts, would a Japanese user expect to find both occurrences when
entering one of them?

~~~
TorKlingberg
Ideally, the document format should specify which parts are in what language.
Both MS Word and HTML can do that.

~~~
enriquto
> Ideally, the document format should specify which parts are in what
> language.

But this is "off-band" information and it kind of defeats the purpose of
unicode. Moreover it is extremely annoying in practice: imagine that you write
a text (e.g., a comment on HN) where you want to explain the character
differences between japanese and chinese. How do you get to do that? There are
no "language selector characters" in unicode!

~~~
hibbelig
It would be nice if it was possible to write mixed-language Chinese and
Japanese here in HN comments. It would also be nice if it was possible to
write math here. But neither are possible...

~~~
enriquto
Yes, it's a bit surprising that things like α, β, γ can be written, but not
other letters. Yet, you can go a long way using unicode math symbols.

EDIT: just to try, it seems that it works 平仮名 ひらがな 漢字 汉字 한자

------
SAI_Peregrinus
The idea that "case insensitivity" is a generally meaningful concept is a
falsehood. It applies to a small subset of writing systems, and horribly
breaks elsewhere.

~~~
g-b-r
I don't think it's such a small subset...

~~~
mcswell
If you count writing systems, a small subset has an upper/lower distinction:
Greek, Latin, and Cyrillic, and maybe a handful of others. Not Arabic,
Chinese, Japanese, Devanagari, Tamil, Thai, Thaana, Bengali, Hebrew, or a
bunch of other scripts.

If on the other hand you count the languages that use writing systems with an
upper/lower case distinction, then most use Latin, a much smaller set use
Cyrillic; and the rest of the writing systems are used by one or two languages
each, with the notable exception of the Arabic script.

So most scripts lack an upper/lower case distinction, but probably most
languages use a script that does have such a distinction.

~~~
SAI_Peregrinus
Exactly, and among those languages that have a case distinction quite a few
have at least one edge case (like German ß vs ss vs ẞ) where the
transformation isn't bijective. Or it's position-dependent.

If you want case insensitivity, and you also want internationalization, you'd
better hope you have a good library that handles the edge cases, or you'll get
a bunch of errors. Case-sensitive is far easier to program.

~~~
g-b-r
When dealing with case you rarely (if ever) need bijective transformations...

Case-insensitive search/comparison requires a temporary mapping, not a
permanent transformation; the German case is easily handled in Unicode's
CaseFolding.txt by just mapping all three of them either to ss or to ß
(depending on the implementer, if he prefers a length-invariant mapping or
not).

Position-dependent? I read about it, but the Unicode case folding algorithm
doesn't support it, so I imagine it's not considered useful for case-
insensitive comparison

You will probably _do_ want a good library when dealing with Unicode, but for
this case thing it actually doesn't seem to be that complex to implement by
hand...

I don't doubt case-sensitive is far easier to program , but you usually
program for some user's sake, not for your own pleasure :)

And in most cases a normal user expects case-insensitive search

------
rurban
He forgot about stricter identifier rules, almost no product got right.
Normalization. And the complete lack of unicode support in the most basic
libraries or programs, like the libc, stdlibc++ or coreutils.

As long as you cannot compare strings, you cannot search for them.

As long as identifiers are treated as binary (e.g filesystem paths, names,
...), there are tons of security holes in the most basic things, like kernel
drivers. Adding unicode support to any product does not make much sense then.
Rather restrict it to Latin-1 or the current locale to be on the safe side.
Nobody knows the security recommendation for identifiers.

------
z3t4
And the article didnt even get started on emojis, there are for example two
different ways you can define skin color of an emoji. Most programming
language dont have unicode support so its up to the developer how to handle
them.

------
rini17
Is it turing complete yet?

(as in, whether the primitive of comparing/collating two unicode strings can
be used for arbitrary computation)

------
HelloNurse
TLDR: OMG text is difficult!

In this article, unusually, text is difficult because of naive expectations,
not because of incompetence. For example, the treatment of case-insensitive
comparisons is quite good, but the author thinks that counting "characters"
should be simple and well-defined, that conversion between uppercase and
lowercase characters should be reversible, and that that one can "obviously
use well supported libraries that solve this problem for you".

------
amelius
Is anyone working on a decentralized version of Unicode?

Currently, one organization decides what symbols/emoji we can use, and I think
that's stifling our expression.

~~~
toast0
There's a ton of recorded history of decentralized character encodings. Some
people long for the days when a character was a byte and a byte was a
character, but that model doesn't fit the world's languages.

The unicode consortium has been historically inclusive enough to avoid
alienating enough people to cause a revolt.

They have defined ranges where they will not assign code points as private use
areas, and you can use those for whatever symbols you need, but of course
there's no mechanism to resolve disputes over use, and you would need your own
method to determine which font was appropriate, and its likely to be
challenging to use outside of applications you control.

~~~
amelius
I think the available extension mechanisms in Unicode are quite limiting and
in fact close to useless because you can run into trouble when you use them.

What I want is refer to new symbols through some unique ID, a decentralized
way to serve those symbols, including all relevant information (such as what
those symbols mean, what they are based on, what glyphs are
preferred/available, how the use of these symbols evolved over time,
references to related symbols, etc. etc.).

If I want to invent a symbol for a Coke can, a link to an external website
[1], a new mathematical operator, or even Covid19, I want to be able to do it
now, not wait for Unicode to stamp the proposal.

[1]
[https://news.ycombinator.com/item?id=23016832](https://news.ycombinator.com/item?id=23016832)

~~~
yoz-y
Use images or svgs then. Or define your custom svg font. I don't see how a
decentralized way of defining whatever could even remotely work, since all of
this absolutely must work while offline and while users are inputting
arbitrary text.

~~~
g-b-r
Actually there's a mention since the first Unicode Emoji TR
([https://www.unicode.org/reports/tr51/tr51-1-archive.html#Lon...](https://www.unicode.org/reports/tr51/tr51-1-archive.html#Longer_Term))
to a "longer-term goal" to support not only embedded graphics, but also "a
reference to an image on a server".

I'm not sure where it comes from and what's come of it; seeing as the next
sentence is "server could track usage", I have the feeling it's from the
editor from Google... ;) (Mark Davis)

