
Unicode Is Awesome - jagracey
https://wisdom.engineering/awesome-unicode/
======
jrochkind1
Unicode is pretty amazing.

People REALLY like to complain about unicode, but where it's complicated, it's
because the _problem space_ is complicated. Which it is. People are actually
complaining that they wish handling global text wasn't so complicated, like,
that humans had been a lot simpler and more limited with their inventions of
alphabets and how they were used in typesetting and printing and what not, and
that legacy digital text encodings historically had happened differently than
they did, they're not actually complaining about unicode at all, which had to
deal with the hand of cards it was dealt.

That unicode worked out as nice a solution as it is to storing global text is
pretty amazing, there were some really smart and competent people working on
it. When you dig into the details, you will be continually amazed how nice it
is. And how well-documented.

One real testament to this is how _well adopted_ Unicode is. There is no
actual guarantee that just because you make a standard anyone will use it.
Nobody forced anyone to move from whatever they did to Unicode. (and in fact
most eg internet standards don't force Unicode and are technically agnostic as
to text encoding). That it has become so universal is because it was so well-
designed, it solved real problems, with a feasible migration path for
developers that had a cost justified by it's benefits. (When people complain
about aspects of UTF-8 required by it's multi-facetted compatibility with
ascii, they are missing that this is what led to unicode actually winning).

The OP, despite the title, doesn't actually serve as a great
argument/explanation for how Unicode is awesome. But I'd read some of the
Unicode "annex" docs -- they are also great docs!

~~~
cryptonector
If we could go back in time to Unicode's beginning and start over but with all
that we know today... Unicode would still look a lot like what it looks like
today, except that:

    
    
      - UTF-8 would have been specified first
      - we'd not have had UCS-2, nor UTF-16
      - we'd have more than 21 bits of codespace
      - CJK unification would not have been attempted
      - we might or might not have pre-composed codepoints[0]
      - a few character-specific mistakes would have gone unmade
    

which is to say, again, that Unicode would mostly come out the same as it is
today.

Everything to do with normalization, graphemes, and all the things that make
Unicode complex and painful would still have to be there because they are
necessary and not mistakes. Unicode's complexity derives from the complexity
of human scripts.

[0] Going back further to create Unicode before there was a crushing need for
it would be impossible -- try convincing computer scientists in the 60s to use
Unicode... Or IBM in the 30s. For this reason, pre-composed codepoints would
still have proven to be very useful, so we'd probably still have them if we
started over, and we'd still end up with NFC/NFKC being closed to new
additions, which would leave NFD as the better NF just as it is today.

~~~
WorldMaker
> Or IBM in the 30s

Love an interesting sci-fi scenario. UTF-8 was a really neat technical trick,
and a lot of the early UTF-8 technical documentation was already on IBM
letterhead. I think if you showed up with the right documents at various
points in history IBM would have been ecstatic to have an idea like UTF-8, at
least. UTF-8 would have sidestepped a lot of mistakes with code pages and
CCSIDs (IBM's attempts at 16-bit characters, encoding both code page and
character), and IBM developers likely would have enjoyed that. Also, they
might have been delightfully confused about how the memo was coming from
inside the house by coworkers not currently on payroll.

Possibly that even extends as far back as the 1930s and formation of the
company, because even then IBM aspired to be a truly International company,
given the I in its own name.

I'm not sure how much of the rest of Unicode you could have convinced them of,
but it's funny imagining explaining say Emoji to IBM suits at various points
in history.

~~~
cryptonector
I.. agree. After all, even before ASCII people already used the English
character set and punctuation to type out many non-English Latin characters on
typewriters (did typesetters do that with movable type, ever? I dunno, but I
imagine so). Just as ASCII was intended to work with overstrike for
internationalization, i can imagine combining codepoints having been a thing
even earlier.

OTOH, it wouldn't have been UTF-8 -- it would have been an EBCDIC-8 thing, and
probably not good :)

~~~
WorldMaker
There actually is a rarely used, but standardized because it needed to be,
UTF-16 variant UTF-EBCDIC, if you needed nightmares about it. [0]

In some small decisions EBCDIC makes as much or more sense than ASCII; the
decades of problems have been that ASCII and EBCDIC coexisted from basically
the beginning. (IBM could have delayed the System/360 while ASCII was
standardized and likely have saved decades of programmer grief.) The reasons
that UTF-EBCDIC is so bad (such as that it is always a 16-bit encoding) could
likely have been avoided had IBM awareness of UTF-8 ahead of time.

Maybe if IBM had something like UTF-8 as far back as the 1930s, AT&T needing
backward compatibility with their teleprinters might not have been as big of a
deal and ASCII might have been more IBM dominated. Or since this is a sci-fi
scenario, you just impress on IBM that they need to build a telegraph
compatible teleprinter model or three in addition to all their focus on punch
cards, and maybe they'd get all of it interoperable themselves ahead of ASCII.

Though that starts to ask about the scenario what happens if you give UTF-8 to
early Baudot code developers in the telegraph world. You might have a hard
time to convince them they need more than 5-bits, but if you could accomplish
that, imagine where telegraphy could have gone. Winking face emoji full stop

[0] [https://en.wikipedia.org/wiki/UTF-
EBCDIC](https://en.wikipedia.org/wiki/UTF-EBCDIC)

~~~
jrochkind1
> Maybe if IBM had something like UTF-8 as far back as the 1930s, AT&T needing
> backward compatibility with their teleprinters might not have been as big of
> a deal

I think this points to why the science fiction scenario really is a science
fiction scenario -- I think decoding and interpreting UTF-8, using it to
control, say, a teleprompter, is probably significantly enough more expensive
than ASCII that it would have been a no go, too hard/expensive or entirely
implausible to implement in a teleprompter using even 1960s digital
technology.

~~~
WorldMaker
Yeah, and I was thinking about it this morning the reliance of FF-FD for
surrogate pairs would shred punch cards (too many holes) and probably be a big
reason for IBM to dismiss it when they were hugely punch card dependent and
hadn't made advances like the smaller square hole punches that could pack
holes more densely and with better surrounding integrity.

"Sigh, another swiss cheese card jammed the reader from all these emojis."

------
RcouF1uZ4gsC
> Unicode is simply a 16-bit code - Some people are under the misconception
> that Unicode is simply a 16-bit code where each character takes 16 bits and
> therefore there are 65,536 possible characters. This is not, actually,
> correct. It is the single most common myth about Unicode, so if you thought
> that, don't feel bad.

Verity Stob has a great column

[https://www.theregister.co.uk/2013/10/04/verity_stob_unicode...](https://www.theregister.co.uk/2013/10/04/verity_stob_unicode/)

where she says that it is wrong to call this a myth, since that was how it was
originally designed. It is better characterized as being obsolete, rather than
a myth.

~~~
msla
It's a myth about the current version of Unicode.

Whether it's true about some obsolete version is hair-splitting at this point.

------
flohofwoe
IMHO the article should mention that UTF-16 was (more or less) a hack to fix
Windows and some other systems which didn't see the light and use UTF-8 from
the start. UTF-16 has all the disadvantes of UTF-8 (variable length) and
UTF-32 (endianess), but none of the advantages (encoding as endian-agnostic,
7-bit ASCII compatible byte stream like UTF-8, or a fixed-width encoding like
UTF-32). UTF-16 should really be considered a hack to talk to (mainly) Windows
APIs.

Also, obligatory link to:
[https://utf8everywhere.org/](https://utf8everywhere.org/)

~~~
wvenable
Windows and many other operating systems and languages (Java) got on board
with Unicode back when the character set would fit in 16bits. The character
set originally used was UCS-2 (not UTF-16). UTF-16 came next to extend the
Unicode character set beyond 65536 code points.

UTF-8 wasn't even invented until well after all these operating systems and
languages deployed Unicode.

They didn't see the light of day to use UTF-8 because they didn't have a time
machine to make that possible.

~~~
flohofwoe
I actually checked a while ago when UTF-8 was created, and it was just around
the same time when Windows NT was developed with 16-bit "early" Unicode
support. UTF-8 was created in September 1992 [1], and Windows NT came out mid
1993, but I guess it was too late for Windows to change to UTF-8 (and I guess
the advantages of UTF-8 haven't been as clear back then).

But IMHO there's no excuse to not use UTF-8 after around 1995 ;)

[1]
[https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt](https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt)

~~~
gpvos
Also, UTF-16 was only published in July 1996 (although the need for more than
16 bits was probably apparent a bit earlier). So before that, Unicode was only
a 16-bit encoding, and UCS-2 was enough. UTF-8 was initially just a nice trick
to keep using ASCII characters for things like directory separators (/) and
single-byte NUL terminators. By 1995 its superiority certainly wasn't apparent
yet.

Also, Windows internals were completely 16-bit-character based, including e.g.
the NTFS disk format, so by 1992 that was already quite hard to change.

That said, it is crazy that NT didn't have full UTF-8 support, including in
console windows, by about 2000.

------
euske
I mean, awesome for whom? It might be awesome for end users as you can type in
or copy/paste things without caring which language you're using. But for
programmers, Unicode is a bloated monstrosity and a source of endless
nightmare. Eventually, it's not going to be awesome for end users either
because it will be plagued by a lot of (subtle) inconsistencies. Unicode looks
a lot like a leaky abstraction to me (because of poor foresight), and it's
getting worse each year.

~~~
jcranmer
If you think Unicode is a "bloated monstrosity and a source of endless
nightmare," what would you remove from Unicode?

And if you're going to respond "emoji", I'll point out that removing emoji
doesn't actually remove anything that makes text processing with Unicode
difficult, just makes it more likely that people will assume that what works
for English works for everybody.

(Side note: it is not possible to accurately represent modern English text
solely with ASCII, as English does contain several words with accented
characters, such as façade and résumé).

~~~
kps
‘Remove’ is too strong, since Unicode is entrenched. But there are things that
should have been done differently. For instance, combining characters and
operators should have been placed before the base character rather than after,
so that (a) it would be possible to know when you've reached the end of a
character^W glyph^W grapheme cluster without reading ahead, and (b) dead keys
would be identical to the corresponding characters.

> _façade and résumé_

ASCII (1967) allowed for them: c BS , _or_ , BS c ↦ ç and e BS ' _or_ ' BS e ↦
é. Encoding ç as 63 CC A7 is not manifestly better than encoding it as 63 08
2C.

~~~
jcranmer
> ASCII (1967) allowed for them: c BS , or , BS c ↦ ç and e BS ' or ' BS e ↦
> é. Encoding ç as 63 CC A7 is not manifestly better than encoding it as 63 08
> 2C.

Doesn't work for ñ, since the ASCII ~ is often typeset in the middle of the
box instead of in a position to appear above an 'n' character. " is a pretty
poor substitute for ◌̈ though, especially when you're trying to write ï as in
naïve. And then there's the æ of archæology, which doesn't work with
overwriting.

I'll also point out that ç is U+00E7 in Unicode and C3 A7 in UTF-8, not 63 CC
A6, since it's a precomposed character (and NFC form is usually understood to
be the preferred way to normalize Unicode unless there's a reason to do
something else).

~~~
kps
Tilde exists in ASCII _because_ of its use as an accent. (In 1967 the non-
diacritic interpretation was an overline.) The use in programming languages,
and lowering to fit other mathematical operators, came later.

There was never any requirement that ‘n BS ~’ have the same appearance as ‘n’
overprinted with ‘~’, although terminals capable of making the distinction
didn't appear until the 70s.

Precomposed characters aren't relevant to illustrating composition mechanisms.

------
deathanatos
I swear there should be some rule or law about how Unicode articles will
inevitably muddle code units/points / grapheme clusters / bytes together.

> _String length is typically determined by counting codepoints._

> _This means that surrogate pairs would count as two characters._

If you were counting code points, a surrogate pair would be 1. If it's two,
you're counting code units.

> _Combining multiple diacritics may be stacked over the same character. a + ̈
> == ̈a, increasing length, while only producing a single character._

Not if you're counting code points _or_ code units, which would both produce
an answer of "2", and that's a great example of why you shouldn't count with
either.

The dark blue on black in tables is next to invisible. And then to put _that_
on white on the alternate rows is just eyeball murder.

> _Since there are over 1.1 million UTF-8 glphys_ (sic)

UTF-8 glyphs _twitch_ ; aside from that, I'm really curious how they got that
number. In some ways, a font has it easy; my understanding is that modern font
formats can do one glyph for acute accent, one glyph for all the
vowels/letters, and then compose the glyphs into arrangements for having them
combined. (IDK if those are also "glyphs" to the font or not.) But it's less
drawing, at least. OTOH, some characters have >1 appearance/"image", AIUI.

~~~
paulddraper
Also a law that the author is thinking of one and only one programming
language.

> String length is typically determined by counting codepoints.

That depends _entirely_ on what "strings" you are talking about.

In C/Go/Rust/Ruby, char*/string/std::string::String/String is bytes.

In Java/JavaScript, java.lang.String/String is UTF-16 code units.

In Python 3, str is code points.

In Swift, String is extended grapheme clusters.

In Haskell, there are various different "string" types in common use.

And in C++, std::basic_string is a generic container for whatever element type
you want. (std::string specialization being for bytes.)

EDIT: Clarified that I don't disagree with parent comment; merely pointing out
additional less-than-precise language.

~~~
happytoexplain
I'm in love with Swift's approach, where the default representation is a well
defined thing that both users and developers think of as "characters", but all
the other representations are trivially accessible.

~~~
lifthrasiir
I disagree. Grapheme clusters are locale-dependent, much like string collation
is locale-dependent. What Unicode gives you by default, the (extended)
grapheme cluster, is as useful as the DUCET (Default Unicode Collation Element
Table); while you can live with them, you would be unsatisfied. In fact there
are tons of Unicode bugs that can't be corrected due to the compatibility
reason, and can only be fixed via tailored locale-dependent schemes.

I would like to avoid locales in the language core. It would be great to have
locale stuffs in the standard library, but without locale information you
can't treat strings as (human) texts.

~~~
tomp
Can you give examples of locale-dependent things, or issues with extended
grapheme clusters?

~~~
lifthrasiir
Hangul normalization and collation is broken in Unicode, albeit for slightly
different reasons. The Unicode Collation Algorithm explictly devotes two
sections related to Hangul; the first section, for "trailing weights" [1], is
recommended for the detailed explanation.

The Unicode Text Segmentation standard [2] explicitly mentions that Indic
_aksaras_ [3] require the tailoring to grapheme clusters. Depending on the
view, you can also consider orthographic digraphs as examples (Dutch "ij" is
sometimes considered a single character for example).

[1]
[https://www.unicode.org/reports/tr10/#Trailing_Weights](https://www.unicode.org/reports/tr10/#Trailing_Weights)

[2] [https://unicode.org/reports/tr29/](https://unicode.org/reports/tr29/)

[3]
[https://en.wikipedia.org/wiki/Aksara#Grammatical_tradition](https://en.wikipedia.org/wiki/Aksara#Grammatical_tradition)

------
grantmnz
If you'd like to explore Unicode characters, you can use the Unicode Character
finder, a web app I built some years ago:
[https://www.mclean.net.nz/ucf/](https://www.mclean.net.nz/ucf/)

The app allows you to paste in a character to find out more about it, or to
search the database of character descriptions to find what you're after.

You can link to a specific character to share with your friends and family:
[https://www.mclean.net.nz/ucf/?c=U+130BA](https://www.mclean.net.nz/ucf/?c=U+130BA)

~~~
jagracey
For exploration, additionally I'd recommend
[http://shapecatcher.com/](http://shapecatcher.com/) It allows you to draw the
shape you are looking for, and with some form of ML, sorts by similarity. It
has come in handy a few times for finding the characters I'm unable to
describe.

------
FisDugthop
Page doesn't render without JS enabled. Enabling JS causes questionable CSP
requests.

~~~
jagracey
Just a quick and scrappy "Ghost" blog running on a $5 Digital Ocean droplet
with the usual analytics.

------
ken
> data to be transmitted in a byte, word or double word oriented format (i.e.
> in 8, 16 or 32-bits per code unit)

I don't think I've heard "word" mean "16 bits" since the 1980's ... and
apparently neither has Wikipedia:
[https://en.wikipedia.org/wiki/Word_(computer_architecture)#T...](https://en.wikipedia.org/wiki/Word_\(computer_architecture\)#Table_of_word_sizes)

~~~
jontro
In Windows world DWORD is 32 bits large: [https://docs.microsoft.com/en-
us/openspecs/windows_protocols...](https://docs.microsoft.com/en-
us/openspecs/windows_protocols/ms-dtyp/262627d8-3418-4627-9218-4ffe110850b2)

~~~
pjtr
And WORD 16 bits: [https://docs.microsoft.com/en-
us/openspecs/windows_protocols...](https://docs.microsoft.com/en-
us/openspecs/windows_protocols/ms-dtyp/f8573df3-a44a-4a50-b070-ac4c3aa78e3c)

(QWORD 64 bits)

------
akdor1154
Unicode is great, emoji are a (technically impressive) monstrosity.

~~~
WorldMaker
Emoji are an almost critical Unicode democratizing need. They aren't doing
anything that other languages encoded with Unicode don't already do (and
haven't already done since the beginning of Unicode). This article itself
points out several key existing relatives, such as how Arabic, one of the most
common and important written languages in the world, or the very important CJK
family of written languages, used ZWJ and ZWNJ well before Emoji made it
"cool" to other parts of the world, most especially the English-writing
contingent that has long thought of Unicode as simply "ASCII plus a bunch of
other stuff I might never use". Suddenly a lot of English documents have
embedded emoji that deeply matters to the writers, and there are fewer excuses
to treat Unicode as "ASCII+" and more cases where doing so is not only wrong
(broken surrogate pairs, incorrect codepoint analysis for ZWJ, etc), but very
visibly wrong in a way that users care and complain about it.

------
nabla9
Unicode has two really great features.

* It names and defines things and sets standard. This seems trivial but is incredibly useful.

* Unicode encodings, mainly UTF-8 are good storage format for text (as a data structure for editing text, not so much if you want to be universal).

Unicode has one really horrible failing.

The 'user-perceived character' (Unicode terminology) is arguably the most
important unit in text. Unicode approximates user-perceived characters using
set of general rules to define grapheme clusters. A Grapheme cluster is a
sequence of adjacent code points that should be treated as a unit by
applications. Unfortunately the ruleset and definition is inadequate.
Sometimes you need two grapheme clusters to define one unit.

If you get UTF-8 encoded and normalized string from somewhere from some
unspecified time and era, don't know what application wrote it, using what
version of UNICODE standard and what was the locale, you may lose some
information.

Unicode should have added explicit encoding for user-perceived character
boundaries (either fixed grapheme cluster eoncoding or completely different
encoding). Let the writing software define it explicitly. It would have been
future-proof (new software in the future can understand old strings) and past-
proof (ancient software can understand and edit strings written in the
future).

------
squaresmile
Unicode definitely has flaws but that doesn't mean we should throw the baby
out with the bathwater and go back to "ASCII and other character sets."
There's a reason we moved on from that world. However, I bet we will see
another encoding coming up eventually (within 30 years) which solves the
problems Unicode currently has and introduces a new set of problems as well. I
saw this comment [0] about how that encoding should get started.

> Greek, for example, has a lot of special-casing in Unicode. Korean is
> devilishly hard to render correctly the way Unicode handles it. And once you
> get into the right-to-left scripts, scripts that sort-of-sometimes omit
> vowels, or Devanagari (the script used to write a bunch of widely-spoken
> languages in India), you start needing very different capabilities than
> what's involved in Western European writing. _The better approach probably
> would have been to start with those, and work back to the European scripts_

[0]
[https://www.reddit.com/r/programming/comments/b09c0j/when_zo...](https://www.reddit.com/r/programming/comments/b09c0j/when_zo%C3%AB_zo%C3%AB_or_why_you_need_to_normalize_unicode/eiekg1x/)

Funnily enough, URLs still can't do actual Unicode.

~~~
BurningFrog
Unicode URL has serious security problems.

The canonical example is google.com vs gооgle.com.

~~~
riquito
That was solved years ago by IDN/Punycode (implemented by any browser worth
their salt).

------
lifthrasiir
Years ago I've posted support material [1] for Hangul filler mentioned in the
article, reproduced below:

\---

U+3164 HANGUL FILLER is one of the stupidest choices made by character sets.
Hangul is noted for its algorithmic construction and Hangul charsets should
ideally be following that. Unfortunately, the predominant method for multibyte
encoding was ISO 2022 and EUC and both required a rather small repetoire of 94
× 94 = 8,836 characters [0] which are much less than required 19 × 21 × 28 =
11,172 modern syllables.

The initial KS X 1001 charset, therefore, only contained 2,350 frequent
syllables (plus 4,888 Chinese characters with some duplicates, themselves
becoming another Unicode headache). Notwithstanding the fact that remaining
syllables are not supported, this resulted in a significant complexity burden
for _every_ Hangul-supporting software and there were confusion and contention
between KS X 1001 and less interoperable "compositional" (johab) encodings
before Unicode. The standardization committee has later acknowledged the
charset's shortcoming, but only by adding four-letter (thus eight-byte) ad-hoc
combinations for all remaining syllables! The Hangul filler is a designator
for such combinations, e.g. `<fliler>ㄱㅏ<filler>` denotes `가` and `<filler>ㅂㅞㄺ`
denotes `뷁` (not in KS X 1001 per se).

Hangul filler was too late in the scene that it had virtually no support from
software industry. Sadly, the filler was there and Unicode had to accept it;
technically it can be used to designate a letter (even though Unicode does not
support the combinations) so the filler itself should be considered as a
letter as well. What, the, hell.

[0] It is technically possible to use 94 × 94 × 94 = 830,584 characters with
three-byte encoding, but as far as I know there is no known example of such
charset designed (thus no real support too).

\---

I should also mention that early Mozilla (and thus Firefox) had once supported
ad-hoc combinations for KS X 1001, got interoperability problems and dropped
the support later. Nowadays we treat KS X 1001 as an alias of Windows code
page 949 for the sake of compatibility [2].

[1] [https://github.com/Wisdom/Awesome-
Unicode/issues/4](https://github.com/Wisdom/Awesome-Unicode/issues/4)

[2] [https://encoding.spec.whatwg.org/#index-euc-
kr](https://encoding.spec.whatwg.org/#index-euc-kr)

------
akersten
Unicode is an inspirational standard. We started with so many different
character encodings and wound up pretty universally using Unicode. I wouldn't
be surprised to see browsers start to drop support for other encodings - who
even uses them at this point?

Are there any scenarios where you _wouldn 't_ use Unicode, other than an
every-byte-matters embedded system?

~~~
TheDong
There's a not-insignificant number of Japanese websites that can only
correctly display using EUC-JP or ShiftJIS.

It seems very Latin/ASCII centric to push for disabling non-UTF-8 encodings,
especially since the only reason UTF-8 works so well on ASCII websites is due
to its backwards compatibility.

If it were the reverse, and UTF-8 were backwards compatible with EUC-JP/CJK,
but not ASCII, I doubt you'd be pushing for eschewing other formats since it
would break so many english websites.

~~~
jcranmer
There is no character in EUC-JP or Shift-JIS that is not in Unicode--the
explicit goal of Unicode in its original formulation was to be able to
losslessly round-trip any other charset through Unicode, and the initial
version of Unicode incorporated the source kanji lists for the EUC-JP/Shift-
JIS charsets in their entirety.

~~~
TheDong
That's true, but you misunderstood what I meant.

The parent comment seemed to be implying that we should drop support for non-
utf8 charsets.

To me, that rings like saying a website with 'charset=EUC-JP' (such as
[http://www.os2.jp/](http://www.os2.jp/)) should be broken, as in browsers
should error out or display a large quantity of black boxes due to it using a
non-utf-8 encoding.

I'm claiming the only reason the author thinks that's really viable is because
in our western-centric world, we see mostly ascii and utf8. Things that, if
you flip to only utf-8, both still look fine.

CJK websites, on the other hand, that are using the equivalent of ASCII will
have to be manually upgraded to display correctly if browsers drop their
support.

Sure, all their characters can be represented in utf-8, but there's large
swathes of websites that will never be updated to a new charset, and it's only
a western-centric view that can so blithely suggest breaking them all.

~~~
jcranmer
Windows-1252/ISO-8859-1 (the two charsets are so commonly conflated that it's
often best to treat them as one) was the dominant [non-ASCII] charset of the
web until around 2007 or 2008, and their prevalence more recently is only
about 5%.

A collection of Usenet messages gathered in 2014 (see
[http://quetzalcoatal.blogspot.com/2014/03/understanding-
emai...](http://quetzalcoatal.blogspot.com/2014/03/understanding-email-
charsets.html) for full details) showed that out of 1,000,000 messages, about
530,000 were actually ASCII; 270,000 were ISO-8859-1 or Windows-1252; and only
75,000 were UTF-8. More modern numbers would probably show higher UTF-8
counts, although Usenet is notoriously conservative in terms of technology.

What I'm trying to elucidate here is that the rise of UTF-8 isn't because most
text is ASCII, but because there's been a rather more concerted effort to
default content generation to UTF-8 and treat other charsets only as legacy
inputs. Well, with the exception of the Japanese, who tend to be strongly
averse to UTF-8. (I've been told that Japanese email users would rather have
their text get silently mangled than silently converted to UTF-8 because
you're quoting an email with a smart quote [not present in any of the 3
Japanese charsets], whereas every other locale was happy changing the default
charset for writing to UTF-8).

------
jakeogh
What's the code point for uppercase superscript Z?

~~~
kps
There isn't one. Unicode considers superscripting a matter of presentation,
which Unicode doesn't cover, except when it does.

~~~
tialaramex
More particularly: Presentation variant is not a justification for inclusion
in Unicode BUT prior encoding in another character set is.

Unicode sets a high priority on roundtripping. The idea is that if you take
some data in any one character set X and convert it to Unicode, you should
preserve all the meaning by doing this, such that you could losslessly convert
it back to encoding X.

It's like the wordprocessor problem where users say they only want 10% of the
features of a popular wordprocessor but it turns out each user wants a
different 10% and so the only way to deliver what they all want is to deliver
100% of the features. Likewise, Unicode has all the weird features of every
legacy character set which was embraced BUT it doesn't arbitrarily add new
weird features, although you could argue that some of the work done for
Unicode has that effect e.g. the way flags work or the Fitzpatrick modifiers.

If Unicode had insisted upon never encoding anything that might be a
presentation feature, it'd be a long forgotten academic project that never
went anywhere and we'd all be using some (probably Microsoft designed) 16-bit
ASCII superset today.

~~~
jakeogh
Is there a realitvely easy way to find the character set that was included for
uppercase superscript W? (ᵂ)

~~~
tialaramex
In the case of U+1D42 Modifier Letter Capital W I was wrong about the cause,
it was in fact specifically added on the rationale that for this purpose
(phonetics) the presentation was semantic in nature, and so the plain text
(thus Unicode) needed to preserve these symbols which could otherwise be
handled by a presentation layer.

U+1D42 Modifier Letter Capital W was added in Unicode 4.0 as part of the
Phonetic Extensions and Wikipedia provides a long list of Unicode committee
paperwork regarding this:
[https://en.wikipedia.org/wiki/Phonetic_Extensions](https://en.wikipedia.org/wiki/Phonetic_Extensions)

You can see that initially it would have been numbered differently and then
over the course of several drafts the proposal evolved until it was assigned
U+1D42

~~~
jakeogh
Does that mean there is hope to complete the [a-Z] sub/super set?

------
kevmoo1
[https://youtu.be/dMnPM6z6z40](https://youtu.be/dMnPM6z6z40)

------
jagracey
The Emoji emoji modifiers are pretty cool. \- Skin color modifiers \-
Character combiners: \- man [ZWJ] woman [ZWJ] boy [ZWJ] girl === family of 4

~~~
johannes1234321
"How to shrink a family using MySQL" or: "Unicode Emojis, Code Points, and
Grepheme Clusters"

[https://twitter.com/johannescode/status/1183716981612208128?...](https://twitter.com/johannescode/status/1183716981612208128?s=19)

------
jagracey
Just posted another Unicode article to HN. "Hacking GitHub with Unicode"

[https://news.ycombinator.com/item?id=21693550](https://news.ycombinator.com/item?id=21693550)

------
jagracey
Unicode reverse character:

'hello \u{202e} world'; 'hello dlrow' // Visual equivalent

