
Emoji.length == 2 - stanzheng
http://blog.jonnew.com/posts/poo-dot-length-equals-two
======
danbruc
The Unicode standard describes in Annex 29 [1] how to properly split strings
into grapheme clusters. And here [2] is a JavaScript implementation. This is a
solved problem.

[1]
[http://www.unicode.org/reports/tr29/](http://www.unicode.org/reports/tr29/)

[2] [https://github.com/orling/grapheme-
splitter](https://github.com/orling/grapheme-splitter)

~~~
Joeri
This is most definitely not a solved problem, because graphemes (visual
symbols) are a poor way to deal with unicode in the real world. Pretty much
all systems either deal with the length in bytes (if they're old-style C), in
code units / byte pairs (if they're UTF-16 based, like windows, java and
javascript), or in unicode code points (if they're UTF-8 based, like every
proper system should be). Dealing with the length in visual symbols is
actually pretty much impossible in practice because databases won't let you
define field lengths in graphemes.

The way things compose: bytes combine into code points (unicode numbers), and
code points combine into graphemes (visual symbols). In UTF-16 for legacy
compatibility reasons with UCS-2, code points decompose into code units (byte
pairs), and high code points, which need a lot of bits to represent their
number, need two code units (4 bytes) instead of one.

Java and JavaScript are UTF-16 based, so they measure length in code units and
not code points. An emoji code point can be a low or high number depending on
when it was added. Low numbers can be stored in two bytes, high numbers need
four bytes. So an emoji can have length 1 or 2 in UTF-16. However, when moving
to the database it will typically be stored in UTF-8, and the field length
will be code points, not code units. So, that emoji will have a length of 1
regardless of whether it is low or high. You don't notice this as a problem
because app-level field length checks will return a bigger number than what
the database perceives, so no field length limits are exceeded.

There isn't any such thing as "characters" in code. In documentation when they
say "characters" usually they mean bytes, code units or code points. Almost
never do they mean graphemes, which is intuitively what people think they
mean. The bottom line is two-fold: (A) always understand what is meant in
documentation by "length in characters", because it almost never means the
intuitive thing, and (B) don't try to use graphemes as your unit of length, it
won't work in practice.

~~~
deathanatos
A "code unit" exists in UTF-8 and UTF-32; they are not unique to UTF-16.[1]
UTF-8's relationship with code points is approximately the same as UTF-16's,
except that UTF-8 systems tend to understand code points better because if
they didn't, things break a _lot_ sooner, whereas they mostly work in UTF-16.

Your entire argument that graphemes are a poor way to deal with unicode seems
to be that current programming languages don't use graphemes, instead dealing
in a mix of code units or points. But the article here shows a number of cases
where that doesn't break down, and the person you're responding to clearly
points out that, for the cases covered in the article, graphemes are the way
to go (and he's correct).

Graphemes aren't _always_ the correct method (and I don't think your parent
was advocating that), just like code units or code points aren't always the
right way to count. It's highly dependent on the problem at hand. The bigger
issue is that programming languages make the default something that's _often_
wrong, when they probably ought to force the programmer to choose, and so,
most code ends up buggy. Worse, some languages, like JavaScript, provide no
tooling within their standard library for some of the various common ways of
needing to deal with Unicode, such as code points.

[1]:
[http://unicode.org/glossary/#code_unit](http://unicode.org/glossary/#code_unit)

~~~
Dylan16807
> A "code unit" exists in UTF-8 and UTF-32; they are not unique to UTF-16.

Technically yes. But they are only "exposed" in UTF-16. In UTF-32 code points
and code units are the same size, so you only have to deal with code points.
In UTF-8 you only have to deal with code points and bytes. UTF-16 is unique in
having something that is neither code point nor byte but sits in between.

~~~
danbruc
That is certainly true if you only look at the word sizes at different layers.
But any implementation will at least logically start with a sequence of bytes,
then turn them into code units according to the encoding scheme, group code
units into minimal well-formed code unit subsequences according to the
encoding form, and finally turn them into code points.

While different layers may use words of the same size, there are still
differences, for example what is valid and what is not. While for example
U+00D800 is a perfectly fine code point, the first high-surrogate, 0x0000D800
is not a valid UTF-32 code unit. 0xC0 0xA0 is a perfectly fine pair of bytes,
both are valid UTF-8 code units, and they could become the code point U+000020
if only 0xC0 0xA0 were not an invalid code unit subsequence.

So yes, while I agree that UTF-16 is special in that sense that one has to
deal with 8, 16 and 32 bit words, I don't think that one should dismiss the
concept of code units for all encoding forms but UTF-16. There enough subtle
details between the different layers so that the distinction is warranted. And
that is actually something I really like about the Unicode standard, it is
really precise and doesn't mix up things that are superficially the same.

~~~
Dylan16807
> But any implementation will at least logically start with a sequence of
> bytes, then turn them into code units according to the encoding scheme,
> group code units into minimal well-formed code unit subsequences according
> to the encoding form, and finally turn them into code points.

Not at all. I've never seen people using UTF-8 deal with a code unit stage.
They parse directly from bytes to code points.

> While for example U+00D800 is a perfectly fine code point, the first high-
> surrogate, 0x0000D800 is not a valid UTF-32 code unit.

I thought that _was_ an invalid code point. Where would I look to see the
difference? Nevertheless I would expect most code to make no distinction
between the invalidity of 0x0000D800 and 0x44444444, except perhaps to give a
better error message.

> 0xC0 0xA0 is a perfectly fine pair of bytes, both are valid UTF-8 code
> units, and they could become the code point U+000020 if only 0xC0 0xA0 were
> not an invalid code unit subsequence.

If you say that they're correct code units then at what point do you
distinguish bytes and code units? In practice almost nobody decodes UTF-8 with
an understanding of code units, neither by that name nor any other name. They
simply see bytes that correctly encode code points, and bytes that don't.

Especially if you say that C0 is a valid code unit despite it not appearing in
any valid UTF-8 sequences.

~~~
netvl
> I've never seen people using UTF-8 deal with a code unit stage. They parse
> directly from bytes to code points.

Well, that's probably because in UTF-8 code unit is byte :)

Quoting
[https://en.wikipedia.org/wiki/UTF-8](https://en.wikipedia.org/wiki/UTF-8):

> The encoding is variable-length and uses 8-bit code units.

By definition, code unit is a bit sequence of a fixed size which can form code
points. In UTF-8 you form code points using 8-bit bytes, therefore in UTF-8
code unit is byte. In UTF-16 it is a sequence of two bytes. In UTF-32 it is a
sequence of four bytes.

~~~
Dylan16807
I said as much in my first comment, yes. I'm not sure if I'm missing something
in your comment?

Code units may 'exist' on all three through the fiat of their definition, but
they only have a visible function and require you to process an additional
layer in UTF-16.

------
darkengine
The thing that frustrates me the most about Unicode emoji is the astounding
number of combining characters. For combining characters in written languages,
you can do an NFC normalization and, with moderate success, get a 1 codepoint
= 1 grapheme mapping, but "Emoji 2.0" introduced some ridiculous emoji
compositions with the ZWJ character.

To use the author's example:

‍woman - 1 codepoint

black woman - 2 codepoints, woman + dark Fitzpatrick modifier

‍️‍‍woman kissing woman - 7 codepoints, woman + ZWJ + heart + ZWJ + kissy lips
+ ZWJ + woman

It's like composing Mayan pictographs, except you have to include an invisible
character in between each component.

Here's another fun one: country flags. Unicode has special characters 🇱 🇮 🇰 🇪
🇹 🇭 🇮 🇸 that you can combine into country codes to create a flag. 🇰+🇷 = 🇰🇷

edit: looks like HN strips emoji? Changed the emoji in the example into
English words. They are all supposed to render as a single "character".

~~~
melloclello
Man, imagine if you could compose chinese characters out of radicals like
this.

I'm not sure if that would be a good thing or a bad thing.

~~~
duskwuff
The one that still surprises me is Hangul (Korean script). Hangul characters
are made of 24 basic characters (jamo) which represent consonant and vowel
sounds, which are composed into Hangul characters representing syllables.

Unicode has a block for Hangul jamo, but they aren't used in typical text.
Instead, Hangul are presented using a massive 11K-codepoint block of every
possible precomposed syllable. ¯\\_(ツ)_/¯

~~~
yongjik
I believe that was a necessary compromise to use Hangul on any software not
authored by Koreans.

"These are characters from a country you've never been to. Each three-byte
sequence (assuming UTF-8) corresponds to a square-shaped character." \--> Easy
for everyone to understand, and less chance of screwup (as long as the
software supports any Unicode at all).

"These should be decomposed into sequences of two or three characters, each
three bytes long, and then you need a special algorithm to combine them into a
square block." \--> This pretty much means the software must be developed with
Korean users in mind (or someone must heroically go through every part of the
code dealing with displaying text), otherwise we might as well assume that
it's English-only.

Well, _now_ the equation might be different, as more and more software are
developed by global companies and there are more customers using scripts with
complicated combining diacritics, but that wasn't the case when Hangul was
added to Unicode.

For example: if NFD works properly, the first two characters below should look
identical, and the third should show a "defective" character that looks like
the first two except without the circle (ㅇ). It doesn't work in gvim (it fails
to consider the second/third example as a single character), Chrome in Linux,
or Firefox in Linux.

은 은 ᅟᅳᆫ

Of course, if it were the _only_ method of encoding Korean, then the support
would have been better, but it would've still required a lot of work by
everyone.

~~~
innocenat
My Linux Chrome shows your example perfectly though. Note that I also have CJK
language pack installed.

~~~
Sean1708
My Linux Firefox also behaves correctly, but I don't have any language packs
installed AFAIA.

------
Animats
Before emoji, fonts and colors were independent. Combining the two creates a
mess. Try using emoji in an editor with syntax coloring. We got into this
because some people thought that single-color emoji were racist.[1] So now
there are five skin tone options. The no-option case is usually rendered as
bright yellow, which comes from the old AOL client. They got it from the
happy-face icon of the 1970s.

Here's the current list of valid emoji, including upcoming ones being added in
the next revision.[2]

A reasonable test for passwords is to run them through an IDNA checker, which
checks whether a string is acceptable as a domain name component. This catches
most weird stuff, such as mixed left-to-right and right-to-left symbols, zero-
width markers, homoglyphs, and emoji.

[1] [https://www.washingtonpost.com/news/the-
intersect/wp/2015/02...](https://www.washingtonpost.com/news/the-
intersect/wp/2015/02/24/are-apples-new-yellow-face-emoji-
racist/?utm_term=.ec65e2f8ef2d) [2] [http://unicode.org/emoji/charts-
beta/full-emoji-list.html](http://unicode.org/emoji/charts-beta/full-emoji-
list.html)

~~~
tomjakubowski
> A reasonable test for passwords is to run them through an IDNA checker,
> which checks whether a string is acceptable as a domain name component. This
> catches most weird stuff, such as mixed left-to-right and right-to-left
> symbols, zero-width markers, homoglyphs, and emoji.

Why test this at all? It's not as if a website should ever need to render a
user's password as text. Is there another use case for excluding this "weird
stuff" that I'm not seeing?

~~~
xnyhps
Suppose I include 'ü': LATIN SMALL LETTER U WITH DIAERESIS in my password. I
switch to a different browser/OS/language and now when I enter "ü" I get 'u':
LATIN SMALL LETTER U + ' ̈': COMBINING DIAERESIS. I can't log in anymore,
though what I do is identical and defined to be equivalent. Especially if the
password is hashed before comparing it, you can't treat it as just a sequence
of bytes.

You don't need to use IDNA for this, though. There are standards specifically
for dealing with Unicode passwords, such as SASLprep (RFC 4013) and PRECIS
(RFC 7564).

~~~
StringEpsilon
I would not actually disallow these characters, but you may warn the user
about the existance of problematic characters in their password of choice.

If I want to use äöüßÄÖÜẞ because I'm confident that I can properly type them
on all devices I'll need to type then, then let me. It's not your concern what
method of input I'm using.

And maybe, just maybe, using latin characters is actually more of a hassle for
a user anyway. (I think the risk of that occoring is low, but still. At the
moment, it's a self-fulfilling prophecy that all users have proper method to
input atin script available. We simply force them to have one.)

Edit: And the confusion is also possible with just latin characters. U+0430
looks exactly like "a", but has a different code point and thus ruins the
hash.

------
kmill
There are multiple ways of counting "length" of a string. Number of UTF-8
bytes, number of UTF-16 code units, number of codepoints, number of grapheme
clusters. These are all distinct yet valid concepts of "length."

For the purpose of allocating buffers, I can see the obvious use in knowing
number of bytes, UTF-16 code units, or the number of codepoints. I also see
the use in being able to iterate through grapheme clusters, for instance for
rendering a fragment of text, or for parsing. Perhaps someone can shed light
on a compelling use case for knowing the number of grapheme clusters in a
particular string, because I haven't been able to think of one.

I'm not sure about calculating password lengths: if the point is entropy, the
number of bytes seems good enough to me!

The password field bug is possibly compelling, but I don't think it's obvious
what a password field _should_ do. Should it represent keystrokes? Codepoints?
Grapheme clusters? Ligatures? Replace all the glyphs with bullets during font
rendering?

(Similarly, perhaps someone could explain why they think reversing a string
should be a sensible operation. That this is hard to do is something I
occasionally hear echoing around the internet. The best I've heard is that you
can reuse the default forward lexicographic ordering on reversed strings for a
use I've forgotten.)

~~~
toast0
> Perhaps someone can shed light on a compelling use case for knowing the
> number of grapheme clusters in a particular string, because I haven't been
> able to think of one.

If you have a limit on the length of a field, it helps to tell the user what
it is in a way they understand. For non-technical users, bytes (and the
embedded issue of encoding) and code points are both pretty esoteric, but
number of symbols is less so. OTOH, SMS has strict data and encoding limits,
and people managed with that; also provisioning byte storage for grapheme
limited fields is hard: some graphemes use a ton of code points, family emoji
and zalgo text are clear examples.

~~~
paulddraper
Why do you have a limit on the length of a field?

So it can fit in a database, i.e. with a certain number of bytes?

~~~
desdiv
Without a limit on password length, an attacker can DOS you by forcing you to
run your KDF on gigabyte-sized strings.

~~~
paulddraper
Giga _byte_ sized strings?

Oh, no. That doesn't make sense. You need to limit by Giga _grapheme_ strings.

------
TorKlingberg
If you want to do Unicode correctly, you shouldn't ask for the "length" of a
string. The is no true definition of length. If want to know how many bytes it
uses in storage, ask for that. If you want to know how wide it will be on the
screen, ask for that. Do not iterate over strings character by character.

~~~
fryguy
How many dots/stars should one display for a password? That's a question that
can't be answered by your two valid question. Are you suggesting that
dots/stars shouldn't be displayed for passwords, since you can't ask how many
"characters" it is?

~~~
slededit
You could divide the length of the string by the length the '*' character in a
monospaced font. It doesn't really make sense for a combining or other
invisible character to get its own asterisk.

------
chungy
> The current largest codepoint? Why that would be a cheese wedge at U+1F9C0.
> How did we ever communicate before this?

Sounds cute, but inaccurate.

If we count the last two planes that are reserved for private use (aka,
applications/users can use them for whatever domain problems they like), that
would be U+10FFFD.

If we count the variation selector codepoints (used for things like changing
skin tone, or the look of certain other characters), U+E01EF.

If we count the last honestly-for-real-written-language character assigned, it
would be 𪘀 U+2FA1D CJK COMPATIBILITY IDEOGRAPH-2FA.

But I suppose none of that sounds as fun as an emoji (which are really a very
small part of the Unicode standard).

~~~
rspeer
I tried to look up what U+2FA1D, the highest-numbered printable character,
means in context.

It is a Traditional Chinese character. It's a variant of U+2F600, 𪘀, which is
pronounced "pián". It apparently is used in zero words. It's in Unicode
because it's listed in the 7th section of TCA-CNS 11643-1992, a Taiwanese
computing standard.

Searching for it gives lots of sites that acknowledge that it's a character
that exists and then provide no definition for it.

My guess: it occurred in someone's name at some point. Pretty strange that it
ended up requiring a compatibility mapping, though, when nobody seems to use
the character or the character it's mapped to!

------
zach417
Tom Scott did a nice YouTube video related to this:
[https://www.youtube.com/watch?v=sTzp76JXsoY](https://www.youtube.com/watch?v=sTzp76JXsoY)

------
teknologist
This appears to be a rehash of what Mathias Bynens was talking about a few
years ago.

[http://vimeo.com/76597193](http://vimeo.com/76597193)

[https://mathiasbynens.be/notes/javascript-
unicode](https://mathiasbynens.be/notes/javascript-unicode)

------
ge0rg
I've gone through exactly the same discovery process when implementing faux
stamps (something between images and Emoji) in my xmpp app yesterday.

My idea was to increase the font size of a message that only consists of
Emoji, depending on the number of Emoji in the message, like this:

[https://xmpp.pix-
art.de/imagehost/display/file/2017-03-09_09...](https://xmpp.pix-
art.de/imagehost/display/file/2017-03-09_09-36-09_r8m468so4vh7.jpg)

The code turned out more complex than first expected, mirroring the same
problems OP encountered:

[https://github.com/ge0rg/yaxim/blob/gradle/src/org/yaxim/and...](https://github.com/ge0rg/yaxim/blob/gradle/src/org/yaxim/androidclient/util/XMPPHelper.java#L66-L93)

~~~
kalleboo
I'm working on a project that has to handle special rendering of emoji as
well, and I simply ask the system "will this string render in the emoji font"
and "how big of a rect do I need to render this string" to calculate the same
thing, rather than trying to handle it myself and relying on assumptions about
the sizing of the emoji. I figure this way I also future proof against
whatever emoji they think up in the future.

------
mhils
The Zero-Width-Joiner allows for some really strange things:
[https://blog.emojipedia.org/ninja-cat-the-windows-only-
emoji...](https://blog.emojipedia.org/ninja-cat-the-windows-only-emoji/).

One can basically achieve an unlimited number of emojis by concatenating the
current ones.

------
joeblau
I ran into this 2 years ago on Swift when I was creating an emojified version
of Twitter. I wanted to ensure that each message sent had at least 1 emoji and
I quickly realized that validating a string with 1 emoji was not as simple as:

    
    
      if (lastString.characters.count == 2) {
         // pseudo code to allow string and activate send button
      }
    

This was the app I was working on [1]; code is finished, but I'm not launching
it (probably ever). The whole emoji length piece was quite frustrating because
my assumption of character counting went right out of the window when I had
people testing the app in Test Flight.

[1] - [https://joeblau.com/emo/](https://joeblau.com/emo/)

~~~
Manishearth
Actually, this is just due to Swift not implementing Unicode 9's version of
UAX 29 (which had just come out at the time). Swift _should_ handle it
correctly, but it's lagging behind in unicode 9 support. In general a
"character" in a string is a grapheme cluster, and most visually-single emoji
are single grapheme clusters. The exception is stuff like ‍️[1]. That _should_
render as a male judge (I don't think there's font support for it yet)
according to the spec, and it should be a single grapheme cluster, but the
spec has what I consider a mistake in it where it isn't considered to be one.
I've filed a bug about this, since the emoji-zwj-sequences file lists it as a
valid zwj sequence, but applying the spec to the sequence gives two grapheme
clusters.

There's active work now for Unicode 9 support in Swift. Since string handling
is heavily dependent on this algorithm (they have a unicode trie and all for
optimization!) it's trickier than just rewriting the algorithm.

But, in general, you should be able to trust Swift to do the right thing here,
barring bugs like "not up to date with the spec". Swift is great like that.

[1]:
[https://r12a.github.io/uniview/?charlist=%F0%9F%91%A8%F0%9F%...](https://r12a.github.io/uniview/?charlist=%F0%9F%91%A8%F0%9F%8F%BB%E2%80%8D%E2%9A%96%EF%B8%8F)

------
hwc
How can that entire article never mention the term UTF-16?

~~~
Retr0spectrum
Why should it? Other than for explaining why the abomination of surrogate
pairs came into existence.

------
tantalor
> I have no idea if there’s a good reason for the name “astral plane.”
> Sometimes, I think people come up with these names just to add excitement to
> their lives.

[https://en.wikipedia.org/wiki/Plane_(esotericism)#The_Planes](https://en.wikipedia.org/wiki/Plane_\(esotericism\)#The_Planes)

------
openasocket
The issue doesn't really seem to be the emojis, but rather the variation
sequences, which seem to be really awkward to work with, but I can sort of see
why they're necessary. But the fact that we need special libraries to answer
fairly basic queries about unicode text doesn't bode well.

~~~
masklinn
> But the fact that we need special libraries to answer fairly basic queries
> about unicode text doesn't bode well.

That's always been needed to actually properly work with unicode, what do you
think ICU is? Few if any languages have complete native Unicode support. And
it's hardly new, Unicode has an annex (#29) dedicated to text segmentation:
[http://www.unicode.org/reports/tr29/](http://www.unicode.org/reports/tr29/)

------
codezero
I see your 2 and raise you 2:

"(this is a color-hued hand from Apple that doesn't render on HN)".length == 4

I ran into the length==2 bug when truncating some text, it led to errors
trying to url encode a string :)

The author's `fancyCount2` still returns a size of 2 for these kinds of emoji,
but I'm not too surprised.

------
sorenjan
I think the article "A Programmer's Introduction to Unicode" that was shared
here recently is a good read and explains Unicode well.

[https://news.ycombinator.com/item?id=13790575](https://news.ycombinator.com/item?id=13790575)

------
pc2g4d
Just ran into this yesterday when I discovered that an emoji character
wouldn't fit into Rust's `char` type. I just changed the type to `&'static
str` but I still wish there was a single `grapheme` type or something like
that.

------
gtrubetskoy
In Go:

    
    
      func main() {
          shit := "\U0001f4a9"
          fmt.Printf("len of %s is %d\n", shit, utf8.RuneCountInString(shit))
      }
    

$ len of � is 1

Though I can't say that this is all that intuitive either...

~~~
geocar
Codepoints still aren't the same as characters.

Consider the examples given about combining emoji; Consider two runes that
make one character: e and ◌́

------
remx
Just going to leave this link here:
[https://mathiasbynens.be/notes/javascript-
unicode](https://mathiasbynens.be/notes/javascript-unicode)

------
Traubenfuchs
If this interests you, read the source of Java's
abstractStringBuilder.reverse(). It's interesting and very short. I am not
sure it can deal with multi-emoji-emoji though.

------
xem
Here are my 2 cents: you can decompose an Unicode string with the ES6 spread
operator:

[..."(insert 5 poo emoji here)"].length === 5

[..."(insert 5 poo emoji here)"][1] === "(poo emoji)"

------
lsv1
As a developer dealing with the encoding of user input made in UTF-8 into a
legacy systems which only support ASCII... I prefer this.

------
rsmets
(U+200B), zero width space, should be outlawed... got me good a couple years
ago! Had todo a hexdump to see what was going on.

------
beaugunderson
lodash's toArray and split both support emoji, with good unit tests. I also
wrote emoji-aware for this purpose:

[https://www.npmjs.com/package/emoji-
aware](https://www.npmjs.com/package/emoji-aware)

------
nutbutter
The golf course flag equals one obviously because at a hole-in-one. :)

------
jtymann
Makes me wonder whether or not that should be considered a bug.

~~~
Manishearth
I'm sure all browser designers out there would love it if we could switch JS
over to UTF8, or in general have any system where JS uses a well formed
encoding when it comes to unicode. We can't, because of backwards
compatability.

------
TheRealPomax
but the real question is why he needed password length constraints instead of
password strength constraints...

------
marichards
create table twitter(tweet varchar(? ... that's it, I give up, time to become
an Uber driver

------
wcummings
> Sometimes, I think people come up with these names just to add excitement to
> their lives.

Let's get outta here guys, we've been rumbled!

------
phkahler
Unicode is fucked. All these bullshit emojis remind me of the 1980s when ASCII
was 7 bits but every computer manufacturer (Atari, Commodore, Apple, IBM, TI,
etc...) made their own set of characters for the 128 values of a byte beyond
ASCII. Of course Unicode is a global standard so your pile-of-poop emoji will
still be a pile-of-poop on every device even if the amount of steam is
different for some people.

It's beyond me why this is happening. Who decides which bullshit symbols get
into the standard anyway?

~~~
TAForObvReasons
There is a Unicode encoding "UTF-32" which has the advantage of being fixed
width. This is not popular for the obvious reason that even ascii characters
are expanded to 4 bytes. Additionally the windows APIs, among other
interfaces, are not equipped to handle 4-byte codepages.

~~~
marcosdumay
> "UTF-32" which has the advantage of being fixed width

It's fixed width for now. It can not hold all the current available code-
points, so it will probably have the same fate as UTF-16 (but it will probably
take a long time).

I'd stay away from it.

~~~
jcranmer
There are currently 17 × 65536 code points (U+0000..U+10FFFF) in Unicode.
UTF-32 could theoretically encode up to a hypothetical U+FFFFFFFF and still be
fixed-width.

Note that, at present, only 4 of the 17 planes have defined characters (Planes
0, 1, 2, and 14), two are reserved for private use (15 and 16), and an
additional is unused but is thought to be needed (Plane 3, the TIP for
historic Chinese script predecessors). Four planes appear to be sufficient to
support every script ever written on Earth, as it's doubtful there are
unidentified scripts with an ideographic repertoire as massive as the Unified
CJK ideographs database.

We are very unlikely to ever fill up the current space of Unicode, let alone
the plausible maximum space permissible by UTF-8, let alone the plausible
maximum space permissible by UTF-32.

~~~
nercht12
The bummer is when you want to create a font that supports all the characters.
Ugh. Talk about alot of work.

