
Unicode programming, with examples in C - mmoez
https://begriffs.com/posts/2019-05-23-unicode-icu.html
======
jancsika
> This is called Normalization Form Canonical Composition (NFC).

Is there something like a "Round Midnight and No Coffee Form" where the
programmer just renders the text to check whether the output of each set of
codepoints matches pixel for pixel?

~~~
a1369209993
The closest unicode has seems to be NFKC, but last I checked it still didn't
correctly handle greek and cyrillic aliases of latin characters, never mind
anything more obscure.

~~~
LunaSea
Do you have examples of this? I wonder if it's implementation specific or due
to the NFKC method itself?

~~~
dfox
The issue does not have anything to do with normalization per se, but stems
from the fact that there are Unicode characters that are semantically
different, but nevertheless look exactly same in most fonts. For example 'a'
(U+0061 LATIN SMALL LETTER A) vs. 'а' (U+0430 CYRILLIC SMALL LETTER A)

~~~
ygra
It's probably also futile to keep homoglyph tables without the context of the
font being used, as т and _т_ are the same letter, just with different styles
(and in some fonts resemble T and _m_ respectively in Latin). Unicode mentions
homoglyphs briefly in UTR #36, but overall I feel that it's not really their
job to solve that issue, as visually similar characters already exist in ASCII
and other limited character sets and every context where those things are a
problem probably needs to be evaluated differently with different mitigations.

------
jhasse
Btw: In C99 you don't have to declare all your variables at the top. You can
also use // comments.

~~~
vardump
We embedded devs might be able to start consistently using C99 in about fifty
years or so. Hopefully.

Lots of development kits barely support C89.

~~~
goto11
Couldn't pre-processors be used to add such features? (I don't know anything
about embedded development.)

~~~
mbel
CPP would not be enough, you would need full C99-to-C89 compiler with parsing
to AST.

------
macintux
I found this quite interesting. Perhaps the closest I’ve come to grasping the
core concepts, even though I’ve done a fair bit of work around the edges
previously.

------
stabbles
I'm just confused the libicu shared libs are ~25MB big. Why is that?

~~~
rurban
Because it has to store all properties of all valid codepoints 0-0x10fff. It
does it via perfect hashes for fastest lookup, not via space saving 3-level
arrays as most others do. I described various implementation strategies here :
[http://perl11.org/blog/foldcase.html](http://perl11.org/blog/foldcase.html)

~~~
burfog
Benchmarked? For one lookup, or for repeated lookups?

Hashes have terrible cache locality. Unicode itself has locality, with the
greek characters generally separate from the chinese characters and so on. The
tree-based and array-based methods take advantage of this locality.

~~~
gmueckl
Just guessing, but based on statistics of web pages in asian languages, most
text comsists of mostly the lower code points, no matter the language. So then
hash lookups end up being pretty much heavily biased towards small subsets of
the data. And I wouldn't be surprised if cache sizes of modern processors
comspire to accelerate this pretty lopsided distribution of accesses
considerably.

~~~
derefr
I’ve always wondered whether, in the context of segmenting/layout-ing entire
Unicode documents (or large streams where you’re willing to buffer kilobytes
at a time, like browser page rendering), there’d be an efficiency win for
Unicode processing, in:

1\. detecting (either heuristically, or using in-band metadata like HTML
“lang”) the set of languages in use in the document; and then

2\. rewriting the internal representation of the received document/stream-
chunk from “an array of codepoints” to “an array of pairs {language ID, offset
within a language-specific tokens table}.”

In other words, one could—with knowledge of which languages are in use in a
document—denormalize the codepoints that are considered valid members of
multiple languages’ alphabet/ideograph sets, into separate tokens for each
language they appear in.

Each such token would “inherit” all the properties of the original Unicode
codepoint it is a proxy for, but would only have to actually _encode_ such
properties as actually matter in the the language it’s a token of.

And, as well, each language would be able to set defaults for the properties
of its tokens, such that the tokens would only have to encode the exceptions
to the defaults; or there could even be language-specific functions for
decoding each property, such that languages could Huffman-compress together
the particular properties that apply to them, given known frequencies of those
properties among its tokens, making it cheaper to decode properties of
commonly-encountered tokens, at the expense of decoding time for rarely-
encountered tokens.

And, of course, this would give each language’s tokens data locality, such
that the CPU could keep only the data (or embodied decision trees) in cache,
for the languages that it’s actually using.

Since each token would know what its codepoint is, so you could map this back
to regular Unicode (e.g. UTF-8) when serializing it.

(Yes, I’m sort of talking about reimplementing code pages. But 1. they’d be
code pages _as materialized views of Unicode_ , and 2. you’d never expose the
code-page representation to the world, only using it in your own text system.)

------
gpvos
Money quote: "Puh-leaze, if your program can’t handle Medieval Irish carvings
then I want nothing to do with it."

------
kazinator
> _Reading lines into internal UTF-16 representation_

Fail.

> _It’s unwise to use UTF-32 to store strings in memory. In this encoding it’s
> true that every code unit can hold a full codepoint._

wchar_t is 32 bits on a number of platforms such as GNU/Linux, MacOS and
Solaris. It behooves you to use that, and all the associated library
functionality, rather than roll your own.

~~~
bhaak
Curiously, the paragraph before the line "it's unwise to use UTF-32" ends with
"Use the encoding preferred by your library and convert to/from UTF-8 at the
edges of the program."

And that is the best advice there is. If you have a choice, use UTF-8,
otherwise use whatever your libraries use.

Unless you have very special needs, forget about UTF-16.

~~~
Const-me
> Unless you have very special needs, forget about UTF-16.

I don’t think programming for Windows, Android, or iOS, or program in /
interop with Java, JavaScript, .NET, qualify as “very special needs”.

~~~
Asooka
Even on Windows it's best to keep your text in UTF-8 and convert it to and
from UTF-16 when interacting with win32 APIs. Java, dotNet and JavaScript are
the worst of all worlds because you're both stuck with wide characters (in
their native string types) and have the intricacies of UTF-16 to consider. I
guess the advice might have been better phrased as "Unless you're forced to,
or have very special needs, stay away from UTF-16".

~~~
Const-me
> it's best to keep your text in UTF-8 and convert it to and from UTF-16 when
> interacting with win32 APIs

It’s extra source code to write and then support, extra machine code to
execute, and likely extra memory to malloc/free. Too slow, in my book
automatically means “not best”.

> Java, dotNet and JavaScript are the worst of all worlds because you're both
> stuck with wide characters (in their native string types) and have the
> intricacies of UTF-16 to consider.

Just a normal UTF-16, like in WinAPI and many other popular languages,
frameworks and libraries. E.g. QT is used a lot in the wild.

> the advice might have been better phrased as

It says exactly the opposite, “Use the encoding preferred by your library and
convert to/from UTF-8 at the edges of the program.”

~~~
fwip
"Too" slow depends on a lot of factors.

~~~
Const-me
When you’re writing code that you 100% sure won’t ever become a performance
bottleneck, you still care about time of development. Very often, unless it’s
a throwaway code, also about cost of support.

Writing any code at all when that code is not needed is always too slow, this
is regardless of any technical factors.

~~~
fwip
Very little code in this world is needed. Much of it is, however, useful.

The person you replied to obviously isn't advocating for something they find
useless.

Perhaps you could have instead asked "Why do you recommend doing this? I don't
understand the benefit." But instead, you decided that they're advocating to
do something useless for no reason.

~~~
Const-me
> you decided that they're advocating to do something useless for no reason.

No, I decided they’re advocating to do something harmful for no reason.

They're advocating to waste hardware resources (as a developer I don’t like
doing that), waste development time (as a manager I don’t like when developers
do that). But the worst of all, UTF8 on Windows and converting to/from UTF16
at WinAPI boundary is a source of bugs, the kernel doesn’t guarantee the bytes
you get from these APIs are valid UTF16, quite the opposite, it guarantees to
treat them as opaque chunk of words.

UTF-8 has it’s place even on Windows, e.g. it makes sense for some network
services, and even for RAM data when you know it’ll be 99% English so it saves
resources, and that data never hits WinAPI. But as soon as you’re consuming
WinAPI, COM, UWP, windows shell, any other native stuff, UTF-8 is just not
good.

