

UCS vs. UTF-8 as Internal String Encoding - romeoonisim
http://lucumr.pocoo.org/2014/1/9/ucs-vs-utf8/

======
rectang
Random-access idioms where unicode strings are treated as arrays of character
data will result in userland code which is either incorrect or inefficient.
Incorrect if the implementation cheats and treats variable-width data as
constant-width, resulting in severing of logical units and incorrect length
values. Inefficient, if each random-access operation counts variable-width
elements from the beginning of the string.

It is better to provide string-manipulation facilities which rely on iteration
and keep track of offset internally.

For an extended explanation, see Tom Christiansen's comment on this old Python
issue:
[http://bugs.python.org/issue12729#msg142036](http://bugs.python.org/issue12729#msg142036)

~~~
geocar
If we're iterating characters, we might as well use zlib-compressed 21-bit
unicode characters. They save even more room than UTF8, and it's difficult to
pretend that characters can be accessed by index.

~~~
barrkel
There's an order of magnitude difference (at least) between accessing memory
vs unzipping. The constant space overhead isn't great either for small
strings.

For big strings like web pages, we already do use zlib (via gzip).

------
TazeTSchnitzel
Something I've wondered about is why, given Unicode is limited to 21 bits now,
nobody seems to be using a 24-bit format ("UTF-24")? If you want constant-time
indexing without bit-packing tricks, that'd be the ideal.

Though, of course, constant-time indexing isn't what it's hyped up to be.
That's only for codepoints. Actual characters you see on screen are often
combinations of codepoints.

So I've answered my own question really. Why does nobody use it? Because
there's no point in it.

[http://stackoverflow.com/questions/10143836/why-is-there-
no-...](http://stackoverflow.com/questions/10143836/why-is-there-no-utf-24)

~~~
protomyth
"That's only for codepoints. Actual characters you see on screen are often
combinations of codepoints."

So, there is no encoding that actually has 1 word (of whatever byte length) to
a character on the screen?

~~~
gsnedders
There's an unbounded number of possible graphemes per Unicode, because they
are formed of a start character and an unbounded number of combining
characters following them. As such, no fixed-length encoding can actually
enumerate all possible graphemes.

~~~
protomyth
What is the practicality of an unbounded number of possible graphemes? Does
anyone in real writing of languages and not some concocted test case need
this? I understood there is some conflict in the encoding of kanji into
unicode, does this relate?

~~~
masklinn
> What is the practicality of an unbounded number of possible graphemes?

That question doesn't really make sense. Unbounded cluster sizes is simply a
result of unicode design because there's no reason to bound it.

> Does anyone in real writing of languages and not some concocted test case
> need this?

In the real writing of languages there are languages with grapheme clusters of
size 4+ e.g. क्षि (Devanagari kshi) is a single cluster made up of क, ्, ष and
ि.

> I understood there is some conflict in the encoding of kanji into unicode,
> does this relate?

It's unrelated. Han Unification compressed the number of clusters by unifying
han characters across simplified Chinese, traditional Chinese, Japanese and
Korean. Han characters were an issue in the original 16 bits unicode because
supporting all of them was estimated as possibly topping out 100,000
codepoints (where 16 bits unicode only supported 65k codepoints). AFAIK Han
characters don't use much if any composition (outside of romanisation).
Brahmic scripts, Hebrew and Arabic scripts, on the other hand, pile on
diacritics.

~~~
protomyth
> That question doesn't really make sense. Unbounded cluster sizes is simply a
> result of unicode design because there's no reason to bound it.

If I'm trying to draw a character on the screen, it makes a lot of sense. I
guess I'm struck about how hard it would be to select the 4th through 8th
character displayed on the screen from rows 3 through 13.

~~~
lambda
Displaying characters on the screen, and giving the ability to select text, is
indeed quite a difficult and complex question in the face of full support of
the world's written languages.

Glyphs vary in width. There are combining characters. There are ligatures (and
in some languages, ligatures are necessary for the text to be readable, not
just stylistic). There is contextual reordering of characters. There's bi-
directional text, which can switch direction mid-sentence when quoting some
Roman text in Hebrew or vice versa. There's vertical text. There are a number
of fonts, font weights, font styles, font sizes, font features like
alternative characters, language dependent selection of glyphs, and so on,
plus fallback if the user doesn't have the designated fonts or fonts that
cover the given Unicode range installed. There's line breaking, which can't
always operate on spaces because not all languages include spaces between
words. There's justification of text, distributing the spacing between
characters to make a block of text even on both sides.

All of this complexity exists regardless of the encoding that you use. Simply
picking a fixed width encoding will not do anything about the vast majority of
the complexity; and the fixed width encoding will cause problems of its own
(size bloat of strings in memory), as well as leading people astray in
thinking that maybe they can get away with making some of the same assumptions
that apply to ASCII on a fixed-width terminal (and even there, assuming that 1
character equals 1 unit of width fails in the case of tabs), when in fact they
can't and most text handling beyond simply appending strings together is best
handled by dedicated libraries that have had hundreds of developer-years of
effort put into solving all of these problems.

------
TazeTSchnitzel
The only reason UTF-16 was so widely used is as a backwards-compatibility
measure. All these systems originated in the UCS-2 era, or had to be
compatible with ones which did. Then UTF-16 came along, and slotted in where
UCS-2 was. This is why you have such sloppiness about treating surrogate pairs
as two characters, for instance: the systems were made for UCS-2 and only
poorly updated to the UTF-16 reality.

Any modern system would use UTF-8.

------
userbinator
"UTF-24" feels like the right choice for an internal format if you want true
constant-time access to code points with 25% less waste than UTF-32.

I know it's not a power of 2, and that's the most common reason cited against
using it, but multiplying by 3 is basically n + 2n or a shift-and-add so it's
not really hard to do.

~~~
lambda
Why would you want constant time access to code points? What can you do with
constant time access to code points that is actually correct?

Every time someone tries to promote UTF-16, UTF-32, or some imagined encoding
like UTF-24, they bring up constant time access to code points, but I have
never heard of a reason why you would want that.

Pretty much every use case I have ever heard for constant-time access can be
handled just as well by constant-time access at the code unit (byte in UTF-8,
16 bit value in UTF-16, 32 bit value in UTF-32) level. Matching text exactly?
Matching works just as well at the code unit level. Doing any kind of fuzzy or
regular expression match? You're going to need to iterate over each item
anyhow to normalize it or classify it. Need to store offsets? That works just
fine at the code unit level.

Most of the other suggestions I've heard for what to do with constant-time
access to code points is incorrect when you consider combining characters,
normalization, different character widths, need for linguistically appropriate
word splitting, etc. If you're doing something that doesn't take these into
account, why are you using Unicode instead of just ASCII?

In addition, for the ASCII range, UTF-24 would take up 3 times the space as
UTF-8, and the vast majority of text processed is actually in the ASCII range
due to verbose markup formats like HTML, XML, etc. Plus if you do anything
with the code points in UTF-24, you need to do a bunch of bit-fiddling to move
them into 32 bit alignment, so as far as actually decoding the individual code
points, it's pretty much a wash with UTF-8.

~~~
vorg
> UTF-24 would take up 3 times the space as UTF-8, and the vast majority of
> text processed is actually in the ASCII range due to verbose markup formats
> like HTML, XML, etc

Sounds like these "verbose" markup formats are the _real_ problem in taking up
too much space, not the Unicode transformation formats.

~~~
lambda
I think that moving the world away from HTML and XML as markup languages is
likely to be considerably harder than simply defaulting to using UTF-8 in any
new code for which there isn't already a natural encoding to choose based on
the platform or APIs you're developing on.

Furthermore, even beyond the markup, the vast majority of written content is
in the Roman alphabet, of which most characters are in the ASCII range, and
those that aren't fit into 2 bytes of UTF-8. 55% of text on the web is in
English, then languages using the Roman script or other scripts in the 2-byte
UTF-8 range make up 21 of the next 25 most commonly used languages on the
internet; of the top 25 languages, only Chinese, Japanese, Korean, and Thai
are in ranges that require 3 bytes in UTF-8 for the majority of their
characters
([https://en.wikipedia.org/wiki/Languages_used_on_the_Internet](https://en.wikipedia.org/wiki/Languages_used_on_the_Internet)).

So for the vast majority of text that is processed, UTF-8 is dramatically more
efficient than a hypothetical UTF-24 or the real UTF-32. Now, you might say
"well, it should all compress away anyhow", but when dealing with text in RAM,
it is generally not compressed (and that would make handling it far more
complex than just using UTF-8), and memory capacity, latency, and bandwidth
can all be important.

Trying to deal with text as fixed-width characters is simply incorrect. Almost
every non-trivial means of handling text needs to support arbitrary length
strings, needs to deal with clusters of more than one codepoint as a single
unit, and needs to iterate over the characters linearly at least once (and can
then store byte offsets for random access later on). Dramatically increasing
storage requirements of text for the non-benefit of being able to deal with
fixed-width codepoints just doesn't make sense.

