
How Python does Unicode - scw
http://www.b-list.org/weblog/2017/sep/05/how-python-does-unicode/
======
camgunz
UCS-4 is essentially never the right choice. It wastes space and thus messes
up your cache. UCS-2 can be the right choice if the language you're encoding
uses a lot of non-Latin glyphs (i.e. East Asian languages) but suffers from
the same problem as UCS-2. UTF-8 is a good default: for most strings it's very
compact, and for strings with a lot of multibyte codepoints it doesn't compare
too unfavorably with UTF-16.

Python 3 tried to have its cake and eat it too by choosing the most compact
encoding depending on the string, but in practice this wastes a lot of space.
You'll double (or heaven forbid __quadruple __) your string size because of a
single codepoint, and these codepoints are almost always a small percentage of
the string. That 's actually why UTF-16 and UTF-8 exist.

It would have been better for strings to default to UTF-8, and to add an
optional encoding so the programmer can specify what kind of encoding to use.
As it is now, in order to use (for example) UTF-16 strings in Python you have
to keep them around as bytes, decode them to a string, perform string
operations, and reencode them to bytes again. Any benefit you get from using
UTF-16 vanishes the moment you need to operate on it like a string, in other
words.

I get that the idea was to maintain indexing via codepoint, but (again) in
practice that's not great: usually you want to index via grapheme -- if you
want to index at all.

A better solution is to allow programmers to specify string encoding and
default it to UTF-8. From that, there's a clear path to everything you'd want
to do.

~~~
jstimpfle
> I get that the idea was to maintain indexing via codepoint, but (again) in
> practice that's not great: usually you want to index via grapheme -- if you
> want to index at all.

I definitely need indexes, and I don't really care about graphemes. I actually
have only a vague idea what that is.

I write parsers typically by using a global string and lots of indices. The
important thing for me is to be able to extract characters and slices at given
positions, and to be able to say "parse error at line X character Y" where X
and Y are helpful to the user most of the time.

I would be absolutely fine with working in UTF-8 bytes only (and that would be
faster I guess), but there would be a more pressing need to recompute
character positions (as a code point or grapheme index) from byte offsets at
times.

There are more abstract parsing methods where parser subroutines are
implemented in a position agnostic way, but I'm very happy with my simple
method.

If everything works on graphemes instead of code points (as I think does
Perl6) I will be happy to use that, but it's not so important from a practical
standpoint.

~~~
Avernar
> I definitely need indexes

No you don't. You need iterators, which behave like pointers. Let's say you're
hundreds or thousands of characters into a string at the start of some token.
Now you want to scan from that position to the end of the token.

With indexes it works fast only if it's by codepoint. in a language that
properly supports graphemes this would mean it would have to scan from the
beginning to get to that index.

With iterators it can start scanning from that position directly. Same speed
no matter where you are in the string. With indexes the larger your input the
slower your parse gets, and not in a linear way.

It's also super easy to get a slice using a start and end iterator. As for
line x character y messages, you can't get that directly from an index as it
depends on how many new lines you parsed so indexing doesn't help there.

~~~
jstimpfle
Well, I could roll my own iterator which encapsulates a string and some
position information, but then I'd have to wrap a lot of different operations,
like advance, advance by n, compare two iterators by position, test for end
position, extract character, extract slice, etc.

And the code would get a lot noiser, while the only advantage I see is
graphemes support, which I have never needed so far. (And I hope graphemes are
actually designed with a similar sensibility for technical concerns as is
UTF-8, where I can simply parse with indexes at the byte level, looking only
for ASCII characters, without headaches and with maximum performance.)

As for getting line/character from a byte or codepoint offset, that's no
problem if I do the calculation only in case of an error. The alternative
would be to do it on each advance, which again means ADT wrapping, thus line
noise and slower performance.

~~~
Avernar
I'm not advocating that the programmer needs to implement the iterators but
that the language/runtime have built in support for them.

As for searching for ASCII, which is prevalent in parsing, the iterator
function to find the next specified character can do a low level and fast byte
search. That's one of the benefits of UTF-8, searching for ASCII characters is
super fast.

You wouldn't have to do the character position on each advance. Just have a
beginning of line iterator that's updated every time you see a newline
character and on error you do call a function that gives you how many
characters between the current position iterator and the start of line
iterator.

Working with iterators is no more coplex than working with indexes. But it's
the language that needs to provide them.

------
dguaraglia
My favorite story about Python's handling of Unicode was when one of my
coworkers did a hotfix for our Python website, wrote tests, confirmed
everything worked as expected... but right before committing and pushing to
production wrote a comment like:

# Apparently we expect the field to be in this format ¯\\_(ツ)_/¯

Right above the code he'd just fixed.

Of course, the moment we pushed the update it brought production down, because
the Python interpreter doesn't understand Unicode in source files unless you
specify which encoding you are using.

After that, "¯\\_(ツ)_/¯" became a synonym for his name on our HipChat server,
heh.

~~~
ubernostrum
This would be the case in Python 2, where source code files are assumed to be
ASCII-encoded unless there's an encoding comment at the top of the file.

In Python 3, source code files are assumed to be UTF-8.

~~~
ninkendo
Interesting that Python 2 couldn't fix that in a hotfix/point release... UTF-8
is backwards compatible with ASCII so it shouldn't break anything if source
started being interpreted as UTF8. I'd be curious to see what their reasoning
is.

~~~
dguaraglia
I would imagine Python's approach to introducing new language features had a
lot to do with it. Having to go through the PEP system takes some time, and
changes like these tend to be reserved for minor-version releases. All in all,
I love the PEP system, it's such an open concept and I've been surprised by
the amount of quality proposals that get implemented. Wish Go had something
like it.

------
wrs
This was a pretty gutsy move on Python's part. The presence of a single emoji
in an English string will blow up memory usage for the whole string by 4x. And
because graphemes aren't 1:1 to code points, the O(1) indexing and length
operations you bought with that trade-off will _still_ confuse people who
don't understand Unicode.

~~~
ubernostrum
As I said in the article, I think the overhead of adding yet more weirdness in
the form of quirks of the internal encoding (which could vary according to how
the Python interpreter was compiled!) is a bad thing to do on top of how much
people seem to struggle mentally just to get Unicode all on its own.

Though I also think the struggle is mostly due to people being stuck in an
everything-is-like-ASCII mindset, and though I didn't get into that, it's one
big reason why I think UTF-8 is generally the _wrong_ way to expose Unicode to
a programmer, since it lets them think they can keep that cherished "one byte
== one character" assumption right up until something breaks at 2AM on a
weekend.

Personally I'd like everyone to just actually learn at least the things about
Unicode that I went into here (such as why "one code point == one character"
is a wrong assumption), and I think that'd alleviate a lot of the pain. I also
avoided talking much about normalization, because too many people hear about
it and decide they can just normalize to NFKC and go back to assuming code
point/character equivalence post-normalization.

~~~
ekidd
> it's one big reason why I think UTF-8 is generally the wrong way to expose
> Unicode to a programmer, since it lets them think they can keep that
> cherished "one byte == one character" assumption right up until something
> breaks at 2AM on a weekend.

Unfortunately, as long as you believe that you can index into a Unicode
string, your code is going to break. The only question is how soon.

I actually like UTF-8 because it will break _very_ quickly, and force the
programmer to do the right thing. The first time you hit é or € or ️an emoji,
you'll have a multibyte character, and you'll need to deal with it.

All the other options will _also_ break, but later on:

\- If you use UTF-16, then é and € will work, but emoji will still result in
surrogate pairs.

\- If you use a 4-byte representation, then you'll be able to treat most emoji
as single characters. But then somebody will build é from two separate code
points as "e + U+0301 COMBINING ACUTE ACCENT", or you'll run into a flag or
skin color emoji, and once again, you're back at square zero.

You _can 't_ really index Unicode characters like ASCII strings. Written
language is just too weird for that. But if you use UTF-8 (with a good API),
then you'll be forced to accept that "str[3]" is hopeless very quickly. It
helps a lot if your language has separate types for "byte" and "Unicode
codepoint", however, so you can't accidentally treat a single byte as a
character.

~~~
patrickthebold
I get the variable byte encodings. And I know that Unicode has things like
U+0301 as you say, and so code points are not the same as characters/glyphs.
But I don't understand why it was designed that way. Why is Unicode not simply
an enumeration of characters.

~~~
Avernar
One reason is because it would take a lot more code points to describe all the
possible combinations.

Take the country flag emoji. They're actually two seperate code points. The 26
code points used are just special country code letters A to Z. The pair of
letters is the country code and shows up as a flag. So just 26 codes to make
all the flags in the world. Plus new ones can be added easily without having
to add more code points.

Another example is the new skin tone emoji. The new codes are just the colour
and are put in front of the existing emoji codes. Existing software just shows
the normal coloured emoji but you may see a square box or question mark symbol
in front of it.

~~~
coldtea
> _The pair of letters is the country code and shows up as a flag. So just 26
> codes to make all the flags in the world. Plus new ones can be added easily
> without having to add more code points. Another example is the new skin tone
> emoji._

Still not answering the question though.

For one, when the unicode standard was originally designed it didn't have
emoji in it.

Second, if it was limitations to the arbitrary addition of thousands of BS
symbols like emoji that necessitate such a design, we could rather do without
emojis in unicode at all (or klingon or whatever).

So, the question is rather: why not a design that doesn't need "normalization"
and runes, code points, and all that...

Using less memory (like utf-8 allows) I guess is a valid concern.

~~~
Avernar
It didn't have emoji but it did have other combining characters. While some
langages it's feasable to normalize them to single code points but other
langagues it would not be.

Plus the fact that some visible characters are made up of many graphemes the
number of single code points would be huge.

As to your second point it seems to me to be a little close minded. The whole
point of a universal character set was that languages can be added to it
whether they be textual, symbolic or pictographic.

~~~
coldtea
> _As to your second point it seems to me to be a little close minded. The
> whole point of a universal character set was that languages can be added to
> it whether they be textual, symbolic or pictographic._

Representing all languages is ok as a goal -- adding klingon and BS emojis not
so much (from a sanity perspective, if adding them meddled with having a
logical and simple representation of characters).

So, it comes to "the fact that some visible characters are made up of many
graphemes the number of single code points would be huge" and "while some
languages it's feasable to normalize them to single code points but other
langagues it would not be".

Wouldn't 32 bits be enough for all possible valid combinations? I see e.g.
that: "The largest corpus of modern Chinese words is as listed in the Chinese
Hanyucidian (汉语辞典), with 370,000 words derived from 23,000 characters".

And how many combinations are there of stuff like Hangul? I see that's 11,172.
Accents in languages like Russian, Hungarian, Greek should be even easier.

Now, having each accented character as a separate might take some lookup
tables -- but we already require tons of complicated lookup tables for string
manipulation in UTF-8 implementations IIRC.

~~~
Avernar
You might be correct and 32 bits could have been enough but Unicode has
restricted code points to 21 bits. Why? Because of stupid UTF-16 and surrogate
pairs.

I'm curious why you think that UTF-8 requires complicated lookup tables.

~~~
coldtea
> _I 'm curious why you think that UTF-8 requires complicated lookup tables._

Because in the end it's still a Unicode encoding, and still has to deal with
BS like "equivalence", right?

Which is not mechanically encoded in the err, encoding (e.g. all characters
with the same bit pattern there are equivalent) but needs external tables for
that.

~~~
Avernar
But that's the same for UTF-16 and UTF-32. That's why I was wondering why you
singled UTF-8 out, implying it needed extra handling.

~~~
coldtea
Nah, didn't single it out, I asked why we don't have a 32-bit fixed-size code
points, non-surrogate-pair-bs etc encoding.

And I added that while this might need some lookup tables, we already have
those in UTF-8 too anyway (a non fixed width encoding).

So the reason I didn't mention UTF-16 and UTF-32 is because those are already
fixed-size to begin with (and increasingly less used nowadays except in
platforms stuck with them for legacy reasons) -- so the "competitor" encoding
would be UTF-8, not them.

------
Animats
Python took the obvious approach - they already had UTF-16 and UTF-32 builds,
so this was just making that mechanism dynamic.

Go and Rust expose UTF-8 at the byte level. This is something of a headache
and may result in invalid string slices. It basically punts the problem back
to the user.

Here's an alternative: Use UTF-8 as the internal representation, but don't
expose it to the user.

If you're iterating over a string one rune or one grapheme at a time, the
UTF-8 substructure is hidden from the user. Only if the user uses an explicit
numeric subscript do you need to know a rune's position in string. When a
request by subscript comes in, scan the string and build an index of rune
subscript->byte position. This is expensive, but no worse than UTF-32 in space
usage or expansion to UTF-32 in time.

Optimizations:

\- Requests for s[0] to s[N], and s[-1] to s[-N], for small N, should be
handled by working forwards or backwards through the UTF-8. (Yes, you can back
up by rune in UTF-8. That's one of the neat features of the representation.)

\- Lookup functions such as "index" should return an opaque type which
represents the position into that string. If such an object is used as a
subscript, there's no need to build the index by rune. If you coerce this
opaque type into an integer, the index table has to be built. Adding or
subtracting small integers from this opaque type should be supported by
working backwards or forwards in the string.

\- Regular expression processing has to be UTF-8 aware. It shouldn't need an
index by rune.

This would maintain Python's existing semantics while reducing memory
consumption.

Some performance measurement tool that finds all the places where an index by
rune has to be built is useful. It's rare that you really need this, but
sometimes you do.

~~~
hsivonen
> Go and Rust expose UTF-8 at the byte level. This is something of a headache
> and may result in invalid string slices. It basically punts the problem back
> to the user.

In Go, yes. In Rust, no. UTF-8 in Go is garbage in, garbage out. Rust,
however, won't let you materialize an invalid &str without "unsafe".

~~~
hackits
> Go and Rust expose UTF-8 at the byte level.

Or you can take the C++/C approach and have a character 1 byte, 2 bytes, or a
multi-byte. It's a pain in the ass to constantly in C/C++ having to interface
between two libraries that one decided to use char and another w_char!

~~~
hsivonen
The way the C and C++ committees approach Unicode is even worse than Python
breaking away from UTF-16 in the wrong direction (UTF-32 being the wrong
direction and UTF-8 being the right direction).

The first rule of reasonably happy C and C++ Unicode programming is not to use
wchar_t for any purpose other than immediate interaction with the Win32 API.

The second rule of reasonably happy C and C++ Unicode programming is not to
use the standard library facilities (which depend on the execution
environment) for text processing but using some other library where the UTF-*
interpretation of inputs and outputs doesn't shift depending on the execution
environment or compilation environment.

~~~
hackits
This is where we run into a little bit of a problem. You have a char pointer
that can be either a multiple byte encoded (depending on the code page window
is using). It also can be UTF-8 encoded. Then when you move onto windows
wchar_t that is originally defined as (UCS-2) then was later renamed to
UTF-16, due to surrogate pair's.

So in the windows world with COM/DCOM you're basically nugged into using
UTF-16 wchar_t or it becomes a hell of a lot of pain. So it is easier just
simply to accept to use UTF-16 and do all the conversion from UTF-8, UTF-32,
code pages to a single encoding standard.

~~~
a_t48
You could just wrap that pointer in a class that describes what it is -
ideally at the type level (Utf8String, etc). Each string class knows how to
convert from other string types, and any library calls get wrapped in a method
that is either templated on the string input type(s) or takes a BaseString*
and calls virtual conversion functions. Or force a manual call to convert each
time so that your fellow developers know when slow conversions are happening
for sure.

It is a crappy situation though. Pick where you want your pain point to be.

------
Avernar
I'm not a fan of how Python 3 stores Unicode strings internally. In my opinion
they should have went with UTF-8. The extra scanning and conversion puts more
preassure on the processor and caches under load.

I agree that Python 2's Unicode handling is broken. That's why I just stored
UTF-8 in a normal string and avoided the whole mess. The only thing I have to
do is validate any input from the outside world is really UTF-8.

~~~
ubernostrum
Since the high-level API is supposed to let you treat a string as a sequence
of code points, a correct implementation (which Python didn't have until 3.3!)
would've imposed the overhead of conversion to something resembling a fixed-
width encoding whenever a programmer invoked certain operations.

And the vast majority of strings in real-world Python contain only code points
also present in latin-1, which means they can be stored in one byte per code
point with this approach. And for strings which can't be stored in one byte
per code point, you were similarly going to pay the price sooner or later.

~~~
Avernar
> Since the high-level API is supposed to let you treat a string as a sequence
> of code points,

I disagree with that premise. It should operate on grapheme clusters.
Operating on code points falls into the same trap as operating on bytes.

> a correct implementation (which Python didn't have until 3.3!) would've
> imposed the overhead of conversion to something resembling a fixed-width
> encoding whenever a programmer invoked certain operations.

Those operations should have been removed. Indexing is the big one that needs
fixed width internal representation for speed. Code could have been rewritten
to not require indexing. But mechanical translation from Python 2 to 3 was a
goal and because of that they couldn't radically change the unicode API for
the better.

> And the vast majority of strings in real-world Python contain only code
> points also present in latin-1, which means they can be stored in one byte
> per code point with this approach. And for strings which can't be stored in
> one byte per code point, you were similarly going to pay the price sooner or
> later.

You're going to pay the price for 4 byte per codepoint strings quite often. A
single emoji will blow up a latin-1 string to 4 times the size.

------
hprotagonist
[http://bit.ly/unipain](http://bit.ly/unipain) is my go-to reference whenever
i get tripped up on what's going on with unicode in python.

it is significantly more sane in python 3.3+.

------
mark-r
I've always been curious on how this change in 3.3 impacts the C/C++
interface. I don't really know where to look it up, and since I haven't yet
had to code a C++ library for Python I've had no burning need to answer the
question.

~~~
ubernostrum
The Python C API grew some new functions and constants which are aware of
what's going on and can tell you what encoding a particular Unicode object is
using, read from/write to it, etc. The pre-3.3 APIs have a lot of deprecations
in favor of the new API. If you want to use new API on a Unicode string
created via old API, you have to use the new PyUnicode_READY() on it first.

------
baby
Question: if you read a file is there an algorithm that will make sure it is
parsing the right encoding?

~~~
pmyteh
Not reliably, no. You can detect if it's an _invalid_ string according to the
encoding you're currently using (value > 127 for ASCII, invalid surrogate pair
for UTF-16) but there are lots of byte sequences that produce valid (but
semantically meaningless) output in multiple encodings. To choose between them
programmatically requires your algorithm to _understand_ the meaning of the
string as well as be able to decode it, which might be possible in limited
domains, but is a very hard problem in general.

------
yarrel
If this was phrased as a question it would be a trick one.

~~~
btym
"How Python does Unicode: Poorly."

~~~
scrollaway
Python 2 maybe. Python 3 does Unicode wonderfully well; I miss it whenever I'm
working with other languages.

~~~
Avernar
All Python 3 did was put a hard barrier between bytes and strings. That's it.

Mising is all the grapheme handling that languages that do Unicode strings
right have.

------
carapace
Unicode is a horrible scam, the worst thing to happen to digital language
representation. This is all just so much _turd polishing_.

(Also, that explanation of UTF-8 is crap. UTF-8 is _beautiful_ quite apart
from its utility, but you'd hardly know it from the article.)

I've said it before: Unicode is a conflation of a good idea and an impossible
idea. The good idea is a standard mapping from numbers to little pictures.
That's all ASCII was. The impossible idea is a digital code for _every way
humans write_. It's a form of digital cultural imperialism.

Unicode Consortium et. al. are absurdly arrogant.

~~~
simonh
Critical rants that don't suggest a better alternative, or describe what a
better alternative might look like even in outline, are rarely informative or
persuasive.

~~~
carapace
Step One: Admit there's a problem.

I heard, "Tell me more about what you think would be better." Here goes:

For written languages that are well-served by a simple sequence of symbols
(English, etc.) there is no problem: a catalog of the mappings from numbers to
pictures is fine is all that is required. Put them in a sequence (anoint UTF-8
as the One True Encoding) and you're good-to-go.

For languages that are NOT well-served by this simple abstraction the first
thing to do (assuming you have the requisite breadth and depth of linguistic
knowledge) is to figure out _simple formal systems_ that _do_ abstract the
languages in question. Then determine equivalence classes and standardize the
formal systems.

Let the structure of the language abstraction be a "first-class" entity that
has reference implementations. Instead of adding weird modifiers and other
_dynamic behavior_ to the code, let them be actual simple DSLs whose output is
the proper graphics.

Human languages are like a superset of what computers can represent.

Here's the Unicode Standard[1] on Arabic:

> The basic set of Arabic letters is well defined. Each letter receives only
> one Unicode character value in the basic Arabic block, no matter how many
> different contextual appearances it may exhibit in text. Each Arabic letter
> in the Unicode Standard may be said to represent the inherent semantic
> identity of the letter. A word is spelled as a sequence of these letters.
> The representative glyph shown in the Unicode character chart for an Arabic
> letter is usually the form of the letter when standing by itself. It is
> simply used to distinguish and identify the character in the code charts and
> does not restrict the glyphs used to represent it.

They baldly admit that Unicode is not good for drawing Arabic. I find the
phrase "the inherent semantic identity of the letter" to be particularly rich.
It's nearly mysticism.

If it is inconvenient to try to represent a language in terms of a sequence of
symbols, then let's represent it as a (simple) program that renders the
language correctly, which allows us to shoehorn non-linear behavior into a
sequence of symbols.

If you think about it, _this is already what Unicode is doing_ with modifiers
and such. If you read further in the Unicode Standard doc I quoted above
you'll see that they basically _do_ create a kind of DSL for dealing with
Arabic.

I'm saying: make it explicit.

Don't try to pretend that Unicode is one big standard for human languages.
Admit that the "space" of writing systems is way bigger and more involved than
Latin et. al. Study the problem of representing writing in a computer as a
first-class issue. Publish reference implementations of code that can handle
each kind of writing system _along with_ the catalog of numbered pictures.

From the Unicode Standard again:

> The Arabic script is cursive, even in its printed form. As a result, the
> same letter may be written in different forms depending on how it joins with
> its neighbors. Vow-els and various other marks may be written as combining
> marks called tashkil, which are applied to consonantal base letters. In
> normal writing, however, these marks are omitted.

Computer systems that are adapted to English are not going to work for Arabic.
I'd love to use a language _simpler than PostScript_ to draw Arabic! Unicode
strings are _not_ that language.

Consider the "Base-4 fractions in Telugu"
[https://blog.plover.com/math/telugu.html](https://blog.plover.com/math/telugu.html)

The fact that we have a way to represent the graphics ౦౼౽౾౸౹౺౻ is _great_! But
any software that wants to _use them properly_ will require some code to
translate to and from numbers in the computer to Telugu sequences of those
graphics.

Let that be part of "Unicode" and I'll shut up. In the meantime, I feel like
it's a huge scam and a kind of cultural imperialism from us hacker types to
the folks who are late to the party and for whom ASCII++ isn't going to really
cut it.

To sum up: I think the thing that replaces Unicode for dealing with human
languages in digital form should:

A.) Be created by linguists with help from computer folks, not by computer
folks with some nagging from linguists (apologies to the linguist/computer
folk who actually did the stuff.)

B.) We should clearly state the _problems_ first: What are the ways that human
language are written down?

C.) Write specific DSLs for each _kind_ of writing. Publish reference
implementations.

I think that's it. Are you informed? Persuaded even? Entertained at least? ;-)

[1]
[http://www.unicode.org/versions/Unicode9.0.0/ch09.pdf](http://www.unicode.org/versions/Unicode9.0.0/ch09.pdf)

~~~
simonh
That's a really good explanation of your position and reasons for it, thanks
you.

>They baldly admit that Unicode is not good for drawing Arabic.....I'd love to
use a language simpler than PostScript to draw Arabic! Unicode strings are not
that language.

Unicode isn't good for drawing anything. Unicode is not intended to, or try to
encode how a text should be displayed. At all, even slightly. This is the root
of my disagreement with your post. You're claiming it can't accurately render
the appearance of text, but that simply isn't it's purpose. It is purely and
only about encoding the graphemes. Glyphs are what fonts and display
technologies like PostScript are for, not Unicode.

You could argue that it should do that, perhapse Unicode should be a vector
drawing language or something, but it's hard to see how that would make it
useful for text processing that does concern itself with graphemes and
grapheme like units. Unless the display oriented system you want contained
within it a grapheme encoding system like Unicode to facilitate that - but
then why not work the other way around and use Unicode for that and build a
display system on top of Unicode to address your concerns?

I think trying to have your cake and eat it with a family of distinct DSLs
would be problematic. Text processing is bad enough, but how would you process
the content of a string that is actually a DSL? With Unicode it's possible to
write a library that can process text in any script, even ones not in the
standard yet, but if text could consist of computer code in any one of
thousands of different domain specific languages, how would you ever be able
to write one piece of code to work with all of them and all possible future
permutations? Finally if your DSL is producing display output, how does that
work with fonts? What if you want to vary the appearance of the output, how do
you apply that to the encoding output? It just seems that this approach
produces an enormous monolithic super-complex rabbit hole with no bottom in
sight.

~~~
carapace
> That's a really good explanation of your position and reasons for it, thanks
> you.

Cheers, I've had time to think and some sleep. I apologize to you and the
people I've offended with my cranky trollish manner.

> Unicode is not intended to, or try to encode how a text should be displayed.

This made realize "text" traditionally is exactly language that is displayed
somehow. The whole concept of storing writing as digital bits is metaphysical.
Barely so for e.g. English, but quite a lot for e.g. Arabic.

> [Unicode] is purely and only about encoding the graphemes.

If it's just a catalog mapping numbers to little pictures (technically to
collections, or families, of glyphs, or even to non-specific heuristics for
deciding if a graphical structure counts as a glyph for a grapheme [1]) then
I'll shut up. But what about the modifiers and stuff?

Maybe I _am_ being unfair to Unicode. I don't want to deny or denigrate the
cool and useful things it actually does do. As I said I think it's a
combination of a good idea (encoding graphemes) with an impossible idea
(encoding written human languages). If Unicode _isn 't_ the latter then I've
been shouting at the wrong cloud!

\- - - -

Here's what I'm trying to say: Imagine a conceptual "space" with ASCII on one
side and PostScript on the other. In between there's a countably infinite set
of formalisms that can describe _and render_ human languages. From this point
of view, the Unicode standard is a small part of that domain but it is
absorbing (in my opinion) so much of the available time and attention that
other potentially more-useful regions of the domain are completely neglected.

\- - - -

So, yeah, I think we should study languages and writing systems and
computerize them carefully with native speakers and writers and linguistic
experts in the room. And I think we would need what are in effect DSLs for
each kind of writing system. (Not every language, but rather every _kind_ of
way that languages are written down.)

> how would you process the content of a string that is actually a DSL

Parse it to a data-structure, the simplest that will suffice for the
language's structure. Work with it using defined functions (API). This is what
we do already but the fact that English could be represented as array<char>
reasonably well tends to obscure it.

string_value.split()

Or better yet:

    
    
        >>> s = "What is the type of text?"
        >>> s.title()
        'What Is The Type Of Text?'
    
    

> With Unicode it's possible to write a library that can process text in any
> script

That seems like it's true but I don't think it is true in practice. In your
reply to mjevans elsewhere in this thread,

> You can't determine [the correct way of connecting the characters] purely
> from Unicode, you have to also know the conventions used in writing Arabic
> script. However Unicode is not intended to encode such conventions.

And you point out that Unicode won't help you properly support cut-and-paste
for Arabic. So you can't process text using Unicode if that text is Arabic. In
fact, there may not _be_ "text" in Arabic the way there is in English! There
is written Arabic but not textual Arabic. In other words, Unicode may well be
engaged in _creating_ the textual form of Arabic (and other languages.)

> any one of thousands of different domain specific languages

I think there would be less than a hundred distinct formalisms that together
could capture the ways we have come up with to write, perhaps less than a
dozen, but I wouldn't want to bet on it.

> how would you ever be able to write one piece of code to work with all of
> them and all possible future permutations?

 _Maybe you can 't._

But if it's possible it will be by figuring out the _type_ of text, which
means exactly to figure out the set of functions that make sense on text. At
which point your code can use those functions (the API of the TextType) to
abstract over text. Like the str.title() method. Does that even makes sense in
Chinese or Arabic?

The comment by int_19h in this thread speaks to this point really well:

> It's not about encodings at all, actually. It's about the API that is
> presented to the programmer.

> And the way you take it all into account is by refusing to accept any
> defaults. So, for example, a string type should not have a "length"
> operation at all. It should have "length in code points", "length in
> graphemes" etc operations. And maybe, if you do decide to expose UTF-8
> (which I think is a bad idea) - "length in bytes". But every time someone
> talks about the length, they should be forced to specify what they want (and
> hence think about why they actually need it).

> Similarly, strings shouldn't support simple indexing - at all. They should
> support operations like "nth codepoint", "nth grapheme" etc. Again, forcing
> the programmer to decide every time, and to think about the implications of
> those decisions.

> It wouldn't solve all problems, of course. But I bet it would reduce them
> significantly, because wrong assumptions about strings are the most common
> source of problems.

What you're asking for is the base type for "text" for all languages, the _ur-
basestring_ , if you will. (It may not exist.)

> Finally if your DSL is producing display output, how does that work with
> fonts? What if you want to vary the appearance of the output, how do you
> apply that to the encoding output? It just seems that this approach produces
> an enormous monolithic super-complex rabbit hole with no bottom in sight.

Well again, computerized text is a new thing under the sun, different from
writing, which has been happening all over the world for thousands of years
(cf. Rongorongo[2]) Separating the "text" from the written form of the text
(the display) is a new and metaphysical thing to do. For languages like
English we get pretty far with encoding the Alphabet and some punctuation
marks and putting them in a row. We completely bunted on capitalization
though, we pretend that 'a' and 'A' are two different things. Typefaces can be
abstracted from the stream of encoded byte/characters and treated as metadata.
If you want to include it in a digital document you immediately have to define
a DSL (Rich Text Format for example) to shoehorn the metadata back into the
byte stream. Complications ensue.

For some languages (e.g. Arabic) _it may not make sense to abstract the
display of the text from the text_. (Again, writing is exactly _display_. It
is literally (no pun intended) the act of displaying language.) You _have_ to
include metadata in addition to the graphemes in order to recreate the correct
display of the text, so you _have_ to have some kind of DSL for the task.

As I said above, I don't think there are more than one or two dozen truly
different ways of writing. A set of DSLs (perhaps not dissimilar to the
generative L-Systems that can produce myriad realistic plant-like images from
a small set of operations) could presumably model those ways of writing.

Unicode was a start on computerization of written languages. I think an
approach that treats each kind of writing system as a first-class object of
study in its own right will give us standard models for dealing with text in
each kind in digital form. We should strive for computerized writing systems
that are "as simple as possible, but no simpler." And, yes, it seems to me
that some of them will have to include producing display output.

[1] DuckDuckGo image search for "letter A"
[https://duckduckgo.com/?q=letter+a&t=ffsb&atb=v60-2_b&iax=1&...](https://duckduckgo.com/?q=letter+a&t=ffsb&atb=v60-2_b&iax=1&ia=images)

[2]
[https://en.wikipedia.org/wiki/Rongorongo](https://en.wikipedia.org/wiki/Rongorongo)

\- - - -

Here's my "Cartoon History of Unicode":

    
    
        1. ASCII exists
        2. Europe does too!  Extend ASCII with the funky umlauts or whatever.
        3. Oh shit! Japan! Mojibake!
        4. I know! Let's use *sixteen* bits!  That'll solve everything.
        5. What do you mean Chinese is different from Japanese?
        6. WTF Arabic!?
        7. Boy there sure are a lot of graphemes.  Gotta collect 'em all.
        8. PIZZA SLICE
        9. POOP
    	

At which point we reach "peak internet" and Doge appears to say "wow".

