
Let’s Stop Ascribing Meaning to Code Points - geofft
https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
======
GolDDranks
Excellent post, thank you!

By the way, I've sometimes heard that it was the popularity of emoji that
finally made American and Western European coders learn about the particulars
of unicode. Is this just a jokingly told anecdote, or did emoji really help?

~~~
klibertp
Western European? I don't know - there are many languages in that part of the
world which use some non-standard characters. Germany (not sure if it's still
the case), France, Austria, Italy, Spain... Anyway, I always thought UNICODE
ignorance is strictly an anglo-sphere thing - most of the rest of the world is
_very_ aware.

~~~
throwawayish
Central European languages can get more or less by with Latin-1, although even
there many characters are already missing, iirc French, Czech, Dutch and
Finnish. Special characters of many regional dialects are missing as well.

FWIW (more or less proper) Unicode support seems to be pretty much an
implicit, hard requirement today for almost anything facing users.

~~~
1wd
Many government databases still just use some codepage and won't be changing
anytime soon though. You can't get your name in a Swiss passport without first
changing it into something ISO 8859-15 can represent.

[http://www.24heures.ch/suisse/Les-noms-de-l-Est-mutiles-
lors...](http://www.24heures.ch/suisse/Les-noms-de-l-Est-mutiles-lors-de-la-
naturalisation/story/25231514)

------
scrollaway
Excellent post.

> _Now, Rust is a systems programming language and it just wouldn’t do to have
> expensive grapheme segmentation operations all over your string defaults.
> I’m very happy that the expensive O(n) operations are all only possible with
> explicit acknowledgement of the cost. So I do think that going the Swift
> route would be counterproductive for Rust._

I'm not sure I understand the logic here. How is Rust exempt but not, say,
Python? Just because it's a systems language? I get why you wouldn't make it
the default in Rust, but the logic applies to all languages, doesn't it?

Aside: I was talking about this with a friend just yesterday. Is there any
language that separates the concept of a string and a user-facing string?
Specifically for localization purposes, the way gettext works.

eg. I wish that in Python you could do this:

    
    
        err = "internal_error"
        msg = t"Internal error!"
    

error_msg would be easy to pick up for translation, rather than have to do
some cludgy import as _ and _("Internal error!"). Hm, now that I'm writing
this here, I guess it's still not a type you'd want to operate differently on
than just plain unicode strings.

~~~
Manishearth
> Just because it's a systems language? I get why you wouldn't make it the
> default in Rust, but the logic applies to all languages, doesn't it?

You deal with strings as bytes (opaque data) (edit2: didn't really mean bytes
here. see child comments) more often than strings as text in systems
programming, basically.

Also, you'll want to write stuff like fast parsers in Rust. For a parser
you're usually doing ascii matching anyway.

Swift _hides_ the actual encoding of the string from the user. That's not
something we can do in Rust.

> I get why you wouldn't make it the default in Rust

That's all I'm saying? I'm totally for having EGC operations there, I don't
want them in the _default_ operations. "it just wouldn’t do to have expensive
grapheme segmentation operations all over your string defaults"

Basically, Rust tries to keep costs explicit, unlike languages like Python.
This would be against that philosophy.

Edit: I'd also like to point out that indexing by code point or grapheme is a
rare operation in Rust _anyway_.

~~~
xenadu02
In Swift you have access to the utf8 and utf16 views. For a string constructed
from UTF8 the utf8 view is a view onto the underlying storage more or less.
That said the String API is getting some revisions in Swift 4 to make it both
faster and easier to use without sacrificing correctness.

Swift supports unicode in source code so its parser is unicode aware. I'm not
sure there is a good case for parsing in a non-unicode-aware way.

If you're working with strings as opaque data why are you even using a String
type?

~~~
hsivonen
Is the memory representation of Swift strings now documented as something that
programmers can rely on for performance characteristics?

I read the Swift Book when Swift first came out and it bothered me a lot that
the book didn't say what the memory representation of strings is. In contrast,
Rust documentation is very clear about this stuff.

~~~
Manishearth
It's a tagged union of utf8 and utf16 IIRC. I had to look at the stdlib source
when I was trying to figure this out.

------
sanxiyn
> The main time you want to be able to index by code point is if you’re
> implementing algorithms defined in the unicode spec that operate on unicode
> strings (casefolding, segmentation, NFD/NFC). Most if not all of these
> algorithms operate on whole strings, so implementing them as an iteration
> pass is usually necessary anyway, so you don’t lose anything if you can’t do
> arbitrary code point indexing.

Another algorithm for unicode strings which wants to index by code point is
unicode regular expression matching. I heard that you _do_ lose something here
if you can't assume all code points encode to same length. Unlike other
algorithms you mentioned which only iterate forward, depending on
implementation regex needs to backtrack.

I once heard that complexity of implementing regex directly on UTF-8 is one
reason why Python will not use UTF-8 internally.

~~~
Animats
If everything that can match in a regular expression is valid UTF-8, and
everything the regular expression can generate is valid UTF-8, you can just
run the regular expression on the byte string.

This requires that matches match UTF-8 substrings, only. That's easy for
explicit strings, such as "abc". For more general forms, "." must match one
rune, not one byte. But that's not hard.

Python's problem is that strings are random-access indexable by rune (in the
Go sense). Python either needs a representation that's rune-indexable, or it
needs to generate an index array for strings that need one. There's an
argument for the second approach, because most loops don't need a random-
access index. "for i in "abcdef: ... " does not, for example.

~~~
Manishearth
> Python's problem is that strings are random-access indexable by rune

Yeah, one of my points in the blog post is that random-access rune (code
point) indexing isn't very valuable. Python is stuck with it, but if anyone in
the future needs to make a similar choice, I hope they don't use O(1) rune
indexing as a reason unless they have a very specific use case where O(1) rune
indexing actually matters.

(I, for one, am very happy with the "rune" terminology, "char" is way too
overloaded. It took me a minute to understand why they did that in Go when I
first picked it up, but when I realized the reason I found it brilliant)

~~~
Animats
One approach for a Python implementation would be to store strings as UTF-8,
and generate an index array as needed. For most operations, you don't need an
index. Concatenation doesn't need one. Ordinary iteration (for c in "hello" :)
doesn't need one. Regular expressions don't need an index. That covers most of
the common use cases.

I like "rune" as terminology. A rune is one Unicode code point. A grapheme is
a sequence of runes which should not be split.

~~~
Manishearth
Yeah, that could work. You don't even need a full index array, it can be a
binary datastructure (I am reminded of how skip lists work), e.g. you start
off with just one index for the halfway point and fill in things afterwards.
There are lots of optimizations you could do.

I'm a bit wary of this because I fear indexing in Python is currently used
more than this scheme can handle (i.e. too much for it to be considered a
win), if erroneously. I'm not too sure. I certainly have written bad python
code using indexing in the past, but that's just me (and many years ago).

~~~
throwawayish
String indexing is pretty rare. Most of them are s[0] or s[:n], where you
don't need an index.

Typically this is found when parsing something, eg.

    
    
        if line[0] == '#':
          # skip comment lines
          continue
    

Or

    
    
        prefixes = {...}
        for line in lines:
            prefix, remainder = line[:4], line[4:]
            if prefix not in prefixes:
                raise ValueError('Invalid prefix ' + prefix)
            parsed = prefixes[prefix](remainder)

~~~
Manishearth
Don't you need to index for `s[:n]`? You need to find the byte position of
code point n. In current python this isn't a problem because the strings don't
use a variable-width encoding, but we're talking about making Python use utf8
:)

------
makecheck
Yep. On Apple platforms there is basically just ONE way to do this
conveniently _and_ correctly: an enumeration function on NSString that visits
composed character sequences individually, as substrings. And, it was only
added in OS 10.6, which shows how slow the evolution of correct Unicode
processing can be.

And, there is _still_ no completely reliable way to guess the cell/column
width of a symbol just by looking at it, especially if your goal is to be
consistent with any other interpreter of the string (like a text editor).
Still heuristics at best for any case outside common tables like CJK.

------
slededit
The flip side to this is you can render text several orders of a magnitude
faster if its from the latin subset. A fast-path is extremely helpful if you
have a lot of text on the screen.

------
chris_wot
I'm not sure if this has changed, but LibreOffice has I herit d the most
amazingly complicated and bizarre text layout system you can ever imagine.

~~~
std_throwaway
How so?

