Hacker News new | comments | show | ask | jobs | submit login
Let’s Stop Ascribing Meaning to Code Points (manishearth.github.io)
132 points by geofft 130 days ago | hide | past | web | 55 comments | favorite

Excellent post, thank you!

By the way, I've sometimes heard that it was the popularity of emoji that finally made American and Western European coders learn about the particulars of unicode. Is this just a jokingly told anecdote, or did emoji really help?

Emoji may have helped popularize support for Unicode characters outside of the Basic Multilingual Plane (The first 2¹⁶ code points in Unicode), because all but a few fall outside of it. Before emoji any code point outside of the BMP tended to have a niche audience. When I wrote my master's thesis in 2009, support for non-BMP characters was still sketchy (although XeLaTeX and PDF-viewer Evince could handle them neatly on the Linux desktop).

I doubt that emoji helped further popularize Unicode in Europe, because in my experience (anecdotal of course) developers were already well aware of the benefits of Unicode (and UTF-8 in particular for the web) before emoji got adopted from Japanese telephones and embraced en-masse. I suspect that in Europe the need for Unicode arose sooner. For Dutch for example; with the introduction of ISO-8859-15 in 1999 (subtly different from ISO-8859-1) you now often had two character encoding standards in mixed use for a single language, and when you had to render documents with names of people from various countries, things just got confusing (in the US diacritics tend to get dropped instead). UTF-8 offered a clean break from this.

It seems plausible that emoji may have helped Americans break the remnants of the ASCII barrier though.

(where "niche audience" includes speakers of the most spoken language in the world, but I assume you're talking about the West specifically)

I was under the assumption that for the average Chinese speaker the Basic Multilingual Plane sufficed. Chinese characters (as used in Chinese, Japanese, Korean, and Vietnamese) were certainly added beyond the BMP as well (the CJK Unified Ideographps Extensions), but to my knowledge these all tend to be rather rare (think characters from Buddhist sutras, linguistic research, historical use, et cetera, but also from names not yet provided for in the BMP).

Huh. IIRC they're a bit more common than that, but yes, not very common.

Or perhaps you're just talking about Chinese where I'm also considering Kanji and others. Which isn't from the most spoken language in the world, but still, top 10!

Actually for Japanese in particular the extended CJK ranges are of limited use (I'm more familiar with Japanese than Chinese), although they do contain the parts of JIS X 0213 that were still missing from Unicode (all of them rare characters) and some ideographical component parts that are useful for dictionary and educational software. With the latter you can for example use Unicode codepoints for the distinct components of a single Chinese character (e.g., 𠂉 as the top-right part of 族). That is a fairly specialised use though.

For Korean and Vietnamese the extra Chinese characters might apply to ancient names (geographical usually I would wager), but neither language has a lot of ideographs in modern usage.

It's certainly an interesting topic though.

Western European? I don't know - there are many languages in that part of the world which use some non-standard characters. Germany (not sure if it's still the case), France, Austria, Italy, Spain... Anyway, I always thought UNICODE ignorance is strictly an anglo-sphere thing - most of the rest of the world is very aware.

Western Europe uses more characters than ASCII like umlauts (äöü), accents (àèé), €¢, German ß and a few more, but large parts are still comfortably stuck in a codepage-style mindset thinking "ISO Latin-1 is good enough". (Also know as or confused with ISO/IEC 8859-1, CP-1252 or "Ansi".)


Central European languages can get more or less by with Latin-1, although even there many characters are already missing, iirc French, Czech, Dutch and Finnish. Special characters of many regional dialects are missing as well.

FWIW (more or less proper) Unicode support seems to be pretty much an implicit, hard requirement today for almost anything facing users.

Many government databases still just use some codepage and won't be changing anytime soon though. You can't get your name in a Swiss passport without first changing it into something ISO 8859-15 can represent.


I find it pretty accurate, but that's just anecdotal from how I've seen software I use change in that time period, and how I've seen attitudes of people online change.

I'm sure plenty of Western programmers were aware of this stuff, though. Just not enough.

Excellent post.

> Now, Rust is a systems programming language and it just wouldn’t do to have expensive grapheme segmentation operations all over your string defaults. I’m very happy that the expensive O(n) operations are all only possible with explicit acknowledgement of the cost. So I do think that going the Swift route would be counterproductive for Rust.

I'm not sure I understand the logic here. How is Rust exempt but not, say, Python? Just because it's a systems language? I get why you wouldn't make it the default in Rust, but the logic applies to all languages, doesn't it?

Aside: I was talking about this with a friend just yesterday. Is there any language that separates the concept of a string and a user-facing string? Specifically for localization purposes, the way gettext works.

eg. I wish that in Python you could do this:

    err = "internal_error"
    msg = t"Internal error!"
error_msg would be easy to pick up for translation, rather than have to do some cludgy import as _ and _("Internal error!"). Hm, now that I'm writing this here, I guess it's still not a type you'd want to operate differently on than just plain unicode strings.

> Just because it's a systems language? I get why you wouldn't make it the default in Rust, but the logic applies to all languages, doesn't it?

You deal with strings as bytes (opaque data) (edit2: didn't really mean bytes here. see child comments) more often than strings as text in systems programming, basically.

Also, you'll want to write stuff like fast parsers in Rust. For a parser you're usually doing ascii matching anyway.

Swift hides the actual encoding of the string from the user. That's not something we can do in Rust.

> I get why you wouldn't make it the default in Rust

That's all I'm saying? I'm totally for having EGC operations there, I don't want them in the default operations. "it just wouldn’t do to have expensive grapheme segmentation operations all over your string defaults"

Basically, Rust tries to keep costs explicit, unlike languages like Python. This would be against that philosophy.

Edit: I'd also like to point out that indexing by code point or grapheme is a rare operation in Rust anyway.

In addition to the Rust design making more sense for systems programming than Swift's, what Rust puts on the language-level is forward-compatible with future Unicode versions. A future version of Unicode could define combining behavior for currently unassigned code points, which means that iterating by grapheme cluster will change as Swift becomes aware of new Unicode versions.

> You deal with strings as bytes (opaque data) (edit2: didn't really mean bytes here. see child comments) more often than strings as text in systems programming, basically.

That's true in some sense [1] for unix-ish OSes, but doesn't hold on other operating systems.

[1] Even on unix-ish OSes it's pretty obvious that filenames are supposed to be encoded in something and aren't just uint8_t*. There's just no one telling you what that encoding is, so normally it's guessed (eg. $LANG). For filesystems that actually have a defined encoding (eg. NTFS\Win32, JFS, HFS+, APFS and the whole bunch of "non-unix OS FS") these are re-coded by the filesystem driver in the unix-ish OS. Eg. ntfs-3g transcodes filenames to UTF-8; cifs takes a "iocharset" option; and we of course have convmv(1).

The idea that "file names and generally everything in the OS is a bunch of bytes" is pretty much only shared by unix-ish systems.

Oh, that's not the sense in which I was using the term "bytes". Rust has different string types for file contents and path names, with their own (platform-dependent) restrictions.

In Swift you have access to the utf8 and utf16 views. For a string constructed from UTF8 the utf8 view is a view onto the underlying storage more or less. That said the String API is getting some revisions in Swift 4 to make it both faster and easier to use without sacrificing correctness.

Swift supports unicode in source code so its parser is unicode aware. I'm not sure there is a good case for parsing in a non-unicode-aware way.

If you're working with strings as opaque data why are you even using a String type?

Isn't the access to the views an iterator?

> I'm not sure there is a good case for parsing in a non-unicode-aware way

I'm not saying that you should parse in a non-unicode-aware way, I'm saying that most parsing code deals with ASCII delimeters, so grapheme clusters are a distraction (and will mess up parsing, since most formats do not consider EGCs. An HTML closing tag followed by an accent would totally break the parser otherwise). You rarely need to index, though.

Basically, in parsing, you're looking for characters which themselves are single code points. So it doesn't matter.

> If you're working with strings as opaque data

Well, not entirely opaque :)

But the String type is basically "here is a valid piece of UTF8 data", and you can pass that around. That's a distinct enough purpose to have a String type, even if you don't mess with its innards.

IME the most used String APIs are those for interpolation, concatenation, and search, all of which don't need to be EGC aware (though search being EGC aware is sometimes a good thing). Slicing sometimes gets used in conjunction with the others.

Is the memory representation of Swift strings now documented as something that programmers can rely on for performance characteristics?

I read the Swift Book when Swift first came out and it bothered me a lot that the book didn't say what the memory representation of strings is. In contrast, Rust documentation is very clear about this stuff.

It's a tagged union of utf8 and utf16 IIRC. I had to look at the stdlib source when I was trying to figure this out.

> You deal with strings as bytes (opaque data) more often than strings as text in systems programming, basically.

Now you're really confusing me. Are you talking about simply byte strings vs. unicode strings? Most modern languages, including python, have that distinction. What am I missing?

No, no, Rust has unicode (utf8) strings (also bytestrings if you need them). "bytes" was a mischaracterization, sorry.

I'm saying that most string handling in Rust is just lugging them around and copying them and concatenating/interpolating them.

On top of that, the few times you do iterate through them, you're looking for specific characters or character sequences (e.g. while parsing), and EGCs aren't so important there. They can be, but usually aren't.

Basically, personally the only times I've seen .chars() being used is in parsery things (where your grammar is defined over code points, not EGCs, so using EGCs is counterproductive and wrong), in the Servo DOM (where the spec tells you to operate on chars), and code where you're just verifying that the string contains or does not contain code points of a certain type (e.g. whitespace).

None of this is ascribing meaning to the concept of a code point, it's just using them as a crutch to reason about things like equality. (Of course, equality in Unicode is a whole other can of worms).

> Is there any language that separates the concept of a string and a user-facing string?

Not quite, but Nim has a concept of a tainted string, which comes from outside your (web) application and has to be escaped or explicitly casted into a regular string (the Python web framework Django has something similar). You could use the same mechanism of "distinct types" to create a user facing string vs an internal string if you want. I used this in a little Unicode related experiment to create a number-of-bytes and number-of-chars type, and made my strings take these as index, so I would never confuse them. Pretty neat language feature I thought.

FYI: According to wikipedia: "Perl supported tainting from at least 1989 as the -T switch was included in Perl 3."


Yeah, I had a hunch that the concept goes back a bit :-)

Nim does have something new though, its "distinct" types are more general than just tainted strings, and can be used whenever you have the same raw data type, but different meaning (and want the compiler to prevent accidential mixing). Of course, Haskellers will say that's old and boring...

> error_msg would be easy to pick up for translation, rather than have to do some cludgy import as _ and _("Internal error!"). Hm, now that I'm writing this here, I guess it's still not a type you'd want to operate differently on than just plain unicode strings.

This is maybe ok for very simple cases, but doesn't work as soon as you need context (in Python this is either done by extra _ parameters or through special comments) or pluralization.

In Python I've seen the convention "user facing string" vs. 'internal string'. It doesn't behave differently of course and there's no type checking, but it can be useful for showing intent.

Ruby has separate strings and symbols. While you can express the same data in both forms, the syntax biases symbols toward simplicity.

> The main time you want to be able to index by code point is if you’re implementing algorithms defined in the unicode spec that operate on unicode strings (casefolding, segmentation, NFD/NFC). Most if not all of these algorithms operate on whole strings, so implementing them as an iteration pass is usually necessary anyway, so you don’t lose anything if you can’t do arbitrary code point indexing.

Another algorithm for unicode strings which wants to index by code point is unicode regular expression matching. I heard that you do lose something here if you can't assume all code points encode to same length. Unlike other algorithms you mentioned which only iterate forward, depending on implementation regex needs to backtrack.

I once heard that complexity of implementing regex directly on UTF-8 is one reason why Python will not use UTF-8 internally.

If everything that can match in a regular expression is valid UTF-8, and everything the regular expression can generate is valid UTF-8, you can just run the regular expression on the byte string.

This requires that matches match UTF-8 substrings, only. That's easy for explicit strings, such as "abc". For more general forms, "." must match one rune, not one byte. But that's not hard.

Python's problem is that strings are random-access indexable by rune (in the Go sense). Python either needs a representation that's rune-indexable, or it needs to generate an index array for strings that need one. There's an argument for the second approach, because most loops don't need a random-access index. "for i in "abcdef: ... " does not, for example.

> Python's problem is that strings are random-access indexable by rune

Yeah, one of my points in the blog post is that random-access rune (code point) indexing isn't very valuable. Python is stuck with it, but if anyone in the future needs to make a similar choice, I hope they don't use O(1) rune indexing as a reason unless they have a very specific use case where O(1) rune indexing actually matters.

(I, for one, am very happy with the "rune" terminology, "char" is way too overloaded. It took me a minute to understand why they did that in Go when I first picked it up, but when I realized the reason I found it brilliant)

One approach for a Python implementation would be to store strings as UTF-8, and generate an index array as needed. For most operations, you don't need an index. Concatenation doesn't need one. Ordinary iteration (for c in "hello" :) doesn't need one. Regular expressions don't need an index. That covers most of the common use cases.

I like "rune" as terminology. A rune is one Unicode code point. A grapheme is a sequence of runes which should not be split.

Yeah, that could work. You don't even need a full index array, it can be a binary datastructure (I am reminded of how skip lists work), e.g. you start off with just one index for the halfway point and fill in things afterwards. There are lots of optimizations you could do.

I'm a bit wary of this because I fear indexing in Python is currently used more than this scheme can handle (i.e. too much for it to be considered a win), if erroneously. I'm not too sure. I certainly have written bad python code using indexing in the past, but that's just me (and many years ago).

String indexing is pretty rare. Most of them are s[0] or s[:n], where you don't need an index.

Typically this is found when parsing something, eg.

    if line[0] == '#':
      # skip comment lines

    prefixes = {...}
    for line in lines:
        prefix, remainder = line[:4], line[4:]
        if prefix not in prefixes:
            raise ValueError('Invalid prefix ' + prefix)
        parsed = prefixes[prefix](remainder)

Don't you need to index for `s[:n]`? You need to find the byte position of code point n. In current python this isn't a problem because the strings don't use a variable-width encoding, but we're talking about making Python use utf8 :)

It's also possible to chunk the index array: to store the byte-index of every 16th rune, say, and do a bit of cheap iteration afterward. I've had good success with this approach: it's fast, it's light on memory, and it preserves constant-time indexing by code point.

Backtracking in a UTF-8 string is easy: when you need to remember a location, just record its byte offset. You have exactly the same issue in UTF-16, which is also variable-width and therefore not O(1)-indexable by code point. There are loads of regular expression engines which operate on both of these encodings, and I've never seen them go to any great pains to work around the lack of code point indexing.

Yes, you have exactly the same issue in UTF-16, I never said otherwise.

I agree that you never said otherwise, and I applaud you for your accuracy in not doing so. :-D

The reason I bring it up is that UTF-16 is a very commonly used internal encoding for strings, and regular expression engines for platforms which use UTF-16 internally are generally designed to work on that encoding without conversion to UTF-32. Examples that spring to mind are most JavaScript regular expression engines, the ICU regular expression engine, and java.util.regex. If variable-width encodings were so difficult to deal with then you would expect these to have run into some problems with it; however, as far as I can tell, they have not. This is evidence against variable-width encodings being difficult for regular expression engines to deal with.

(Sanity check for my argument above: If you read through https://swtch.com/~rsc/regexp/regexp2.html, which covers both backtracking and non-backtracking regex engines and includes code, none of the code examples would need more than trivial modification to deal with variable-width encodings.)

Note that JS isn't UTF-16, it's slightly different. In JS unpaired surrogates are allowed. String indexing is by code unit, so non-BMP code points are considered to be two "characters".

IIRC Java also allows unpaired surrogates? But does consider non-BMP codepoints to be single characters. I think. Idk.

So from JSs point of view it is a fixed-width encoding, one where all byte values are valid. `"\uD83D\uDC68".match(/\ud83d/)` works (U+D83D is a high surrogate, and "\uD83D\uDC68" is U+1F468 MAN), so the regex engine operates on code units, not code points.

I think for Java the regex engine would have to deal with it as a variable-width encoding, though.

Edit: Hacker News removes emoji from comments.

Java strings are sequences of 16-bit units and don't guarantee UTF-16 validity. There are newer standard library APIs that perform astral-aware operations, though.

So does regex matching break up paired surrogates? It seems like Java Characters can be from the higher planes. JS treats surrogates as their own kind of character, but I think Java only does this for unpaired surrogates, so the encoding is still multibyte. ICBW.

Java, like NT, was originally UCS-2 and moved to UTF-16 when it became clear that UCS-2 won't suffice. APIs that operate on units (anything involving "char"s [the type], CharSequences and so on) are legacy and, when used, are almost certainly used incorrectly. Note that newer APIs on String also use the word "char", but the type "int", these are actually operating on code points and not CUs, ie. these APIs are correct.

(Note that java.lang.Character is a char, so wrong. I think there is no CodePoint type or similar that wraps int.)

In Java regexps, the dot matches a code point or an unpaired surrogate, so if you have a proper surrogate pair, the dot matches the whole pair without splitting it. It does split grapheme clusters.

It's unclear to me what you mean by "Java Characters". A Java char is a UTF-16 code unit, and the java.lang.Character boxes a char (16 bits unsigned). However, since Java 5, java.lang.Character has static methods for astral-aware UTF-16 operations where a code point is a Java int (32 bits signed).

Yeah, by Character I meant java.lang.Character.

> the dot matches the whole pair without splitting it

Yeah, so UTF16 in Java is being used as a variable-width encoding. Nice to know.

> Another algorithm for unicode strings which wants to index by code point is unicode regular expression matching.

Why? This is certainly not a requirement for at least some levels of Unicode support.

> I once heard that complexity of implementing regex directly on UTF-8 is one reason why Python will not use UTF-8 internally.

It's really not that bad. Both RE2 and Rust's regex engine operate on UTF-8 directly, and this is one of the reasons why these engines are fast. If your automaton is byte based, then you can do a lot of nice tricks in your DFA implementation. (In fact, it's a critical reason why ripgrep doesn't slow down when handling Unicode features, unlike GNU grep.)

For Rust at least, the trickier parts of generating the UTF-8 automaton are available for reuse in the utf8-ranges crate: https://docs.rs/utf8-ranges/1.0.0/utf8_ranges/ (Which is used in at least two regex implementations I've written.)

I'd say the complexity claim might be coming from some other component of the implementation, or something more related to how Python strings are generally represented, rather than it being a fundamental complexity of building a regex engine on UTF-8 directly.

There are many considerations of course and I might not have fully grasped your point, but in reference to your last paragraph:

By default RE2 matches directly against the UTF-8 encoding, see this neat state diagram:


In terms of backtracing engines, perl also stores strings internally as UTF-8 (on most platforms) and its regexp engine runs directly on them. It can also match "eXtended grapheme clusters" with \X.

Hm, interesting. You still don't need arbitrary indexing for that; just the ability to iterate backwards, but I can see how it would complicate things. I wonder if it could be implemented on utf8 without affecting performance much.

burntsushi's utf8-ranges library used by the Rust Regex lib probably makes this easier to deal with. I wonder how it performs in relation to the Python implementation (which I assume is in native code)

UTF-8 substrings can't find a misaligned match, because of the way UTF-8 is encoded. First bytes are unambiguous. I had to write "rfind" in Rust, and here it is:

    fn rfind(s: &[&str], key: &str) -> Option<usize> {
        (0 .. s.len()).rev().find(|&i| s[i] == key)    
No need to explicitly worry about rune boundaries. This, like forward find, returns a byte index into a UTF-8 string. Explicit byte indices into UTF-8 strings are troublesome.

Incidentally, that function could be s.iter().rev().position(|ss| ss == key), or drop the `rev()` and use `.rposition(...)` if one wants indices from the start of the slice. Although I'm not sure how searching a slice of strings for a specific one is related to UTF-8: the &str's are used completely opaquely. Did you mean to be searching a &str itself?

> because of the way UTF-8 is encoded.

Yes, I know, I mention this in my blog post :)

Rust's indexing methods on strings actually do operate on byte indices. You can't do `"foo"[2]`, but you can do `"foo[2..3]"`, and it will panic if you are not on a boundary. Nobody uses this (like you say, it's troublesome) :P I've seen folks just break out the byte/char iterator whenever they need something more complicated, because usually they do need to iterate anyway.

Rust internally stores slices as byte indices (well, pointer + byte length), but the compiler enforces the validity of that.

Yep. On Apple platforms there is basically just ONE way to do this conveniently and correctly: an enumeration function on NSString that visits composed character sequences individually, as substrings. And, it was only added in OS 10.6, which shows how slow the evolution of correct Unicode processing can be.

And, there is still no completely reliable way to guess the cell/column width of a symbol just by looking at it, especially if your goal is to be consistent with any other interpreter of the string (like a text editor). Still heuristics at best for any case outside common tables like CJK.

The flip side to this is you can render text several orders of a magnitude faster if its from the latin subset. A fast-path is extremely helpful if you have a lot of text on the screen.

I'm not sure if this has changed, but LibreOffice has I herit d the most amazingly complicated and bizarre text layout system you can ever imagine.

How so?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact