The Unicode standard describes in Annex 29 [1] how to properly split strings int...

Joeri · on March 9, 2017

This is most definitely not a solved problem, because graphemes (visual symbols) are a poor way to deal with unicode in the real world. Pretty much all systems either deal with the length in bytes (if they're old-style C), in code units / byte pairs (if they're UTF-16 based, like windows, java and javascript), or in unicode code points (if they're UTF-8 based, like every proper system should be). Dealing with the length in visual symbols is actually pretty much impossible in practice because databases won't let you define field lengths in graphemes.

The way things compose: bytes combine into code points (unicode numbers), and code points combine into graphemes (visual symbols). In UTF-16 for legacy compatibility reasons with UCS-2, code points decompose into code units (byte pairs), and high code points, which need a lot of bits to represent their number, need two code units (4 bytes) instead of one.

Java and JavaScript are UTF-16 based, so they measure length in code units and not code points. An emoji code point can be a low or high number depending on when it was added. Low numbers can be stored in two bytes, high numbers need four bytes. So an emoji can have length 1 or 2 in UTF-16. However, when moving to the database it will typically be stored in UTF-8, and the field length will be code points, not code units. So, that emoji will have a length of 1 regardless of whether it is low or high. You don't notice this as a problem because app-level field length checks will return a bigger number than what the database perceives, so no field length limits are exceeded.

There isn't any such thing as "characters" in code. In documentation when they say "characters" usually they mean bytes, code units or code points. Almost never do they mean graphemes, which is intuitively what people think they mean. The bottom line is two-fold: (A) always understand what is meant in documentation by "length in characters", because it almost never means the intuitive thing, and (B) don't try to use graphemes as your unit of length, it won't work in practice.

danbruc · on March 9, 2017

This is most definitely not a solved problem, because graphemes (visual symbols) are a poor way to deal with unicode in the real world.

What do you think how text editing controls work? You cursor moves one grapheme cluster at a time, selections start and end at grapheme cluster boundaries, and pressing backspace once deletes the last grapheme cluster even if it took you several key strokes to enter. Grapheme cluster are obviously useful and certainly not a poor way to deal with Unicode in the real world.

Sure, grapheme clusters are neither the most common way to talk about strings, nor are they the most useful one in all situations, but nobody claimed that. If you have to allocate storage, you of course use the size in bytes after encoding. If you translate between encodings, you may want to look at code points. The right tool for the job, and sometimes the right tool is grapheme clusters.

There isn't any such thing as "characters" in code.

Sure, there is. Actually characters exist only in code, they are not used in any field dealing with written language besides computing. A character is the smallest unit of text a computer system can address.

raphlinus · on March 9, 2017

Backspace is typically not one grapheme at a time, though it is for emoji. For scripts such as Arabic, it typically deletes ḥarakāt when they are composed on top of a base character. For a bit more discussion of how I hope to handle this in xi-editor, as well as links to the logic in Android, see https://github.com/google/xi-editor/issues/159

Manishearth · on March 9, 2017

clarification: It is for some emoji, e.g. backspace on a family emoji will eliminate family members one by one. (on most browsers and platforms afaict). But flag emoji will be deleted as a group. IIRC handling of multicodepoint profession emoji is inconsistent.

efaref · on March 10, 2017

> backspace on a family emoji will eliminate family members one by one.

Unexpectedly sinister.

piersadrian · on March 9, 2017

I would point you to Swift's implementation of its "Character" type. Swift string handling is a model for how programming languages should approach Unicode characters and their complex combinations. The standard interface into all Swift strings is its "Character" type, which works exclusively with grapheme clusters.

[1] https://developer.apple.com/library/content/documentation/Sw...

smackfu · on March 11, 2017

And they are changing a bunch of it for Swift 4: https://github.com/apple/swift/blob/master/docs/StringManife...

deathanatos · on March 9, 2017

A "code unit" exists in UTF-8 and UTF-32; they are not unique to UTF-16.[1] UTF-8's relationship with code points is approximately the same as UTF-16's, except that UTF-8 systems tend to understand code points better because if they didn't, things break a lot sooner, whereas they mostly work in UTF-16.

Your entire argument that graphemes are a poor way to deal with unicode seems to be that current programming languages don't use graphemes, instead dealing in a mix of code units or points. But the article here shows a number of cases where that doesn't break down, and the person you're responding to clearly points out that, for the cases covered in the article, graphemes are the way to go (and he's correct).

Graphemes aren't always the correct method (and I don't think your parent was advocating that), just like code units or code points aren't always the right way to count. It's highly dependent on the problem at hand. The bigger issue is that programming languages make the default something that's often wrong, when they probably ought to force the programmer to choose, and so, most code ends up buggy. Worse, some languages, like JavaScript, provide no tooling within their standard library for some of the various common ways of needing to deal with Unicode, such as code points.

[1]: http://unicode.org/glossary/#code_unit

throwaway399 · on March 10, 2017

How would you implement graphemes compatibility if you can have unlimited number of code points combine into a grapheme? Designing an efficient storage solution for such text seems like a nightmare.

Any time you define an upper limit, someone will come up with more emojis that will require larger number of code points per grapheme.

Dylan16807 · on March 10, 2017

> A "code unit" exists in UTF-8 and UTF-32; they are not unique to UTF-16.

Technically yes. But they are only "exposed" in UTF-16. In UTF-32 code points and code units are the same size, so you only have to deal with code points. In UTF-8 you only have to deal with code points and bytes. UTF-16 is unique in having something that is neither code point nor byte but sits in between.

danbruc · on March 10, 2017

That is certainly true if you only look at the word sizes at different layers. But any implementation will at least logically start with a sequence of bytes, then turn them into code units according to the encoding scheme, group code units into minimal well-formed code unit subsequences according to the encoding form, and finally turn them into code points.

While different layers may use words of the same size, there are still differences, for example what is valid and what is not. While for example U+00D800 is a perfectly fine code point, the first high-surrogate, 0x0000D800 is not a valid UTF-32 code unit. 0xC0 0xA0 is a perfectly fine pair of bytes, both are valid UTF-8 code units, and they could become the code point U+000020 if only 0xC0 0xA0 were not an invalid code unit subsequence.

So yes, while I agree that UTF-16 is special in that sense that one has to deal with 8, 16 and 32 bit words, I don't think that one should dismiss the concept of code units for all encoding forms but UTF-16. There enough subtle details between the different layers so that the distinction is warranted. And that is actually something I really like about the Unicode standard, it is really precise and doesn't mix up things that are superficially the same.

Dylan16807 · on March 10, 2017

> But any implementation will at least logically start with a sequence of bytes, then turn them into code units according to the encoding scheme, group code units into minimal well-formed code unit subsequences according to the encoding form, and finally turn them into code points.

Not at all. I've never seen people using UTF-8 deal with a code unit stage. They parse directly from bytes to code points.

> While for example U+00D800 is a perfectly fine code point, the first high-surrogate, 0x0000D800 is not a valid UTF-32 code unit.

I thought that was an invalid code point. Where would I look to see the difference? Nevertheless I would expect most code to make no distinction between the invalidity of 0x0000D800 and 0x44444444, except perhaps to give a better error message.

> 0xC0 0xA0 is a perfectly fine pair of bytes, both are valid UTF-8 code units, and they could become the code point U+000020 if only 0xC0 0xA0 were not an invalid code unit subsequence.

If you say that they're correct code units then at what point do you distinguish bytes and code units? In practice almost nobody decodes UTF-8 with an understanding of code units, neither by that name nor any other name. They simply see bytes that correctly encode code points, and bytes that don't.

Especially if you say that C0 is a valid code unit despite it not appearing in any valid UTF-8 sequences.

netvl · on March 10, 2017

> I've never seen people using UTF-8 deal with a code unit stage. They parse directly from bytes to code points.

Well, that's probably because in UTF-8 code unit is byte :)

Quoting https://en.wikipedia.org/wiki/UTF-8:

> The encoding is variable-length and uses 8-bit code units.

By definition, code unit is a bit sequence of a fixed size which can form code points. In UTF-8 you form code points using 8-bit bytes, therefore in UTF-8 code unit is byte. In UTF-16 it is a sequence of two bytes. In UTF-32 it is a sequence of four bytes.

Dylan16807 · on March 10, 2017

I said as much in my first comment, yes. I'm not sure if I'm missing something in your comment?

Code units may 'exist' on all three through the fiat of their definition, but they only have a visible function and require you to process an additional layer in UTF-16.

burntsushi · on March 10, 2017

> I thought that was an invalid code point.

Surrogate codepoints are indeed valid codepoints. It's just that valid UTF-8 is not allowed to encode surrogate codepoints, so the space of codepoints supported by UTF-8 is actually a subset of all Unicode codepoints. This subset is known as the set of Unicode scalar values. ("All codepoints except for surrogates.")

Dylan16807 · on March 10, 2017

Those points cannot be validly encoded in any format. I suppose you can argue that they are valid-but-unusable in an abstract sense, since the unicode standard does not actually label any code points as valid/invalid, but if you were going to label any code points as invalid then those would be in the group.

danbruc · on March 10, 2017

You are certainly correct that it is common to not pay too much attention to what things are called in the specification, especially if you want to create a fast implementation. Logically you will still go through all the layers even if you operate on only one physical representation.

My admittedly quite limited experience with Unicode is from trying to exploit implementation bugs. And with that focus it is quite natural to pay close attention to the different layers in the specification in order to see where optimized implementations might miss edge cases.

And I am generally a big fan of staying close to the word of standards, if it does not cause unacceptable performance issues, I would always prefer to stick with the terminology of the standard even if it means that there will be transformations that are just the identity.

The distinction between code points and scalar values will for example become relevant if you implement Unicode meta data. There you can query meta data about surrogate code points even if a parser should never produce those code points.

Manishearth · on March 9, 2017

> There isn't any such thing as "characters" in code. In documentation when they say "characters" usually they mean bytes, code units or code points. Almost never do they mean graphemes, which is intuitively what people think they mean. The bottom line is two-fold: (A) always understand what is meant in documentation by "length in characters",

This is because languages usually have a built in char type.

> don't try to use graphemes as your unit of length, it won't work in practice.

Swift does this and it's a really good thing. Everything is in graphemes by default -- char segmentation, indexing, length, etc.

There are way too many problems caused by programmers interpreting "code point" as a segmentable unit of text and breaking so many other scripts, not to mention emoji.

deathanatos · on March 9, 2017

> This is a solved problem.

Not … really. Yes, we "know" the solution, but the terrible APIs that compose so many language's standard string type goads the programmer into choosing the wrong method or type.

JavaScript has — to an extent — the excuse of age. But the language still really (to my knowledge) lacks an effective way to deal with text that doesn't involve dragging in third-party libraries. You are not a high-level language if your standard library struggles with Unicode. Even recent additions to the language, such as the recent inclusion of leftPad, ignore Unicode (and, in that particular example, render the function mostly useless).

paulddraper · on March 10, 2017

> You are not a high-level language if your standard library struggles with Unicode

So C++, Lisp, Java, Python, Ruby, PHP, and JS are not high-level languages.

HN teaches me something new every day.

jcranmer · on March 10, 2017

What would you say Java is missing? Sure, it does have the "oops, we implemented Unicode when they said we only needed 16 bits problem" but unlike, say, JS, it actually handles astral plane characters well (e.g., the regex implementation actually says that . matches an astral plane code point rather than half of one).

It does have all the major Unicode annexes--normalization (java.text.Normalizer), grapheme clusters (java.text.BreakIterator), BIDI (java.text.Bidi), line breaking (java.text.BreakIterator), not to mention the Unicode script and character class tables (java.lang.Character). And, since Java 8, it does have a proper code point iterator over character sequences.

paulddraper · on March 10, 2017

I stand corrected. Java 8 has everything you could expect.

nathancahill · on March 10, 2017

Python 3 does pretty good.

paulddraper · on March 10, 2017

It does better than most, though Python 3 lacks grapheme support in the standard library, requiring developers to use a library like uniseg. I.e. it "lacks an effective way to deal with text that doesn't involve dragging in third-party libraries", and is thus evidently not a "high-level language".

nurettin · on March 10, 2017

How does Ruby struggle with unicode?

lucaspiller · on March 10, 2017

This is from a couple of weeks ago, there's a few things broken still, but what languages do have full support out of the box?

http://blog.honeybadger.io/ruby-s-unicode-support/

slobotron · on March 10, 2017

Perl6 https://perl6advent.wordpress.com/2015/12/07/day-7-unicode-p...

paulddraper · on March 10, 2017

Swift

Immortalin · on March 10, 2017

Sean1708 · on March 10, 2017

Doesn't look like Go does: https://github.com/golang/go/issues/14820

paulddraper · on March 10, 2017

Well, for one, I can't even write a portable unicode string literal.

> "\xAA".split ''

That works on a platform where my platform is UTF-32, but not one where it is UTF-8.

danbruc · on March 9, 2017

That is what I meant, there is an existing algorithm to do this because the author tried to come up with one. That JavaScript fails to provide an implementation, well, too bad, but this is of course a problem one may have to solve in any language.

And while other languages provide the necessary support at the language or standard library level, I would guess there are quite a few developers out there that are not even aware that they are looking for enumerating grapheme clusters. But now some more know and if they made a good language choice, it is now a solved problem for them.

klodolph · on March 10, 2017

It's not a problem that comes up that often, to be fair. Many of the cases where you think you'd need to split a string like that, you have some library doing the work for you. One of the main purposes is to figure out where valid cursor locations are in text editors... but in JavaScript, you just put a text field in your web page and let the browser do the heavy lifting. Same with text rendering... hand it off to a library which does the right thing.

libeclipse · on March 10, 2017

I didn't know perfect unicode support in the stdlib was a requirement for being a high-level language.

newtang · on March 9, 2017

I'm the author. Thank you for you sharing! I will check it out shortly.

ggchappell · on March 10, 2017

> And here [2] is a JavaScript implementation.

Is it up to date?

The last commit to this repo was on July 16, 2015, and the code says it conforms to the 8.0 standard. But Unicode 9.0 came out in June 2016. The document in your link [1] indicates that there were changes in the text-segmentation rules in the 9.0 release. However, I can't say whether any of these affect the correctness of the code.