Hacker News new | past | comments | ask | show | jobs | submit login

I don't understand the argument for "characters = grapheme clusters". From my perspective, there are a lot of different ways you'd want to iterate over a string. Grapheme cluster breaks, tailored grapheme cluster breaks, word breaks, line breaks, code points, code units… all of these make sense in some context. However, there are precious few times that I've wanted to iterate over grapheme clusters, so telling people that they should do that instead of something else doesn't make sense to me. (I mean, what problem is so common that we would want to iterate this way by default?)

For parsing, it often makes sense to iterate over code points or code units, since many languages are defined in terms of code points (and you can translate that to code units, for performance). XML 1.1, JavaScript, Haskell, etc... many languages are defined in terms of the underlying code points and their character classes in the Unicode standard. JSON and XML 1.0 are not everything.




We're pretty much on the same page here. When you want to slice a string (because you can only display or store a certain amount) or you want to do text selection and other cursor operations you can't so it by code point. That's where you want to break at character boundries which are graphemes or grapheme clusters.

For parsing it's easier to just scan for a byte sequence in UTF-8 because you know what you're looking for ahead of time. If you're looking for a matching quote, brace, etc. you just need to scan for a single byte in your text stream. Adding a smart iterator to the process that moves to the start of each code unit is not necessary and will slow things way down.

I just gave JSON and XML as examples and not an exhaustive list. If you know the code points you are scanning for it's way more effecient to scan for their code units. The state machine in a paraer will be operating at the byte level anyways.

I have yet to see a good example where processing/iterating by code point is the better choice (other than the grapheme code of the unicode library).


I'm not convinced that state machines will operate at the byte level. First of all, not all tokenizers are written using state machines. Even if that is the mathematical language we use to talk about parsers, it's still relatively common to make hand-written parsers. Secondly, if you take a Unicode-specified language and convert it to a state machine that operates on UTF-8, you can easily end up with an explosion in the number of possible states. Remember, this trick doesn't really change the size of the transition table, it just spreads it out among more states. On the other hand, you can get a lot more mileage out of using equivalency classes, as long as you're using something sensible like code points to begin with.

If you're curious, here's the V8 tokenizer header file:

https://github.com/v8/v8/blob/master/src/parsing/scanner.h

You can see that it works on an underlying UTF-16 code unit stream which is then composed into code points before tokenization. This extra step with UTF-16 is a quirk of JavaScript.

If you think that V8 shouldn't be processing by code point, feel free to explain that to them.


State machines would have to operate on the byte level. Otherwise each state would have to have have 65536 entries per state. The trick to handle UTF-8 would be to have 0-127 run as a state machine and > 127 break out to functions to handle the various unicode ranges that are valid for identifiers.

For languages that only allow non ascii in string literals a pure state machine would suffice.

Not sure why you're mentioning parsers. At that point you you're dealing with tokens.

As for UTF-16 it's an ugly hack that never should have existed in the first place. Unfortunately the unicode people had to fix their UCS-2 mistake.

Since Javascript is standardised to be either UCS-2 or UTF-16 it probably made sense to make the scanner use UTF-16.


State machines don't have to operate on the byte level because the tables can use equivalency classes. This will often result in smaller and faster state machines than byte-level state machines, if your language uses Unicode character classes here and there.


Looks like Javascript source code is required to be processed as UTF-16:

ECMAScript source text is represented as a sequence of characters in the Unicode character encoding, version 3.0 or later. [...] ECMAScript source text is assumed to be a sequence of 16-bit code units for the purposes of this specification. [...] If an actual source text is encoded in a form other than 16-bit code units it must be processed as if it was first converted to UTF-16.


Right, but the UTF-16 is read code point by code point, not code unit by code unit. At that point, it might as well be UTF-8 or UTF-32.


Walking over grapheme clusters is common in UIs, e.g. visually truncating a string.

Really I think you are arguing against the notion of "default iteration" altogether. As you say, the right type of iteration is context dependent, and it ought to be made explicit.


I'm not sure it is so common in UIs. Truncation is done by a single library function, so that's one case where it's used. Another case is for character wrapping, but that's fairly uncommon. I'm having trouble coming up with another case where it's used. Font shaping is done by a font shaping engine, which applies an enormous number of rules specific to the script in use. Text in a text editor isn't deleted according to grapheme cluster boundaries, and the text cursor doesn't fall on grapheme cluster boundaries either. These are all rules that change according to the script in use.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: