Hacker News new | past | comments | ask | show | jobs | submit login

> For len() in bytes or code points and graphemes, which are ALL identical for ASCII, you can calculate it directly on the tagged pointer.

7-bit ASCII yeah, not 8-bit, which it appears are supported in tagged pointers per your original comment.

8-bit ASCII has fun characters like é (ASCII 130, 0x82) which may be 1 grapheme cluster -- but.

There's several ways to represent it in Unicode. You can do it as U+00E9 (NFC) or you can do it as U+0065 U+0301 (the combining acute accent, NFD).

This means it's a byte length of 1 in ASCII -- but several in UTF-8 (since UTF-8 only has direct compatibility with the 7-bit plane). And either 2 or 4 in UTF-16.

Also, it means this string has a Unicode code point length of 1 or 2 depending on context and normalization form.

Again, asking the context-free len() of a string is a meaningless question.

> And you also have random access to bytes, code points, and graphemes in that special case.

You should never make this assumption. You should treat strings as opaque data blobs.

Unicode is just a billion edge cases in a trench coat ;)




Meta string is used only for ascii chars, so every char is in range of a byte. We can just random access them. But your reminder is good, I just find out that we didn't check the passed string are ascii string, although our case always pass ascii string




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: