My point was something like - if you want to send a bunch of small strings over ...

chaokunyang · 2024-05-08T16:23:02

Yes, we only use this encoding when it brings benifits. `wire format` is a good term to describe such cases, which the context are constrained to current message only. So the data for compression are small, and many statistical methods won't work since it's a `wire format`. The stats itself must be send to peers, which will have bigger space and cpu cost

arcticbull · 2024-05-08T21:33:41

> For len() in bytes or code points and graphemes, which are ALL identical for ASCII, you can calculate it directly on the tagged pointer.

7-bit ASCII yeah, not 8-bit, which it appears are supported in tagged pointers per your original comment.

8-bit ASCII has fun characters like é (ASCII 130, 0x82) which may be 1 grapheme cluster -- but.

There's several ways to represent it in Unicode. You can do it as U+00E9 (NFC) or you can do it as U+0065 U+0301 (the combining acute accent, NFD).

This means it's a byte length of 1 in ASCII -- but several in UTF-8 (since UTF-8 only has direct compatibility with the 7-bit plane). And either 2 or 4 in UTF-16.

Also, it means this string has a Unicode code point length of 1 or 2 depending on context and normalization form.

Again, asking the context-free len() of a string is a meaningless question.

> And you also have random access to bytes, code points, and graphemes in that special case.

You should never make this assumption. You should treat strings as opaque data blobs.

Unicode is just a billion edge cases in a trench coat ;)

chaokunyang · 2024-05-09T14:47:01

Meta string is used only for ascii chars, so every char is in range of a byte. We can just random access them. But your reminder is good, I just find out that we didn't check the passed string are ascii string, although our case always pass ascii string