Hacker News new | past | comments | ask | show | jobs | submit login

My point was something like

- if you want to send a bunch of small strings over the network, you can pick a variable length encoding even more compact than UTF-8

- if you want to pass and return a lot of strings as values, store them in data structures like a List<Str>, then you can pick a variable length encoding even more compact than UTF-8. So you can compute on the entire List<Str> without indirections for the common case of small strings, with good cache locality

---

That doesn't really contradict anything you said -- the tagged pointer is in some sense a "wire format", and then you have to do a conversion to actually do something useful with the string.

Although sometimes you don't need conversions either. For len() in bytes or code points and graphemes, which are ALL identical for ASCII, you can calculate it directly on the tagged pointer.

And you also have random access to bytes, code points, and graphemes in that special case. A lot of the "work" happened when deciding that this encoding is valid for a particular string, which is nice.




Yes, we only use this encoding when it brings benifits. `wire format` is a good term to describe such cases, which the context are constrained to current message only. So the data for compression are small, and many statistical methods won't work since it's a `wire format`. The stats itself must be send to peers, which will have bigger space and cpu cost


> For len() in bytes or code points and graphemes, which are ALL identical for ASCII, you can calculate it directly on the tagged pointer.

7-bit ASCII yeah, not 8-bit, which it appears are supported in tagged pointers per your original comment.

8-bit ASCII has fun characters like é (ASCII 130, 0x82) which may be 1 grapheme cluster -- but.

There's several ways to represent it in Unicode. You can do it as U+00E9 (NFC) or you can do it as U+0065 U+0301 (the combining acute accent, NFD).

This means it's a byte length of 1 in ASCII -- but several in UTF-8 (since UTF-8 only has direct compatibility with the 7-bit plane). And either 2 or 4 in UTF-16.

Also, it means this string has a Unicode code point length of 1 or 2 depending on context and normalization form.

Again, asking the context-free len() of a string is a meaningless question.

> And you also have random access to bytes, code points, and graphemes in that special case.

You should never make this assumption. You should treat strings as opaque data blobs.

Unicode is just a billion edge cases in a trench coat ;)


Meta string is used only for ascii chars, so every char is in range of a byte. We can just random access them. But your reminder is good, I just find out that we didn't check the passed string are ascii string, although our case always pass ascii string




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: