We tried huffman first, But we can't know which string will be serialized. If using huffman, we must collect most strings and compute a static huffman tree. If not, we must send huffman stats to peer, which the cost will be bigger
Good suggestion, I was planing it as an optional option. We can crawl some github repos, and extract most types, extract their class names, paths, fields names, and use such data as the corpus
> This is vaguely similar to how data is encoded for QR codes.
Doesn't QR use variable static encodings (alphabets)? e.g. mode 1 is numerics, mode 2 is uppercase alphanumeric, mode 3 is binary, and mode 4 is jis (japanese).
QR codes use a series of (mode, length, characters) segments - a given code can switch between modes. I think for QR codes the length encoding is mode-specific.
The fact that the example is "org.apache.fury.benchmark.data" makes me think that there is probably a lot of room to do better than 5 bits per char, if the goal is simply to reduce space. E.g. you probably know ahead of time that the string "org.apache.fury" is going to be very common in this context, and giving that a 2-bit or even an 8-bit codepoint would be a huge space savings over the project. That said, it's not super surprising that you'd rather have a suboptimal encoding than have the complexity of a decoder that depends on the project. But of course you've already chosen to sacrifice simplicity for the sake of space to some degree by rolling your own encoding when you could use readily agreed-upon ones, which people assume is because space is at a premium. It's not surprising that the reader wonders why this space savings is worth this complexity, but that space savings is not worth it. I think what's missing is the justification of the tradeoffs, at least in the blog post. Both the space vs time tradeoffs which I can imagine are important here, and the space vs. complexity tradeoffs that are more subjective.
We have such options in Fury too. We support register classes. In such cases, such string will be encoded with a varint id. It's kind of dict encoding. But not all users like to register classes. Meta string will try to reduce space cost when such dict encoding are not available.
Your suggestions are great. I should clarify why dict encoding can't be used to recude cost.
As for performance, it won't be an issue. The string for meta encoding are limited, we can cache the encoded result. So it's not in critical path