We tried huffman first, But we can't know which string will be serialized. If us...

masklinn · 2024-05-07T15:24:24

> If using huffman, we must collect most strings and compute a static huffman tree.

Isn't that essentially what you've done to define your bespoke alphabets / partial encodings?

chaokunyang · 2024-05-07T15:46:31

The thing is that we don't the frequency about every char happens

ryan-c · 2024-05-07T16:05:39

You don't need to. You compute a static huffman tree over fixed probabilities you've estimated or generated from a corpus.

chaokunyang · 2024-05-07T16:10:48

Good suggestion, I was planing it as an optional option. We can crawl some github repos, and extract most types, extract their class names, paths, fields names, and use such data as the corpus

ryan-c · 2024-05-07T16:28:10

Crude static Huffman example, that could definitely be improved:

     5 bits 0 0000-1111      acde fghi lmno rstu
     7 bits 10 00000-11111   bjkp qvwx yzAB CDEF GHIK LMNO PRST UVWY
     7 bits 110 0000-1111    JQXZ 0123 4567 89/.
     8 bits 111 00000-11110  !@#$ $%^& *()_ {}[] `~+= |\"' ;:<> ,?  (space at the end)
    16 bits 11111111 00000000-11111111 plus UTF-8 1 to 256 unicode characters encoded as UTF-8

You could even include some bigrams in this sort of table.

There's some code here that could maybe be used for that sort of static huffman tree:

https://github.com/ryancdotorg/lztp/blob/main/generate-seed....

Alternatively, have something like this:

    00 000000-111111 1-64 characters, 5 bits each
    01 000000-111111 1-64 characters, 6 bits each
    10 000000-111111 1-64 characters, ~6.4 bits each (character set of 84 characters, every 5 packed into 32 bits)
    11 000000-111111 1-64 characters, UTF-8

This is vaguely similar to how data is encoded for QR codes.

As pointed out elsewhere, this will not outperform zstd with a dictionary trained on a corpus, but zstd would require pulling in a lot of code.

masklinn · 2024-05-07T16:49:56

> This is vaguely similar to how data is encoded for QR codes.

Doesn't QR use variable static encodings (alphabets)? e.g. mode 1 is numerics, mode 2 is uppercase alphanumeric, mode 3 is binary, and mode 4 is jis (japanese).

ryan-c · 2024-05-07T16:56:00

QR codes use a series of (mode, length, characters) segments - a given code can switch between modes. I think for QR codes the length encoding is mode-specific.

topaz0 · 2024-05-07T16:32:25

The fact that the example is "org.apache.fury.benchmark.data" makes me think that there is probably a lot of room to do better than 5 bits per char, if the goal is simply to reduce space. E.g. you probably know ahead of time that the string "org.apache.fury" is going to be very common in this context, and giving that a 2-bit or even an 8-bit codepoint would be a huge space savings over the project. That said, it's not super surprising that you'd rather have a suboptimal encoding than have the complexity of a decoder that depends on the project. But of course you've already chosen to sacrifice simplicity for the sake of space to some degree by rolling your own encoding when you could use readily agreed-upon ones, which people assume is because space is at a premium. It's not surprising that the reader wonders why this space savings is worth this complexity, but that space savings is not worth it. I think what's missing is the justification of the tradeoffs, at least in the blog post. Both the space vs time tradeoffs which I can imagine are important here, and the space vs. complexity tradeoffs that are more subjective.

chaokunyang · 2024-05-07T17:25:11

We have such options in Fury too. We support register classes. In such cases, such string will be encoded with a varint id. It's kind of dict encoding. But not all users like to register classes. Meta string will try to reduce space cost when such dict encoding are not available.

Your suggestions are great. I should clarify why dict encoding can't be used to recude cost.

As for performance, it won't be an issue. The string for meta encoding are limited, we can cache the encoded result. So it's not in critical path