Hacker News new | past | comments | ask | show | jobs | submit login
Meta String: A more space-efficient string encoding than UTF-8 in Fury (apache.org)
75 points by chaokunyang 12 days ago | hide | past | favorite | 62 comments





When doing ad-hoc space optimized encodings, it’s usually a good idea to compare the compress+decompress speed in addition to the resultant size against just compressing the data. This may win on one-off encodings of strings within an RPC, but the compression mechanism probably wins on the overall RPC message unless the message is very small & you can’t build a more optimized baseline dictionary for small messages.

It’s really hard to actually beat something like zstd with these ad-hoc compression since both the compressed size will be bigger than what zstd produces with a dictionary or on larger messages without a dictionary AND the speed of a round trip through zstd is likely faster.


We use meta string mainly for object serialization. Users passed an object to Fury, we return a binary to users. In such cases, the serialized binary are mostly in 200~1000 bytes. Not big enough for zstd to work.

For dictionary encoding, Fury used similar tech too. When a string first come up, we encoding it as binary, later writing in same object graph will write such string as an varint id.

Zstd can be used with Fury too. For multiple serialization across different objects. Fury don't have a context to compression, unless users enable fury meta share mode. If meta share mode not enabled, zstd can be used to compress data across multiple object graph serializaiton


> In such cases, the serialized binary are mostly in 200~1000 bytes. Not big enough for zstd to work

You're not referring to the same dictionary that I am. Look at --train in [1].

If you have a training corpus of representative data, you can generate a dictionary that you preshare on both sides which will perform much better for very small binaries (including 200-1k bytes). It's the same kind of technique but zstd's mechanism is absolutely general whereas purpose-built non-entropy based dictionaries are more limited & less likely to outperform.

If you want maximum flexibility (i.e. you don't know the universe of representative messages ahead of time or you want maximum compression performance), you can gather this corpus transparently as messages are generated & then generate a dictionary & attach it as sideband metadata to a message. You'll probably need to defer the decoding if it references a dictionary not yet received (i.e. send delivers messages out-of-order from generation). There are other techniques you can apply, but the general rule is that your custom encoding scheme is unlikely to outperform zstd + a representative training corpus. If it does, you'd need to actually show this rather than try to argue from first principles.

[1] https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md


Sadly, we don't have such a training corpus of representative data. Fury is just a serialization framework, we can't assume any string distribution. I thought about scan the code of apache ofbiz, and use the domain objects in this project as the default corpus to carry a static huffman/zstd. But ofbiz may not be representative still. For your second suggestion, I can't agree more. Actually Fury has already implemented this, we call it meta share mode. Fury will write such meta only once on a channel. And resend such meta if the channel got disconnected. But this is not easy to use and impose overhead to users. So we proposed meta encoding here. Anyway, yours suggestions are very insightful. Thanks very much

Wow. Didn't know about Train and custom dictionaries. Very cool.

It's not designed to beat zstd, those are used for different scenarios. zstd is used for data compression, the meta string encoding here is used for meta meta encoding. Meta data here are classname/fieldname/packagename/namespace/path/modulename mostly. For such small string, any statistical compression won't have a smaller size.

> For such small string, any statistical compression won't have a smaller size

That is a statement you can make, but you have to actually demonstrate this is true against zstd's --train mechanism [1] which generates a more optimized "small string" dictionary based on representative training data precisely for this use-case.

[1] https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md

> those are used for different scenarios. zstd is used for data compression, the meta string encoding here is used for meta meta encoding

Not sure what this means. The motivating use-case described for the alternate string encoding is precisely compression.


Dictionary encoding is already used in fury. We will encode same string as an varint when it's seen later. The thing here is that many cases the string don't have repeat. Imagine you send `record Point(into x, int y)` in an rpc. You have one 'Point' to send. No dictionary encoding will be used in such cases

You seem to be fundamentally misunderstanding how the zstd dictionary works. It’s a dictionary at the statistical level. Your `record Point(int x, int y)` RPC would statistically at the bit level have to look like no other message for it to have no help. That seems unlikely since there’s all sorts of framing. And I question the mindset of ad-hoc compression schemes that are trying to shrink the long tail of “sent only once” messages.

zstd’s approach is fundamentally very different from a basic interning dictionary. Seriously, there’s been tons of research on this. It’s also interesting that Fury claims to be 0-copy and then does all these ad-hoc field compression which is the literal definition of not 0-copy (& yes, varint encoding is not 0-copy and neither is this custom string compression).

Anyway, unless you actually do a proper comparison against zstd with a trained dictionary against a representative corpus, we’re just going in circles with me saying “you need to evaluate your ad-hoc compression against zstd” and you trying to argue from first principles that the custom compression scheme wins.


Hi, it's not “sent only once” . It's millions of “sent only once” .

The thing here are that all those RPC are stateless, so the context for compression are the one message itself. i.e. compress object like `Point(1,2)` only without priori knowledge. Users may use train static zstd. But Fury can't, Fury doesn't know which data will be serialized.

And meta string encoding are not statistical, it won't beats zstd. It's just for cases zstd is not suitable.


> For such small string, any statistical compression won't have a smaller size.

You can hardcode expected frequencies and throw arithmetic encoding at it and the average size will probably drop a meaningful amount.

And I can't easily find an example corpus, but the description of these strings sounds like they'd often have repetition, so another symbol to encode repetition and make this into a normal compression algorithm is probably worth it.

I wonder how many of these string start with org.apache


I'm curious on this. Why wouldn't statistical compression go smaller? Assuming you restrict the statistics in question to be tailored to the metadata that is being compressed, it feels like it should be comparable? Yes, you may need/want to store a precomputed dictionary at the receiving end so that you aren't sending that every time, but that is really no different from having custom code that you need at receiving end. Right?

The thing is that we can't store a precomputed dictionary. Fury is just a serialization framework, we don't know which data will be serialized

If you can store new code, you can get a more optimized dictionary for what you are doing, though? Why not?

Yes, if we can. The users of Fury may be able to do this. Fury may only provide such an interface to users to let them pass such an dict/zstd/huffman or something like

Seems like they are just using a custom 5 bit charset for strings with a limited alphabet.

That's not really a rethink of string encoding, just using a different fixed encoding.

Given the usecase, not sure i understand why not just use static huffman.


It looks like it's the same idea as static Huffman without the variable length part. Comparing it to UTF-8 is pointless, of course - UTF-8 is an universal text representation (whose major selling point is it's also ASCII-compatible and human-readable), and this is pretty much a compression algorithm. It is obvious natural language based text is redundant, so no wonder it's compressible and I am sure a lot of literature exists on which compression works best on which kinds of texts.

We tried huffman first, But we can't know which string will be serialized. If using huffman, we must collect most strings and compute a static huffman tree. If not, we must send huffman stats to peer, which the cost will be bigger

> If using huffman, we must collect most strings and compute a static huffman tree.

Isn't that essentially what you've done to define your bespoke alphabets / partial encodings?


The thing is that we don't the frequency about every char happens

You don't need to. You compute a static huffman tree over fixed probabilities you've estimated or generated from a corpus.

Good suggestion, I was planing it as an optional option. We can crawl some github repos, and extract most types, extract their class names, paths, fields names, and use such data as the corpus

Crude static Huffman example, that could definitely be improved:

     5 bits 0 0000-1111      acde fghi lmno rstu
     7 bits 10 00000-11111   bjkp qvwx yzAB CDEF GHIK LMNO PRST UVWY
     7 bits 110 0000-1111    JQXZ 0123 4567 89/.
     8 bits 111 00000-11110  !@#$ $%^& *()_ {}[] `~+= |\"' ;:<> ,?  (space at the end)
    16 bits 11111111 00000000-11111111 plus UTF-8 1 to 256 unicode characters encoded as UTF-8
You could even include some bigrams in this sort of table.

There's some code here that could maybe be used for that sort of static huffman tree:

https://github.com/ryancdotorg/lztp/blob/main/generate-seed....

Alternatively, have something like this:

    00 000000-111111 1-64 characters, 5 bits each
    01 000000-111111 1-64 characters, 6 bits each
    10 000000-111111 1-64 characters, ~6.4 bits each (character set of 84 characters, every 5 packed into 32 bits)
    11 000000-111111 1-64 characters, UTF-8
This is vaguely similar to how data is encoded for QR codes.

As pointed out elsewhere, this will not outperform zstd with a dictionary trained on a corpus, but zstd would require pulling in a lot of code.


> This is vaguely similar to how data is encoded for QR codes.

Doesn't QR use variable static encodings (alphabets)? e.g. mode 1 is numerics, mode 2 is uppercase alphanumeric, mode 3 is binary, and mode 4 is jis (japanese).


QR codes use a series of (mode, length, characters) segments - a given code can switch between modes. I think for QR codes the length encoding is mode-specific.

The fact that the example is "org.apache.fury.benchmark.data" makes me think that there is probably a lot of room to do better than 5 bits per char, if the goal is simply to reduce space. E.g. you probably know ahead of time that the string "org.apache.fury" is going to be very common in this context, and giving that a 2-bit or even an 8-bit codepoint would be a huge space savings over the project. That said, it's not super surprising that you'd rather have a suboptimal encoding than have the complexity of a decoder that depends on the project. But of course you've already chosen to sacrifice simplicity for the sake of space to some degree by rolling your own encoding when you could use readily agreed-upon ones, which people assume is because space is at a premium. It's not surprising that the reader wonders why this space savings is worth this complexity, but that space savings is not worth it. I think what's missing is the justification of the tradeoffs, at least in the blog post. Both the space vs time tradeoffs which I can imagine are important here, and the space vs. complexity tradeoffs that are more subjective.

We have such options in Fury too. We support register classes. In such cases, such string will be encoded with a varint id. It's kind of dict encoding. But not all users like to register classes. Meta string will try to reduce space cost when such dict encoding are not available.

Your suggestions are great. I should clarify why dict encoding can't be used to recude cost.

As for performance, it won't be an issue. The string for meta encoding are limited, we can cache the encoded result. So it's not in critical path


We first consider using some lossless compression algorithms, but since we are just encoding meta strings such as classname/fieldname/packagename/path. There doesn't have much duplciation pattern to detect, the compression won't have a very good effect. If we are encoding generic string, such compression algorithms such astd/deflate will be better, but that's not the case here. This is why we name it as meta string

Everything old is new again. This reminds me of ZSCII, which was how Infocom games encoded string data to fit ~100kb of text data into a Commodore 64, packing 3 characters into every 2 bytes.

    --first byte-------   --second byte---
    7    6 5 4 3 2  1 0   7 6 5  4 3 2 1 0
    bit  --first--  --second---  --third--
The 5-bit character encoding was one of three character sets:

       Z-char 6789abcdef0123456789abcdef
    current   --------------------------
      A0      abcdefghijklmnopqrstuvwxyz
      A1      ABCDEFGHIJKLMNOPQRSTUVWXYZ
      A2       ^0123456789.,!?_#'"/\-:()
              --------------------------
Z-char 0 was space in all character sets, z-char 1 was newline, z-chars 2 and 3 switched to one of the other character sets depending on which one you were already in for the next character only, and z-chars 4 and 5 switched permanently, the "shift-lock".

https://inform-fiction.org/zmachine/standards/z1point0/sect0...


This is interesting, thanks for sharing this. I will take a deep look at it.

It's a compression via domain-specific frequency encoding.

Good, but that's a pretty common technique for cases where every bit counts.


Yes, in many rpc/serialization systems, we need to serialize class name, field names, package names. So we can supprrt dynamic deserialization, type forward/backward compatibility. In such cases, those meta may took more bits than the actual data, this is where meta string bring gains

In rpc/serialization systems, we often need to send namespace/path/filename/fieldName/packageName/moduleName/className/enumValue string between processes.

Those strings are mostly ascii strings. In order to transfer between processes, we encode such strings using utf-8 encodings. Such encoding will take one byte for every char, which is not space efficient actually.

If we take a deeper look, we will found that most chars are lowercase chars, ., $ and _, which can be expressed in a much smaller range 0~32. But one byte can represent range 0~255, the significant bits are wasted, and this cost is not ignorable. In a dynamic serialization framework, such meta will take considerable cost compared to actual data.

So we proposed a new string encoding which we called meta string encoding in Fury. It will encode most chars using 5 bits instead of 8 bits in utf-8 encoding, which can bring 37.5% space cost savings compared to utf-8 encoding.

For string can't be represented by 5 bits, we also proposed encoding using 6 bits which can bring 25% space cost savings


You don't include '/' in the list of special characters so how are you using this to encode paths? As an array of strings?

We can use `|` instead, `|` is not used in path. Currently we have only 2 chars can be extended. I haven't decided what to inlucde

In addition to the existing dictionary compression algorithms, it is very important to know the characteristics of string contents. The proposed encoding has a very crude mechanism to describe them, but that's all (and that encoding flag is included to every string encoded, even though the schema could have included that information instead). Namespaces would start with only a small number of fixed prefixes like `java.`, `com.` and `org.`. Namespaces, paths and file names will have a lot of overlapping substrings as well. Class and field names are more chaotic, but some heuristics can be easily extracted as most of them will obey the common naming convention.

By the way, the specification is highly confusing to interpret to say the least as well. For example, is the "ref(erence) meta" a separate section in the serialization format or not? The corresponding section never mentions which bytes are to be written, and reference flags apparently describe the state of each object, so it can't be in that section anyway. Providing at least a single worked example would have significantly reduced confusion.

(Due to the inability to fully comprehend the specification, I can't give a general criticism and/or advice for now. But it does seem to miss several lessons from existing serialization formats. For instance, a varint encoding based on a contiuation bit is actually the worst encoding for given distribution, even though it is too common! Consider an alternative like `1^n 0 bit^(7n+7)` which can determine the contiuation length from the first byte instead.)


I can't help but feel like something has gone fundamentally wrong when there are so many arbitrary strings in the system that this optimisation makes a tangible difference.

Why is it wrong? It is a widely known fact that texts are redundant and compress well. Many systems work with texts and there's an advantage when those texts are human-comprehensible. You can make a system which would auto-compress and decompress any text of sufficient length - and some do - but there's absolutely no surprise that texts are compressible and it doesn't indicate there's something wrong. Texts are not optimized for space, they are optimized for human convenience.

But these are not general text, they are identifiers. If you are building space-efficient RPC system then I do think it is reasonable to question why are these identifiers passed as strings in such quantity that it makes a difference.

On top of my head, a better approach could be to have some mechanism to establish mapping from these human-readable identifiers to numeric identifiers (protocol handshake or some other schema exchange), and then use those numeric identifiers in the actual high-volume messages.

edit: umm... seems like fury is doing something like that already https://fury.apache.org/docs/guide/java_object_graph_guide#m... so I am bit puzzled if this saving really makes meaningful difference?!


Identifiers are textual also (though shorter). You don't have arbitrary byte strings as identifiers. And yes, if you build RPC specifically, then you have to be ready to the fact that you will have to deal with symbolic identifiers. You can work around that by transcoding them somehow into an efficient format - like protobuf does by forcing the schema to provide translation, or by additional communication as you mentioned, but it's unavoidable that the symbolic identifiers will pop up somewhere and they will be human-readable and thus redundant.

> I am bit puzzled if this saving really makes meaningful difference?!

They make somewhat misleading claim - "37.5% space efficient against UTF-8" - but that doesn't tell us how much gain it is on a typical protocol interaction, It could be 37.5% improvement on 0.1% of all data or on 50% of all data - depending on that, the overall effect could vary drastically. I guess if they are doing it, it makes some sense for them, but it's hard to figure out how much it saves in the big picture.


If you establish mapping first you could probably use 4 or 8 byte integers as ids instead of using custom encodings to gain 30 vs 19 bytes. With 8 bytes you could even use a hash.

We've already support such dict encoding, we let users register class with an id. And write class by id, id will be encoded as a varint, which uses only 1~5 bytes. But not every users like the registration. Meta string encoding here is for cases where not mapping is available

Meta string is not designed for encoding arbitrary strings, they are used to encode limited enumerated string only, such as classname/fieldname/packagename/namespace/path/modulename. This is why we name it as meta string, used in a case where lossless statistical compresion can't provide a gain

OOOh, very very nice. I might have to copy this into the H2 database engine. We already use variable-size encoding for numbers, and this would make a nice addition to reducing the size of row data containing text.

Glad to see H2 database engine benifits from this. I use H2 too.

If you're intersted in this topic, there exist more string compression algorithms that save data transfer or memory in different ways. Look into Thrift Compact Protocol, MessagePack, Protocol Buffers and Avro.

Protobuf only support varint encoding, fury supports that too. But IMO, protobuf use utf-8 for string encoding. Could you share more details how protobuf compress string.

And Fury meta string are not used to compress data plane , it's used to compress meta in data plane.


For implemetation details, https://github.com/apache/incubator-fury/blob/main/java/fury... can be taken as an example

What about using gzip on messages? I'd expect for it to yield similar savings while keeping message format simpler.

You'd never use gzip for that, it has two dozen bytes of overhead before you can even start building a deflate stream. Even a raw deflate stream might not recoup the overhead on values which are only a few dozen bytes at the most, and if you're not using a static huffman tree the overhead for the tree is going to end all possibility of a gain.

Furthermore LZ constructions work off of redundancy, which is difficult to find in large amounts in such short strings. The overhead of framing will just increase the size of the payload after construction.

You could define just a static huffman tree and make that part of your spec.


Yes, totally agree. I tried gzip first, but it introduce about 10~20 bytes cost first. So I proposed such meta encoding here. If we have a much bigger string, gzip will be better, but all string we want to compress here are just classname/fieldname/packagename/namespace/path/modulename. We don't have enough repete pattern for gzip to work

If you rip off the gzip headers and footers, DEFLATE applied to a small string only has 10 bits of overhead, 2 bytes with padding.

The default Huffman table is not suited for this use case, and a Huffman table in general is a waste of space these days, but the overhead is very small and can be reduced to nothing if you know the original string length.


Using gzip to the whole stream will introduce extra cpu cost, we try to make out serialization implementation as fast as possible. If we apply gzip only on such small meta string, it emits bigger size. I belive some stats are written in the data.

For spec details, please see fury meta string spec: https://fury.apache.org/docs/specification/fury_xlang_serial...

Apple's NSString apparently uses an in-memory encoding (not wire encoding) that can go as low as 5 bits per byte, as of OS X 10.10 in 2014:

https://mikeash.com/pyblog/friday-qa-2015-07-31-tagged-point...

If the length is between 0 and 7, store the string as raw eight-bit characters.

If the length is 8 or 9, store the string in a six-bit encoding, using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013bDNvwyUL2O856P-B79AFKEWV_zGJ/HYX".*

If the length is 10 or 11, store the string in a five-bit encoding, using the alphabet "eilotrm.apdnsIc ufkMShjTRxgC4013"

---

How about 5 bits? This isn't totally ludicrous. There are probably a lot of strings which are just lowercase, for example. 5 bits gives 32 possible values. If you include the whole lowercase alphabet, there are 6 extra values, which you could allot to the more common uppercase letters, or some symbols, or digits, or some mix.

Related thread - https://lobste.rs/s/5417dx/storing_data_pointers#c_l4zfrv

It's interesting how network formats and in-memory formats kinda converge in design, because retrieving data from memory to CPU is so expensive now.


... only in tagged pointers meaning the total size can't exceed like ~60 bits. After that, the canonical representation of NSString is UTF-16 and in Swift strings is UTF-8.

So not "in memory" (although obviously it is in memory) it's when the goal is to fit the totality of the string plus a tag into a register.

If they can't completely encode the 10-11 character value using the 5-bit charset it spills to UTF-16 on the heap.

> It's interesting how network formats and in-memory formats kinda converge in design, because retrieving data from memory to CPU is so expensive now.

Honestly Strings are probably one of the worst examples of this. Strings have like 50 different representations depending on what you're trying to do with them.

1. Wire format for storage and retrieval is generally UTF-8.

2. If you're trying to parse a string you generally want to expand it into code points.

3. If you're trying to sort or compare strings you generally want to expand that into a normalized form like NFC or NFD.

4. If you're trying to edit text, you generally want grapheme clusters.

5. If you're trying to present text, you generally want pixels.

The way to think about it is: "there is no such things as the length of a string without context." Strings are basically a bytecode format for representing concepts.

[1] https://developer.apple.com/documentation/foundation/nsstrin...


My point was something like

- if you want to send a bunch of small strings over the network, you can pick a variable length encoding even more compact than UTF-8

- if you want to pass and return a lot of strings as values, store them in data structures like a List<Str>, then you can pick a variable length encoding even more compact than UTF-8. So you can compute on the entire List<Str> without indirections for the common case of small strings, with good cache locality

---

That doesn't really contradict anything you said -- the tagged pointer is in some sense a "wire format", and then you have to do a conversion to actually do something useful with the string.

Although sometimes you don't need conversions either. For len() in bytes or code points and graphemes, which are ALL identical for ASCII, you can calculate it directly on the tagged pointer.

And you also have random access to bytes, code points, and graphemes in that special case. A lot of the "work" happened when deciding that this encoding is valid for a particular string, which is nice.


Yes, we only use this encoding when it brings benifits. `wire format` is a good term to describe such cases, which the context are constrained to current message only. So the data for compression are small, and many statistical methods won't work since it's a `wire format`. The stats itself must be send to peers, which will have bigger space and cpu cost

> For len() in bytes or code points and graphemes, which are ALL identical for ASCII, you can calculate it directly on the tagged pointer.

7-bit ASCII yeah, not 8-bit, which it appears are supported in tagged pointers per your original comment.

8-bit ASCII has fun characters like é (ASCII 130, 0x82) which may be 1 grapheme cluster -- but.

There's several ways to represent it in Unicode. You can do it as U+00E9 (NFC) or you can do it as U+0065 U+0301 (the combining acute accent, NFD).

This means it's a byte length of 1 in ASCII -- but several in UTF-8 (since UTF-8 only has direct compatibility with the 7-bit plane). And either 2 or 4 in UTF-16.

Also, it means this string has a Unicode code point length of 1 or 2 depending on context and normalization form.

Again, asking the context-free len() of a string is a meaningless question.

> And you also have random access to bytes, code points, and graphemes in that special case.

You should never make this assumption. You should treat strings as opaque data blobs.

Unicode is just a billion edge cases in a trench coat ;)


Meta string is used only for ascii chars, so every char is in range of a byte. We can just random access them. But your reminder is good, I just find out that we didn't check the passed string are ascii string, although our case always pass ascii string



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: