I can't help but feel like something has gone fundamentally wrong when there are...

smsm42 · 2024-05-07T17:38:18

Why is it wrong? It is a widely known fact that texts are redundant and compress well. Many systems work with texts and there's an advantage when those texts are human-comprehensible. You can make a system which would auto-compress and decompress any text of sufficient length - and some do - but there's absolutely no surprise that texts are compressible and it doesn't indicate there's something wrong. Texts are not optimized for space, they are optimized for human convenience.

zokier · 2024-05-07T18:11:46

But these are not general text, they are identifiers. If you are building space-efficient RPC system then I do think it is reasonable to question why are these identifiers passed as strings in such quantity that it makes a difference.

On top of my head, a better approach could be to have some mechanism to establish mapping from these human-readable identifiers to numeric identifiers (protocol handshake or some other schema exchange), and then use those numeric identifiers in the actual high-volume messages.

edit: umm... seems like fury is doing something like that already https://fury.apache.org/docs/guide/java_object_graph_guide#m... so I am bit puzzled if this saving really makes meaningful difference?!

smsm42 · 2024-05-07T18:57:14

Identifiers are textual also (though shorter). You don't have arbitrary byte strings as identifiers. And yes, if you build RPC specifically, then you have to be ready to the fact that you will have to deal with symbolic identifiers. You can work around that by transcoding them somehow into an efficient format - like protobuf does by forcing the schema to provide translation, or by additional communication as you mentioned, but it's unavoidable that the symbolic identifiers will pop up somewhere and they will be human-readable and thus redundant.

> I am bit puzzled if this saving really makes meaningful difference?!

They make somewhat misleading claim - "37.5% space efficient against UTF-8" - but that doesn't tell us how much gain it is on a typical protocol interaction, It could be 37.5% improvement on 0.1% of all data or on 50% of all data - depending on that, the overall effect could vary drastically. I guess if they are doing it, it makes some sense for them, but it's hard to figure out how much it saves in the big picture.

IsTom · 2024-05-08T14:00:27

If you establish mapping first you could probably use 4 or 8 byte integers as ids instead of using custom encodings to gain 30 vs 19 bytes. With 8 bytes you could even use a hash.

chaokunyang · 2024-05-09T14:49:05

We've already support such dict encoding, we let users register class with an id. And write class by id, id will be encoded as a varint, which uses only 1~5 bytes. But not every users like the registration. Meta string encoding here is for cases where not mapping is available

chaokunyang · 2024-05-07T15:40:14

Meta string is not designed for encoding arbitrary strings, they are used to encode limited enumerated string only, such as classname/fieldname/packagename/namespace/path/modulename. This is why we name it as meta string, used in a case where lossless statistical compresion can't provide a gain