This is SO good. The interactive demos, including the way it presents the binary format towards the end, are clear and really engaging. I love this kind of article.
it's a great presentation and interesting topic... one nit though is that it is good form to type out the acronym or initialism as words first and then append the initialism to the end unless it's quite a well known one like PDF(usually i suppose as it could be probability distribution function on a maths page).
OP implements their own RLE-based compression scheme. I bet you'd get better results from emitting an uncompressed binary format, and then passing the whole thing through a general-purpose compressor like zstd.
I'd also be interested to see what zstd (or even just gzip) would make of their original un-optimised format - I bet you'd get most of the way there "for free".
Given equivalent data stored in both JSON and BSON format I would expect them both to compress down to blobs of roughly equivalent sizes. This is because both encode roughly the same amount of information, so the compression algorithm will tend to compress down to the same final result. I haven't run this as an experiment though..that would be fun.
General compression algorithms like zstd will do way, way better on previously encoded data. Sorting, RLE, etc, are all going to lead to a smaller resulting binary after going through the generalized algorithm.
You're also going to be relying on heuristics that may fail. Zstd has no idea that your UUIDs are going to repeat, it doesn't know to dictionary encode them or to create frequency mappings. What it ends up doing is unlikely to be as effective as doing the work ahead of time, then allowing zstd to find the interesting and complex patterns that arise out of your encoded format.
I wonder if using the rsync algorithm would also help? From looking at the state changes done in one CRDT in one machine and another, both states still share a lot of things in common. So to merge A in machine 1, with B in machine 2, you can rsync A to B in machine 2 without overwriting B, but just having to send the deltas. My understanding is that there are crdts that are able to send over just the operations done, but for simplicity, for a full state crdt, this might be easier and almost as efficient than a specific algorithm.
Honestly, a good showcase of how expensive UUIDs are. If you can perform a mapping to integers you should really consider it. Not only will your UUIDs take up 2x the space of a 64bit integer, they'll compress horribly by comparison. Integer compression is so good these days you can compress integers such that they basically end up taking up ~1/2 a byte instead of 8. Doing that with UUID v4s is fundamentally not going to work.
The article mentions RLE, which is also incredibly good for integer IDs if you diff-encode them, since ids are typically sequential with no or small gaps. Diff + RLE can turn your encoded structure into ~1 byte.
Also, incredible website. The interactivity is so fun.
I did something similar to this very recently where I took a dataset and just continuously applied data encoding methods to it. It was much smaller in memory and compressed with zstd to a smaller size as well. I've found that 'prepping' data before using a generalized compression algorithm has significant gains both for encode/decode performance + the output size. These were, incidentally, CRDT operations :D
If you care about compressability, you can simply use a different way of generating UUIDs, instead of switching to another type. For example initializing a 128-bit counter when starting the process and incrementing it for each number generated. This should compress almost as well as sequential integers, while still being collision resistant without coordination or persistent state.
I have never seen this strategy described. Is there any literature on it? Cause it does seem to be that this would be as good as the other algorithms for generating UUIDs, but faster and with the benefit of compression. It would make resulting UUIDs easier to guess, but that has never been a design goal of UUIDs anyway.
This method doesn’t scale in a distributed system. Two machines running this algorithm will generate conflicts which violates the “universal” part of UUID which means you can run the same algorithm anywhere and you won’t get a conflict. So if you want to make this unique, you have to make a remote query to generate the UID uniquely. Examples of this are Snowflake. Also Sled (the database written in Rust) does a variant of this but it’s not a distributed system per se (ie generates unique identifiers globally on a single machine across process restarts which is useful for transaction IDs).
The current SOTA as I understand it is to have the timestamp be the most significant digits and a random value in the least (ULID, UUIDv7 etc). That way you get temporal locality for the UUID which is typically an indexed column where that helps for storage a bit (not so much for lookups because it’s still a random value relative to other concurrently accessed UIDs being input into the system)
Although CodesInChaos didn't specify the initialization, I assume the intent is that each machine would randomly initialize their counter, in order to prevent collisions. While collisions are possible in any scheme without coordination, I suspect that 128 bits is enough entropy for this to be extremely improbable.
Yup, and I am curious about the math on collisions with this scheme. Cause 128 bits is a lot of entropy, and I would expect this scheme to have similar collision resistance of "it never happens with good RNG" as regular uuids, but a proof would be nice.
Yeah, intuitively I assume that the gaps between any two UUID guesses will be sufficiently large such that you can increment a lot without overlap but... until someone shows me the math, I don't love the idea.
I meant using a random initial value. This will make collisions less likely than v4 UUIDs (because the quadratic scaling of the birthday problem only applies to counter initialization), however once you get one collision you get many.
That’s an interesting approach, but what’s the use case you’re thinking of? There’s no correlation of UUIDs across nodes like there is with something like UUIDv7 or ULIDs.
I guess you’ll still have batches of locality depending on the uptime of the server which might be useful and generation would of course be significantly faster although I doubt many applications are really bottlenecks by UUID generation speed.
Within a node it would improve compression, so in theory your overall system would benefit. This diminishes as you increase the number of nodes. TBH if you have a very small number of nodes I think just having a synchronized counter is fine. You can make extremely fast counters if you allow for gaps.
Yep. Different UUIDs will have different characteristics when compressed. Still, if you can afford to just have '1, 2, 3, ..' as IDs that's hard to beat.
One idea is just to use fewer random bits in peerIDs. Yjs (https://docs.yjs.dev/) gets away with just 32 random bits. If you compromise and use 64 random bits, then even a very popular doc with 1 million lifetime peerIDs will have a < 10^-7 lifetime probability of collision.
I don't quite get it. The normal string encoding of a UUID (d92b13b6-6780-4b98-82cc-d469dcdd42ab) clearly requires more bytes (36B) than the actual UUID itself (16B). The string encoding is for showing to users, it's not what you store in your DB, or your CRDT -- for that, you use the actual 16-byte value.
> Not only will your UUIDs take up 2x the space of a 64bit integer, they'll compress horribly by comparison. Integer compression is so good these days you can compress integers such that they basically end up taking up ~1/2 a byte instead of 8. Doing that with UUID v4s is fundamentally not going to work.
Yes, UUIDs are 128 bits, and 64 bit integers are 64 bits, but in both cases you're storing them as raw bytes -- or, maybe, integers, sure. Whatever compression works on your int64s will work just as well on your UUID int128s.
> Whatever compression works on your int64s will work just as well on your UUID int128s.
That's not true in general; it depends on how the ints are distributed. UUIDs are, by design, distributed in a way that's pretty much maximally unfriendly to compression
I assume they are referring specifically to a monotonic integer sequence. This will compress and index very nicely (low entropy) whereas v4 UUIDs are random and will not (high entropy). This is the same principle that explains why gzip works well on text but not photos.
That is definitely not why the vast, vast majority of people choose UUIDs, though it is a nice property of them. The reason people choose them is because you can generate them without any synchronization in a distributed system.
To reply to your initial point:
> Whatever compression works on your int64s will work just as well on your UUID int128s.
This is not the case. Compressing non-random 64bit values (such as the ones in the article - they are sequential) is going to be way more effective than compressing UUIDs. UUIDs do not compress, you'll likely end up with a larger output than the input.
What the article does, essentially, is a dictionary encoding pass. The duplicate, expensive UUIDs are mapped to small, sequential integers.
I explained elsewhere that you can turn nearly arbitrary numbers of sequential integers into a few bits each.
This assumes that you're compressing IDs in batches. I'm sure that's a common situation, but I'm not sure why it's assumed to be the _default_ situation, nor am I sure why this is assumed to be the primary metric for IDs.
It's the situation we're presented with in the blog post. It's not an uncommon situation given that UUIDs are used in databases and, for many databases, compression is done at a block level.
> nor am I sure why this is assumed to be the primary metric for IDs.
Well if your integers are sequential you can encode huge numbers of them using diff + RLE in just a few bytes, likely far fewer than 1/2 a byte on average, for the right dataset (in theory you can store 1,2,3,4,5...10_000 in 2 bytes).
The linked papers there will talk about techniques that can be used to store multiple 32bit integers into a single byte, etc. Integer compression is pretty powerful if your data isn't random. The thing with UUIDs is that your data is pretty random - even a UUIDv7 contains a significant amount of random data.
Also had great success with Roaring Bitmaps, a bit further on the compression/ease of manipulation axis.
Edit: I see you're already linking to Daniel Lemire's github, so I guess you're familiar with them, just mentioning them, since it's in a comment thread like this one that I discovered them.
With all due respect (which is a lot; this blog post series is quite good!), it's not quite 98%. The webserver will gzip it in transit, and the first, most inefficient JSON bit of data gzips, from 255 bytes, down to 152 bytes. 152 is still 27% larger than the final, binary, format, of 119 bytes.
Not that I think this optimization is unnecessary - I very much quite enjoy products that use CRDTs under the hood and undoubtedly had this sort of optimization and evolution to a binary format from JSON done. But I wanted to give a very brief but bigger system picture of where this fits.
Transmitting it is not the only concern. Storing the history in a compressed format that allows fast reads is very important. GZIP needs to read the whole content at once, so incremental reads/writes — e.g. to implement undo — are expensive.
When I saw the headline, I thought it would be a neat article. Then when seeing the first data representation snippet, I thought "OK cool, that's what you conceptually have". Then I realized that it was the actual data representation.
The article is still neat, but more for the presentation. Of course you can get a huge reduction in data size if what you are starting out with is just about the most bloated representation you can imagine.
Author here — the original data structure is basically the plain state you'd get by implementing these CRDTs. So I agree that the original data structure uses a lot of space, but I also think the post is broadly representative of the type of improvement you can get if you started with a naive implementation.
TL;DR brotli does the best — it compresses the JSON down to 21.42kb, which is a 96.6% reduction. That's probably the most bang for the buck. If you take the final binary encoding format from the end of the post, though, brotli crunches it down to 5.99kb — a 99% reduction!
I don't know if the author is rediscovering or inspired by lossless image compression algorithms. For example PNG has a palette mode that supports up to 256 colors in the palette. And of course so does GIF. The run-length-encoding is probably just a primitive version of LZW.
I wonder if it might be better to simply use PNG with an extra chunk for the UUID and timestamp information.
Did anyone else experience the demos getting sluggish the lower I scrolled? It was pretty fast at the start but as the CRDT got more optimised, the demos started showing more lag.
Author here! That makes sense; it only runs the algorithms that show up in the table, but it is running every algorithm under the hood. I tried to make them build on each other (so for example, every one after the UUID table uses that same table without recreating it) but I'm probably doing a lot of extra allocation.
I'll see if I can find any performance gains. Sorry about that!
I'm surprised the RLE example shows an improvement with RLE if you fill the entire artboard. Seems like it's only an improvement if there are null entries.
Why does a completely full art board benefit from RLE that only tracks null pixels?
Because the “run length binary encoding”, does not track only null pixels.
It switches data from row major / AoS to column major / SoA form, then each type gets its own RLE scheme, timestamp remains on RLE-ing only 0 values but color and writer switch to every value being RLE’d (so each entry is a pair of a value and a repetition count).
From 650kb to 14kb. Can we just have this as a new image format? Joking but not joking. I’m looking for an extremely packed image format and I think you might have just defined one. Also, those binary layouts with interactivity are awesome, don’t stop doing that. It’s a killer feature.