Hacker News new | past | comments | ask | show | jobs | submit login
Making CRDTs More Efficient (jakelazaroff.com)
275 points by jakelazaroff on Oct 17, 2023 | hide | past | favorite | 57 comments



This is SO good. The interactive demos, including the way it presents the binary format towards the end, are clear and really engaging. I love this kind of article.


Thank you so much! I’m a longtime reader of your blog, that really means a lot.


it's a great presentation and interesting topic... one nit though is that it is good form to type out the acronym or initialism as words first and then append the initialism to the end unless it's quite a well known one like PDF(usually i suppose as it could be probability distribution function on a maths page).


OP implements their own RLE-based compression scheme. I bet you'd get better results from emitting an uncompressed binary format, and then passing the whole thing through a general-purpose compressor like zstd.

I'd also be interested to see what zstd (or even just gzip) would make of their original un-optimised format - I bet you'd get most of the way there "for free".


I was curious about this myself. I extracted the json for the "uncompressed" image:

https://gist.github.com/jmfd/8dbb96fcd8a1ba1e8ad6e9167bd70ce...

And then tried these commands against it:

  gzip:        39.56kb    (93.7%)
  zstd:        35.86kb    (94.3%)
  gzip -9:     35.51kb    (94.4%)
  bzip2:       24.35kb    (96.1%)
  zstd -19:    24.01kb    (96.2%)
  brotli:      21.42kb    (96.6%)
(I'm not up-to-date on default web server configs, but I imagine most would automatically transmit with gzip over the wire for json?)


JSON compresses well in general. Can you do the same with a binary version?


I only have the "final" 13.5kb version to test against. It compresses very well:

  gzip:     6.95kb    (98.9%)
  brotli:   5.99kb    (99.0%)


Man, I should have done this — would've been a much nicer number for the headline. Thanks for the analysis, though, this is really cool!


I compared it to image in PNG and QOI formats [0]

    img:         13.552kb (100%)
    img.png:      1.877kb (13.8%)
    img.qoi:     10.316kb (76.1%)
    img.qoi.gz:   1.167kb (8.6%)
While it does compress by 50%, it's not a format optimized for compression unlike png or qoi

[0]: https://qoiformat.org/


Given equivalent data stored in both JSON and BSON format I would expect them both to compress down to blobs of roughly equivalent sizes. This is because both encode roughly the same amount of information, so the compression algorithm will tend to compress down to the same final result. I haven't run this as an experiment though..that would be fun.


Is it faster though?


the storage format may not be the only consideration - it may be preferable to have an in-memory representation that is also inexpensive.


General compression algorithms like zstd will do way, way better on previously encoded data. Sorting, RLE, etc, are all going to lead to a smaller resulting binary after going through the generalized algorithm.

You're also going to be relying on heuristics that may fail. Zstd has no idea that your UUIDs are going to repeat, it doesn't know to dictionary encode them or to create frequency mappings. What it ends up doing is unlikely to be as effective as doing the work ahead of time, then allowing zstd to find the interesting and complex patterns that arise out of your encoded format.


I wonder if using the rsync algorithm would also help? From looking at the state changes done in one CRDT in one machine and another, both states still share a lot of things in common. So to merge A in machine 1, with B in machine 2, you can rsync A to B in machine 2 without overwriting B, but just having to send the deltas. My understanding is that there are crdts that are able to send over just the operations done, but for simplicity, for a full state crdt, this might be easier and almost as efficient than a specific algorithm.


Honestly, a good showcase of how expensive UUIDs are. If you can perform a mapping to integers you should really consider it. Not only will your UUIDs take up 2x the space of a 64bit integer, they'll compress horribly by comparison. Integer compression is so good these days you can compress integers such that they basically end up taking up ~1/2 a byte instead of 8. Doing that with UUID v4s is fundamentally not going to work.

The article mentions RLE, which is also incredibly good for integer IDs if you diff-encode them, since ids are typically sequential with no or small gaps. Diff + RLE can turn your encoded structure into ~1 byte.

Also, incredible website. The interactivity is so fun.

I did something similar to this very recently where I took a dataset and just continuously applied data encoding methods to it. It was much smaller in memory and compressed with zstd to a smaller size as well. I've found that 'prepping' data before using a generalized compression algorithm has significant gains both for encode/decode performance + the output size. These were, incidentally, CRDT operations :D

Your blog posts are great, keep it up


If you care about compressability, you can simply use a different way of generating UUIDs, instead of switching to another type. For example initializing a 128-bit counter when starting the process and incrementing it for each number generated. This should compress almost as well as sequential integers, while still being collision resistant without coordination or persistent state.


I have never seen this strategy described. Is there any literature on it? Cause it does seem to be that this would be as good as the other algorithms for generating UUIDs, but faster and with the benefit of compression. It would make resulting UUIDs easier to guess, but that has never been a design goal of UUIDs anyway.


This method doesn’t scale in a distributed system. Two machines running this algorithm will generate conflicts which violates the “universal” part of UUID which means you can run the same algorithm anywhere and you won’t get a conflict. So if you want to make this unique, you have to make a remote query to generate the UID uniquely. Examples of this are Snowflake. Also Sled (the database written in Rust) does a variant of this but it’s not a distributed system per se (ie generates unique identifiers globally on a single machine across process restarts which is useful for transaction IDs).

The current SOTA as I understand it is to have the timestamp be the most significant digits and a random value in the least (ULID, UUIDv7 etc). That way you get temporal locality for the UUID which is typically an indexed column where that helps for storage a bit (not so much for lookups because it’s still a random value relative to other concurrently accessed UIDs being input into the system)


Although CodesInChaos didn't specify the initialization, I assume the intent is that each machine would randomly initialize their counter, in order to prevent collisions. While collisions are possible in any scheme without coordination, I suspect that 128 bits is enough entropy for this to be extremely improbable.


Yup, and I am curious about the math on collisions with this scheme. Cause 128 bits is a lot of entropy, and I would expect this scheme to have similar collision resistance of "it never happens with good RNG" as regular uuids, but a proof would be nice.


Yeah, intuitively I assume that the gaps between any two UUID guesses will be sufficiently large such that you can increment a lot without overlap but... until someone shows me the math, I don't love the idea.


I meant using a random initial value. This will make collisions less likely than v4 UUIDs (because the quadratic scaling of the birthday problem only applies to counter initialization), however once you get one collision you get many.


That’s an interesting approach, but what’s the use case you’re thinking of? There’s no correlation of UUIDs across nodes like there is with something like UUIDv7 or ULIDs.

I guess you’ll still have batches of locality depending on the uptime of the server which might be useful and generation would of course be significantly faster although I doubt many applications are really bottlenecks by UUID generation speed.


Within a node it would improve compression, so in theory your overall system would benefit. This diminishes as you increase the number of nodes. TBH if you have a very small number of nodes I think just having a synchronized counter is fine. You can make extremely fast counters if you allow for gaps.


Yep. Different UUIDs will have different characteristics when compressed. Still, if you can afford to just have '1, 2, 3, ..' as IDs that's hard to beat.


One idea is just to use fewer random bits in peerIDs. Yjs (https://docs.yjs.dev/) gets away with just 32 random bits. If you compromise and use 64 random bits, then even a very popular doc with 1 million lifetime peerIDs will have a < 10^-7 lifetime probability of collision.


> how expensive UUIDs are

I don't quite get it. The normal string encoding of a UUID (d92b13b6-6780-4b98-82cc-d469dcdd42ab) clearly requires more bytes (36B) than the actual UUID itself (16B). The string encoding is for showing to users, it's not what you store in your DB, or your CRDT -- for that, you use the actual 16-byte value.

> Not only will your UUIDs take up 2x the space of a 64bit integer, they'll compress horribly by comparison. Integer compression is so good these days you can compress integers such that they basically end up taking up ~1/2 a byte instead of 8. Doing that with UUID v4s is fundamentally not going to work.

Yes, UUIDs are 128 bits, and 64 bit integers are 64 bits, but in both cases you're storing them as raw bytes -- or, maybe, integers, sure. Whatever compression works on your int64s will work just as well on your UUID int128s.


> Whatever compression works on your int64s will work just as well on your UUID int128s.

That's not true in general; it depends on how the ints are distributed. UUIDs are, by design, distributed in a way that's pretty much maximally unfriendly to compression


I assume they are referring specifically to a monotonic integer sequence. This will compress and index very nicely (low entropy) whereas v4 UUIDs are random and will not (high entropy). This is the same principle that explains why gzip works well on text but not photos.


Usually you use UUIDs when you don't want your IDs to be guessable.


That is definitely not why the vast, vast majority of people choose UUIDs, though it is a nice property of them. The reason people choose them is because you can generate them without any synchronization in a distributed system.

To reply to your initial point:

> Whatever compression works on your int64s will work just as well on your UUID int128s.

This is not the case. Compressing non-random 64bit values (such as the ones in the article - they are sequential) is going to be way more effective than compressing UUIDs. UUIDs do not compress, you'll likely end up with a larger output than the input.

What the article does, essentially, is a dictionary encoding pass. The duplicate, expensive UUIDs are mapped to small, sequential integers.

I explained elsewhere that you can turn nearly arbitrary numbers of sequential integers into a few bits each.


This assumes that you're compressing IDs in batches. I'm sure that's a common situation, but I'm not sure why it's assumed to be the _default_ situation, nor am I sure why this is assumed to be the primary metric for IDs.


> I'm sure that's a common situation

It's the situation we're presented with in the blog post. It's not an uncommon situation given that UUIDs are used in databases and, for many databases, compression is done at a block level.

> nor am I sure why this is assumed to be the primary metric for IDs.

It isn't.


Thank you so much! I spent a long time on these diagrams haha.


> Integer compression is so good these days you can compress integers such that they basically end up taking up ~1/2 a byte instead of 8

This needs an explanation.


Well if your integers are sequential you can encode huge numbers of them using diff + RLE in just a few bytes, likely far fewer than 1/2 a byte on average, for the right dataset (in theory you can store 1,2,3,4,5...10_000 in 2 bytes).

But for other integer datasets there's FastPFOR

https://github.com/lemire/FastPFor

The linked papers there will talk about techniques that can be used to store multiple 32bit integers into a single byte, etc. Integer compression is pretty powerful if your data isn't random. The thing with UUIDs is that your data is pretty random - even a UUIDv7 contains a significant amount of random data.


Also had great success with Roaring Bitmaps, a bit further on the compression/ease of manipulation axis.

Edit: I see you're already linking to Daniel Lemire's github, so I guess you're familiar with them, just mentioning them, since it's in a comment thread like this one that I discovered them.


With all due respect (which is a lot; this blog post series is quite good!), it's not quite 98%. The webserver will gzip it in transit, and the first, most inefficient JSON bit of data gzips, from 255 bytes, down to 152 bytes. 152 is still 27% larger than the final, binary, format, of 119 bytes.

Not that I think this optimization is unnecessary - I very much quite enjoy products that use CRDTs under the hood and undoubtedly had this sort of optimization and evolution to a binary format from JSON done. But I wanted to give a very brief but bigger system picture of where this fits.


Transmitting it is not the only concern. Storing the history in a compressed format that allows fast reads is very important. GZIP needs to read the whole content at once, so incremental reads/writes — e.g. to implement undo — are expensive.


When I saw the headline, I thought it would be a neat article. Then when seeing the first data representation snippet, I thought "OK cool, that's what you conceptually have". Then I realized that it was the actual data representation.

The article is still neat, but more for the presentation. Of course you can get a huge reduction in data size if what you are starting out with is just about the most bloated representation you can imagine.

https://thedailywtf.com/articles/The-Speedup-Loop


Author here — the original data structure is basically the plain state you'd get by implementing these CRDTs. So I agree that the original data structure uses a lot of space, but I also think the post is broadly representative of the type of improvement you can get if you started with a naive implementation.


What sort of savings would you get with basic compression algorithms...?


Looks like someone else in this thread tried it out: https://news.ycombinator.com/item?id=37920255

TL;DR brotli does the best — it compresses the JSON down to 21.42kb, which is a 96.6% reduction. That's probably the most bang for the buck. If you take the final binary encoding format from the end of the post, though, brotli crunches it down to 5.99kb — a 99% reduction!


Would also be interesting to run-length-decode the final file and compress that, to see how much (or how little) the RLE help.

Though I don’t think it’s super useful over the wire, LZ4 could also be an interesting thing to look at.


What would be an accurate and neutral title?

(Edit: I've taken out the "98%" bit per the HN guidelines: https://news.ycombinator.com/newsguidelines.html)


Whoops, sorry — I didn't realize that was against the guidelines!


I don't know if the author is rediscovering or inspired by lossless image compression algorithms. For example PNG has a palette mode that supports up to 256 colors in the palette. And of course so does GIF. The run-length-encoding is probably just a primitive version of LZW.

I wonder if it might be better to simply use PNG with an extra chunk for the UUID and timestamp information.


Recent and related:

Building a collaborative pixel art editor with CRDTs - https://news.ycombinator.com/item?id=37832432 - Oct 2023 (23 comments)

An interactive intro to CRDTs - https://news.ycombinator.com/item?id=37764581 - Oct 2023 (130 comments)

Normally we downweight multipart posts after one has had significant frontpage attention (see https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que... and https://hn.algolia.com/?dateRange=all&page=0&prefix=false&so... for why), but this seems to be something of an exception.


I love the presentation, however this is optimizations on data transfer. I was hoping for something CRDT-specific.


For the uninitiated, this might be a better starting point https://jakelazaroff.com/words/an-interactive-intro-to-crdts...


Did anyone else experience the demos getting sluggish the lower I scrolled? It was pretty fast at the start but as the CRDT got more optimised, the demos started showing more lag.


Author here! That makes sense; it only runs the algorithms that show up in the table, but it is running every algorithm under the hood. I tried to make them build on each other (so for example, every one after the UUID table uses that same table without recreating it) but I'm probably doing a lot of extra allocation.

I'll see if I can find any performance gains. Sorry about that!


So it’s essentially using a few compression techniques

1. Palettes aka dictionary compression 2. Storing as column arrays + run length encoding 2. Binary encoding instead of ascii

All standard techniques used in modern compression, especially columnstore databases.

Very well written and approachable by a beginner.


I'm surprised the RLE example shows an improvement with RLE if you fill the entire artboard. Seems like it's only an improvement if there are null entries.

Why does a completely full art board benefit from RLE that only tracks null pixels?


Because the “run length binary encoding”, does not track only null pixels.

It switches data from row major / AoS to column major / SoA form, then each type gets its own RLE scheme, timestamp remains on RLE-ing only 0 values but color and writer switch to every value being RLE’d (so each entry is a pair of a value and a repetition count).


That protocol/encoding visualizer is beautiful. What a fantastic work!


From 650kb to 14kb. Can we just have this as a new image format? Joking but not joking. I’m looking for an extremely packed image format and I think you might have just defined one. Also, those binary layouts with interactivity are awesome, don’t stop doing that. It’s a killer feature.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: