I ported a tool 'detools' which generates binary patches and compresses them using heatshrink to ESP32 and got some great results! Use case: Compressed Delta OTA Updates for ESP32
Once did a prototype that had to work with a really memory poor device. Compression and speed is meh, but still worth it (small serial interfaces are no racing dogs either).
Also very simple to understand and port and play around with, and to compare with other algorithms. More or less took this project as an excuse to learn some compression basics. Can recommend
I’m curious about embedded and constrained environments, do you mind telling us a bit more? What was the project? What hardware was involved? What tooling and OS’s did you use? What were the most challenging technical problems?
Not the OP, but check out devices like Microchip's PIC (8 bits) or AVR (8 bits), or ARM Cortex M0 or Cortex M3/4 devices. On cheaper devices you can get down to a few kB of RAM and Flash memory, so you need to be mindful of what you use!
When working with embedded devices like that, the main thing is you can't just easily import libraries or use many layers of abstraction (limited stack depth too). So generally everything is done at a low level.
> What were the most challenging technical problems?
While technically the customer asked for the solution to problem A, they wanted us to solve problem B, but we failed to invoke our inner patio11, and as a result delivered what they said instead of what they meant. A frustrating lesson for me. At that point in my career I should have known better already.
ETA: sibling comment about Cortex-M0 is spot on. You really start counting bytes. Some kiB of RAM sound a lot, but when you have to juggle several tasks, and don't have to option to write from scratch but instead reuse existing old code you've lying around, stuff gets dirty pretty quickly. Other team member was doing the hard work, I was just providing tooling for them.
And some microcontrollers have even less RAM. Quite a few AVRs have 1k or 512 bytes of RAM. A few don't have RAM (just CPU registers) but that's very different again.
This seems really nice, very ambitious to do it as a "public" state machine that you sink data into and poll compressed bytes out. I can imagine that makes it really nice if you're trying to decompress something in a real-time environment, to cap the execution time. Cool.
Another related algorithm is LZJB, which I've successfully used in embedded environments in production. I have a pure Python implementation [1] that can be useful for "offline" tool development, i.e. when building images for consumption by some embedded target, just thought I'd mention that.
What would be a good approach to compress not just a steam of arbitrary bytes, but a stream of 4-byte numbers which are similar to each other (like some sensor output)?
Look into time-series compression. Delta-encoding, mentioned by sibling, is a very old but effective technique also known as differential PCM. You can also delta-encode delta-encoded values, which works especially well for values whose delta is constant (classic example: timestamped data). The advantage of these over general purpose compressors is that they require basically no resources to implement. TFA is "low memory usage (as low as 50 bytes)", but delta encoding needs just one register.
You could start by encoding the deltas using variable length encoding, so a 0 delta could be as little as 1 bit. But anything off-the-shelf should be able to compress this type of data well.
I used tiny lz4 variants in one of my embedded projects, but using delta decoding of known default values is much better beforehand. Will look at that encoder also
The linked blog post [1] contains a benchmark against the Canterbury Corpus [2]. It is hard to test against other compressors because of its very low memory requirement, but LZO1X-1(11) (`lzo1x -11` in LZBench [3]) comes close as both LZO1X-1(11) and HS 13,4 targets 8 KB of work memory. The result (higher is better):
It does seem to have a good compression compared with LZO1X. I think this is mostly because LZO1X has a requirement that literals should be aligned at byte boundary for the performance but Heatshrink doesn't.
ESP32 port: https://github.com/ESP32-Musings/esp32_compressed_delta_ota_...
detools: https://github.com/eerimoq/detools