Hacker News new | past | comments | ask | show | jobs | submit login
Iguana: fast SIMD-optimized decompression (sneller.io)
162 points by l2dy on June 1, 2023 | hide | past | favorite | 40 comments



As general purpose compressor iguana is decompressing a lot slower than advertised, when tested with a typical data compression corpus.

It requires: avx512-vbmi2, available only on ice-lake/Tiger-lake/AMD zen4

- Benchmark from encode.su experts: https://encode.su/threads/4041-Iguana-a-fast-vectorized-comp...

- benchmark from the iguana developers here: https://github.com/SnellerInc/sneller/tree/master/cmd/iguana...

Silesia corpus / cpu Xeon Gold 5320

zstd -b3 3.186 943.9 MB/s

zstd -b9 3.574 1015.8 MB/s

zstd -b18 3.967 910.6 MB/s

lz4 -b1 2.101 3493.8 MB/s

lz4 -b5 2.687 3323.5 MB/s

lz4 -b9 2.721 3381.5 MB/s

iguana -t=0 2.58 4450 MB/s

iguana -t=1 3.11 2260 MB/s

As you can see, iguana with entropy coding enabled (-t 1) has a similar compression ratio to zstd -3, but it decompresses more than twice as quickly. With entropy coding disabled (-t 0), iguana has a compression ratio roughly equivalent to lz4 -5 and decompresses about 33% faster.


> It requires: avx512-vbmi2, available only on skylake-X/Tiger-lake/AMD zen4

Isn't Avx 512 also available on Xeon which is most hosted servers, like aws.


No modern intel consumer CPU has avx512 anymore, they only support at most the 10 year old avx2 vector instruction set, because they want to give every single consumer chip efficiency cores, those can't handle it, and even the performance cores have this feature removed due to it.

The modern AMD Zen 4 CPU's for consumers do support avx512 (including avx512_vbmi2).


Odd how you completely dodged the question, as asked. Every EC2 instance based on an Intel CPU has AVX512 since the release of the C/M5 family in 2017.


So .. what's the vector set that people should be targeting for wide compatibility these days?


Ideally, dynamic detecting cpu isa (sse/avx2/avx512) and respectively dispatching, used in most of my projects for ex. [1] and [2]

- [1] https://github.com/powturbo/Turbo-Range-Coder

- [2] https://github.com/powturbo/Turbo-Base64


honestly you can probably stop supporting pre avx2 for a lot of use cases. avx2 is 2014 so those machines are pretty much at end of life. this also has the advantage that it lets you assume native fma hardware


You can use function multiversioning (1) to provide multiple implementations of the same function for different targets that is selected at runtime.

(1) https://releases.llvm.org/7.0.0/tools/clang/docs/AttributeRe...


> and even the performance cores have to be downgraded for it.

Old rumours, it's no longer true.


not really rumors, but also not necessarily true. the performance cores on consumer intel parts no longer have AVX-512 enabled. This is because it’s tough to implement on the smaller efficiency cores, and scheduling is more annoying with heterogeneous ISAs.


I understand downgrading as thermal throtlling.


I did mean have the feature removed, I edited the comment (replaced "downgraded" with "have this feature removed". I actually originally typed "crippled" but wanted to phrase it more mildly). What I mean is to avoid the heterogeneous cores they chose to disable it in the performance cores rather than e.g. emulate or double pump it in the efficiency cores.

And they chose to give you efficiency cores + performance cores without avx512 on all consumer CPU's, rather than, what would be my preference, sell some with only performance cores with avx512 like their previous gen CPUs. I'd be ok with efficiency cores on a laptop, sure, but not on a desktop machine.

But AMD already gives plenty fast cores with avx512 now so I'm satisfied, I'm simply not an intel user anymore. Of course intel's choice still affects me anyway since the less consumers have avx512, the less software will make use of it.


All major cloud providers (AWS, Azure and GCP) provide instance-types that support AVX512-VBMI2. Iguana targets large amounts of that are decompressed often and speed is of the essence. Those are typical server scenarios, so the AVX512-VBMI2 requirement isn't really an issue in common-case scenarios.

DISCLAIMER: I work for Sneller


Not only AVX-512, but AVX-512 + VBMI2


vbmi2 is not available on skylakex


sorry, Ice-Lake not skylake


It looks similar to LZSSE. We tried it in ClickHouse, but then removed it:

https://github.com/ClickHouse/ClickHouse/pull/24424

Reasons: - the decompression speed is slightly better than for lz4, but the compression speed is low; - the code was non-perfect, and the fuzzer has found issues.

LZSSE library was abandoned five years ago, but they have great blog posts to read: https://github.com/ConorStokes/LZSSE

Iguana looks promising, but AVX-512 requirement is too restrictive. We need something to work both on x86 and ARM. Also, integrating Go assembly into other software is not easy. And A-GPL license makes it incompatible.


Looks like they switched from AGPL to Apache 2 last week?

https://github.com/SnellerInc/sneller/commit/05410a85f900e02...


This is amazing news!

PS. Interesting if someone has experience integrating Go Asm dialect into C or C++ build system? We needed it for another experiment https://github.com/ClickHouse/ClickHouse/issues/45130#issuec...


A C version of the library with a portable implementation and AVX-512 optimizations is planned.


technically this looks really impressive. great to see a new compression approach that supports extremely high performance decompression, with a high performance open source implementation.

re: winning adoption of new compression approaches, there's an interesting podcast interview [1] with Yann Collet (of lz4 / zstd):

Some factors Yann discussed that helped lz4 & zstd gain traction were: permissive licensing (BSD) ; implementation in C -- widest support for including it into other software ecosystems ; open development & paying attention to issues raised by users of the software ; the new compression approach able beat an existing popular approach in some use cases with no downside: e.g. if a hypothetical new compression approach has 200% faster decompression but offers 10% worse compression ratios, then there's friction for introducing it into an existing system, as the new approach might first require purchase and deployment of additional storage. Whereas a new approach that is 50% faster and has exactly the same or slightly better compression ratios can be adopted with much less friction.

It looks like the Iguana code has recently been relicensed with Apache instead of AGPL (which used for the rest of the sneller repo), which could lower the barrier for other projects to consider adopting Iguana, although there are still dependencies from the Iguana code to code in AGPL licensed files elsewhere in the sneller repo.

[1] https://corecursive.com/data-compression-yann-collet/


Hi! Sneller CTO here.

Thanks for pointing out the dependency on the AGPL bits -- I'm going to fix that ASAP. (There's just a small utility library we've got to re-license as Apache-2 as well.)

Once we're comfortable committing to the format for the long haul, we're planning on publishing a C implementation as well. There are still a few tweaks we've been evaluating to try to make the compression ratio a tiny bit more competitive.


this is very cool. thank you for releasing it! if you had to guess, how much of the performance depends on avx512 specifically? if this could run reasonably well on avx2, IMO this would be a really great general successor to LZ4


That's a good question.

For the pure-LZ77 Iguana variant (no rANS encoding), most of the decoding time is spent moving memory around rather than decoding the match+length tuples from the input stream, which suggests the performance difference wouldn't be that great if we were only on AVX2, but AVX-512 has a bunch of instructions that are super helpful for parsing our base254 integers quickly. If I had to take a wild guess I'd say it would cost an additional 15%.

One sacrifice we made in the design is that the minimum match offset distance is always 32 bytes, which means we can always perform a literal or match copy by starting with a ymm register load + store. This hurts the compression ratio a bit but it helps performance immensely, and for that reason alone I suspect we'd still come out ahead of lz4 even without AVX-512.


I remember reading that AVX-512 hurt the ability of Intel CPUs to turbo and run other tasks in parallel. This was a few years ago and I would hope it’s not the case anymore, especially since AMD has managed to add AVX-512 support too.

Have you done any testing with running multiple decompression tasks in parallel, or just running a single decompression task while at the same time running other tasks like maybe a web server?


The initial AVX-512 implementation brought a lot of issues with it. The biggest problem was that Intel used 512-bit ALUs from the beginning and I think it was just too much that time (initial 14nm node) - even AMD's Zen4 architecture, which came years after Skylake-X, uses 256-bit ALUs for most of the operations except complex shuffles, which use a dedicated 512-bit unit to make them competitive. And from my experience, AMD's Zen4 AVX-512 implementation is a very competitive one. I just wish it had faster gathers.

Our typical workload at Sneller uses most of the computational power of the machine: we typically execute heavy AVX-512 workloads on all available cores and we compare our processing performance at GB/s per core. This is generally why we needed a faster decompression, because before Iguana almost 50% of the computational power was spent in a zstd decompressor, which is scalar. The rest of the code is written in Go, but it's insignificant compared to how much time we spend executing AVX-512 now.

(I work for Sneller)


The less implementations in C, the better.

Unless they provide a certified proof of QA assurance memory corruption free code, with respective data fuzzing tests.

OS ABI compliant, by all means, as implementation language, better not.

Thankfully the ongoing cybersecurity bills will improve on this.


Thank you, Promising work!

Question: How was zstd built in the tests?

In other words, was the possibility of 2-stage pgo+lto optimization taken into account?

(the alpine zstd claim ~ "+30% faster on x86_64 than the default makefile" [1] )

[1] ""

# 2-stage pgo+lto build (non-bootstrap), standard meson usage.

# note that with clang,

# llvm-profdata merge --output=output/somefilename(?) output/.profraw

# is needed.

# believe it or not, this is +30% faster on x86_64 than the default makefile build (same params)..

# maybe needs more testing

# shellcheck disable=2046

""*

https://github.com/alpinelinux/aports/blob/master/main/zstd/...


Most time-series / analytical databases,... are already using or switching to integer-compression [1] where you can compress/decompress several times (>100GB/s see TurboBitByte in [2]) faster than general purpose compressors.

[1] - https://github.com/powturbo/TurboPFor-Integer-Compression

[2] - https://github.com/powturbo/TurboPFor-Integer-Compression/is...


Is this effected by Microsoft's patent on various rAns coding and decoding?

If not, how does it avoid the (rather vague) claims?

https://patents.google.com/patent/US11234023B2/en


Decompression speed looks good, but in my experience once you get past a certain point (~X000 MB/s) performance gains become pretty marginal in real world applications. I'd like to see compression speeds and performance on AVX if AVX-512 is not available.


I don't agree with this. The whole point of tools like this, is that we don't live in age of dial up internet anymore, and haven't for a long time.

Back then you wanted maximum compression like XZ, because the slowest part of the process was the download. Now especially with XZ awful decompress rate, the slowest part is extract. In many cases it's better to have 9% larger size, for 500% increase in extract, like you get with ZST.


Definitely there's a lot of workloads that are pure data-munging, where you don't want to store the whole bloated JSON (or protobuf) mess on eg. S3 where you're paying by the byte, but at the same time, you're looking at key parts of your infra which are spending double digit percentages of their CPU time just decompressing stuff.

Sure, you can say the problem is JSON, protobuf, whatever. But in a huge organization with hundreds or thousands of types of documents, it's a needless friction to require custom optimized formats for each one.

So a faster decompressor is often a no-brainer of a win and can have substantial dollar values attached.


> in my experience once you get past a certain point (~X000 MB/s) performance gains become pretty marginal in real world applications

There might be no latency benefits to that, but there's always a throughput gain from using fewer cpu cycles over all - the cpu cycles you would have spent decompressing can be used for other things.

For that reason single-core performance would still be nice to have and always great to have open source code which is performance centric so that we all get to find out "How the hell did they do it?" (it is very readable assembly [1])

> I'd like to see compression speeds and performance on AVX if AVX-512 is not available.

The ARM is probably more interesting in the elastic compute world right now, the AVX world is unlikely to get cheaper.

And unlike before, I don't need to wait for a vendor to sell me an SoC, Gravitons are already here.

[1] - https://github.com/SnellerInc/sneller/blob/master/ion/zion/i...


Yeah, I think often these benchmarks don't capture the constant factor. The rate at which 4 Kb strings can be decompressed probably matter more than rate at which 400 Mb files can be decompressed.

Benchmarks are hard though.


What do people use these for?


DISCLAIMER: I'm one of the Sneller developers

Compression is required to make more efficient use of the available bandwidth and RAM. Although AWS provides up to 100Gb/s networking, we improve the net throughput using compression. Also we cache data on the computing nodes, so storing it in compressed form allows us to store more data on the node.

Our query engine is often limited by memory bandwidth when using uncompressed data. So it needs a lot of data and it needs it fast. Because we cache data in compressed form, decompression is always needed, so it should be as fast as possible. That's the primary reason why we developed Iguana, because a faster decompression results in a faster query engine.


I don't use Iguana in particular, but zstd and lz4 in Marginalia Search.

First, I use it to store and retrieve crawl data faster. Mechanical hard drives are much slower than these decompression algorithms. So for sequential access, you can make a cheap mechanical drive faster and bigger as big as long as you've got some CPU cycles to spare to decode the data. (This is also true for network, basically anything outside of RAM, as long as it's nice and sequential, there's probably an argument for compressing it)

I also use them to keep more stuff in RAM than what should be possible. I do transparent compression/decompression of some large strings e.g. document HTML that typically compresses with a factor 10x. These algorithms are so fast the performance hit of decompressing the strings as they're used is like a 10% pessimization of the run time for significantly slashing the RAM usage.


Someone else brought up the recent podcast episode of CoRecursive. One benefit that was brought up was compression of the cache close to the user’s physical location. Facebook has a lot of machines around the world dedicated to this sort of cache. They need a balance between compression ratio and compression/decompression throughput. There’s no point using a great compression algorithm to achieve an amazing compression ratio if you end up being CPU-bound when you need to decompress it later on.


You can check my presentation on this topic. It covers the case when compression has to be faster than memory bandwidth: https://presentations.clickhouse.com/meetup53/optimizations/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: