More

klauspost · 2026-03-14T18:39:55 1773513595

About the versioning, glad you spotted it anyway. There isn't as much use of the gzhttp package compared to the other ones, so the bar is a bit higher for that one.

Also making good progress on getting a slimmer version of zstd into the stdlib and improving the stdlib deflate.

fl0ki · 2026-03-14T19:07:33 1773515253

Yeah, I make it a habit to read the changelogs of every update to every direct dependency. I was anticipating this change for years, thanks for doing it!

terrelln · 2026-03-14T19:29:53 1773516593

> Also making good progress on getting a slimmer version of zstd into the stdlib

Awesome! Please let me know if there is anything I can do to help

klauspost · 2026-03-14T21:19:03 1773523143

https://github.com/klauspost/stdgozstd

klauspost · on March 17, 2025

I just released about 2 years of work on improving compression with a fixed encoding LZ77 style compressor.

Our goal was to improve compression by combining and tweaking the best aspects of LZ4 and Snappy.

The package provides Block (up to 8MB) and Stream Compression. Both compression and decompression have amd64 assembly that provides speeds of multiple GB - typical at memory throughput limits. But even the pure Go versions outperform the alternatives.

Full specification available.

klauspost · on March 17, 2025

Writeup on design and changes: https://gist.github.com/klauspost/a25b66198cdbdf7b5b224f670c...

Block and Stream Format: https://raw.githubusercontent.com/minio/minlz/refs/heads/mai...

ptman · on March 18, 2025

The blog post is quite nice: https://blog.min.io/minlz-compression-algorithm/

But comparisons with zstd & lz4 would be nice.

Is someone planning on a wuffs/rust implementation with C API so that it wouldn't be Go only?

klauspost · on March 18, 2025

The main reason I took the effort to write the spec was to encourage ports - and commit to the format.

I've converted a basic block encoder/decoder to C (see end of README for link), to assist anyone wanting to get started on a C implementation.

Having been 10+ years since I've done any serious C, I didn't want to assume I could write a good C API for it.

If anyone are interested in picking it up, reach out.

> comparisons with zstd & lz4

The repo has lz4 comparisons (Go version) for both blocks and streams.

I see MinLZ more like a compliment to zstd, just like lz4 - where speed matter more than ultimate size.

Here is the fully parallel speed of block decompressing with zstd or minlz - Go versions:

* Protobuf - zstd: 31,597 MB/s - MinLZ 155,804 MB/s. * HTML - zstd: 25,157 MB/s - MinLZ 82,292 MB/s. * URL List - zstd: 16,869 MB/s - MinLZ 45,521 MB/s. * GEO data - zstd: 11,837 MB/s - MinLZ 36,566 MB/s.

Of course zstd compresses to a smaller size - but for things like streams transfers or for fast readbacks from disk you probably want the fastest.

Overall, I would say that MinLZ is about 2x faster decompressing, and compresses about 2x the speed compressing at max speed. But of course with less compression.

klauspost · on Aug 27, 2024

Almost all (x86) CPUs sold have GFNI. That can pretty much saturate memory bandwidth on a single core or two. You can use SSSE3 pshufb for the rest which is about half the speed.

ARM has NEON and SVE/SVE 2. They also operate very fast.

So not sure what you are thinking of.

nullc · on Aug 27, 2024

GFNI only does 8-bits and sticks you with a single modulus. But for some reason I'd forgotten it completely, you're totally right to give me a mystified response.

(FWIW, it's possible to project elements from one field to another isomorphic field, though it takes enough operations that for fast code like RS decoding the conversion is probably performance limiting).

For hybrid codes GFNI should be sufficient, though for things like using RS at 16/32 bit sizes it's not.

dragontamer · on Aug 28, 2024

The matrix multiplication needed to correct grows at n^2 despite the code only growing at n. There are asymptotic faster matrix multiplications in theory than O(n^2) but in practice every algorithm is O(n^2).

As such, large ReedSolomon codes are impractical. If you need a larger code than what GF(2^8) can offer, you grow with 2-dimension codes, slicing or other features.

In practice, this sacrifices Minimum Distance property, meaning you should use a Turbo Code (or other XOR code) which are O(n) but imperfect.

---------

CRC32 can be implemented in GFNI. And AES is also GF(2^8).

----------

I don't think there are many algorithms where GF(2^16) or bigger are needed.

And if they did, it's possible to turn 8x8 into 16x16 or 32x32 anyway.

nullc · on Aug 28, 2024

RS codes can be done in N log N time for both encoding and erasure decoding, and sometimes achieving MDS is useful for attack resistance... e.g. that you can always decode with N blocks, vs with non-minimum distance codes you can have adversarial subsets that will fail to decode even when you have significantly more blocks than needed.. Larger word sizes for RS codes is also useful for list decoding even when the number of data blocks isn't that great.

And sure, list decoding is slow. But I think there are two distinct groups of applications for good instruction sets: one is where you're trying to make a fast thing like an 8-bit RS code or something "free" by having it run at near memory, wire, or disk speeds (or make it consume little power). but the other is where you're doing something legitimately slow, including things that are O(N^2) (or worse). In those cases sometimes a small constant factor makes a big difference between usable and very much not usable.

dragontamer · on Aug 29, 2024

In those cases, I know that ARM has PMULL / PMULL2 and x86 has PCLMULQDQ, which go up to 64x64 == 128-bit multiplication.

There is even an AVX512 version of PCLMULQDQ.

klauspost · on Aug 28, 2024

In my experience, the hybrid approach is practically better due to cache locality anyway, so I don't think even native GF16 or GF32 would be of much help.

FWIW, I've ported rs-leopard to Go and found it very effective, except in the "recover one" scenario, where it is only at 50% speed of the "recover all" scenario, since it has to do several reconstructions to get one output. But even so, I am not too sure it would be much better with plain GF16, since you would still need to touch most shards to get one shard out.

klauspost · on June 24, 2024

Looking at it quickly, it seems like you just moving entropy to the frequency table.

If the frequency table isn't included in the count, I could just make an "infinite" compressor by storing the frequency and cut one byte off the end. From the frequency table I could then deduce what the last byte should be.

> typical entropy encoders (Huffman, ANS, etc) would require 1 bit per symbol,

No. ANS (and range/arithmetic coding) allows for probabilities to be stored with fractional bits.

peter-ebert · on June 24, 2024

ANS requires 1 bit per symbol if the ratio is 1:1, you can confirm this here: https://kedartatwawadi.github.io/post--ANS/

foobarqux · on June 24, 2024

The value of the ratio is information that needs to be transmitted. You are pretending that it’s free

peter-ebert · on June 24, 2024

I was using what I seemed to be the standard with other encoders, storing the ratios is roughly the same cost compared to other compressors, a few extra bits for exact counts. But I do see how this isn't i.i.d. so I'm fixing that.

foobarqux · on June 24, 2024

I don't think you understand the basic concepts and you are wasting your time trying to "fix" it. Let me try to explain a bit differently: The Shannon limit depends on assuming a model of the source (the thing producing symbols). For example imagine the output of something that you think is "true" noise like a fair coin flip that produces 0 (for heads) and 1 (for tails). That has entropy of 1 bit and you cannot compress it. Now you learn that it is actually a pseudorandom number generator that produces 0 or 1 in a 1:1 ratio. If you know the program (and seed if required) the source now has 0 entropy (you can run the program to know every bit that will be produced). Same symbol outputs, different models of how they are produced, different theoretical compression limits. As another example if you are compressing text one model would be to assume that words are produced independently of one another. We know that that isn't true (certain words are more likely to follow other words like your next word prediction system on your phone works) but you might still choose to implement a compression system based on that naive model because it's less complex, uses less memory or compression/decompression time etc. If you implemented compression using a less naive model (e.g. what is the probability of the next symbol in all human text given all the symbols I have seen before this one) you could get better compression rates but you wouldn't be breaking the Shannon limit that you calculated using the naive model because the "true" limit is based on the "true" model of whatever the entropy of human produced text is, which is not really calculable (people produce new sentences every day) but you might approximate using something like all human text you can find.

As I alluded to above note that Shannon entropy doesn't take into consideration practical complexities (memory, program size, processing time) that are real considerations when implementing a compression scheme. Most of the time you are going to trade higher entropy for simpler, faster, less resource intensive compression.

To sum up: You cannot beat the Shannon limit of the "true" model of the source but you can beat the Shannon limit of a naive approximation of the source. Those naive models are used because they are more practical to implement.

peter-ebert · on June 24, 2024

I appreciate the input. I did not mean to imply I could encode a random stream of symbols/characters, that is absolutely valid. I was approaching this as a compression technique for something like a text file, where the symbol counts are known at compression time.

foobarqux · on June 25, 2024

First issue is that if you do comparisons you have to do them apples-to-apples, you can't assume that method A has that information available for free but method B does not or that method A assumes a given source model but method B assumes a different source model.

Second, I really don't understand how you intend to use a table of symbol counts: If you do it over the entire file the table might be a reasonable size but the number of permutations becomes infeasible. Conversely if you do it in small windows (like 8 or so in your examples) you have to store a separate symbol count table for each window which would explode the symbol count table. I really doubt you are gaining anything from doing this. You are going to create an enormous per-file symbol frequency table and then not count it against the compressed size, that isn't compression it's just misdirection.

peter-ebert · on June 25, 2024

When I get time may compress the table as well then, in looking at example arithmetic encoders they seem to have some flat table implementations as well (https://github.com/nayuki/Reference-arithmetic-coding/blob/m...). I don't see how that's misdirection, I left the table uncompressed specifically for transparency reasons. Was just trying to keep that part simple.

gliptic · on June 24, 2024

Not if the counts are updated correctly.

klauspost · on March 10, 2024

> If I could change one thing about S3's API I would like an option to read the metadata with the listings.

Agree. In MinIO (disclaimer: I work there) we added a "secret" parameter (metadata=true) to include metadata and tags in listings if the user has the appropriate permissions. Of course it being an extension it is not really something that you can reliably use. But rclone can of course always try it and use it if available :)

> You can create zero length files ending in /

Yeah. Though you could also consider "shared prefixes" in listings as directories by itself. That of course makes directories "stateless" and unable to exist if there are no objects in there - which has pros and cons.

> Or alternatively a way of setting the Last-Modified on an object when you upload it would do too.

Yes, that gives severe limitations to clients. However it does make the "server" time the reference. But we have to deal with the same limitation for client side replication/mirroring.

My personal biggest complaint is that there isn't a `HeadObjectVersions` that returns version information for a single object. `ListObjectVersions` is always going to be a "cluster-wide" operation, since you cannot know if the given prefix is actually a prefix or an object key. AWS recently added "GetObjectAttributes" - but it doesn't add version information, which would have fit in nicely there.

nickcw · on March 10, 2024

> Agree. In MinIO (disclaimer: I work there) we added a "secret" parameter (metadata=true) to include metadata and tags in listings if the user has the appropriate permissions. Of course it being an extension it is not really something that you can reliably use. But rclone can of course always try it and use it if available :)

Is this "secret" parameter documented somewhere? Sounds very useful :-) Rclone knows when it is talking to Minio so we could easily wedge that in.

> My personal biggest complaint is that there isn't a `HeadObjectVersions` that returns version information for a single object. `ListObjectVersions` is always going to be a "cluster-wide" operation, since you cannot know if the given prefix is actually a prefix or an object key

Yes that is annoying having to do a List just to figure out which object Version is being referred to. (Rclone has this problem when using --s3-list-version).

klauspost · on Aug 21, 2023

Repo: https://github.com/mxmlnkn/rapidgzip

As someone who has dabbled quite a lot with deflate this was a very interesting find. It seems like the format that never wants to die.

First of all I am surprised this is even possible. Given the extremely minimal and non-aligned nature of the deflate headers, I actually discarded this as being feasible, with the false positive rate likely being too high.

The paper and implementation is an amazing piece of engineering. Hats off to the author. With an index, you are in "easy" mode, so I consider the unindexed "hard mode" the big accomplishment.

I am still digesting it. So if I read the paper correctly the false positive rate is approximately 1 for every 5GB. Very reasonable.

mxmlnkn · on Aug 21, 2023

Hi, author here.

You are right in the index being the easy-mode. Over the years there have been lots of implementations trying to add an index like that to the gzip metadata itself or as a sidecar file, with bgzip probably being the most known one. None of them really did stick, hence the necessity for some generic multi-threaded decompressor. A probably incomplete list of such implementations can be found in this issue: https://github.com/mxmlnkn/rapidgzip/issues/8

The index makes it so easy that I can simply delegate decompression to zlib. And since paper publication I've actually improved upon this by delegating to ISA-l / igzip instead, which is twice as fast. This is already in the 0.8.0 release.

As derived from table 1, the false positive rate is 1 Tbit / 202 = 5 Gbit or 625 MB for deflate blocks with dynamic Huffman code. For non-compressed blocks, the false positive rate is roughly one per 500 KB, however non-compressed blocks can basically be memcpied or skipped over and then the next deflate header can be checked without much latency. On the other hand, for dynamic blocks, the whole block needs to be decompressed first to find the next one. So the much higher false positive rate for non-compressed blocks doesn't introduce that much overhead. Furthermore, you need to consider that the block finder is basically only applied to the first ~0-128 KiB or so data of each 4 MiB large block (until it finds a valid one). So, in terms of real input archive size, the effective false positive rate is smaller by a factor of 32.

I have some profiling built into rapidgzip, which is printed with -v, e.g., rapidgzip -v -d -o /dev/null 20xsilesia.tar.gz :

    Time spent in block finder              : 0.227751 s
    Time spent decoding                     : 20.4015 s
    Time spent allocating and copying       : 2.71328 s
    Time spent applying the last window     : 1.39377 s
    Replaced marker bytes                   : 1 GiB 497 MiB 311 KiB 503 B

The block finder adds about 1% overhead and the marker replacement about 7% overhead. It probably would be nice to add a statistic for the false positive rate to the -v output.

klauspost · on Aug 17, 2023

Last I checked QAT was limited to 64KB backreferences and independent 128KB blocks.

Now you know why they are comparing 16KB payloads only.

terrelln · on Aug 17, 2023

Yeah, that is definitely a limitation with QAT. It isn't a great fit for larger data, as it can quickly lose compression ratio due to its smaller window. However, there is a lot of compression done on data that is <64KB. E.g. compression in RocksDB.

We also have some half baked ideas to combine a fast SW match finder that only looks for matches >64KB away, and supplements the matches that QAT finds.

klauspost · on Aug 17, 2023

.. for small payloads.

QAT cannot compress across blocks. So as soon as your input is more than 128KB the compression ratio tanks.

Usually well below what even level 1 of software zstd does. They choose the input very carefully.

mastax · on Aug 17, 2023

Interesting. That sounds basically perfect for ZFS where the default is 128KB blocks and zstd compression.

klauspost · on Feb 12, 2023

This could be very useful for a project I am just starting on.

No documentation, and example has no content makes the learning curve a bit steep.

Does anyone have any pointers on how to use this?

totallygamerjet · on Feb 12, 2023

It’s pretty simple to use if you are familiar with dlopen and friends.

Just call purego.Dlopen(“libname.so”, purego.RTLD_GLOBAL)

Take the returned library (make sure to check for errors with purego.Dlerror() first) and call purego.Dlsym(lib, “cfuncName”). If it exists than u can call it with either purego.SyscallN or purego.RegisterFunc

mook · on Feb 13, 2023

It would be good to have more documentation on usage, though; things like how to deal with struct padding (or packed structs), common OS API types (presumably manual munging of UCS2/UTF16 is needed for Windows), etc; at least to mention that it's unchanged from …/x/sys?

It's easier with dlopen because it's still C and therefore you have the normal headers…

mkm40 · on Feb 13, 2023

+1 for additional examples or documentation.

Particularly an example that takes a c struct pointer would be awesome.

What happens to const char* return values that are null ? I think it is empty string, but either test case or doc confirming it would be awesome

viraptor · on Feb 12, 2023

The example seems straightforward: Include this package, and your usual os calls start going through their "fakecgo" path.

klauspost · on Dec 9, 2022

Reading it, it seems very similar to the Playstation 3 Cell, with its fate mirroring it very much.

A highly specialized processor that has very high computational throughput for specialized operations, but a quite limited scalar unit.

In both cases you really have to write software specifically for it to get it to perform with any reasonable speed. If you do that, you get great value, but any existing code will need significant modifications to perform.

Granted AVX512 is (now) more common than SPE code ever became. It is slightly better than the Itanium approach, but scalar performance (especially single threaded) will have limited the value you get from this CPU, unless you write your own software and it can utilize AVX-512.

rbanffy · on Dec 9, 2022

It was much easier to program though. The Cell had two different ISAs, one for the PPUs and one for the SPUs. The SPUs also didn't have direct access to memory and the PPUs had to manage task assignment and completion, as well as setting up DMA transfers between SPU memory and main memory.

The Phi was closer to the Sun Niagara family - lots and lots of simple, slow, cores, with the note that, in the case of the Phi, the weakling x86's had mighty SIMD abilities while the Niagara had more or less standard SPARC stuff.

Neither will have amazing performance unless you have at least as many threads running as you have cores, and most of the time, at least twice as many. For single-threaded code, they were on the slow side.

Still, I always suggested people use Phis to develop because they'd get a taste of future computers. Nowadays a decent laptop will have half a dozen cores and, unless your mail client has that many threads, it'll not feel as fast as it could be.