This is one of the most useful documents on the web regarding bzip2 implementation. I wrote a bzip2 encoder in Rust [1] a couple of years ago and it would have been an uphill struggle without Joe Tsai's work.
Like Tsai did for his Go implementation, I used the 'SA-IS' algorithm for computing the Burrows-Wheeler Transform. Unlike the algorithm used in the reference implementation of bzip2, it is linear time, which in practice means it has much better upper-bound performance (though is typically somewhat slower on average).
The problem of suffix array construction, which SA-IS solves (with computation of the BWT being a natural corollary), is a very interesting one in which theoretical and practical advances are still being made. There is a notable implementation of SA-IS by Ilya Grebnov which is spectacularly fast due to usage of techniques like cache prefetching [2]. It's worth a look for anyone interested in really high performance software compression.
It's a consequence of being block-based as mentioned elsewhere, but interesting to note that cat'ing together bzip2 files gives a valid bzip2 file. That's the basis of pbzip2 [0] - it breaks the input file into chunks of 900K by default, compresses each chunk and then concatenates the compressed chunks. The individual chunks can be compressed in parallel if hardware allows.
gzip isn't by default block based but does effectively support a dictionary reset command in the compressed stream. This “command” is essentially the start of the gzip header, so if you cat two bits of gzipped data together the result from decompressing the result is the same as the source data streams concatenated. This means you can turn gzip into a block-based process and therefore parallelise it in the same manner as bzip2, and this is how pigz⁰¹ works.
This dictionary reset trigger is how the “rsyncable”² option³ is implemented too. Resetting the compression dictionary this way every 1000 input bytes increases the size of the compressed output by surprisingly little⁴.
[1] I actually started making my own version of this, way back when, inspired by looking into how gzip's rsyncable option² worked, before discovering it already existed! I “finished” my version as far as a working PoC though as it was an interesting enough exercise.
[3] also supported by pigz⁰ where it is used within each block it compresses, though because it splits the input at regular intervals anyway (instead of a more dynamic approach) its output is naturally already more rsync compatible than plain gzip (though with the default 128KiB block size, notably less so than with the reset every 1000 input bytes)
[4] usually between 1% and 3% IIRC, depending on input content of course, for some inputs the difference could be lower than that range, or much higher
This looks really good, I remember looking into BWT ad a kid. It's a true "wat" once you understand it.
And once you understand it, why does it compression so well? Because suffixes tend to have the same byte preceeding them.
Bzip2 is still highly useful because it is block based and can thus be scaled nearly linearly across cou cores (both on compress and decompress)! Especially at higher compression levels. See e.g. lbzip2.
Bzip2 is still somewhat relevant if you want to max out cores. Although it has a hard time competing with zstd.
When I have first heard about bzip3, a few months ago, I have run a series of tests, comparing it with zstd and other compression programs.
In the beginning, I had been extremely impressed, because with the test archives that I happened to use bzip3 had outperformed zstd in all cases, at all possible settings for both, either in compression time at the same compressed size, or in the compressed size at the same compression time.
Nevertheless, later my initial enthusiasm had to be tempered, because I have found other test archives where the compression ratio achieved by bzip3 was more modest, falling behind other compression programs.
Therefore, the performance of bzip3 seems a little hard to predict, at least for now, but there are circumstances when it has excellent performance, with a much better compromise between compression speed and compressed size than the current alternatives.
There is one thing you can't with most algorithms: prallelize decompression. That's because most compression algorithms use sliding windows to remove repetitive sections.
And decompression speed also drops as compression ratio increases.
If you transfer over say a 1GBit link then transfer speed is likely the bottleneck as zstd decompression can reach >200MB/s. However if you have a 10GBit link then you are CPU bound on decompression.
See e.g. decompression speed at [1].
Bzip2 is not window but block based (level 1 == 100kb blocks, 9 == 900kb blocks iirc). This means that, given enough cores, both compression and decompression can parallelize. At something like 10-20MB/s per core. So somwhere >10 cores you will start to outperform zstd.
Granted, that's a very very corner case. But one you might hit with servers. That's how I learned about it. But so far I've converged on zstd for everything. It is usually not worth the hassle to squeeze these last performance bits out.
That's possible with pzstd in any case. zstd upstream has a plan to eventually support parallel decompression natively but hasn't prioritized it given the complexity and lack of immediate need.
The issue talks about one vs. multiple frames. That's exactly the issue. It's not a matter of complexity, it's a matter of bad compromises.
The issue can be easily played through. The most simplistic encoding where the issue happens is RLE (run length encoding).
Say we have 1MB of repeated 'a'. Originally 'aaa....a'. We now encode it as '(length,byte)', so the stream turns into (1048576,'a').
Now we would want to parallelize it over 16 cores. So we split the 1MB into 16 64k chunks and compress each chunk independently. This works but is ~16x larger.
Similar things happen for window based algorithms. We encode repeated content as (offset,length), referencing older occurrences. Now imagine 64k of random data, repeated 16 times. The parallel version can't compress anything (16x random data), the non-parallel version will compress it roughly 16:1.
There is a trick to avoid this downside. The lookup is not unlimited, there is a maximum window size to limit memory usage. For compatibility it's 8MB for zstd (at level 19), but you can go all the way to 2GB (ultra, 22, long=31).
As you make chunks significantly larger than the window you are only loosing out on the new ramp up. E.g. if you use 80MB chunks then you have a bit less than 10% of the file encoded worse. You could still double your encoded size with a well crafted file.
If you don't care about parallel decompression then you are able to only parallelize parts like the lookup search. This gives good speedup, but only on compression. That's the current parallel compression approach in most cases (iirc) leading to a single frame, just faster. The problem is that back-references can only be resolved backwards.
The whole problem is not implementation complexity. It's something you algorithmically can't do with current window based approaches without significant tradeoffs on memory consumption, compression ratio and parallel execution.
For bzip2 the file is always chunked at 900kb boundaries at most. Each block is encoded independently and can be decoded independently. It avoids this whole tradeoff issue altogether.
I would also disagree with "no need". Zstd easily outperforms tar, but even my laptop SSD is faster than the zstd speed limits. I just don't have the _external_ connectivity to get something onto my disk fast enough. I've also worked with servers 10 years ago where the PCIe bus to the RAID card was the limiting factor. Again easily exceeding the speed limits.
Anyway, as mentioned a few times it's an odd corner case. And one can't go wrong by choosing zstd for compression. But it is real fun to dig into these issues and look at them, I hope this sparks some interest in it!
That does, of course, require that you compress it into multiple frames to begin with, which could be a problem if you don't control the source of the compressed files, because the default is a single frame. In theory if everyone used pzstd to compress their files, it would be strictly superior to BZ2 in nearly every circumstance. As it is, you do have to go out of your way to do that.
But I don't think that necessarily means the single-frame choice by default is a bad tradeoff. It's better in most circumstances. And if they do eventually figure out a reasonably efficient way to handle intra-frame parallel decompression, then it's just gravy.
A huge amount of time is spent optimizing for #3. Maybe instead we should offer descriptive commands that convey the goals. Say, "squash", "speedup", and "deflate", or some such.
I still struggle to get my head around BWT. I understand what it does conceptually and why it helps, and I can read code that implements it, but I don't fully get it - there's a mental disconnect for me somewhere. Mainly, I can't convince myself that computing the inverse transform is possible.
It's one of those algorithms that I can say for sure I'd never have been able to come up with on my own.
I think it really helps to stop thinking about the string as a linear sequence with a beginning and end, and instead consider an unbroken loop of characters. Literally imagine the string written into a circle like the letters on an Enigma rotor.
Then you can consider the construction of all the substrings of length 2, length 3, and so on. You may also wish to consider the same induction, but working backwards from its conclusion. Start by considering the set of n length substrings, then the n-1 length substrings, etc.
Either way, your objective should be to convince yourself that you can reconstruct the whole ring from the BWT. At this point it is not hard to make the final leap to understand how it can be applied to regular strings.
Like Tsai did for his Go implementation, I used the 'SA-IS' algorithm for computing the Burrows-Wheeler Transform. Unlike the algorithm used in the reference implementation of bzip2, it is linear time, which in practice means it has much better upper-bound performance (though is typically somewhat slower on average).
The problem of suffix array construction, which SA-IS solves (with computation of the BWT being a natural corollary), is a very interesting one in which theoretical and practical advances are still being made. There is a notable implementation of SA-IS by Ilya Grebnov which is spectacularly fast due to usage of techniques like cache prefetching [2]. It's worth a look for anyone interested in really high performance software compression.
[1] https://github.com/jgbyrne/banzai/
[2] https://github.com/IlyaGrebnov/libsais