Hacker News new | past | comments | ask | show | jobs | submit login
Decompressing a Gzip File by Hand (ttay.me)
144 points by thomastay 4 days ago | hide | past | favorite | 18 comments

For anyone looking to write their own Deflate/Gzip/Zlib, I would recommend looking at the little program called `infgen`:

> infgen is a deflate stream disassembler. It will read a gzip, zlib, or raw deflate stream, and output a readable description of the contents.


Wow! This program is really cool and I'm honestly not surprised it's written by Mark Adler, the guy who wrote zlib, that guy is an amazing programmer.

Thanks, this would have saved me a bunch of time and I wish I'd known this before writing this article. I'll add it in as a footnote later on.

For anyone who's curious, here's the output of the infgen program on the gzip file in the article (I remade the file so the timestamp is different)

  $ cat test.out.gz | ./infgen -ddis
  ! infgen 2.6 output
  time 1637813470 ! UTC Thu Nov 25 04:11:10 2021
  name 'test.out
  last                    ! 1
  fixed                   ! 01
  literal 'a              ! 10010001
  literal 'a              ! 10010001
  match 6 1               ! 00000 0000100
  literal 10              ! 00111010
  end                     ! 0000000
  ! stats literals 8.0 bits each (24/3)
  ! stats matches 66.7% (1 x 6.0)
  ! stats inout 5:6 (4) 9 0
                          ! 00
  ! stats total inout 5:6 (4) 9
  ! stats total block average 9.0 uncompressed
  ! stats total block average 4.0 symbols
  ! stats total literals 8.0 bits each
  ! stats total matches 66.7% (1 x 6.0)

> (I remade the file so the timestamp is different)

That's why you should always use "gzip -n". Having the file name and timestamp in the compressed file header was IMO a mistake, and newer single-file compressors (bzip2, xz, zstd) AFAIK don't even have these fields in their native file format.

Woah! I didn't even notice it was him. Most of the accepted answers relating to anything Deflate are by him on Stack Overflow, so I shouldn't be surprised.

If you use the < operator you don’t need to spawn a ‘cat’ process that does nothing but copy around the contents of the file.

Programming a gzip (deflate) decompressor is a really good learning project. I did it last year, ultimately producing a pretty poor but working implementation [0]. I ended up modifying Mark Adler's puff example program to see the intermediary tables it produces to help debug my own implementation. I wish I had known about infgen (also by Adler). The other resource I would recommend, beyond the official specs is this article by Joshua Davies[1], also mentioned in the original post here.

[0]: https://github.com/davecom/nflate [1]: https://commandlinefanatic.com/cgi-bin/showarticle.cgi?artic...

This tool served as a base for a fun project of mine [0], which consisted on injecting arbitrary bytes to a deflate stream without affecting the decompression output.

The main idea is that a stream is composed of blocks, and dynamic blocks can contain Huffman tables followed by a end-of-block symbol. Since there's no other symbols, such blocks won't produce any output when decompressed. So there's a few bytes for those tables that can be changed (e.g. to put a little ascii signature). However, this affects the codes present in the tables, which must pass some validations (for decompressors to unambiguously match parsed bits to codes). I used a SMT solver to figure out values for the remaining bytes, so that the tables had valid code counts.

[0]: https://nevesnunes.github.io/blog/2021/09/29/Decompression-M...

Oh, that's really nice.

I wish there was a widely accepted standard to write down arbitrary binary formats so one could inspect/debug them. Or even use it as a library to puck/unpack the data. I guess the 010 editor comes close with it's binary templates.

One of the mantras I used when I was a programming instructor was "How can you know what to tell the computer to do, if you can't do it yourself?" This is a great example of that.

Don’t forget that the huffman codes are packed LSB to MSB, but are to be interpreted as an integer in big endian format. Why this insanity?

Because that's what Phil Katz did. I vaguely remember reading about that years ago, and believe it had to do with how he implemented it; not as canonical table lookup, but as explicit tree-walking, since the former hadn't really been invented yet and the latter would fit within the memory constraints of a 16-bit system (DOS on a PC).

Looks like the bytes shall not be reversed beforehand. Bits are easier to extract at the lower end. For example, let's take the first byte as an example, 01001011. The first field is one bit wide, indicating the final record of the stream, 01001011 & 1 = '1'. Remove the extracted bit by shifting, 0100101 remains. The next field is two bits wide, so extract those: 0100101 & 11 = '01' and remove the bits giving 01001. And so on...

Not so much. There are several degrees of freedom in this design, see ryg's Reading bits in far too many ways [1] for the introduction.

[1] https://fgiesen.wordpress.com/2018/02/19/reading-bits-in-far...

And here's a gzip quine for anyone who likes that sort of thing:

  ( echo H4sIAAAAAAAAA5Pv5mCIeLzKnIGZgYHh/38GEYbX/+WxiDl1 ;
    echo KB5hIIJ1TXfrBNeU3DznNGYG78SS1AKgEchsAEthdGVwAAAA ) |
  base64 -d | gunzip > quine.gz
  zcat quine.gz | diff - quine.gz

For some reason I’m unsettled by the fact that, as presented, you have to unzip it once before it become a quine for subsequent unzippings. It lets you save 56 characters of base64 data, at the cost of `| gunzip`.

There’s some superfluous echo and subshell in there, incidentally; I like it better written this way:

  base64 -d <<<H4sIAAAAAAAAA5Pv5mCIeLzKnIGZgYHh/38GEYbX/+WxiDl1KB5hIIJ1TXfrBNeU3DznNGYG78SS1AKgEchsAEthdGVwAAAA | gunzip > quine.gz
  zcat quine.gz | diff - quine.gz

This was a great read. I look forward to the Part II teased at the end. In addition to deep diving on the Huffman codes, I hope you find a way to include an explanation of when to read left-to-right vs right-to-left, which is what tripped me up the most in this article.

Thanks, appreciate it. Honestly, the switch from little endian to big endian within the same compressed blocks is incredibly confusing and is probably one of the things that is so weird about the Deflate specification. I had to read that part of the spec over and over again, and compare it to other people's blog posts (made harder because some people write bits from Right to Left???)

I personally think it's easiest if you write bits Left to right, from LSB to MSB, which fits how humans write words, but i also get that when you're programming, you tend to think of MSB being on the left and LSB on the right (for bit twiddling operations)

So I can see why people would write bits right to left too, to fit how you program, but frankly I don't think that is any easier to read on paper.

It's funny, just earlier today I decided to pick up my Deflate project again (for the fourth time), and I was reading up on the odd ordering stuff and struggling.

I usually write them RTL because it better aligns with shift operations in my head. This makes it easier to think about certain kinds of binary algorithms in my opinion.

I love this! Great post.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact