Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

how is that possible?

tar.gz files don't have a central directory (like zip), and they are compressed as one stream (almost always non-seekable)






.tar itself gives you enough information to seek forward past each file, though every file must be visited.

.gz does not give you enough information to randomly seek within the big compressed .gz file, so you cannot skip past files within a .tar archive.

But if you load a .gz file and consume the entire stream, but keep periodic checkpoints of your past sliding window (about 64KB) every 1MB or so, you can get random access with 1MB granularity. You still had to consume the entire stream to build the lookup though.


Decompress, scan as you go, discard. Having to read a few hundred GB and scan a terabyte is a nuisance. Not having to write a terabyte is priceless.

Could also maintain an in-memory index so that you can go back after the fact and extract individual files.

That's less helpful than you might imagine - gzip isn't seekable by default; if all you know is the seek point, you still have to decompress everything up to that point to start decompressing from there. And if you have to do that, reading the tar headers as you go isn't a serious burden.

What might help is saving the state of the decompressor periodically, rather than just the index in the file. But that's getting pretty far into the weeds for an optimization to an infrequently used feature.


Interesting, yeah that makes sense— and I agree, that would be tricky to figure out the proper balance of caching the actual contents somewhere vs just caching the decompressor state, and whether that caching goes to RAM or disk. There isn't an obvious right answer for either, nor is there necessarily a right way to expose that option to the user.

Can definitely see why systems like python's wheel would choose zip as it's just always been natively seekable out of the box. I believe Nix now does something similar with flake repo archives being zipfiles in the store, as they can be seeked and evaluated without a full decompression, saving a lot of disk space.


I am guessing the gzip is retrieved as a stream and then reading the tar from that stream in memory?



Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: