Hacker News new | past | comments | ask | show | jobs | submit login
Hop: Faster than unzip and tar at reading individual files (github.com/jarred-sumner)
139 points by ksec on Nov 10, 2021 | hide | past | favorite | 128 comments



Since Hop doesn't do compression, the most appropriate comparison would be to asar

https://github.com/electron/asar

It's not hard being faster than zip if you are not compressing/uncompressing.


tar doesn't do compression either, and zip doesn't NEED to (several file formats are just bundles of files in a zip with/without compression)


Tar does do compression, via the standard -z flag. Every tar I have ever downloaded used some form of compression, so its hardly an optional part of the format.


It is important to distinguish between tar the format and tar the utility.

tar the utility is a program that can produce tar files, but it is also able to then compress that file.

When you produce a compressed tar file, the contents are written into a tar file, and this tar file as a whole is compressed.

Sometimes the compressed files are named in full like .tar.gz, .tar.bz2 or .tar.xz but often they are named as .tgz, .tbz or .txz respectively. In a way you could consider those files a format in their own right, but at the same time they really are simply plain tar files with compression applied on top.

You can confirm this by decompressing the file without extracting the inner tar file using gunzip, bunzip2 or unxz respectively. This will give you a plain tar file as a result, which is a file and a format of its own.

You can also see the description of compressed tar files with the “file” command and it will say for example “xz compressed data” for a .txz or .tar.xz file (assuming of course that the actual data in the file is really this kind of data). And a plain uncompressed tar file described by “file” will say something like “POSIX tar archive”.


> When you produce a compressed tar file, the contents are written into a tar file, and this tar file as a whole is compressed.

I don't think that's quite how it works. I'm pretty sure it's more similar to doing something like `tar ... | gzip > output.tgz`. That is it streams the tar output through gzip before writing to the file. In fact with GNU tar at least you can use a `-I` option to use an arbitrary program to compress.


Yeah I formulated it simplistically to not make the comment too long. But I didn't mean to imply that a tar file is actually created on disk or anything like that before compression happens.


Tar (the tool) does compression. Tar (the format) does not. Compression is applied separately from archiving.

https://www.gnu.org/software/tar/manual/html_node/Standard.h...


The numbers shown are with zip -0, which disables compression.


What is the meaning then? What's the benefit of an archive that's not compressed? Why not use a FS?


Plenty of uncompressed tarballs exist. In fact, if the things I’m archiving are already compressed (e.g. JPEGs), I reach for an uncompressed tarball as my first choice (with my second choice being a macOS .sparsebundle — very nice for network-mounting in macOS, and storable on pretty much anything, but not exactly great if you want to open it on any other OS.)

If we had a random-access file system loopback-image standard (open standard for both the file system and the loopback image container format), maybe we wouldn’t see so many tarballs. But there is no such format.

As for “why archive things at all, instead of just rsyncing a million little files and directories over to your NAS” — because one takes five minutes, and the other eight hours, due to inode creation and per-file-stream ramp-up time.


> random-access file system loopback-image standard

> But there is no such format.

What do you think of ISO (ISO9660)? just downloaded a random image to double-check, opens on OSX, Windows and Linux just fine.

It's read-only ofc, but that should not be a problem given we're talking about archives


Archives often have checksums, align files in a way conducive to continuous reading which can be great performance wise in some cases (and like zip, also random read/write), and can provide grouping and logical/semantic validation that is hard to do on a ‘bunch of files’ without messing it up.


FWIW, IPFS does all of that by default (maybe outside of the continuous reading part).


Hop reduces the number of syscalls necessary to both read and check for the existence of multiple files nested within a shared parent directory.

You read from one file to get all the information you need instead of reading from N files and N directories.

Can’t easily mount virtual filesystems outside of Linux. However, Linux supports the copy_file_range syscall which also makes it faster to copy data around than doing it in application memory


An archive is essentially a user-mode file system. User-mode things are often more efficient in many ways, as they don't need to call into the kernel as often.



I don't know if it allows streaming, but if it does transferring files on portable devices or streaming their over the wire is a lot faster this way compared to the files directly. Especially for small files


The speed of accessing a zip -0 archive would seem to be an implementation issue, not a format issue. Why didn't you fix the performance of your zip implementation instead of inventing yet another file format?


The format makes different tradeoffs vs. the zip format, making creation more expensive but reading cheaper. With that said, if you imposed additional restrictions on the zip archive (e.g., sorted entries, always include end of central directory locator), and the reader understands that the archive conforms, the performance difference will probably be imperceptible.


For a random-access archive format that supports compression see DAR: http://dar.linux.free.fr/home.html

It doesn't seem very well known, which is unfortunate because it's much better suited for archiving files compared to gzipped tar (which is great for distributing files, but not great for archiving/backup).


> It's not hard being faster than zip if you are not compressing/uncompressing

Actually, since CPUs are so fast and disk IO is so slow, it is essentially impossible to beat a properly tuned program that reads data from disk and decompresses it on the fly using one that doesn't use compression at all.


This looks really cool, I was looking for a simple compressed archive format that doesn't have junk like permissions, filename encoding issues, etc.

It claims it's easy to write a parser but then requires "Pickle" (derived from Python's Pickle?) which is just a link to a chrome header file... I'm not aware of a standard and like that it doesn't seem like it even wants to be standardized. Do they mean easy to write a parser as long as you're using chrome's JS engine?


I was poking around, and I didn’t see a Weismann score. Wonder if anyone ran a test.


There exists a utility called tarindexer [0] that can be used for random access to tar files. An index text file is created (one time) that is used to record the position of the files in the tar archive. Random reads are done by loading the index file and then seeking to the location of the file in question.

For random access to gzip'd files, bgzip [1] can be used. bgzip also uses an index file (one time creation) that is used to record key points for random access.

[0] https://github.com/devsnd/tarindexer

[1] http://www.htslib.org/doc/bgzip.html


I've recently been looking into this same issue because I analyse a lot of data like sosreports or other tar/compressed data from customer systems. Currently I untar these onto my zfs filesystem which works out OK because it has zstd compression enabled but I end up decompressing and recompressing which is quite expensive as often the files are GBs or more compressed.

But I've started using a tool called "ratarmount" (https://github.com/mxmlnkn/ratarmount) which creates an index once (and something I could automate our upload system to generate in advance, but you can also just process it lcoally) and then lets you fuse mount the file. This works pretty great with the only exception that I can't create scratch files inside the directory layout which in the past I'd wanted to do.

I was surprised how hard a problem to solve it is to get a bundle file format that is indexable and compressed with a good and fast compression algorithm which mostly boils down to zstd at this point.

While it works quite well, especially with gzip and bzip2, sadly the zstd and xz (and some other compression formats) don't allow for decompressing only parts of a file by default, even though it's possible the default tools aren't doing it. The nitty gritty details are summarised here: https://github.com/mxmlnkn/ratarmount#xz-and-zst-files

The other main options I found was squashfs which recently grew zstd support and there is some preliminary zip file support for zstd but there are multiple standards which is not helpful!


That just sounds like an inferior version of ZIP. IMO, unless you can only work with the tar format (which is a perfectly valid explanation, some programs are long-lived), ZIP is a better option for seekable archives because it supports even LZMA compression.


Zip is….. weird. In some undesirably ways sometimes, and desirable in others.

The index is stored at the tail end of a zip file for instance, which is really not cool for something likej tape, doubly not cool when you don’t know in advance how big the data on the tape is.


It is a little more complicated than that: having the index at the end is great for writing tape, but sucks for reading tape. The nice thing of that design is that you can "stream" the data (read one file, write it, and just make some notes on the side for your eventual index). But you can't stream things off (you have to read the index first).

Tar is of course all-around great for tape, since every file is essentially put on by itself (with just a little bit of header about it). Again this is great for streaming things both on and off tape. But you can't do any sort of jumping around. You have to start at the beginning and then go from file to file. This gets even worse if you try to compress it (e.g. .tar.gz), as you then have to decompress every byte to get to the next one.


couple of points:

a) zip spec does not require file index to be at the end. Only the pointer to file index

b) You can do streaming decompression of zip as each zip entry has a header + filename inline with data.

c) I came up with this optimization(placing index at front of making file io sequential to make use of OS readahead) for omni.ja files in firefox. It's still a standard zip file, but index lives at the front. Most zip tools can open that file unmodified(tho they sometimes complain).


Ho, learn something new everyday, thanks! Also, for anyone else reading this found this elsewhere and is applicable [https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT]

That is actually applicable to a project I’m working on now, any chance your implementation is in python and open source somewhere?


Yes, it's in python :)

https://github.com/humphd/mozilla-central-old/blob/9d4d9f265...

Curious, how is it applicable?


Offline high speed data ingestion of multi-thousand file, multi hundred GB data sets, followed by rapid transfer to permanent online storage (and replication fan-out, etc).

Seems convenient to allow optimization for high speed sequential reads and random read/writes at different parts of the life cycle, along with indexing, crcs, signatures, etc.

One big issue with zip storing the index at the end of course is a truncated file basically lost most of it’s context and is generally unrecoverable even in part, which this could also help with from a durability perspective.

Storing it at the beginning (without a end pointer) opens up the possibility you have a valid looking archive you’re touching that is truncated and missing a lot of data, and won’t know until you look past the end (or validate total bytes or whatever, which doesn’t work well when steaming).

Storing the index at the beginning, pointer and file sig at the end, and all the other format extensions does solve for all this. Which is convenient.


neat, let me know if you have any further questions. Would love to make this a more common thing that happens to zip files. As result of me doing this in firefox, zip utilities(aka 7zip) started complaining a lot less about this creative interpretation of the standard :)


Also limited to 4Gb unless it's zip64 which is limited to 16 Eb and not supported by all zip implementations.


Don't tapes support some form of seeking?


What lazide saying is that some important information is stored at the end (as in the very end of the book/movie or at the end of the line). So imagine there is a 1TB .zip archive in the tape, the tape device have to go further (deeper) into that 1TB file to get that last bit of vital data that the user want to see. Normally the vital bit usually at the start of the file (as in the front line/beginning of the book) that the user have the information ready before they could transfer it. But for lazide case, the tape device have to keep reading the entire 1TB zip to get that last vital bit of information which made it slow. It is more like it cannot "skip the line" and would have to go through entire line to get there.


They do - it’s very slow, and entirely linear. It also puts wear on the tape, so if you do it a lot you’ll break the tape in a not-super-long-time.

And since you wouldn’t know where the end is by reading the beginning, you’ll have to keep reading until you hit the right marker - and seek backwards (which you generally can’t read backwards on most tape drives, so you need to jump back and re-read).


Also relevant is pixz [1] which can do parallel LZMA/XZ decompression as well as tar file indexing.

[1] https://github.com/vasi/pixz


I've done the same thing for decades with the -v -R options to GNU tar, which cause it to print the block number of each file to stderr while creating the archive. With knowledge of the block number, you can then pipe dd to tar to extract a single file without a linear search. Or, in the case of tapes, you can use mt to seek to the block before you run tar. Of course, something like tarindexer is more convenient!


Also see SQLite archive files. Random access and compression. https://www.sqlite.org/sqlar.html


Yeah I wonder how the format compares to this. It also wasn’t obvious whether hop supports adding data with an operation that isn’t just append (a stricter requirement for tar because it was designed for magnetic tape). It wasn’t clear to me from the table. sqlar can add to an existing archive and doesn’t require uncommon software to extract if you don’t have the right tool to hand.


From the README:

"Why? Reading and writing lots of tiny files incurs significant syscall overhead, and (npm) packages often have lots of tiny files. (...)"

It seems the author is working on a fast JS bundler tool (https://bun.sh) and the submission links to an attempt to fix some of the challenges in processing lots of npm files quickly. (But of course could be useful beyond that.)


Came to mention Bun when I saw this hit the front page. I’ve been following Jarred’s Twitter since I heard about Bun and it’s quite impressive (albeit incomplete). To folks wondering why another bundler/what makes Bun special:

- Faster than ESBuild/SWC

- Fast build-time macros written as JSX (likely friendlier to develop than say a Babel plugin/macro). These open up a lot of possibilities that could benefit end users too, by performing more work on build/server and less client side.

- Targeting ecosystem compatibility (eg will probably support the new JSX transform, which ESBuild does not and may not in the future)

- Support for integration with frameworks, eg Next.js

- Other cool performance-focused tools like Hop and Peechy[1] (though that’s a fork of ESBuild creator’s project Kiwi)

This focus on performance is good for the JS ecosystem and for the web generally.

1: https://github.com/Jarred-Sumner/peechy


> "Why? Reading and writing lots of tiny files incurs significant syscall overhead, and (npm) packages often have lots of tiny files"

Ah, those trees of is-character packages depending on is-letter-a packages, themselves depending on is-not-number packages each appearing several times in different versions are probably challenging to bundle and unpack efficiently. We might want to address file systems too, so they too can handle NPM efficiently (actual size on disk and speed).

Or maybe the actual fix is elsewhere.

Runs, fleeing their screen screaming in fear and frustration

(no offense to the author, I actually find the problem interesting and the work useful)


If I had asked them what they wanted, they would've said faster filesystems.


Obviously, what we really need is an OS optimized for npm.


Maybe the way to fix these issues would be to try to reduce the package bloat in the npm ecosystem? Otherwise I'm afraid a faster bundler will only result in even more packages being bundled. But I guess it's hard to get that genie back into the bottle...


Speaking of faster coreutils replacements, I highly recommend ripgrep (rg) and fd-find (fd), which are Rust-based, incompatible replacements for grep and find.

I know I'm way behind the popularity curve* (they're already in debian-stable, for crying out loud); but for the benefit of anyone even more out of the loop than myself, do check these out. The speed increase is amazing: a significant quality-of-life improvement, essentially for free**. Particularly if you're on a modern NVME SSD and not utilizing it properly. (Incidentally, did you know the "t" in coreutil's "tar" stands for magnetic [t]ape?")

* (The now-top reply to this comment says 'ag' is even superior to 'rg', and they're probably right, but I had no clue about it! I did say I'm ignorant!)

**(Might have some cost if you're heavily invested in power-user syntax of the GNU or BSD versions, in which case incompatibility has a price).

https://github.com/BurntSushi/ripgrep

https://github.com/sharkdp/fd


When I looked ag https://github.com/ggreer/the_silver_searcher was more featureful than ripgrep yet it's always ripgrep that is mentioned. :/


Which features does it have that ripgrep doesn't?


I still haven't seen anyone elaborate on what features ag has over ripgrep. Can you explain what you like about ag that ripgrep currently doesn't do?


weird right? I don't know what ripgrep has, maybe it's just the name? the fact that you call `ag` but it's called `the_silver_searcher`?


It's not weird because ripgrep has had more features than ag for a long time. Originally ripgrep didn't have multi-line support or support for fancy regex features like look-around, but it has both of those now. It also has support for automatic UTF-16 transcoding, preprocessors for searching non-text files and overall less buggy support for gitignore. (Look at ag's issue tracker.)

And then there's also the fact that ag isn't that great as a general purpose grep tool. It really falls over in terms of speed when searching, say, bigger files:

    $ time rg 'Sherlock Holmes' OpenSubtitles2018.raw.en | wc -l
    7673

    real    1.475
    user    1.115
    sys     0.356
    maxmem  12511 MB
    faults  0

    $ time ag 'Sherlock Holmes' OpenSubtitles2018.raw.en | wc -l
    7673

    real    20.276
    user    19.850
    sys     0.413
    maxmem  12508 MB
    faults  0
A lot of people like to comment and say, "well ag is fast enough for me." Well, OK, that's fine. But if you're wondering about why other people might mention ripgrep more, well, maybe it isn't just about the way you use the tools. For example, if you only ever search tiny repositories of code, then you aren't going to care that ripgrep is faster than ag. Which is perfectly reasonable, but it should also be reasonable to be aware that others might actually search bigger corpora than you.


ripgrep regexp has lookaround? The linked docs still doesn't say that.

https://docs.rs/regex/1.5.4/regex/#syntax

> This crate provides a library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a few features like look around and backreferences.

Edit: I see. It can use pcre2 but it's a build time option which of course Ubuntu has off.


> Edit: I see. It can use pcre2 but it's a build time option which of course Ubuntu has off.

That's unfortunate. Perhaps a bug report to the packagers is in order? Archlinux enabled it: https://github.com/archlinux/svntogit-community/blob/0dc033f...



From my side, I knew about both `ripgrep` and `the_silver_searcher` but I will openly admit I've lost faith in C and C++'s abilities to protect against memory safety errors.

Thus, Rust tools get a priority, at least for myself. There are probably other memory-safe languages but I haven't had the chance to give them a spin like I did with Rust. If I found out about them then I'll also prefer tools written in them if there's no Rust port and if the alternatives are C/C++ tools.


Ag is shorthand for argentum which in turn means silver.

https://en.wikipedia.org/wiki/Silver


Code search on your own codebase is generally not a place where memory safety shines.


When it's about security, I try not to pick and choose.

Plus a memory error might lead to scary consequences (like a random script getting elevated privileges).

Finally, and last I checked, `rg` is quite featureful and I've never felt constrained by it. So for me it's a win all around.


This is kind of my point: security requires a threat model, and you don't really have one. Rust has a lot going for it, and it does hold promise in improving the security of a lot of critical software. But in this case, it's not really doing that, so it's kind of misleading to say it is meaningfully doing anything for security.


I agree that in the case of grep-ing in the terminal the odds of covering your butt well enough by using a Rust tool are super slim.

That being said, there are powerful adversaries of anonymity and the right to personal data out there -- and security in depth is what works best against them. There's no one UltimateSecuritySolution™; there are many small ones that we layer on top of each other so we don't allow even a smidgen of air to pass between the cracks.

But yeah, I am paranoid. I am gradually preparing myself to move from macOS to Linux and even though I am not a criminal and never will be, I'll still make a heroic effort to make the odds of any foul play against me practically zero. (And that's why I will start using the userland Rust tools alternatives as well.)

I'll concede that in my case the biggest impact would probably come from running Chrome in a jail, and not from using `rg` vs. `ag`. That much is true, yep.


Yeah, it's kind of a shame in this case. There's tons to talk about in the area of where Rust shines here ("makes concurrency easy", "provides easy access to fast algorithm libraries", etc.) but security is just not really one of those points.


I don't disagree. I am just happy to point out that Rust increases security (since most security vulnerabilities I see reported are buffer over/under-flows or other memory safety mishaps). Rust definitely does not solve everything in security. You can still open yourself up for an elementary replay attack if you're not careful -- like I did just a few days ago.


This is what I suspected: Rust zealotry. We are searching for a string. Language matters to the guy writing it but the user? Nah. There's no security hole here.


Who's the zealot here? The guy who doesn't want to risk and openly states so, or the guy proclaiming there's no risk, even with ample historical evidence for the opposite?

I don't accept your labeling, especially when it's so egregiously misguided.

Also, you're clearly seeing what you want to see.


ripgrep is noticeably faster than ag. I'm not sure what features you mention that ripgrep is missing, but I've been plenty happy with it for basic grepping around. I'm sure it's also partly because BurntSushi, the author of ripgrep, is reasonably active here.

Some fun benchmarks, and a fair description of how text search works, can be found in this 2016 post https://blog.burntsushi.net/ripgrep/


> ripgrep is noticeably faster than ag.

Since both are instant on every codebase I care about, not so sure about that. These tools were always chiefly I/O bound , SSDs have torn that down especially now.

But for features: there's no lookaround in the regex.


They are only I/O bound when searching data that isn't in cache. It's often the case that you're searching, for example, a repository of code repeatedly.

> Since both are instant on every codebase I care about, not so sure about that.

It is very possible that there is no performance difference between ripgrep and ag for your use cases, but that does not mean there isn't a performance difference between ripgrep and ag. For example, in my checkout of the chromium repository:

    $ time rg -c Openbox
    testing/xvfb.py:2
    tools/metrics/histograms/enums.xml:1
    ui/base/x/x11_util.cc:1

    real    0.448
    user    2.593
    sys     2.490
    maxmem  77 MB
    faults  0

    $ time ag -c Openbox
    ui/base/x/x11_util.cc:1
    tools/metrics/histograms/enums.xml:1
    testing/xvfb.py:2

    real    2.302
    user    2.996
    sys     10.462
    maxmem  15 MB
    faults  0
> But for features: there's no lookaround in the regex.

That's not true and hasn't been true for a long time. ripgrep supports PCRE2 with the -P/--pcre2 flag. You can even put `--engine auto` in an alias or ripgreprc file and have ripgrep automatically select the regex engine based on whether you're using "fancy" features or not.

In general, I also claim that ripgrep has far fewer bugs than ag. ag doesn't really get gitignore support correct, although if you only have simple gitignores, its support might be good enough.


These tools are I/o bound, but not disk bound. I often perform multiple searches in a row, and all but the first will hit the disk cache.


In an attempt to remove some of my ignorance, I tried this: (MBA, M1)

  time rg Query src > /tmp/junk.rg
  rg Query src > /tmp/junk.rg  0.01s user 0.04s system 156% cpu 0.037 total
  time pt Query src > /tmp/junk.pt
  pt Query src > /tmp/junk.pt  1.44s user 10.80s system 220% cpu 5.545 total
  time ag Query src > /tmp/junk.ag
  ag Query src > /tmp/junk.ag  4.32s user 9.82s system 204% cpu 6.918 total

  ls -l /tmp/junk\*
  -rw-r--r--  1 sramam  wheel  16492080 Nov 10 16:05 /tmp/junk.ag
  -rw-r--r--  1 sramam  wheel  16492080 Nov 10 16:04 /tmp/junk.pt
  -rw-r--r--  1 sramam  wheel     27477 Nov 10 16:04 /tmp/junk.rg

This is a TypeScript module, turns out ripgrep correctly ignores the sourcemaps for the improved performance. I like tools that are smart by default.

(EDIT: formatting)


I think rg isn’t ignoring sourcemaps to be fast. It tries to ignore based on .gitignore and similar files, but maybe the sorcemaps are ignored for being binary.


I thought so too. The documentation for all state they honor .gitignore. In my case, the compiled code is committed.

My query was for a generic term "src" - so that likely trapped the file-path embedded within the long lines of a source-map. Hence the massive file-size difference.

(I did inspect the files to validate)


The main goal of respecting gitignore is generally to decrease noise in search results. But, this does of course also have the effect of improving performance in a lot of cases.


Okay, call me lazy. Not for ripgrep, installed it early from sources. However your fdfind (fd) mention made me curious. Thought I give it an apt install on Ubtunu 20.04 LTS but dead end. So perhaps in debian-stable but not in Ubuntu. Just saying, so I can feel less out of the loop ;)


Upstream says Ubuntu's package is 'fd-find' (and the executable is 'fdfind' -- both renamed from upstream's 'fd' because of a name collision with an unrelated Debian package. If you don't care about that one, you can safely alias fd="fdfind").

https://github.com/sharkdp/fd

(I've edited my first comment in response to this reply: I originally wrote "fdfind". (For a comment about regexp tools, this is a uniquely, hilariously stupid oversight. Sorry!)).


Well, even its late here, I could still have enough creative energy to insert the minus in there ... yeah, thanks for that. So now I feel really behind because sure, it's packaged in Ubuntu 20.04. Thanks again.


There is also ouch, a fast multi-format compressor/decompressor written in Rust

https://github.com/ouch-org/ouch


I made this

Happy to answer any questions or feedback


If you manage the index using a B-tree, then you can perform partial updates of the B-tree by appending the minimum number of pages needed to represent the changes. At that point, you can append N additional new files to the tail of the archive, and add a new index that re-uses some of the original index.

Just an idea to check the "append" box.

See also B-trees, Shadowing, and Clones https://www.usenix.org/legacy/events/lsf07/tech/rodeh.pdf


Good idea, worried a little about impact to serialization/deserialization time though. Maybe could still store it as a flat array


Consider variable-length integers for file sizes and string lengths. When I need them, I usually implementing what’s written in MKV spec: https://www.rfc-editor.org/rfc/rfc8794.html#name-variable-si...

That’s a good way to bypass that 4GB size limit, and instead of wasting 8 bytes per number this will even save a few bytes.

However, I’m not sure how easy it is to implement in the language you’re using. I only did that in C++ and modern C#. They both have intrinsics to emit BSWAP instruction (or an ARM equivalent) to flip byte order of an integer, helps with performance of that code.


Don't use 32-bit file times! Change it quick while you have the chance.


How far are we from public beta or Bun 1.0?

Not related to bun or hop is the use of Zig. I am wondering if you will do a write up on it someday.


Something like two weeks before a public beta


Have you looked at using LZ4 compression? Supposedly it has a very fast decompressor, so it could be better than uncompressed files for your purposes.


I don't think builtin compression is necessary, but it would be better if the CLI supported it similarly to tar.

Compressing a 57.5 MB .hop archive with `gzip -c` returns an 11.7 MB file. This is a similar compression ratio as tar -> tar.gz

Compression is great for sending files over the network, but for fast reads, it might be undesirable


It’s worth benchmarking -- LZ4 can decompress faster than SSD read speeds, so it could well be a net time saving.


How about compressing the files while building the archive instead of compressing the entire archive itself? That ought to preserve reasonably fast reads. It won't give optimal compression but in practice should do alright (and of course will be smaller than no compression).


What do you think about Valve VPK?


This seems like more of a tar problem than a zip problem, unless I'm missing something, given the lack of compression on Hop.


I think you can still zip files without compression.


Usually the -0 option. https://linux.die.net/man/1/zip -mx=0 for 7zip style programs.

If you also used the mt option with 7zip and just stored you probably could have a decent read rate as I think it spins extra threads.


Yes, but I don't think anyone does that.


This ability is an important aspect of “document” archives following the open container format (OCF): the first file in the archive must be uncompressed, called “mimetypes”, and contain the mimetype of the document encoded in us-ascii.


File names not being normalized across platforms is sometimes beneficial. Ignoring symlinks is also sometimes beneficial. However, sometimes these are undesirable features. The same is true of solid compression, concatenatability, etc. Also, being able to effiently update an archive means that some other things may be lacked, so there is advantage and disadvantage of it.

I dislike the command-line format of Hop; it seems to missing many features. Also, 64-bit timestamps and 64-bit data lengths (and offsets) would be helpful, as some other people mention; I agree with them, to improve it in this way. (You could use variable length numbers if you want to save space.)

My own opinion for the general case is that I like to have concatenable format with separate compression (although this is not suitable for all applications). One way to allow additional features might be having an extensible set of fields, so you can include/exclude file modes, modifications times, numeric or named user IDs, IBM code pages, resource forks, cryptographic hashes, multi-volumes, etc. (I also designed a compression format with a optional key frame index; this way the same format supports both solid and non-solid compression, whichever way you want to do, and this can work independently from the archive format being used.)

For making backups I use tar with a specific set of options (do not cross file systems, use numeric user IDs, etc); this is then piped to the program to compress it, stored in a separate partition, and then recorded on DVDs.

For some simple uses I like the Hamster archive format. However, it also limits each individual file inside to 4 GB (although the entire archive is not limited in this way), and no metadata is possible. Still, for many applications, this simplicity is very helpful, and I sometimes use it. I wrote a program to deal with these files, and has a good number of options (which I have found useful) without being too complicated. Options that belong in external programs (such as compression) are not included, since you can use separate programs for that.


> I dislike the command-line format of Hop; it seems to missing many features.

I agree 100%. I wrote most of it in like three hours; it's not a polished product.

> My own opinion for the general case is that I like to have concatenable format with separate compression (although this is not suitable for all applications). One way to allow additional features might be having an extensible set of fields, so you can include/exclude file modes, modifications times, numeric or named user IDs, IBM code pages, resource forks, cryptographic hashes, multi-volumes, etc. (I also designed a compression format with a optional key frame index; this way the same format supports both solid and non-solid compression, whichever way you want to do, and this can work independently from the archive format being used.)

I'm wary of slowing it down by adding lots of features. I think that, generally speaking, _more_ purpose-built binary formats should exist.

Engineers do this all the time with YAML files and JSON, but why not binary files?


> I'm wary of slowing it down by adding lots of features. I think that, generally speaking, _more_ purpose-built binary formats should exist.

It is a valid point, yes. However, my mention was meant to mean that unknown fields can be easily skipped.


> Can't be larger than 4 GB

How does that even happen in 2021?


4GB is the maximum value of a 32 bit unsigned int. If I had to guess that's the maximum size of the array/vector container in zig.


Zig has weird limits in a few places (1k compile-time branches without additional configuration, the width of any integer type needs to be less than 2^16, ...). Array/vector length's aren't the issue though; you can happily work with 64-bit lengths on a 64-bit system.

Skimming the source, there are places where the author explicitly chooses to represent lengths with 32-bit types (e.g., schema.zig/readByteArray()). I bet you could massage the code to work with larger data without many issues.


> I bet you could massage the code to work with larger data without many issues.

I'm sure the author would appreciate a pull request if it's that straight forward.


So it is that straightforward for a proof of concept (downgrade to an old version of Zig compatible with the project, patch an undefined variable bug the author introduced yesterday, s/32/64, add 4 bytes to main.zig->Header and its accesses).

Doing so makes the program slower though which might be a non-starter for a performance focused project. Plus you'd need a little more work to properly handle large archives on 32-bit and smaller systems (at least those which support >4GB files).


Offsets and lengths are stored as unsigned 32 bit integers. This saves space/memory, but means it won’t work for anything bigger than 4 GB

Maybe that’s overly conservative. Wouldn’t be hard to change


With most processors being 64-bit today, what would be the impact of using uint64 everywhere?


The answer is at the bottom of the page and all the offset/length data being uint32s.

2^32 = 4294967296

Not sure if this limitation is being enforced by upstream concerns, but this is why this code in particular is limited to that size.


I used to use cdb a lot for random access of small key/value pairs (~100k entries, 10kb per entry). It's effectively a hashtable on disk.

https://cr.yp.to/cdb.html


> No random limits: cdb can handle any database up to 4 gigabytes.

I suppose this could easily be changed, right? I could not compile it though. I got this error:

  /usr/bin/ld: errno: TLS definition in /lib/x86_64-linux-gnu/libc.so.6 section .tbss mismatches non-TLS reference in cdb.a(cdb.o)
  /usr/bin/ld: /lib/x86_64-linux-gnu/libc.so.6: error adding symbols: bad value
I never came across this before. Ideas as to why or how to fix?

Nevermind, I fixed it. I added `#include <errno.h>` to `error.h`.


Related: pixz – a variant of xz that works with tar and enables (semi)efficient random access: https://github.com/vasi/pixz


Is there anything like this for RAR files? I'm looking for an alternative to unrar, as I've recently learned that it's code is actually non-free.


I think maybe you misunderstood this project....

Anyway libarchive can read rar files so just use bsdtar. It can do many other archive files [0] like cpio as well. One standardized interface for everything is nice

[0] https://github.com/libarchive/libarchive#supported-formats


I wonder how this format compares to SquashFS


Is 7zip = zip ?


No. The 7zip format has much better compression than the ancient (1990s) zip format.


ZIP as a standard is no longer ancient. It has support for modern encryption and compression, including LZMA. This is sometimes referred to as the ZIPX archive, but it's part of the standard in its later revisions.


All this "no longer ancient" means is that zip is now out of the window. Its value has been lost.

The format has been perverted, and we can't trust zip as the "just works" option that will open in any platform, with any implementation, anymore.

All because somebody thought it a good idea to try to leverage the zip name's attached popularity to try and make a new format instantly popular.

Great job. This is why we can't have good things.


The same can be said of HTML. I realise that file archive formats are expected to be more stable (they are for archiving, yes?), but is it right to expect them to be forever frozen in amber? Especially when open source or free decompressors exist for every version of every system? ZIP compressed using LZMA is even supported in the last version of 7-zip compiled for MS-DOS.


>ZIP compressed using LZMA is even supported in the last version of 7-zip compiled for MS-DOS.

But then, why wouldn't you just use the 7z format?

The expectation with ZIP is (or was) that it'll unpack fine, even under CP/M.

Moving to '.zipx' extension was the right move, but it was done far too late.


Moving to the .zipx extension was definitely the right (and only) move. This is due to the fact that Microsoft's compressed folder code hasn't been updated in years and thus it isn't safe to send out any Zip files that uses any new features [1].

[1]: https://devblogs.microsoft.com/oldnewthing/20180515-00/?p=98...


Thank you Microsoft.

I hope if/when they upgrade their zip support, they only accept the new format if the filename extension is zipx.


zipx != zip -- we are speaking of file compression standards NOT branded compression software programs


No, it literally is part of the ZIP standard [0]. I updated my original comment.

ZIPX archives are simply ZIP archives that have been branded with the X for the benefit of the users who may be trying to open them with something ancient, that doesn't yet support LZMA or the other new features.

[0] https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT


When a 'hacker' (programmer) says 'zip file' they refer to the widely compatible implementation which is likely to work in any common implementation of archive handling software.

https://en.wikipedia.org/wiki/ZIP_(file_format)#Version_hist...

According to the above version table, that would conform to PKWARE 2.0, which as with the ODF file specification, limits packing methods to STORE and DEFLATE only; as mentioned in the standardization section and ISO/IEC 21320-1.

When you say 'zip' you should be saying 'zipx' at the very least, but you may as well be comparing a tar file to a tar.zstd file in the same case.


> may as well be comparing a tar file to a tar.zstd file in the same case.

It's not exactly hard to get a new compressor on a *nix system. It's either in the repos, or not so hard to compile. On non-nixes, there's usually a binary. Tarballs have not been limited to only tar.gz for a very long time, though a lot of people do choose to be conservative in how their distributed files are compressed.

> When a 'hacker' (programmer) says 'zip file' they refer to the widely compatible implementation which is likely to work in any common implementation of archive handling software.

You don't speak for every hacker, certainly not for me. If you're implementing ZIP these days and it's not, at the very least, capable of reading ZIP64 archives (forget about newer compression methods), then you're just creating obsolete software.


7zip uses lzma, a much more advanced and slower format than RFC 1951 DEFLATE used in zlib/gzip/png/zip/jar/office docs/etc


Zip can also use LZMA, assuming the recipient can interoperate with that.


(cries in winrar)


Hey! Did you register me? </nag>




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: