Hop: Faster than unzip and tar at reading individual files

yboris · on Nov 10, 2021

Since Hop doesn't do compression, the most appropriate comparison would be to asar

It's not hard being faster than zip if you are not compressing/uncompressing.

gnabgib · on Nov 10, 2021

tar doesn't do compression either, and zip doesn't NEED to (several file formats are just bundles of files in a zip with/without compression)

syrrim · on Nov 10, 2021

Tar does do compression, via the standard -z flag. Every tar I have ever downloaded used some form of compression, so its hardly an optional part of the format.

codetrotter · on Nov 10, 2021

It is important to distinguish between tar the format and tar the utility.

tar the utility is a program that can produce tar files, but it is also able to then compress that file.

When you produce a compressed tar file, the contents are written into a tar file, and this tar file as a whole is compressed.

Sometimes the compressed files are named in full like .tar.gz, .tar.bz2 or .tar.xz but often they are named as .tgz, .tbz or .txz respectively. In a way you could consider those files a format in their own right, but at the same time they really are simply plain tar files with compression applied on top.

You can confirm this by decompressing the file without extracting the inner tar file using gunzip, bunzip2 or unxz respectively. This will give you a plain tar file as a result, which is a file and a format of its own.

You can also see the description of compressed tar files with the “file” command and it will say for example “xz compressed data” for a .txz or .tar.xz file (assuming of course that the actual data in the file is really this kind of data). And a plain uncompressed tar file described by “file” will say something like “POSIX tar archive”.

thayne · on Nov 11, 2021

> When you produce a compressed tar file, the contents are written into a tar file, and this tar file as a whole is compressed.

I don't think that's quite how it works. I'm pretty sure it's more similar to doing something like `tar ... | gzip > output.tgz`. That is it streams the tar output through gzip before writing to the file. In fact with GNU tar at least you can use a `-I` option to use an arbitrary program to compress.

codetrotter · on Nov 11, 2021

Yeah I formulated it simplistically to not make the comment too long. But I didn't mean to imply that a tar file is actually created on disk or anything like that before compression happens.

nemetroid · on Nov 10, 2021

Tar (the tool) does compression. Tar (the format) does not. Compression is applied separately from archiving.

https://www.gnu.org/software/tar/manual/html_node/Standard.h...

Jarred · on Nov 10, 2021

The numbers shown are with zip -0, which disables compression.

kupopuffs · on Nov 10, 2021

What is the meaning then? What's the benefit of an archive that's not compressed? Why not use a FS?

derefr · on Nov 10, 2021

Plenty of uncompressed tarballs exist. In fact, if the things I’m archiving are already compressed (e.g. JPEGs), I reach for an uncompressed tarball as my first choice (with my second choice being a macOS .sparsebundle — very nice for network-mounting in macOS, and storable on pretty much anything, but not exactly great if you want to open it on any other OS.)

If we had a random-access file system loopback-image standard (open standard for both the file system and the loopback image container format), maybe we wouldn’t see so many tarballs. But there is no such format.

As for “why archive things at all, instead of just rsyncing a million little files and directories over to your NAS” — because one takes five minutes, and the other eight hours, due to inode creation and per-file-stream ramp-up time.

wizzard0 · on Nov 19, 2021

> random-access file system loopback-image standard

> But there is no such format.

What do you think of ISO (ISO9660)? just downloaded a random image to double-check, opens on OSX, Windows and Linux just fine.

It's read-only ofc, but that should not be a problem given we're talking about archives

lazide · on Nov 10, 2021

Archives often have checksums, align files in a way conducive to continuous reading which can be great performance wise in some cases (and like zip, also random read/write), and can provide grouping and logical/semantic validation that is hard to do on a ‘bunch of files’ without messing it up.

eins1234 · on Nov 10, 2021

FWIW, IPFS does all of that by default (maybe outside of the continuous reading part).

Jarred · on Nov 10, 2021

Hop reduces the number of syscalls necessary to both read and check for the existence of multiple files nested within a shared parent directory.

You read from one file to get all the information you need instead of reading from N files and N directories.

Can’t easily mount virtual filesystems outside of Linux. However, Linux supports the copy_file_range syscall which also makes it faster to copy data around than doing it in application memory

chrisseaton · on Nov 10, 2021

An archive is essentially a user-mode file system. User-mode things are often more efficient in many ways, as they don't need to call into the kernel as often.

zuhsetaqi · on Nov 10, 2021

https://github.com/Jarred-Sumner/hop#why

rjzzleep · on Nov 10, 2021

I don't know if it allows streaming, but if it does transferring files on portable devices or streaming their over the wire is a lot faster this way compared to the files directly. Especially for small files

zinconfire · on Nov 11, 2021

The speed of accessing a zip -0 archive would seem to be an implementation issue, not a format issue. Why didn't you fix the performance of your zip implementation instead of inventing yet another file format?

lern_too_spel · on Nov 11, 2021

The format makes different tradeoffs vs. the zip format, making creation more expensive but reading cheaper. With that said, if you imposed additional restrictions on the zip archive (e.g., sorted entries, always include end of central directory locator), and the reader understands that the archive conforms, the performance difference will probably be imperceptible.

kumarsw · on Nov 10, 2021

For a random-access archive format that supports compression see DAR: http://dar.linux.free.fr/home.html

It doesn't seem very well known, which is unfortunate because it's much better suited for archiving files compared to gzipped tar (which is great for distributing files, but not great for archiving/backup).

orlp · on Nov 11, 2021

> It's not hard being faster than zip if you are not compressing/uncompressing

Actually, since CPUs are so fast and disk IO is so slow, it is essentially impossible to beat a properly tuned program that reads data from disk and decompresses it on the fly using one that doesn't use compression at all.

rendaw · on Nov 11, 2021

This looks really cool, I was looking for a simple compressed archive format that doesn't have junk like permissions, filename encoding issues, etc.

It claims it's easy to write a parser but then requires "Pickle" (derived from Python's Pickle?) which is just a link to a chrome header file... I'm not aware of a standard and like that it doesn't seem like it even wants to be standardized. Do they mean easy to write a parser as long as you're using chrome's JS engine?

catchmeifyoucan · on Nov 11, 2021

I was poking around, and I didn’t see a Weismann score. Wonder if anyone ran a test.

abetusk · on Nov 10, 2021

There exists a utility called tarindexer [0] that can be used for random access to tar files. An index text file is created (one time) that is used to record the position of the files in the tar archive. Random reads are done by loading the index file and then seeking to the location of the file in question.

For random access to gzip'd files, bgzip [1] can be used. bgzip also uses an index file (one time creation) that is used to record key points for random access.

[0] https://github.com/devsnd/tarindexer

[1] http://www.htslib.org/doc/bgzip.html

lathiat · on Nov 11, 2021

I've recently been looking into this same issue because I analyse a lot of data like sosreports or other tar/compressed data from customer systems. Currently I untar these onto my zfs filesystem which works out OK because it has zstd compression enabled but I end up decompressing and recompressing which is quite expensive as often the files are GBs or more compressed.

But I've started using a tool called "ratarmount" (https://github.com/mxmlnkn/ratarmount) which creates an index once (and something I could automate our upload system to generate in advance, but you can also just process it lcoally) and then lets you fuse mount the file. This works pretty great with the only exception that I can't create scratch files inside the directory layout which in the past I'd wanted to do.

I was surprised how hard a problem to solve it is to get a bundle file format that is indexable and compressed with a good and fast compression algorithm which mostly boils down to zstd at this point.

While it works quite well, especially with gzip and bzip2, sadly the zstd and xz (and some other compression formats) don't allow for decompressing only parts of a file by default, even though it's possible the default tools aren't doing it. The nitty gritty details are summarised here: https://github.com/mxmlnkn/ratarmount#xz-and-zst-files

The other main options I found was squashfs which recently grew zstd support and there is some preliminary zip file support for zstd but there are multiple standards which is not helpful!

selfhoster11 · on Nov 10, 2021

That just sounds like an inferior version of ZIP. IMO, unless you can only work with the tar format (which is a perfectly valid explanation, some programs are long-lived), ZIP is a better option for seekable archives because it supports even LZMA compression.

lazide · on Nov 10, 2021

Zip is….. weird. In some undesirably ways sometimes, and desirable in others.

The index is stored at the tail end of a zip file for instance, which is really not cool for something likej tape, doubly not cool when you don’t know in advance how big the data on the tape is.

larkost · on Nov 10, 2021

It is a little more complicated than that: having the index at the end is great for writing tape, but sucks for reading tape. The nice thing of that design is that you can "stream" the data (read one file, write it, and just make some notes on the side for your eventual index). But you can't stream things off (you have to read the index first).

Tar is of course all-around great for tape, since every file is essentially put on by itself (with just a little bit of header about it). Again this is great for streaming things both on and off tape. But you can't do any sort of jumping around. You have to start at the beginning and then go from file to file. This gets even worse if you try to compress it (e.g. .tar.gz), as you then have to decompress every byte to get to the next one.

tarasglek · on Nov 11, 2021

couple of points:

a) zip spec does not require file index to be at the end. Only the pointer to file index

b) You can do streaming decompression of zip as each zip entry has a header + filename inline with data.

c) I came up with this optimization(placing index at front of making file io sequential to make use of OS readahead) for omni.ja files in firefox. It's still a standard zip file, but index lives at the front. Most zip tools can open that file unmodified(tho they sometimes complain).

lazide · on Nov 12, 2021

Ho, learn something new everyday, thanks! Also, for anyone else reading this found this elsewhere and is applicable [https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT]

That is actually applicable to a project I’m working on now, any chance your implementation is in python and open source somewhere?

tarasglek · on Nov 15, 2021

Yes, it's in python :)

https://github.com/humphd/mozilla-central-old/blob/9d4d9f265...

Curious, how is it applicable?

lazide · on Nov 15, 2021

Offline high speed data ingestion of multi-thousand file, multi hundred GB data sets, followed by rapid transfer to permanent online storage (and replication fan-out, etc).

Seems convenient to allow optimization for high speed sequential reads and random read/writes at different parts of the life cycle, along with indexing, crcs, signatures, etc.

One big issue with zip storing the index at the end of course is a truncated file basically lost most of it’s context and is generally unrecoverable even in part, which this could also help with from a durability perspective.

Storing it at the beginning (without a end pointer) opens up the possibility you have a valid looking archive you’re touching that is truncated and missing a lot of data, and won’t know until you look past the end (or validate total bytes or whatever, which doesn’t work well when steaming).

Storing the index at the beginning, pointer and file sig at the end, and all the other format extensions does solve for all this. Which is convenient.

tarasglek · on Nov 18, 2021

neat, let me know if you have any further questions. Would love to make this a more common thing that happens to zip files. As result of me doing this in firefox, zip utilities(aka 7zip) started complaining a lot less about this creative interpretation of the standard :)

petre · on Nov 10, 2021

Also limited to 4Gb unless it's zip64 which is limited to 16 Eb and not supported by all zip implementations.

selfhoster11 · on Nov 10, 2021

Don't tapes support some form of seeking?

Isthatablackgsd · on Nov 10, 2021

What lazide saying is that some important information is stored at the end (as in the very end of the book/movie or at the end of the line). So imagine there is a 1TB .zip archive in the tape, the tape device have to go further (deeper) into that 1TB file to get that last bit of vital data that the user want to see. Normally the vital bit usually at the start of the file (as in the front line/beginning of the book) that the user have the information ready before they could transfer it. But for lazide case, the tape device have to keep reading the entire 1TB zip to get that last vital bit of information which made it slow. It is more like it cannot "skip the line" and would have to go through entire line to get there.

lazide · on Nov 10, 2021

They do - it’s very slow, and entirely linear. It also puts wear on the tape, so if you do it a lot you’ll break the tape in a not-super-long-time.

And since you wouldn’t know where the end is by reading the beginning, you’ll have to keep reading until you hit the right marker - and seek backwards (which you generally can’t read backwards on most tape drives, so you need to jump back and re-read).

cb321 · on Nov 10, 2021

Also relevant is pixz [1] which can do parallel LZMA/XZ decompression as well as tar file indexing.

[1] https://github.com/vasi/pixz

zinconfire · on Nov 11, 2021

I've done the same thing for decades with the -v -R options to GNU tar, which cause it to print the block number of each file to stderr while creating the archive. With knowledge of the block number, you can then pipe dd to tar to extract a single file without a linear search. Or, in the case of tapes, you can use mt to seek to the block before you run tar. Of course, something like tarindexer is more convenient!

tyingq · on Nov 10, 2021

Also see SQLite archive files. Random access and compression. https://www.sqlite.org/sqlar.html

dan-robertson · on Nov 11, 2021

Yeah I wonder how the format compares to this. It also wasn’t obvious whether hop supports adding data with an operation that isn’t just append (a stricter requirement for tar because it was designed for magnetic tape). It wasn’t clear to me from the table. sqlar can add to an existing archive and doesn’t require uncommon software to extract if you don’t have the right tool to hand.

seeekr · on Nov 10, 2021

From the README:

"Why? Reading and writing lots of tiny files incurs significant syscall overhead, and (npm) packages often have lots of tiny files. (...)"

It seems the author is working on a fast JS bundler tool (https://bun.sh) and the submission links to an attempt to fix some of the challenges in processing lots of npm files quickly. (But of course could be useful beyond that.)

eyelidlessness · on Nov 10, 2021

Came to mention Bun when I saw this hit the front page. I’ve been following Jarred’s Twitter since I heard about Bun and it’s quite impressive (albeit incomplete). To folks wondering why another bundler/what makes Bun special:

- Faster than ESBuild/SWC

- Fast build-time macros written as JSX (likely friendlier to develop than say a Babel plugin/macro). These open up a lot of possibilities that could benefit end users too, by performing more work on build/server and less client side.

- Targeting ecosystem compatibility (eg will probably support the new JSX transform, which ESBuild does not and may not in the future)

- Support for integration with frameworks, eg Next.js

- Other cool performance-focused tools like Hop and Peechy[1] (though that’s a fork of ESBuild creator’s project Kiwi)

This focus on performance is good for the JS ecosystem and for the web generally.

1: https://github.com/Jarred-Sumner/peechy

jraph · on Nov 10, 2021

> "Why? Reading and writing lots of tiny files incurs significant syscall overhead, and (npm) packages often have lots of tiny files"

Ah, those trees of is-character packages depending on is-letter-a packages, themselves depending on is-not-number packages each appearing several times in different versions are probably challenging to bundle and unpack efficiently. We might want to address file systems too, so they too can handle NPM efficiently (actual size on disk and speed).

Or maybe the actual fix is elsewhere.

Runs, fleeing their screen screaming in fear and frustration

(no offense to the author, I actually find the problem interesting and the work useful)

zkldi · on Nov 11, 2021

If I had asked them what they wanted, they would've said faster filesystems.

aaaaaaaaaaab · on Nov 10, 2021

Obviously, what we really need is an OS optimized for npm.

rob74 · on Nov 11, 2021

Maybe the way to fix these issues would be to try to reduce the package bloat in the npm ecosystem? Otherwise I'm afraid a faster bundler will only result in even more packages being bundled. But I guess it's hard to get that genie back into the bottle...

perihelions · on Nov 10, 2021

Speaking of faster coreutils replacements, I highly recommend ripgrep (rg) and fd-find (fd), which are Rust-based, incompatible replacements for grep and find.

I know I'm way behind the popularity curve* (they're already in debian-stable, for crying out loud); but for the benefit of anyone even more out of the loop than myself, do check these out. The speed increase is amazing: a significant quality-of-life improvement, essentially for free**. Particularly if you're on a modern NVME SSD and not utilizing it properly. (Incidentally, did you know the "t" in coreutil's "tar" stands for magnetic [t]ape?")

* (The now-top reply to this comment says 'ag' is even superior to 'rg', and they're probably right, but I had no clue about it! I did say I'm ignorant!)

**(Might have some cost if you're heavily invested in power-user syntax of the GNU or BSD versions, in which case incompatibility has a price).

https://github.com/BurntSushi/ripgrep

https://github.com/sharkdp/fd

_ugfj · on Nov 10, 2021

When I looked ag https://github.com/ggreer/the_silver_searcher was more featureful than ripgrep yet it's always ripgrep that is mentioned. :/

burntsushi · on Nov 11, 2021

Which features does it have that ripgrep doesn't?

ziml77 · on Nov 11, 2021

I still haven't seen anyone elaborate on what features ag has over ripgrep. Can you explain what you like about ag that ripgrep currently doesn't do?

rscnt · on Nov 10, 2021

weird right? I don't know what ripgrep has, maybe it's just the name? the fact that you call `ag` but it's called `the_silver_searcher`?

burntsushi · on Nov 11, 2021

It's not weird because ripgrep has had more features than ag for a long time. Originally ripgrep didn't have multi-line support or support for fancy regex features like look-around, but it has both of those now. It also has support for automatic UTF-16 transcoding, preprocessors for searching non-text files and overall less buggy support for gitignore. (Look at ag's issue tracker.)

And then there's also the fact that ag isn't that great as a general purpose grep tool. It really falls over in terms of speed when searching, say, bigger files:

    $ time rg 'Sherlock Holmes' OpenSubtitles2018.raw.en | wc -l
    7673

    real    1.475
    user    1.115
    sys     0.356
    maxmem  12511 MB
    faults  0

    $ time ag 'Sherlock Holmes' OpenSubtitles2018.raw.en | wc -l
    7673

    real    20.276
    user    19.850
    sys     0.413
    maxmem  12508 MB
    faults  0

A lot of people like to comment and say, "well ag is fast enough for me." Well, OK, that's fine. But if you're wondering about why other people might mention ripgrep more, well, maybe it isn't just about the way you use the tools. For example, if you only ever search tiny repositories of code, then you aren't going to care that ripgrep is faster than ag. Which is perfectly reasonable, but it should also be reasonable to be aware that others might actually search bigger corpora than you.

_ugfj · on Nov 11, 2021

ripgrep regexp has lookaround? The linked docs still doesn't say that.

https://docs.rs/regex/1.5.4/regex/#syntax

> This crate provides a library for parsing, compiling, and executing regular expressions. Its syntax is similar to Perl-style regular expressions, but lacks a few features like look around and backreferences.

Edit: I see. It can use pcre2 but it's a build time option which of course Ubuntu has off.

burntsushi · on Nov 11, 2021

> Edit: I see. It can use pcre2 but it's a build time option which of course Ubuntu has off.

That's unfortunate. Perhaps a bug report to the packagers is in order? Archlinux enabled it: https://github.com/archlinux/svntogit-community/blob/0dc033f...

_ugfj · on Nov 13, 2021

https://bugs.launchpad.net/ubuntu/+source/rust-ripgrep/+bug/...

pdimitar · on Nov 10, 2021

From my side, I knew about both `ripgrep` and `the_silver_searcher` but I will openly admit I've lost faith in C and C++'s abilities to protect against memory safety errors.

Thus, Rust tools get a priority, at least for myself. There are probably other memory-safe languages but I haven't had the chance to give them a spin like I did with Rust. If I found out about them then I'll also prefer tools written in them if there's no Rust port and if the alternatives are C/C++ tools.

jhoechtl · on Nov 11, 2021

Ag is shorthand for argentum which in turn means silver.

https://en.wikipedia.org/wiki/Silver

saagarjha · on Nov 11, 2021

Code search on your own codebase is generally not a place where memory safety shines.

pdimitar · on Nov 11, 2021

When it's about security, I try not to pick and choose.

Plus a memory error might lead to scary consequences (like a random script getting elevated privileges).

Finally, and last I checked, `rg` is quite featureful and I've never felt constrained by it. So for me it's a win all around.

saagarjha · on Nov 12, 2021

This is kind of my point: security requires a threat model, and you don't really have one. Rust has a lot going for it, and it does hold promise in improving the security of a lot of critical software. But in this case, it's not really doing that, so it's kind of misleading to say it is meaningfully doing anything for security.

pdimitar · on Nov 12, 2021

I agree that in the case of grep-ing in the terminal the odds of covering your butt well enough by using a Rust tool are super slim.

That being said, there are powerful adversaries of anonymity and the right to personal data out there -- and security in depth is what works best against them. There's no one UltimateSecuritySolution™; there are many small ones that we layer on top of each other so we don't allow even a smidgen of air to pass between the cracks.

But yeah, I am paranoid. I am gradually preparing myself to move from macOS to Linux and even though I am not a criminal and never will be, I'll still make a heroic effort to make the odds of any foul play against me practically zero. (And that's why I will start using the userland Rust tools alternatives as well.)

I'll concede that in my case the biggest impact would probably come from running Chrome in a jail, and not from using `rg` vs. `ag`. That much is true, yep.

saagarjha · on Nov 12, 2021

Yeah, it's kind of a shame in this case. There's tons to talk about in the area of where Rust shines here ("makes concurrency easy", "provides easy access to fast algorithm libraries", etc.) but security is just not really one of those points.

pdimitar · on Nov 12, 2021

I don't disagree. I am just happy to point out that Rust increases security (since most security vulnerabilities I see reported are buffer over/under-flows or other memory safety mishaps). Rust definitely does not solve everything in security. You can still open yourself up for an elementary replay attack if you're not careful -- like I did just a few days ago.

_ugfj · on Nov 11, 2021

This is what I suspected: Rust zealotry. We are searching for a string. Language matters to the guy writing it but the user? Nah. There's no security hole here.

pdimitar · on Nov 11, 2021

Who's the zealot here? The guy who doesn't want to risk and openly states so, or the guy proclaiming there's no risk, even with ample historical evidence for the opposite?

I don't accept your labeling, especially when it's so egregiously misguided.

Also, you're clearly seeing what you want to see.

timerol · on Nov 10, 2021

ripgrep is noticeably faster than ag. I'm not sure what features you mention that ripgrep is missing, but I've been plenty happy with it for basic grepping around. I'm sure it's also partly because BurntSushi, the author of ripgrep, is reasonably active here.

Some fun benchmarks, and a fair description of how text search works, can be found in this 2016 post https://blog.burntsushi.net/ripgrep/

_ugfj · on Nov 11, 2021

> ripgrep is noticeably faster than ag.

Since both are instant on every codebase I care about, not so sure about that. These tools were always chiefly I/O bound , SSDs have torn that down especially now.

But for features: there's no lookaround in the regex.

burntsushi · on Nov 11, 2021

They are only I/O bound when searching data that isn't in cache. It's often the case that you're searching, for example, a repository of code repeatedly.

> Since both are instant on every codebase I care about, not so sure about that.

It is very possible that there is no performance difference between ripgrep and ag for your use cases, but that does not mean there isn't a performance difference between ripgrep and ag. For example, in my checkout of the chromium repository:

    $ time rg -c Openbox
    testing/xvfb.py:2
    tools/metrics/histograms/enums.xml:1
    ui/base/x/x11_util.cc:1

    real    0.448
    user    2.593
    sys     2.490
    maxmem  77 MB
    faults  0

    $ time ag -c Openbox
    ui/base/x/x11_util.cc:1
    tools/metrics/histograms/enums.xml:1
    testing/xvfb.py:2

    real    2.302
    user    2.996
    sys     10.462
    maxmem  15 MB
    faults  0

> But for features: there's no lookaround in the regex.

That's not true and hasn't been true for a long time. ripgrep supports PCRE2 with the -P/--pcre2 flag. You can even put `--engine auto` in an alias or ripgreprc file and have ripgrep automatically select the regex engine based on whether you're using "fancy" features or not.

In general, I also claim that ripgrep has far fewer bugs than ag. ag doesn't really get gitignore support correct, although if you only have simple gitignores, its support might be good enough.

aidenn0 · on Nov 11, 2021

These tools are I/o bound, but not disk bound. I often perform multiple searches in a row, and all but the first will hit the disk cache.

sramam · on Nov 11, 2021

In an attempt to remove some of my ignorance, I tried this: (MBA, M1)

  time rg Query src > /tmp/junk.rg
  rg Query src > /tmp/junk.rg  0.01s user 0.04s system 156% cpu 0.037 total
  time pt Query src > /tmp/junk.pt
  pt Query src > /tmp/junk.pt  1.44s user 10.80s system 220% cpu 5.545 total
  time ag Query src > /tmp/junk.ag
  ag Query src > /tmp/junk.ag  4.32s user 9.82s system 204% cpu 6.918 total

  ls -l /tmp/junk\*
  -rw-r--r--  1 sramam  wheel  16492080 Nov 10 16:05 /tmp/junk.ag
  -rw-r--r--  1 sramam  wheel  16492080 Nov 10 16:04 /tmp/junk.pt
  -rw-r--r--  1 sramam  wheel     27477 Nov 10 16:04 /tmp/junk.rg

This is a TypeScript module, turns out ripgrep correctly ignores the sourcemaps for the improved performance. I like tools that are smart by default.

(EDIT: formatting)

dan-robertson · on Nov 11, 2021

I think rg isn’t ignoring sourcemaps to be fast. It tries to ignore based on .gitignore and similar files, but maybe the sorcemaps are ignored for being binary.

sramam · on Nov 11, 2021

I thought so too. The documentation for all state they honor .gitignore. In my case, the compiled code is committed.

My query was for a generic term "src" - so that likely trapped the file-path embedded within the long lines of a source-map. Hence the massive file-size difference.

(I did inspect the files to validate)

burntsushi · on Nov 11, 2021

The main goal of respecting gitignore is generally to decrease noise in search results. But, this does of course also have the effect of improving performance in a lot of cases.

hakre · on Nov 10, 2021

Okay, call me lazy. Not for ripgrep, installed it early from sources. However your fdfind (fd) mention made me curious. Thought I give it an apt install on Ubtunu 20.04 LTS but dead end. So perhaps in debian-stable but not in Ubuntu. Just saying, so I can feel less out of the loop ;)

perihelions · on Nov 10, 2021

Upstream says Ubuntu's package is 'fd-find' (and the executable is 'fdfind' -- both renamed from upstream's 'fd' because of a name collision with an unrelated Debian package. If you don't care about that one, you can safely alias fd="fdfind").

https://github.com/sharkdp/fd

(I've edited my first comment in response to this reply: I originally wrote "fdfind". (For a comment about regexp tools, this is a uniquely, hilariously stupid oversight. Sorry!)).

hakre · on Nov 10, 2021

Well, even its late here, I could still have enough creative energy to insert the minus in there ... yeah, thanks for that. So now I feel really behind because sure, it's packaged in Ubuntu 20.04. Thanks again.

nextaccountic · on Nov 11, 2021

There is also ouch, a fast multi-format compressor/decompressor written in Rust

https://github.com/ouch-org/ouch

Jarred · on Nov 10, 2021

I made this

Happy to answer any questions or feedback

brandmeyer · on Nov 10, 2021

If you manage the index using a B-tree, then you can perform partial updates of the B-tree by appending the minimum number of pages needed to represent the changes. At that point, you can append N additional new files to the tail of the archive, and add a new index that re-uses some of the original index.

Just an idea to check the "append" box.

See also B-trees, Shadowing, and Clones https://www.usenix.org/legacy/events/lsf07/tech/rodeh.pdf

Jarred · on Nov 10, 2021

Good idea, worried a little about impact to serialization/deserialization time though. Maybe could still store it as a flat array

Const-me · on Nov 10, 2021

Consider variable-length integers for file sizes and string lengths. When I need them, I usually implementing what’s written in MKV spec: https://www.rfc-editor.org/rfc/rfc8794.html#name-variable-si...

That’s a good way to bypass that 4GB size limit, and instead of wasting 8 bytes per number this will even save a few bytes.

However, I’m not sure how easy it is to implement in the language you’re using. I only did that in C++ and modern C#. They both have intrinsics to emit BSWAP instruction (or an ARM equivalent) to flip byte order of an integer, helps with performance of that code.

ectopod · on Nov 10, 2021

Don't use 32-bit file times! Change it quick while you have the chance.

ksec · on Nov 10, 2021

How far are we from public beta or Bun 1.0?

Not related to bun or hop is the use of Zig. I am wondering if you will do a write up on it someday.

Jarred · on Nov 10, 2021

Something like two weeks before a public beta

iainmerrick · on Nov 10, 2021

Have you looked at using LZ4 compression? Supposedly it has a very fast decompressor, so it could be better than uncompressed files for your purposes.

Jarred · on Nov 11, 2021

I don't think builtin compression is necessary, but it would be better if the CLI supported it similarly to tar.

Compressing a 57.5 MB .hop archive with `gzip -c` returns an 11.7 MB file. This is a similar compression ratio as tar -> tar.gz

Compression is great for sending files over the network, but for fast reads, it might be undesirable

iainmerrick · on Nov 11, 2021

It’s worth benchmarking -- LZ4 can decompress faster than SSD read speeds, so it could well be a net time saving.

eby · on Nov 11, 2021

How about compressing the files while building the archive instead of compressing the entire archive itself? That ought to preserve reasonably fast reads. It won't give optimal compression but in practice should do alright (and of course will be smaller than no compression).

throwaway375 · on Nov 10, 2021

What do you think about Valve VPK?

CannoloBlahnik · on Nov 10, 2021

This seems like more of a tar problem than a zip problem, unless I'm missing something, given the lack of compression on Hop.

slaymaker1907 · on Nov 10, 2021

I think you can still zip files without compression.

sumtechguy · on Nov 10, 2021

Usually the -0 option. https://linux.die.net/man/1/zip -mx=0 for 7zip style programs.

If you also used the mt option with 7zip and just stored you probably could have a decent read rate as I think it spins extra threads.

sigzero · on Nov 10, 2021

Yes, but I don't think anyone does that.

masklinn · on Nov 11, 2021

This ability is an important aspect of “document” archives following the open container format (OCF): the first file in the archive must be uncompressed, called “mimetypes”, and contain the mimetype of the document encoded in us-ascii.

zzo38computer · on Nov 10, 2021

File names not being normalized across platforms is sometimes beneficial. Ignoring symlinks is also sometimes beneficial. However, sometimes these are undesirable features. The same is true of solid compression, concatenatability, etc. Also, being able to effiently update an archive means that some other things may be lacked, so there is advantage and disadvantage of it.

I dislike the command-line format of Hop; it seems to missing many features. Also, 64-bit timestamps and 64-bit data lengths (and offsets) would be helpful, as some other people mention; I agree with them, to improve it in this way. (You could use variable length numbers if you want to save space.)

My own opinion for the general case is that I like to have concatenable format with separate compression (although this is not suitable for all applications). One way to allow additional features might be having an extensible set of fields, so you can include/exclude file modes, modifications times, numeric or named user IDs, IBM code pages, resource forks, cryptographic hashes, multi-volumes, etc. (I also designed a compression format with a optional key frame index; this way the same format supports both solid and non-solid compression, whichever way you want to do, and this can work independently from the archive format being used.)

For making backups I use tar with a specific set of options (do not cross file systems, use numeric user IDs, etc); this is then piped to the program to compress it, stored in a separate partition, and then recorded on DVDs.

For some simple uses I like the Hamster archive format. However, it also limits each individual file inside to 4 GB (although the entire archive is not limited in this way), and no metadata is possible. Still, for many applications, this simplicity is very helpful, and I sometimes use it. I wrote a program to deal with these files, and has a good number of options (which I have found useful) without being too complicated. Options that belong in external programs (such as compression) are not included, since you can use separate programs for that.

Jarred · on Nov 11, 2021

> I dislike the command-line format of Hop; it seems to missing many features.

I agree 100%. I wrote most of it in like three hours; it's not a polished product.

> My own opinion for the general case is that I like to have concatenable format with separate compression (although this is not suitable for all applications). One way to allow additional features might be having an extensible set of fields, so you can include/exclude file modes, modifications times, numeric or named user IDs, IBM code pages, resource forks, cryptographic hashes, multi-volumes, etc. (I also designed a compression format with a optional key frame index; this way the same format supports both solid and non-solid compression, whichever way you want to do, and this can work independently from the archive format being used.)

I'm wary of slowing it down by adding lots of features. I think that, generally speaking, _more_ purpose-built binary formats should exist.

Engineers do this all the time with YAML files and JSON, but why not binary files?

zzo38computer · on Nov 11, 2021

> I'm wary of slowing it down by adding lots of features. I think that, generally speaking, _more_ purpose-built binary formats should exist.

It is a valid point, yes. However, my mention was meant to mean that unknown fields can be easily skipped.

kazinator · on Nov 10, 2021

> Can't be larger than 4 GB

How does that even happen in 2021?

maccard · on Nov 10, 2021

4GB is the maximum value of a 32 bit unsigned int. If I had to guess that's the maximum size of the array/vector container in zig.

hansvm · on Nov 11, 2021

Zig has weird limits in a few places (1k compile-time branches without additional configuration, the width of any integer type needs to be less than 2^16, ...). Array/vector length's aren't the issue though; you can happily work with 64-bit lengths on a 64-bit system.

Skimming the source, there are places where the author explicitly chooses to represent lengths with 32-bit types (e.g., schema.zig/readByteArray()). I bet you could massage the code to work with larger data without many issues.

maccard · on Nov 11, 2021

> I bet you could massage the code to work with larger data without many issues.

I'm sure the author would appreciate a pull request if it's that straight forward.

hansvm · on Nov 11, 2021

So it is that straightforward for a proof of concept (downgrade to an old version of Zig compatible with the project, patch an undefined variable bug the author introduced yesterday, s/32/64, add 4 bytes to main.zig->Header and its accesses).

Doing so makes the program slower though which might be a non-starter for a performance focused project. Plus you'd need a little more work to properly handle large archives on 32-bit and smaller systems (at least those which support >4GB files).

Jarred · on Nov 10, 2021

Offsets and lengths are stored as unsigned 32 bit integers. This saves space/memory, but means it won’t work for anything bigger than 4 GB

Maybe that’s overly conservative. Wouldn’t be hard to change

jonny_eh · on Nov 10, 2021

With most processors being 64-bit today, what would be the impact of using uint64 everywhere?

georgemcbay · on Nov 10, 2021

The answer is at the bottom of the page and all the offset/length data being uint32s.

2^32 = 4294967296

Not sure if this limitation is being enforced by upstream concerns, but this is why this code in particular is limited to that size.

euske · on Nov 11, 2021

I used to use cdb a lot for random access of small key/value pairs (~100k entries, 10kb per entry). It's effectively a hashtable on disk.

https://cr.yp.to/cdb.html

johnisgood · on Nov 11, 2021

> No random limits: cdb can handle any database up to 4 gigabytes.

I suppose this could easily be changed, right? I could not compile it though. I got this error:

  /usr/bin/ld: errno: TLS definition in /lib/x86_64-linux-gnu/libc.so.6 section .tbss mismatches non-TLS reference in cdb.a(cdb.o)
  /usr/bin/ld: /lib/x86_64-linux-gnu/libc.so.6: error adding symbols: bad value

I never came across this before. Ideas as to why or how to fix?

Nevermind, I fixed it. I added `#include <errno.h>` to `error.h`.

nathell · on Nov 10, 2021

Related: pixz – a variant of xz that works with tar and enables (semi)efficient random access: https://github.com/vasi/pixz

1MachineElf · on Nov 10, 2021

Is there anything like this for RAR files? I'm looking for an alternative to unrar, as I've recently learned that it's code is actually non-free.

BlackLotus89 · on Nov 10, 2021

I think maybe you misunderstood this project....

Anyway libarchive can read rar files so just use bsdtar. It can do many other archive files [0] like cpio as well. One standardized interface for everything is nice

[0] https://github.com/libarchive/libarchive#supported-formats

bouk · on Nov 10, 2021

I wonder how this format compares to SquashFS

mattfrommars · on Nov 10, 2021

Is 7zip = zip ?

wolpoli · on Nov 10, 2021

No. The 7zip format has much better compression than the ancient (1990s) zip format.

selfhoster11 · on Nov 10, 2021

ZIP as a standard is no longer ancient. It has support for modern encryption and compression, including LZMA. This is sometimes referred to as the ZIPX archive, but it's part of the standard in its later revisions.

snvzz · on Nov 10, 2021

All this "no longer ancient" means is that zip is now out of the window. Its value has been lost.

The format has been perverted, and we can't trust zip as the "just works" option that will open in any platform, with any implementation, anymore.

All because somebody thought it a good idea to try to leverage the zip name's attached popularity to try and make a new format instantly popular.

Great job. This is why we can't have good things.

selfhoster11 · on Nov 10, 2021

The same can be said of HTML. I realise that file archive formats are expected to be more stable (they are for archiving, yes?), but is it right to expect them to be forever frozen in amber? Especially when open source or free decompressors exist for every version of every system? ZIP compressed using LZMA is even supported in the last version of 7-zip compiled for MS-DOS.

snvzz · on Nov 10, 2021

>ZIP compressed using LZMA is even supported in the last version of 7-zip compiled for MS-DOS.

But then, why wouldn't you just use the 7z format?

The expectation with ZIP is (or was) that it'll unpack fine, even under CP/M.

Moving to '.zipx' extension was the right move, but it was done far too late.

wolpoli · on Nov 10, 2021

Moving to the .zipx extension was definitely the right (and only) move. This is due to the fact that Microsoft's compressed folder code hasn't been updated in years and thus it isn't safe to send out any Zip files that uses any new features [1].

[1]: https://devblogs.microsoft.com/oldnewthing/20180515-00/?p=98...

snvzz · on Nov 17, 2021

Thank you Microsoft.

I hope if/when they upgrade their zip support, they only accept the new format if the filename extension is zipx.

mjevans · on Nov 10, 2021

zipx != zip -- we are speaking of file compression standards NOT branded compression software programs

selfhoster11 · on Nov 10, 2021

No, it literally is part of the ZIP standard [0]. I updated my original comment.

ZIPX archives are simply ZIP archives that have been branded with the X for the benefit of the users who may be trying to open them with something ancient, that doesn't yet support LZMA or the other new features.

[0] https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

mjevans · on Nov 11, 2021

When a 'hacker' (programmer) says 'zip file' they refer to the widely compatible implementation which is likely to work in any common implementation of archive handling software.

https://en.wikipedia.org/wiki/ZIP_(file_format)#Version_hist...

According to the above version table, that would conform to PKWARE 2.0, which as with the ODF file specification, limits packing methods to STORE and DEFLATE only; as mentioned in the standardization section and ISO/IEC 21320-1.

When you say 'zip' you should be saying 'zipx' at the very least, but you may as well be comparing a tar file to a tar.zstd file in the same case.

selfhoster11 · on Nov 11, 2021

> may as well be comparing a tar file to a tar.zstd file in the same case.

It's not exactly hard to get a new compressor on a *nix system. It's either in the repos, or not so hard to compile. On non-nixes, there's usually a binary. Tarballs have not been limited to only tar.gz for a very long time, though a lot of people do choose to be conservative in how their distributed files are compressed.

> When a 'hacker' (programmer) says 'zip file' they refer to the widely compatible implementation which is likely to work in any common implementation of archive handling software.

You don't speak for every hacker, certainly not for me. If you're implementing ZIP these days and it's not, at the very least, capable of reading ZIP64 archives (forget about newer compression methods), then you're just creating obsolete software.

wolf550e · on Nov 10, 2021

7zip uses lzma, a much more advanced and slower format than RFC 1951 DEFLATE used in zlib/gzip/png/zip/jar/office docs/etc

selfhoster11 · on Nov 10, 2021

Zip can also use LZMA, assuming the recipient can interoperate with that.

notananthem · on Nov 10, 2021

(cries in winrar)

Octplane · on Nov 10, 2021

Hey! Did you register me? </nag>