Honey, I shrunk the NPM package

terrelln · on Oct 3, 2023

In parts (1) and (2) comparing the default setting of Zstd (level 3) against the default setting of Brotli (level 11) is a bit misleading. It shows Brotli compressing ~30% better than Zstd, but Brotli's default level is >100x slower than Zstd's default level. Zstd level 3 is expected to run at hundreds of MB/s, and Brotli level 11 is expected to run at ~2 MB/s. The compression speed is only 30% slower because that benchmark includes the time to tar the directory, which is likely more expensive than the compression itself. As @sfink already suggested, just running lzbench on the npm-9.7.1.tar would be a better benchmark.

In part (3), because its running only on lib/npm.js which is 13KB, you are getting skewed results which aren't directly applicable to the compression of npm-9.7.1.tar. Brotli excels at compressing small Javascript files, as this is where its dictionary provides the most benefit. The benefits of the dictionary for a large tar file will be negligible.

However, in the npm-9.7.1.tar scenario we still expect Brotli level 11 to produce slightly smaller files than Zstd level 19. Likely ~5% smaller. But we do expect Zstd to provide significantly faster decompression speed.

tuatoru · on Oct 3, 2023

> The compression speed is only 30% slower because that benchmark includes the time to tar the directory, which is likely more expensive than the compression itself.

Amdahl's Law in action.

prollings · on Oct 3, 2023

Amdahl's law states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used".

For anyone else who didn't know.

ornornor · on Oct 4, 2023

Thanks

terrelln · on Oct 3, 2023

That said, this is an interesting article, and I love to see people experimenting with modern compression algorithms for package management! There are a lot of easy wins in this space.

unilynx · on Oct 4, 2023

I think the article points out that there are no easy wins at all. Sure, you could try to update the compression algorithm any time it gives a small, easy win (and as others have pointed out, his approach to comparing algorithms is debatable)

The interesting part of the article to me is: okay, now that we have a better compression option, how do we deploy it to an existing ecosystem? And suddenly we're looking at a 4 year migration path!

No easy win at all.

sfink · on Oct 3, 2023

> The first [caveat] is that lzbench isn’t able to compress an entire directory like tar , so I opted to use lib/npm.js for this test.

As opposed to... just using the npm-9.7.1.tar file that the other tests were just using (sorta barely internally to tar, but I don't think tar does any fancy streaming or anything if you're passing something via --use-compress-program, certainly nothing that would skew the results more than replacing all of npm.tar with just npm.js.)

In my local install, npm.tar is 25MB. npm.js is 16KB. It may not change the final outcome, but the data in the article do not support the conclusion. I would strongly suggest tarring up the npm directory and rerunning lzbench.

kevincox · on Oct 4, 2023

I think the author is unaware that `tar --compress-program foo` is equivalent to `tar | foo` (other than handling output writing and things.

The analysis would have been much simpler, clearer and more focused on the compressor if they just benchmarked the compression with an already-generated tar archive.

(Notably zip is different where files are compressed individually in the archive)

bhouston · on Oct 3, 2023

What to cut total NPM traffic by 50% or more overall? Easy:

Create a shared Brotli dictionary (or zstandard or whatever) based on the top NPM packages by download bandwidth and then have all npm packages compressed using it.

I think this can be done server side by npmjs.org, where NPM packages are recompressed in this fashion after upload using the shared dictionary, thus it is an optional feature and fully backwards compatible.

Riffing on this idea because of this new chrome feature, which does this in a flexible fashion: https://chromestatus.com/feature/5124977788977152

EDIT: This recompression of packages may be insecure as the digital signature of the package no longer aligns, but then the trick is to sign the contents of the package rather than package itself.

shanemhansen · on Oct 3, 2023

It would be a fun test to run. But I'm not encouraged by the fact that the existing brotli dictionary already contains a bunch of javascript specific stuff:

https://gist.github.com/klauspost/2900d5ba6f9b65d69c8e

brotli literally already has a tokens for function/return/throw/indexOf(/.match/.length/etc.

Also verify after decompress is not without tradeoffs. On one hand we have folks like github who can't change the version of zlib because people rely on identical .tar.gz. https://news.ycombinator.com/item?id=34586917

On the other hand we have a whole lot of iffy stuff you can do to make programs decompressing content use large amounts of resources https://en.wikipedia.org/wiki/Zip_bomb which makes "decompress this potentially untrusted file so that I can validate it's safe to use" hard.

bhouston · on Oct 4, 2023

> brotli literally already has a tokens for function/return/throw/indexOf(/.match/.length/etc.

Yeah, I see it already has a lot of JavaScript, HTML and CSS content. Interesting. I didn't realize it had an existing web-focused token library, and figured it was more like zstd, 7z and zlib, which I believe have none.

I would love to do the experiment if I had time. I wonder what is the laziest way to do it?

anonacct37 · on Oct 4, 2023

I've done that experiment with zstd before.

https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md...

Not sure about brotli though.

Aeolun · on Oct 3, 2023

> On one hand we have folks like github who can't change the version of zlib because people rely on identical .tar.gz.

People rely on the same checksums for the same files. They’re perfectly fine with changing the method for new files.

orf · on Oct 4, 2023

The files (repo tarballs) are generated on demand, there are no “new” or “old” files.

I guess you could base it on the commit time, but this is user supplied.

hoten · on Oct 3, 2023

Signing the contents means you must unzip it to confirm its validity, which exposes users to the latent security bugs of their unarchival program (or a zip bomb).

bhouston · on Oct 3, 2023

If recompressing it on the server breaks the chain of trust of the client provided zip, then one would need to upgrade npm on the creator's side of things so that they create both uploads themselves (legacy zip, and the shared-dictionary brotli package) and sign both themselves.

Wait, does npm have digital signatures at all? I sort of assumed it did, but does it really?

IshKebab · on Oct 3, 2023

50%? I doubt that, unless your dictionary is literally the top NPM packages, in which case you're not saving much bandwidth by sending them all in advance...

Maybe a small dictionary would save you 20% though.

bhouston · on Oct 3, 2023

It is hard to judge ahead of time and yeah it depends on the dictionary size. It is hard to argue concretely about hypotheticals.

It would be a fun experiment to figure out the compression ratio of say the top 1000 packages given a shared brotli dictionary of X size. Just keep increases the dictionary size until you see diminishing returns.

My estimate is based on NPM packages contain a bunch of super stereotypical files that if they are used to create the dictionary likely result in amazing compression ratios: package.json, package-lock.json, CSS, Tailwind, Bootstrap .gitignore, LICENSE, .eslint, README, React/Vue/Angular code...

Cthulhu_ · on Oct 3, 2023

I don't know how compression works; would a common library for Javascript save bandwidth? I'm just thinking that JS is a lot of repeated characters and keywords, and any browser or npm or node installation could be shipped with a (partial) dictionary.

Although the bandwidth savings would only be the size of that dictionary.

Still, I can't help but think there's ways to improve size & bandwidth usage by a lot, even besides using a different compression algorithm; a non-HTTP transfer method, for example.

infogulch · on Oct 3, 2023

The bandwidth savings would be the size of the dictionary per downloaded package.

corbezzoli · on Oct 4, 2023

You say that as packages never receive updates. Unless you’re doing a lot of installs every day (let’s say on CI), you’re suggesting it’s better to forcibly download a hundred megabytes of data daily just to potentially save a couple of megabytes once a week.

There’s a smarter mechanism than that: npm already caches packages locally twice, both in node_modules and in a global location.

In reality you’re probably never going to install every dependency from scratch every day, so you’re already using either cache on every install.

What I’d like to see instead is a pre-resolution of packages done by the registry. I have a list of 30 dependencies, please resolve the tree and send it all over instead of forcing me to do a waterfall of fetches.

bhouston · on Oct 4, 2023

I was saying a dictionary trained on the top packages but not the size of them. I believe you can specify the size of the dictionary right? So you want to figure out when you have diminishing returns for average users. I figure an effective shared dictionary is anywhere from a few 100kb to a few MB. Where between those is optimal I don’t know.

I am definitely not saying download 100MB to save a few MB per user. That is just dumb.

I find NPM in the last year or so is faster to resolve that yarn. Although I think bun is even faster and may have better global caching.

kevincox · on Oct 4, 2023

If you are going to create a custom dictionary you may as well use zstd.

IIRC the main "innovation" that Brotli brings to compression is a default dictionary trained on web content. So if you are replacing this you best use the better algorithm.

hinkley · on Oct 3, 2023

I thought this may be going another way.

I make a periodic practice of searching our node_modules folder for files that shouldn't be there and reporting bugs against the offending projects.

Usually that's been pretty effective, and now the total cruft is around a megabyte whereas before it was somewhere north of 50MB all told. (Conditions apply).

coverage reports, test results, build detritus, etc.

The one I'm still debating, because it's becoming a serious problem for a couple of our libraries: should the tests be included in .npmignore or kept along with the library? I'm not sure what the right answer is there. Test sizes especially including fixtures can creep up quite a lot over time. I know what I'd like it to be, but I'm not sure I can win that argument with a bunch of maintainers on different projects.

diggan · on Oct 3, 2023

> should the tests be included in .npmignore or kept along with the library?

What are the reasons it would be a good idea to include the tests with the release/distribution of a package?

Seems like something you don't care about when you're just using it as a library, unless you want to modify something in it, but then you'll clone the library straight from a repository anyways, which includes the tests.

mwilliamson · on Oct 3, 2023

For packages where I don't include tests, I've had at least one downstream distro maintainer request that I include tests, since at least some of them treat npm or PyPI or whatever as the source of releases.

For packages where I do include tests, I've had at least one user request that I remove tests so that the footprint of the Docker image they're building is smaller.

Both are entirely reasonable requests, but package repositories don't really provide a good way of accommodating both at the same time, for instance, by allowing a separate upload of the dev gubbins such as tests.

j1elo · on Oct 3, 2023

As a software maintainer myself, I believe the downstream distro maintainer is the one being wrong there.

You have a software project, with a build process, and the "output" or final product of that project is the library that gets uploaded to NPM.

If they are packaging a software library, they should do it from the project's repository, not from one of its output artifacts.

They would probably reject a request if someone who was downstream of their work decided to repackage their stuff and asked them to include tests and other superfluous content on their packages.

mwilliamson · on Oct 3, 2023

I don't know about other distros, but Debian makes it extremely easy to download both the binary package and the source package. For instance, on the page for the jq package [1], you can download the source using the links down the right-hand side, which includes the full test suite. The key, in my view, is that Debian has a nice way to associate both the final output artefact and the source (both the original source and their patches) with a specific version.

[1] https://packages.debian.org/bookworm/jq

j1elo · on Oct 3, 2023

The way it works for Debian packaging is that they usually have their own copy of the project's source code (what they call upstream). So the packaging process does start from the actual, original source code repo of the upstream project being packaged. This code is kept in a "upstream" branch, to which Debian-specific files and patches are added, usually in a different branch named "debian". For new versions, the "upstream" branch is updated with new code from the project.

All of which, if you ask me, is the correct way of doing any kind of packaging. Following that, IMO the same should be done for JavaScript libraries: the packaging should be done by cloning the project repo and adding whatever packager-specific files in there.

Notice in your link how in the upper part it says: [ Source: jq ], where "jq" is a link to the source description. In that page, the right hand section will now contain a link to the Debian copy repository where packaging is done:

https://salsa.debian.org/debian/jq

You can clone and explore the branches of that repo.

(Maybe you are a Debian maintainer, in any case I'm writing this for whoever passes by and is curious about how I think JS or whatever else should be packaged if done properly)

Cthulhu_ · on Oct 3, 2023

The downstream distro maintainer is in the wrong IMO; if they want the source code, they can get the source code off of e.g. github and roll their own release.

That said, in old Java dependency management (i.e. Maven), you could often find a source file and a docs file alongside a compiled / binary release, so that you get the choice.

But this can also be done with NPM libs already; the package.json shipped in the distribution contains metadata, including the repository URL, which can be used to get the source.

corbezzoli · on Oct 4, 2023

I had that discussion with someone before. They were under the impression that the registry is rally permanent while GitHub could go away. Aside from the platform betting, they really thought their 2012 package (and particularly its tests) would be useful in 2052. But who am I to say otherwise.

The recent HN discussion about “that one npm maintainer” confirms please hold onto the most painful ideas.

mook · on Oct 4, 2023

That's a really interesting view, given that… GitHub owns npm. https://github.blog/2020-03-16-npm-is-joining-github/

hinkley · on Oct 3, 2023

My first thought was, "include a dev and prod version of the package" but that creates a ton of regression surface area for a feature that most people can't be bothered with anyway.

It's easy enough to have things work in pre-prod and fail in prod without running slightly different code between them.

I think there is a solution to this, but it's going to require that we change to something a lot more beefy than semver to define interface compatibility. Semver is a gentlemen's agreement to adhere to the Liskov Substitution Principle. We are none of us gentlemen, least of all when considered together.

simonw · on Oct 3, 2023

The most interesting thing about zstandard is how easy it makes it to train custom compression dictionaries, which can provide massive improvements against certain types of data.

I'd like to see how well a custom dictionary trained against a few hundred npm packages could work against arbitrary extra npm packages. My hunch is that there are a lot of patterns - both in JavaScript code and in the JSON and README conventions used in those packages - that could help achieve much better compression.

mbb70 · on Oct 3, 2023

We had billions of Protobufs to store in Cassandra as byte blobs, using a zstd dictionary dramatically reduced storage size and improved latency over the built in compression. The complexity overhead of managing these dictionaries and making sure the client always has access to the right dictionary to decompress was non-trivial but well worth it.

We looked at Brotli as well but decompression speed at acceptable ratio was the most important factor for us, that plus the far superior docs and evangelism sealed the deal for zstd.

metadat · on Oct 3, 2023

What kind of additional gains (% wise) did you see with custom dictionaries compared to vanilla zstd?

bhouston · on Oct 3, 2023

Depends on how much shared entropy your data has. Could you test this by trying to compress all your content into one stream (shared dictionary) compared to compressing it into separate streams (no shared dictionary)?

bhouston · on Oct 3, 2023

We think similarly. :) I posted the same idea while you were typing yours: https://news.ycombinator.com/item?id=37755005

erikpukinskis · on Oct 4, 2023

The one simple thing I do that cuts my NPM package size massively:

Don’t push devDependencies.

I honestly don’t understand why devDependencies come with the package. Most packages don’t distribute their source files, so what use are the dev dependencies?

If you want to do dev, clone the got repo and get everything. To me `npm publish` is for distributing the production library, plus source maps and types—that’s it.

kimi · on Oct 3, 2023

Who cares about how large a NPM library is? the issue is what you send to your clients, not how monstrous is what you download once on the integration server.

justinsaccount · on Oct 3, 2023

If anyone really cared, they would fix the broken ecosystem that encourages people to redownload the same packages millions of times.

Picked a popular package at random, webpack. npm says version 5.88.2 released 3 months ago has 5,992,398 downloads in the last 7 days.

I don't know how anyone can look at that see it as anything other than a massive failure.

Fast connections and free bandwidth have caused people to completely ignore the fact that every time some CI pipeline runs, npm goes off and downloads 100MB of dependencies. Dependencies that haven't changed since the pipeline last ran 30 seconds ago.

npm could fix this by aggressively rate limiting clients that have already downloaded the same package multiple times, but I guess as long as the vc funding is paying the bandwidth bill it's not a problem, and those "millions of downloads" make you look good.

CharlieDigital · on Oct 3, 2023

    > Fast connections and free bandwidth have caused people to completely ignore the fact that every time some CI pipeline runs, npm goes off and downloads 100MB of dependencies. Dependencies that haven't changed since the pipeline last ran 30 seconds ago.

Maybe it's just me, but I've always thought it was well known best practice to cache your deps[0].

I'm pretty certain that this can be achieved with most CI/CD tools.

https://docs.github.com/en/actions/using-workflows/caching-d...

justinsaccount · on Oct 3, 2023

I've seen a lot of pipelines that simply don't bother. Or maybe they tried, but the caching isn't working and since the build works in the end no one notices an extra 30 seconds.

paulddraper · on Oct 3, 2023

Elaborate, please.

The vast majority of those are from CI on ephemeral cloud instances.

Do you think CI should not be run?

Or CI should be run, but not on ephemeral cloud instances?

Or CI should be run on ephemeral cloud instances, but the packages should be cached using a separate service from npmjs.com (e.g. S3)? If so, what makes this other service preferable?

justinsaccount · on Oct 3, 2023

> Or CI should be run on ephemeral cloud instances, but the packages should be cached using a separate service from npmjs.com (e.g. S3)? If so, what makes this other service preferable?

Yes, you should vendor external dependencies.

A build should ideally not require internet access to complete.

People learned nothing from leftpad.

paulddraper · on Oct 3, 2023

> A build should ideally not require internet access to complete.

You've got a non-internet CI with non-internet source code repository with non-internet vendored dependencies??

Technically possible, but I call BS.

justinsaccount · on Oct 3, 2023

> You've got a non-internet CI with non-internet source code repository with non-internet vendored dependencies??

Vendored dependencies are pulled down from an internal s3 bucket (and cached locally) before the build starts, the rest of the build runs with no internet access.

Look how nix does this, it's basically the same.

mixmastamyk · on Oct 4, 2023

Woosh.

Karellen · on Oct 3, 2023

> In real terms that means a saving of around 1MB. That doesn’t sound like much, but at 4 million weekly downloads, that would save 4TB of bandwidth per week.

Yeah, who cares about 4TB/week? What is this, the '90s?!

kimi · on Oct 4, 2023

A dedicated 10 Gigabit internet link at a data center will cost you around $1,000 per month from any of a number of providers.

A 10 gigabit link can transmit 3240 Terabytes of data in one month.

kimi · on Oct 3, 2023

That's 6 Mbytes per second.

taway1237 · on Oct 3, 2023

For a single package. And npm hosts a few more than that.

They shrinked a package size by almost 40%. No way this isn't significant improvement. Hell, at their scale 4% improvement is big.

dpcx · on Oct 3, 2023

That's a lot of data that you're not paying for...

bhouston · on Oct 3, 2023

> Who cares about how large a NPM library is?

NPMJS.org probably cares. And having smaller downloads for everyone else would speed things up a bit.

BTW I saw this recently about shared brotli dictionaries for delivering JS, which is nice: https://chromestatus.com/feature/5124977788977152

fabiospampinato · on Oct 3, 2023

I mean they care to some degree, if they _really_ cared presumably everything would be compressed with zstd and served to the more modern npm-cli installations, and npm-cli would refuse to upload binaries that are not explicitly allowlisted.

Too · on Oct 4, 2023

> download once on the integration server.

Have you seen most CI systems in the wild? Majority of projects just wipe everything and do a fresh install, on every single run.

Maybe it will be counter-productive to shrink the packages, since it would encourage that behavior of not caring about local cache even more.

hoten · on Oct 3, 2023

Lots. See the "npm node_modules blackhole" meme. For one practical reason, confirming the quality of the code in node_modules is so impossible a task it just isn't even attempted. So people are shipping code they have 0 knowledge of. For a paranoia-fueled reason, even devDep packages run the risk of being harmful (malware in the build process).

kimi · on Oct 3, 2023

Of all the reasons why the NPM ecosystem sucks, the compression of archives is but a side note.

bdcravens · on Oct 3, 2023

This assumes third-party code is never served directly from the node_modules folder (or via proxies like unpkg)

yellow_lead · on Oct 3, 2023

As tech improves and new libraries are created, limitations of the past are sometimes removed. From this post, I'm curious about other 'low hanging fruit' (if I may call it that) in open source.

*I don't mean it's a simple endeavor but at least it's simple to describe.

Aldo_MX · on Oct 3, 2023

Got an unrelated question: Why do we need to decompress the files?

ianlevesque · on Oct 4, 2023

It's a fair question. If you're using a zipfile you can random-access the individual files in the zip easily enough. The node_modules folder is famously heavy with thousands of tiny files, so there's likely to be performance wins to be had. Old, less fashionable programming languages, like Java, of course already do all this with .jar files.

kevincox · on Oct 4, 2023

To make your point explicit:

tar archives don't support random access. So it would be slow to scan for each file in the archive as you load it.

3np · on Oct 4, 2023

I guess support would also have to be added to yarn and pnpm at the very least before npmjs.org can consider enabling it for new public uploads.

pcblues · on Oct 4, 2023

Hopefully the testing avoided disk caching artifacts.

wolfi1 · on Oct 4, 2023

ok, I totally misunderstood the headline, I assumed that it would show how to minimize dependencies

onedognight · on Oct 3, 2023

Yann Collet went from project management, to programming his calculator, to creating zstd. This podcast detailing his journey in quite inspiring.

From Project Management to Data Compression innovator: https://podcasts.apple.com/us/podcast/corecursive-coding-sto...