Hacker News new | past | comments | ask | show | jobs | submit login
Node_modules: One character saved 50 GB of disk space (mainmatter.com)
59 points by feross on Sept 30, 2022 | hide | past | favorite | 62 comments



This is an extremely clickbait title.

The 50GB figure is the number of Node modules a developer has installed on their local machine that are duplicated between multiple Node projects/repos checked out on that machine. Even for the notoriously bloated JS ecosystem that seems well above average, even assuming any given developer has >1 project checked out at once.

For reference, I'm a developer primarily working in JS. I use an old 2017 MB Air with the smallest disk (120GB) and have many many Node projects checked out (including random GH FOSS I've contributed to once). I don't use pnpm & I've never had disk space issues.

Don't get me wrong, pnpm is cool. I've started trying it out and will likely convert a a lot of stuff. But 50GB is extreme even for Node.


I didn’t really find it clickbait. It’s not unusual for a complex react app to have 600+ mb in dependencies, and at work I have at least 50 repositories checked out at once. 50gb may be a bit of a stretch for most users but I imagine multi-gig savings to be fairly common! We’ve moved many of our projects over to pnpm.


By that metric OP would need to have almost 100 complex react projects checked out to reach the 50GB number. Seems highly unusual to me.


Not really. Some dependencies can be heavy. As soon as your projects include, let's say, electron, you're having ~200MB consumed by a single dependency on top of anything else used by your project. And if you had your node_modules populated by older npm version, you may have multiple copies of that dependency within a single project.


> 50GB is extreme even for Node

Depends entirely on how many projects you regularly work with.

pnpm is objectively better than npm in many other way too though. It does all the things that npm finally realized they needed to do in version 8 by default.


No disagreement - as I said in the last line of my comment.

Only taking specific issue with the hyperbole hook in this article.


> even assuming any given developer has >1 project checked out at once.

One? Oh lord you don't want to look at my work machine then. Probably a couple dozen projects pulled down, and once added, they aren't getting removed until I have to swap PCs.


I work in code analysis automation so I have way more than a couple dozen but the point is:

1. I'm not an average case.

2. Even I've never hit 50GB

3. (not mentioned in my original post but...) if you did reach 50GB that's likely going to be because you've a load of old projects lying around needing deleting. Using a new tool in other to retain that dysfunction isn't exactly a great recommendation.

Pnpm is great for other reasons: not recommending against it, just calling out the ridiculous title here.


"A couple dozen" is a pretty conservative estimate:)


Clickbaity title. I sifted through the article to find out what it was and got really annoyed by what they presented.

For those who haven't read it: it's about using `pnpm` instead of `npm`. Saved you time and possible frustration.


In case the author sees this:

> Believe it or not, this is basically how everything in the Python and Ruby worlds work

And those 2 paragraphs are not really true. Python and ruby environments don't exist "these days" - they've been available longer than npm existed. (virtualenv 2007, npm 2010)

The system/project split exists in the same way npm --global / npm exists. The only real difference is that you can't have different versions installed in the same environment at the same time - not the other things implied by the post.


The big problem for me is that python does not seem to have settled on just a single (or even obvious) way of dealing with this. Every project I have to figure out if I need to run setuptools, pip, virtualenv, etc.

That said, I did have issues with the same paragraph, because php’s composer has been doing it the correct way since forever.


> Every project I have to figure out if I need to run setuptools, pip, virtualenv, etc.

You're mixing layers. Pip uses setuptools to install packages inside a virtualenv. You need something that manages environment/dependencies and something that installs them. Sometimes they're the same thing, like with poetry or pipenv.

How do you know what to use? Read the project readme. Same as people choosing yarn or npm.


> How do you know what to use? Read the project readme. Same as people choosing yarn or npm.

Actually no. If a repository has a package.json file I can run any package manager I want and it’ll work.


The worst part about all these node modules is the little small silly ones that do something really inane - like to just get the current year.

I said the same thing about some ruby gems years ago and thankfully that’s a little bit sane now.

I don’t use JS that often. But recently I looked at the dependencies for some library I was using and I was astonished at the literally hundreds of tiny modules that were being used.

And it gets even worse - those tiny little modules have their dependencies too.

Amazing.


Someone linked me to 1-liners[0], which is - you guessed it - a bunch of one-liners. I think it's nice to have as a reference. But a dependency? Really?

My least favorite is assign.

Not only does JavaScript feature that natively (though I suppose the library may predate widespread support for Object.assign), the 1-liner assign flips the order of the parameters!

    assign({ a: true }, { a: false }) -> { a: true }
    Object.assign({ a: true }, { a: false }) -> { a: false }
And most of them are just straight-up pointless! Like, let's introduce a dependency for decrement lol

EDIT: In looking up whether the 1-liners assign predated widespread Object.assign support, I found that their implementation - confusingly named extend[1] at first - literally used Object.assign from the very beginning. And they still chose to mess with the parameter order. For shame lol

[0] https://github.com/1-liners/1-liners

[1] https://github.com/1-liners/1-liners/blob/7c1f8d51df4b4b3e0a...


> Like, let's introduce a dependency for decrement lol

Here's a package that basically does that: https://www.npmjs.com/package/number-precision

Not entirely unreasonable as all `number`s are floats by default in JS, but the implementation of the entire package (https://github.com/nefe/number-precision/blob/master/src/ind...) is less than 100 lines of code and actually contains a method called "minus".

The very worst JS packages I've seen have got to be is-odd and is-even. 430,796 and 202,268 downloads every week, I kid you not!


Seems like a perfect candidate for a drive by fix. If you see it in your dependency tree, fix it in the project that uses it.


It would appear this argument order is due to Principle 5: Data comes last, for consistent currying.

You might not like it, or the library (I don't write JS on purpose, so no opinion), but it's right there in the README.


I was not aware of the data-last convention. Thanks! Makes a lot of sense.

My mental model of assign was backwards, and it took me a while to comprehend why their implementation would be data-last.

So, the data is the object that is being modified? And the idea is it lets you write stuff like this more easily:

    const addStuffToObject = stuff -> object -> assign({ "some_stuff": stuff }, object);
    const addWeirdStuff = addStuffToObject("weird stuff");
    const weirdObj = addWeirdStuff({ "foo": "bar" });


Yes, and this sort of application (pun intended) is what the one-liner library is for.

If you have a function isEqualTo(a, b), you can curry(isEqualTo, 5) and filter with it.

`assign` could be used in the same way for assigning/overwriting the same field in an array of Objects, and so on.


It's got 2-3000 installs/week so not many people are using it.

https://www.npmjs.com/package/1-liners


A project I've been working on has roughly 40 dependencies. If you run `npm install` it'll pull about 1050 npm packages.

Change one minor version number and everything breaks. Forget one dependency and npm will not tell you that a dependency is missing but instead it'll complain that Steam has a broken link in the home directory (this is a known open issue for years and the only two solutions are to uninstall Steam or to use a Docker container)

Needless to say I am not a fan of large web projects.


I don't understand why you're getting downvoted. I'm a JS developer (and framework developer) and what you describe is a serious problem. It's great that there's so many problems solved, but some stuff is just a one or two liner that should be in your own app's /lib, not a dependency.

This is one reason I like Deno's idea of having a standard library (I just wish Ryan would have proposed that for Node directly instead of creating a brand new runtime).


The solution to all this dependency mess if for NPM to make a standard library. Yeah that sounds crazy and weird. But they are in the best place to make a unified standard library for Javascript. This would bypass all the junk transitive dependencies and have more libraries rely on a centralised but standard library.


Anyone who's run a CI platform for more than a few devs and NodeJS projects quickly bumps into inode problems unless they thought about build server filesystems in advance. Very quickly you end up with hundreds of thousands of minuscule files filling up the disk.


Imagine how easy it would be to exploit one of those too.


Right - every single dependency adds potentially another human maintainer who, if bribed or threatened, could release an update that exploits your project.


Is there any need to have so many dependencies in the first place? Seems like slovenly development practices are to blame.


Kind of yes... not all dependencies are direct for the app, a lot are just dev dependencies. Just to get eslint/prettier to warn, auto format and cleanup my code when I save a file, it is 13 direct dev dependencies in my project [0].

[0] https://github.com/lookfirst/mui-rff/blob/master/package.jso...


My projects tend to have vastly more dev dependencies as well. Especially with something like create-react-app. It's a pretty sad ecosystem.


The JavaScript development ecosystem is completely insane. Why do people do this to themselves.


I often hear “why would people do this to themselves” and I look around and wonder what they’re talking about given I’m perfectly happy using it daily.

I’d rather use node with all its ugly bits than Python or .NET or C++ with all their ugly bits. I’d rather use Rust over any of them but rarely is it the right tool for the job with what I do.


js runs everywhere and is the most popular language. The package managers are maturing, as are the rest of the tooling. C# didn’t have nuget at first, Python didn’t have Pip, etc. these tools evolve as part of the ecosystem over time. Nuget didn’t reuse package references in a central cache until recently, and had pretty much the same issue as this. If you think this issue means JS isn’t worth using, that’s an interesting line to draw in the sand, but it doesn’t mean people who use JS are somehow suffering because of this and that the ecosystem is insane.


Because (IMO)

1. Looks like it attracts a lot of talent these days

2. It's basically so well designed that its smooth learning curve has enabled generations of people to develop cs skills/get jobs/build stuff independently

3. While this and the previous points are not here to state that the JS universe is perfect, I have seen far worse stuff haunting the industry in the past: Php and Java alone have produced thousands of so-called professionals who still have trouble distinguishing db from backend from frontend, and have no remote idea of what dependency management is whatsoever.


A different / additional thing you can do on a few systems is compress the node_packages specifically. Afsctool on macos, chatter +c on btrfs, folder properties on Windows - you don't have to have compression enabled on the whole drive to use that.

Since node_modules is mostly text, this has amazing results and can be applied to the deduplicated pnpm store as well.


I am just amazed that this is still an issue for JS when it has basically being fixed for other languages, being such a largely used language with such a fanatic user base claiming "its the best" I would expect this no longer being an issue nor all the security issues that come with it


It's not really "fixed" for other languages in general. The support in Node for multiple different versions of transitive dependencies is actually quite nice. In Python, for example, you simply can't have multiple versions of transitive dependencies, and this can lead to issues with commonly-used utility packages. I've seen issues like this come up with utility libraries like six or boto and its variants. Likewise with larger libraries like numpy.

As someone who's worked pretty heavily in both ecosystems – it's definitely not something I think about every day on the Python side, but Python dependency conflicts are very annoying... while in Node they're mostly not a big deal except in a small set of cases where peer dependencies show up.


I agree ist not entirelly fixed on any language yet as you mentioned Python has it not only well nailed plus using virtual envs plus virtual env manages (pipenv for example).

In java is basically a non existing problem, you CAN have dependency conflicts yes, nontheless dependency management is simple, and you keep everithing on a local central repo when using maven, which also provides a very nice dependency tree plus tools for filtering, whicg are nice, which of course you can also achieve with grep for even easier dependency conflict debugging.

Also using tools as dependencyManagement in maven allows you to replace all usages of a library across your entire application "at your own risk" which simplifies addressing security vulnerabilities


> while in Node they're mostly not a big deal

Until you have to figure out that the reason something doesn’t work is that dependency v1 is storing the data that dependency v2 is trying to use, and it complains about missing data that you are sure is there.

I very much enjoy having those issues up front, instead of at runtime.


The limitation of this approach it that all project must reside on the same device as the pnpm store. There is no cross-device symlinks


Which becomes a problem using containers in a monorepo.


How does that make it problematic?


Can’t hard link across mount point, so either copy to the container or install inside the container (both negate pnpm’s benefit).

Our workaround is to put the global cache in the project so we can mount it alongside the code in the same mountpoint.


Thanks, I wouldn’t have thought of sharing npm packages like that.

I would argue building different containers from a shared node_modules is inherently dangerous anyway. Sounds like your ”workaround” is in fact pretty much the optimal setup for quickly performing multiple similar builds.


> There is no cross-device symlinks

Do you mean hardlinks?


You can also use yarn berry (version 2 and onward has this codename). It has a plug and play algorithm instead of node_modules, but you can also use it with a pnpm resolver if PnP breaks stuff which it sometimes does as many libraries assume node_modules exists.


This is where a file system with compression support becomes useful, such as ZFS or APFS.


Or btrfs, which can also deduplicate files.


Fwiw, it looks like you can use pnpm as nodeLinker with yarn berry. I’m a fan of yarn’s plugin support and the extensibility that provides, so I’ll likely be trying that with pnpm as the linker and see how that goes.


Maybe I’m out of date, but I thought hardlinks were generally a bad idea.


Why, exactly? I know Windows wasn't built to withstand hardlink-based attacks so it puts them behind admin permissions by default, but this seems like an excellent use case for hard links to me.

That said, soft links should work just as well.


Hard links are not too bad of idea as long as they are done on files though they do have some downsides. There are some very good reasons why hardlinks on directories are not easy to enable. If something is scanning things recursively, they would end up in infinite loop if they are not keeping track of (disk,inode) tuples for parent directories, especially if they follow symlinks. IIRC they are explicitly disallowed in Linux, you would need to modify kernel source to allow them.

I have no clue why they are saying that they are hard linking directories, I sure hope they are only doing it on file level.


I've caused infinite scanning bugs to appear on accident with just soft links, though. I know soft links and hard links are processed differently at the lowest levels of the I/O stack, but I think there are few languages and run times where that's actually the default.

In most programming languages I've used on Linux, symlink expansion is on by default, creating all the problems you can think of.

Yes, you can use whatever API call readlink relies on to prevent loops, but you can also keep a set of inode numbers on a file system and stop processing on duplicates. In both cases you need to do all kinds of workarounds.

The best argument I've heard is that hard linked directories break the acyclic graph property of the file system but even there I'm not so sure if that's really a problem in an environment where most tools recurse into soft links anyway.

I suppose the C folks that like to do all the hard things themselves would get annoyed by having to add another check?

I don't think the kernel forbids directory hard links per se; NTFS has junction points which are somewhere between hard links and soft links for directories, and NTFS support made it into Linux. I don't know how the kernel exposes directory junctions in Linux, but they're different from soft links (in that they'll be processed in the server for SMB file servers, as opposed to symlinks which are resolved on the client).


Maven is doing this for more than a decade now


> I'd rather have two versions of the same dependency in a project, than not be able to use the dependency at all.

This is where a poor choice was made and everything went wrong.


> npm dedupe


tl;dr: pnpm


Sadly my team had to go back to npm. Pnpm has issues in resolving dependences. Also the build box didn’t have it. There were enough quirks to make it hard to write scripts that ran smoothly between npm and pnpm.


Thanks! I thought it was because of a bug or some interesting behavior.


Good tl;dr !


The TL;DR is: use pnpm instead of npm.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: