Hacker News new | past | comments | ask | show | jobs | submit login
Unix philosophy without left-pad, Part 2: Minimizing dependencies (raku-advent.blog)
112 points by lizmat on Dec 11, 2021 | hide | past | favorite | 67 comments

People keep saying "pin versions" as a defense against supply chain attacks. That's all well and good until something widely used like log4j has a remote-code-execution exploit and then it all comes crashing down.

Trusting any single author is a single point of failure — eventually the author of one of the packages you depend on will get compromised and an attacker will publish a malicious package. To combat this, you need package validation by multiple independent identities. The classic ways to do this are to have multiple people sign a package using PGP, or to rely on vendor endorsement — but the theory behind it is just multi-factor authentication.

A second useful step is connect releases to an open source commit history. This makes it much more feasible for independent authorities to review the differences between release versions as a sequence of logical, coherent commits. The ideal is to have multiple committers on a project sign a release package, after having followed the commit history as it played out.

If a package cannot be connected to an auditable history — because a source package is grossly transformed from what's in a repo, because there's no public repo, because the history is just one big commit or similarly useless, or because a binary package is not created using a reproducible build — then it is harder to have confidence in it.

> People keep saying "pin versions" as a defense against supply chain attacks. That's all well and good until something widely used like log4j has a remote-code-execution exploit and then it all comes crashing down.

And it doesn't come crashing down for those who didn't pin log4j? They're somehow immune to the 0-day?

Or do you mean that the next time they build from scratch they'll have their arse saved by a security update they didn't even bother tracking?

In a nutshell, what I favor is automatically accepting upstream security releases when those releases can be validated by multiple identities. Probabilistically, this shortens but does not eliminate the window when you are vulnerable.

Unfortunately, as far as I know, typical primary package management systems trust the single author/uploader of a package and don't provide support for multi-authority validation, and so are vulnerable as soon as a single author's credentials get compromised. (If that's wrong, and npm, PyPI, crates.io, Maven, or anybody else supports multi-authority validation, I would love to hear about it.)

I came to these conclusions having been deeply involved with release policy at the ASF. (I redrafted the official release policy documents in 2015.) The ASF, notably, requires at least one PGP signature for every release, but some projects have a tradition of multiple signatories — including the Apache HTTPD project, which is where the tradition arose.

> Trusting any single author is a single point of failure — eventually the author of one of the packages you depend on will get compromised and an attacker will publish a malicious package.

Thanks, this is exactly the sort of thing I had in mind when writing the "Making _ trustworthy"[0] section and exactly the sort of conversation I was hoping my post would prompt. One benefit I'm hoping to get from keeping the `_` sub-packages as simple/self-contained as possible is that that sort of supply-chain attack will be easier to spot (e.g., with a 0-dependency file, you couldn't use an attack like the event-stream incident, where a dependency was swapped out for a malicious copy – the malicious code would have to be in the repo itself).

Of course "easier to spot" ≠ "won't happen", which is where your other point comes in:

> To combat this, you need package validation by multiple independent identities. The classic ways to do this are to have multiple people sign a package using PGP

Someone else made a similar point in an r/programminglanguages comment[1] in response to part 1:

> One thing I'd like to see package managers adapt, though, is quorums for publishing. A simple majority quorum of amongst 3+ people would naturally make hacking much more difficult

Do you happen to know any details about how something like that could be put into practice? I agree that it seems like something that'd be worth investing in, as an ecosystem and would be interested in any info/thoughts other care to share.

[0]: https://raku-advent.blog/2021/12/11/unix_philosophy_without_...

[1]: https://www.reddit.com/r/ProgrammingLanguages/comments/raau0...

The machinery for quorums could be built on top of PGP. Multiple people can sign a package, and the trustworthiness of their endorsements can be evaluated based on a web of trust — including by downstream users, so you don't actually have to rely on the robustness of the package manager's authentication at the moment of upload.

Because PGP is not universally loved, I think it's important to reiterate that the fundamental theory behind quorums is just multi-factor auth. But PGP does solve some of the hardest parts.

From there it's a matter of defining which authorities to trust, and then gating acceptance of a release once a quorum is reached (however that quorum is defined).

Finally, the idea needs buy-in and participation from package authors, which could be encouraged by privileging releases with multiple endorsers.

Thanks for sharing these ideas. Raku is actually in the process of migrating to a new package ecosystem, so this could be an ideal time to get something like this set up. I'm not sure how much work would be involved from a technical standpoint, but I've opened an issue[0] to ask the maintainer of our ecosystem package repository; hopefully we'll be able to implement a system somewhat along these lines.

[0]: https://github.com/tony-o/raku-fez/issues/50

The main thing the central package manager has to do is support uploading some significant number of .asc PGP signature files (max 10? max 100?) alongside a specific package. That's enough for third parties to start experimenting.

The package manager might also boost search rankings for packages with multiple sigs, but it's just one contributing measure of "kwalitee", like docs, a complete metadata file, etc.

> Trusting any single author is a single point of failure

This is what Linux distributions are for: all big distributions have a team of maintainers plus a dedicated security team.

I'll rather use a really small, static (as in never changing) package then something bloated that get updates every day and breaking changes from time to time. left-pad was not the problem. The problem was that NPM changed ownership of already existing package-names - which caused the left-pad owner to remove his packages in protest.

> I'll rather use a really small, static (as in never changing) package [instead of one] that get updates every day and breaking changes from time to time.

That's an entirely fair point and, as I got into a bit in the versioning[0] section, something that I'm giving a good deal of thought too.

I'm currently leaning towards tracking the Rakudo[1] compiler releases (~monthly), so updates wouldn't be anything like daily. As far *breaking* changes go – well, again, I'm still thinking about/discussing what guarantees to make, but I'm hoping to be able to promise to (try to) provide strong backwards compatibility. One thing I mentioned in the post is that Raku's strong support for multiple dispatch[2] makes backwards compatibility a bit easier: `_` can add a new version of a function without impacting the existing one.

That still leaves _accidental_ breakage (i.e., bugs) – which is the area I'm currently most concerned about. If not handled correctly, a utility package risks creating its own sort of internal dependency hell: if there's a bug in one sub-package that you use, it could potentially block you from using that version – even if a different sub-package has a feature you want. I'm not sure of the best solution yet, but I'm exploring a few Raku options that I think may let me provide versions at the sub-package level (or maybe even the function level?). That's very much a WIP for now, but it's something that'll happen before a 1.0.0 release.

[0]: https://raku-advent.blog/2021/12/11/unix_philosophy_without_...

[1]: https://rakudo.org/

In this vein, for particularly sensitive applications (e.g. password managers) I prefer to disable automatic updates.

How do you identify when to perform updates for security reasons?

I think poster is saying turn _off_ "Automatically install updates" but leave _on_ "Check for updates [Daily/Weekly/Monthly]". That way you at least become aware that updates are available, and can assess them for yourself.

Maybe published package versions should be immutable.

I get the malware concerns but in practice I don't think they are such a big blocker.

> Maybe published package versions should be immutable.

They are in many languages. Of those I'm familiar with, Raku, Rust, and JavaScript all have immutable package repos. (npm wasn't when left-pad was pulled but has changed since then).

Of course, in each case they're only "immutable" in the sense that some organization (with varying degrees of centralization) has promised to host them forever; people clearly vary in their willingness to believe promises of that nature.

Haskell's Stack and Nix have immutable repositories in a more technical sense: you can specify the entire set of dependencies using cryptographic hashes of their contents.

Other package managers also store hashes, just separately in lock files. The main issue is a takedown for legal or administrative reasons. If you have a hash that might possibly be helpful in searching for an alternative source after a takedown, but it's not that much protection

>Maybe published package versions should be immutable.

Still won't help you if the leftpad dev wanted to send a message/protest could have put a small update that would do something bad.

The problem is when you are not the idiot that installs leftpad but you need to install some other package like some GUI or testing framework and those "smart" devs decided to depend on leftpad directly or indirectly because some stupid philosophy. I have inherited a project with such kind of idiotic dependencies , inlcuding small stupid shit or packages with incorrect package.json that depend on things they do not actually depend or things they should not .

It would help if you use a lock file and/or pin to badges of dependency versions.

Yeah, but I never seen people locking the dependencies to an exact version, probably to get small fixes and important security ones.

But then you still have issues with packages depend on npm website existing in future or even some packages are just linking to a git repo directly so if the repo is gone or giuthub is gone you(or others) can't re-create your project.

> I never seen people locking the dependencies to an exact version

This depends heavily on the language/ecosystem. For example, golang's Minimal Version Selection[0] basically requires libraries to specify an exact version – the only way they'd get a higher one is if another library in the dependency graph had manually upgraded to the higher version.

But yeah, if the source is hosted externally and you don't have a local copy somewhere, then that's going to hurt. Which is (part of) why "should I vendor my dependencies" is such a perennial topic.

[0]: https://research.swtch.com/vgo-mvs

>But yeah, if the source is hosted externally and you don't have a local copy somewhere, then that's going to hurt. Which is (part of) why "should I vendor my dependencies" is such a perennial topic.

Is not only this, like what if I create an open source thing and share it on github/npm or whatever packages website, the best practice is not to bundle my dependencies and just list them. Then 5 years later someone wants to install my package that depends on their package that depends on some leftpad isOdd package that now is gone. In other ecosystems it is acceptable as a good practice that beside sthe sources you offer an .exe,.dll, .jar ,.tar.gz but in node and python community I see that the developers only distrbute now with npm, pip or similar .

Part of the solution would be to put important core stuff in the standard library , then somehow we need to stop the CV driven development that causes this fragmentation and many alternatives for same thing that you don't get a clear answer that should you use for X.

Those large libraries don't change either, if you use a fixed version.

Security updates are important, but it's not like CVEs are particularly common in 5 functions like left-pad, and bloated code that isn't reachable in your app is probably not going to be an attack surface.

Especially if a dead code remover gets it.

JDNI was "dead code" for almost everybody, in that nobody intentionally used or wanted it. Unfortunately it was really just "dormant code" waiting to wake up.

There is a dial from small to large (in terms of code size or feature set), static or growing feature set, pinned to floating dependencies (if you are notified on available updates to your pinned version and review them, which is actually possible for small dependencies, it's largely equivalent to floating).

I don't think anyone is going to be able to convincingly say "my choice of large, growing, floating is best", or even though it's my starting preference, convincingly say "small, static, pinned is best". If you don't have the features or performance you need yet, you can make a good choice in picking a growing, floating dependency -- your call.

But there is absolutely in my mind a need for greater general discipline in dependency capabilities. We don't need a monad stack or effects system in order to say "you can't add code to remote download, deserialize, and execute, in a call to log, that's just not a sensible thing to do". Or maybe we do, because much of us still haven't learned this lesson.

It doesn't apply universally, but in front end development a large part of the problem is that engineering is not, in practice, the ultimate decider for (all) dependencies. The business/product side often dictates integration with a particular third-party service; those services don't have an option to not use their own SDK, and the SDK itself may come with its own dependencies. To the business side, security risk looks abstract, theoretical, and easy to "it can't happen to me", especially compared to whatever goal is tied to the integration.

Facebook in mobile applications is a perfect example. Not a security issue, but the two crash incidents last year caused some havoc for iOS developers. But as far as I am aware the only way to get Facebook login in your mobile app is to use the SDK, and no product manager on the planet is going to let engineering talk them out of Facebook user support.

Sure, people have to use graphics APIs they hate (D12, Vulkan, Metal) because there's no realistic way to avoid them. If you're forced, you're forced.

Log4j type libraries don't really fall into that camp (unless it's an awful transitively forced dependency, which unfortunately can happen, but it's at least usually slightly easier to fight against/mitigate). And I was mainly challenging an implied equivalence between "large, growing, pinned dependencies plus dead code elimination", and "small, static, pinned/floating dependencies". There are plausible trade-offs that could cause you to choose any combination of those factors, but I think it's wrong if one were to make an equivalence like that.

> The problem was that NPM changed ownership of already existing package-names - which caused the left-pad owner to remove his packages in protest.

And of course that npm allowed unilaterally pulling packages and breaking all dependents.

The funny thing about this is the Unix philosophy is just about keeping functional units small, separate, and theoretically independent of each other. It says nothing about the granularity of packaging for end users. Nobody has ever, to my knowledge, individually provided each Unix utility in its own package. A GNU system has most stuff in coreutils, with most everything else in findutils, binutils, and util-linux on Linux systems. Only grep, awk, and sed are single-tool packages among the POSIX utilities. In BSD systems, one base package contains the entire POSIX toolchain.

The idea of having a gigantic "utils" package like this, or even a batteries included standard library like Ruby and Python, is perfectly in keeping with Unix philosophy. The point is not have a single executable that does everything, but you can provide many executables and shared objects in one addressable package with a common version, a single build, and a monorepo.

Separating the question of "functional units" from "packaging units" is a good point – you're right that there's nothing non-Unix-y about packaging coreutils together.

I might add a third category, though, maybe "development units"? Something like Python's batteries-included standard library strikes me as a bit less Unix-y – not because it packages things together but because they (as I understand it) develop things together and do so in a way that creates barriers to outside packages integrating quite as well as standard library packages. (Or at least that's what I've understood from the outside, looking in)

Ruby has truly ruined me for stuff like this. Most basic functionality and some non-trivial functionality is covered in the standard library. And if for some reason Ruby doesn’t have enough Rail’s ActiveSupport probably has you covered.

But Ruby is quite famously a batteries included language and its libraries follow in that philosophy. Solve the entire problem, not tiny pieces of it.

> Ruby has truly ruined me for stuff like this. Most basic functionality and some non-trivial functionality is covered in the standard library.

Ruby is one language I haven't had the chance to explore yet. Are there any Ruby functions you particularly miss in other languages? Any that aren't built in to Raku might be ones I consider for the `_` utility library.

Code blocks. You don’t use a for loop, you call Array#each. On top of that, Ruby’s Enumerable module allows any object with an each method to access a whole bunch of convenience methods.

Note, the # means instance method when discussing ruby code. It’s not valid syntax.

Ruby allows you to be extremely concise, while maintaining readability for anyone moderately familiar with the language. In a way Perl is most definitely not.

Short example:

    arr = [<blog posts objects>]
    # Author may be nil
    number_of_authors = arr.map(&:author).compact.uniq.count
You can imagine how easy it is to throw data around with very little code. Loops are abstracted, so you never think about them as loops. Instead you just see data moving around.

Edit: A few languages (like JavaScript) implement this behavior explicitly by passing anonymous functions as callbacks.

Code blocks are just lambdas though? It's pretty much mainstream these days, even Java and C++ have them.

    // java equivalent
    int numberOfAuthors = arr.stream().map(BlogPost::getAuthor).distinct().collect(Collectors.toList()).size();
I happen to think for loops are usually a better choice in languages like Java, but the option is there.

EDIT: actually there's a better way, no need to create a list just to count its elements:

    int numberOfAuthors = arr.stream().map(BlogPost::getAuthor).distinct().collect(Collectors.counting());

Blocks are a bit different, they can do things like return from the enclosing method. Ultimately I don't think it is worth the complexity and code coloring it adds.

Note how much more concise and yet readable the equivalent Ruby is.

If you want to play code golf you can do static imports and write this:

    int numberOfAuthors = arr.stream().map(getAuthor).distinct().collect(counting());
You still save a few parentheses and one method call in Ruby (.compact vs .stream() and .collect()) and the method names are shorter. Mostly it's a matter of static vs dynamic typing and naming conventions, not a consequence of Ruby code blocks. And is it worth shortening at this point?

BTW I know uniq is a thing in unix, but I hate this naming decision.

Thanks, those all seem useful.

Assuming I'm following you correctly, Raku already has equivalents to each of those built in: we have code blocks[0], List.map[1] (or `for list -> $el { [codeblock]}`[2] which is a `for` loop, but not a C-style one), and the Iterable/Iterator Roles[3].

So I don't think those features give me any ideas for items to add to the library – but I agree that I'd be sad in a language without them!

[0]: https://docs.raku.org/language/control#index-entry-control_f...

[1]: https://docs.raku.org/routine/map

[2]: https://docs.raku.org/syntax/for

[3]: https://docs.raku.org/language/iterating

I just saw the code you added; here's a pretty literal translation into Raku in case you're curious (I'm assuming that `<blog posts objects>` is a stand in for omitted code)

    my @arr = [blog_post_objects];
    # Author may be Nil
    my $author-count = +arr.map(*<author>).grep(*.defined).unique;
(Though note that this would only exclude undefined authors. In most situations, I'd probably either know that all defined authors are truthy (e.g., an object) or would want to exclude the falsy ones as well (e.g., empty string). in that case I'd use `.grep(?*)` save a few characters.)

Not familiar with Raku, but that grep call seems like it would be quite a bit slower than the Ruby equivalent at runtime if it’s anything like a normal grep. Is there some magic there that would make that not so?

Edit: clarity

“grep” is how raku spells what is elsewhere called “filter” or “where”.

The entire set of string and enumerable methods. Extraordinarily useful.




I took a (somewhat quick) look and I'm pretty sure that Raku has equivalents for all of the Enumerable methods except for partition. (We could do basically the same thing with our classify[0] method by converting the Hash to an Array. Or do it manually with a reduce. But there are times when a partition method would be handy.

[0]: https://docs.raku.org/type/List#routine_classify

The String methods might be offer a few more options, but I'll need to think more carefully about that. It's idiomatic (and supported with syntax) to use Regexes for at least some of those tasks in Raku. Plus, Raku's strings aren't directly iterable/indexable (though it's trivial to convert them to a list of characters), and Raku doesn't have a direct equivalent to a symbol (something that I do miss). I suspect that, even with those factors, there might be some ideas worth stealing in there, so thanks for the pointer.

(Oh, and I know that it's "just" syntax, but for some reason I *really* like the idea of having an sprintf operator. I hadn't thought of a language doing that, but I just might borrow that one!)

Yup, similar in C#, where the .NET framework has tons of stuff already built-in.

I don't think the Unix philosophy makes too much sense for things other than CLI commands, and even there, I'm not 100% convinced.

Going by the Salus' summary:

- Write programs that do one thing and do it well.

- Write programs to work together.

- Write programs to handle text streams, because that is a universal interface.

(3) definitely makes no sense outside of CLI commands, (2) is too ill-defined to be of any use (at least outside of programs, though even for programs it seems to be quite redundant with 3), which leaves (1)... which is basically a matter of taste: "one thing" is an extremely ill-defined concept, and you could easily argue that most of the useful SUS commands break it, especially (though not exclusively) if you look at it from GNU's coreutils.

> (3) ["write programs to handle text streams"] definitely makes no sense outside of CLI commands

I'm not sure I'd say that. In the web development world, you'll definitely see people arguing to "JSON all the things" (text) but others arguing to "protobuff all the things" (binary). And they raise many of the same simplicity-vs-performance issues that came up for Unix CLI commands.

As for (1) and (2) being too ill-defined/a matter of taste – well, I agree, but I don't think that means they're useless. I think of them as being in the same category as advice for writing prose ("Use short sentences where possible", "Avoid cliches"): helpful goals to keep in mind, even if I can't pin down exactly what they mean.

For libraries defaulting to an untyped interface makes no sense.

If you're using any kind of software that exchanges data in a human readable serialization format you're following 3, in some form.

Like for example, an HTTP server or client

That makes even less sense, the output is dictated by the purpose, it's not a choice.

Computers don't run for computer's sake. They run due to extant human utility. That utility is itself existentially traceable to a human making a choice.

No computer ever, has done something whose chain of causality does not return to a matter of human choice in orchestrating the circumstances for said outcome.

The good parts of the Unix philosophy have already been subsumed into common sense as a programmer. So what remains as "The Unix philosophy" are the controversial, more religious components.

You can tell it's a little religious because, like Agile and REST, "everyone is doing it wrong". Where the thing everyone is doing wrong is a weird little corner of the thing with dubious utility.

The general motivation behind it can apply to any kind of software, but it's definitely not fitting for many categories.

In desktop applications I usually prefer tools that cover the majority of use-cases and provide an easy way to extend them (the last part is important too, otherwise you end up with the Windows 8 default apps). It doesn't make sense to split up a photo editor's core featureset, but that doesn't mean it's a good idea to just bury it in a pile of features out of the box that only 5% of users care about.

The Unix philosophy is unhelpful dogma at this point - something that people use to blindly criticise software that they've arbitrarily decided does "too much" even if it makes perfect sense for that software to do more than one "thing". SystemD comes to mind.

Maybe it made sense at the time.

Unix desktops are composed of little tools. Under xfce you can switch out Thunar and Xfce4-terminal for something else. And long ago you could use your own WM in Gnome.

Right but you've just retrospectively decided that Thunar and Xfce4-terminal are the ideal unit of "doing one thing". It's a completely arbitrary threshold.

Thunar can unzip files. Oh no it's doing two things! Shouldn't it leave that job up to `tar`? Xfce4-terminal has tabs. Erm shouldn't that be the job of the window manager like on Mac? Stop doing more than one thing!!

Thunar should call an external program to unzip.

It should use a third party library. Having a file manager use `system()` to unzip files would be insane for so so many reasons.

You could potentially do it with something like D-Bus I suppose. But it would still probably be simply better if it was built in.

Huh? What's wrong with calling the unzip binary? Using a library is only a slight optimization (though I admit -- it's better).


1. Way more likely to break because the user doesn't have the program installed or it isn't in PATH or whatever.

2. You have to do all communication through argv and parsing stdout which is extremely error-prone.

3. You're almost certainly not going to get proper progress reporting.

4. It can't be interactive (e.g. how do you implement the "This file exists, what do you want to do?" dialog?)

It's so obviously worse I'm surprised you even asked. I'm not sure what "D-BUS??" means but D-Bus mitigates a few of those problems at least.

Exactly, and integration (launchers, notifications, menu integration) can be done around a common ground (a la freedesktop.org)

That's only possible because DEs got together and made freedesktop.org, against the desires of many in their various communities at the time. There's little in the way of "Unix philosophy" built into the various freedesktop conventions. In fact there's a lot of "don't reinvent the wheel" in the conventions which is not really addressed in the "Unix philosophy" which is rife with reinvented wheels.

There's still plenty of GUI tools in the FOSS space that don't heed freedesktop conventions. There's also still people that bitch endlessly about D-Bus, systemd, any anything that's not just piped plain text everywhere.

This is the followup to Following the Unix philosophy without getting left-pad, https://raku-advent.blog/2021/12/06/unix_philosophy_without_...

Raku, mentioned in the blog, was formerly Perl 6.

> The idea of black box abstraction is that you can implement some complex functionality, box it up, and expose it to the outside world so carefully that the world can totally ignore the implementation details and can care only about the inputs and outputs.

Is there such a thing as "glass box" abstractions? :)

The opposite of black box is "white box", e.g. white box testing is when you can poke into the internals.

But now that I think about it glass box does make more sense!

The Unix Way is small, replacable single purpose binary tools that are vendor blind.

This seems to be the exact opposite.

How about we minimize the UNIX philosophy instead?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact