Hacker News new | past | comments | ask | show | jobs | submit login
A peculiarity of the GNU Coreutils version of 'test' and '[' (utcc.utoronto.ca)
78 points by ingve on Nov 26, 2023 | hide | past | favorite | 66 comments



> PPS: In theory GNU Coreutils is portable and you might find it on any Unix. In practice I believe it's only really used on Linux.

Oh how times have changed:) But even with the demise of most commercial unixen, I was under the impression that Darwin preserved the ancient tradition of installing the system and then immediately replacing its coreutils with GNU?


Darwin's coreutils are mainly updated from FreeBSD I think. They've got some gnu stuff, but it hasn't been updated since gnu switched to GPLv3.


Right; the coreutils built in to Darwin are BSD-derived, I was referring to a user installing the GNU versions over top of the OS proper.


I bet many of us do this due to small differences between BSD and GNU coreutils. One that always gets me is:

  sed -i '' 's/foo/bar/g'  # BSD sed
  sed -i 's/foo/bar/g'     # GNU sed


That is indeed an incredibly frustrating difference, because there is no version of it that works on both GNU and macOS. In practice I often end up using perl to do an in-place search-replace instead.


It takes a second to install gnu coreutils with homebrew and make Macs barely useable


I much prefer Linux but having used Macs a few times for work, I absolutely cannot recommend making GNU coreutils the default option on Macs.

If you make GNU coreutils the first binaries in PATH, you can expect subtle, _very nasty_ issues.

The last one I encountered was `hostname` taking something like 5 seconds to run. Before that, there was some bug with GNU’s stty which I don’t remember the specifics of.


Or just use Perl


As long as you're using stone knives, might as well throw in some bear skins.

https://www.youtube.com/watch?v=F226oWBHvvI


Fun! But honestly there's a place for Perl.

There's far less variance between Perls on different systems than there are for sed/awk/shell. So if you want performant portable code, Perl does better than all of those.

I'd never use Perl for a 'big' program these days but it still beats the crap out of the mess of sed/awk/bash/ksh/zsh.


Apple has been threatening to remove perl for ages. One of these days they'll find the "courage" to do so and we'll need a better strategy.


FreeBSD removed Perl from base ages ago, in FreeBSD 5 or 6. That went alright.

I don't even have it on my Linux system any more; for better or worse, fewer and fewer things use it.


But "sed -i'' s/../../ FILE" will work on both, no? Or -i.orig if you want to keep a backup file?


That does not work on macOS. "-i''" is the same as just a bare "-i", and it will then interpret the next argument as the backup suffix.


Oops yes, that should have been -i '' with a space. But I see now that GNU doesn't accept -i '', or even -i .orig. It MUST be without a space: "-i.orig". The reason for that I assume is there's no way to disambiguate:

  echo foo | sed -i pattern
  echo foo | sed -i .orig
But you can still use "-i.orig"; that should work on both. That will leave you with a .orig file to clean up, but arguably that's not a bad thing as -i can clobber files.


Yes, like I did first with homebrew and now with nix.


That was the case for a while, but when they were updated to use GPLv3, Apple stopped updating them, probably to avoid licensing problems on iOS. Nowadays you can install them from Homebrew, MacPorts, straight from source, or other methods.


Yes, macos takes its utils from FreeBSD.

An interesting factoid is that FreeBSD/macos sort(1) was using GNU code until recently, since this is quite tricky to implement. Eventually it was reimplemented for GPL avoidance reasons.

We do consider macos though, and ensure all tests pass on macos for each release


Does Apple compensate or sponsor that work at all?


Definitely not


I tend to install coreutils on FreeBSD. Besides some minor annoyance with every command being prefixed with "g", some of the programs work a bit nicer than the FreeBSD-shipped versions (or some Linux-centric programs just want the coreutils versions...).


I'd be interested in learning which commands you have in mind and what specifically is a bit nicer about their coreutils implementation.


In GNU utilities, option arguments can come after (or between) positional arguments. Personally I find this small convenience invaluable, because I'm used to it.


Oh, I had no idea that GNU utilities allow this.

As a Unix graybeard I always place options first. Options last feels like Windows command prompt, so nothing I want to see...

I always tell younger colleagues who place options in the end, it might work with some commands, but just don't do it. I did not know that "some" includes all of GNU coreutils? A single common code style is a virtue, even in interactive use if there are onlookers. So I guess I will continue to point it out.


Even many command line parsing libraries support it and scan the entire argv for options. You should always terminate the options with "--" if it's in a script and any of the positional arguments are variables that might or might not start with a dash.


Unfortunately not. macOS comes with violently outdated FreeBSD coreutils, for the GPLv3 situation. Though the default shell was recently changed to zsh from Bash 3.2 from 2007.


Right, which is why I thought it was common for people to install the operating system, then immediately use macports or homebrew to go get the GNU coreutils and a modern version of bash because the BSD versions are less friendly.

(Thus continuing and extremely long tradition of layering GNU over the vendor tools; ex. Sun had their own tools, but everyone liked the GNU versions better.)


Oh sorry, I interpreted your comment as asking whether Darwin installed GNU coreutils itself in some roundabout, very-expensive-lawyer sanctioned manner :)


FWIW that's exactly what I've done


People do it, but why you would when zsh is right there, is beyond me.

Possibly the same type of people that target bash specifically in shell scripts I guess.


What's wrong with bash though? Targeting bash specifically has massively improved my productivity and decreased the incidence of easily avoidable mistakes. Portable POSIX shell scripting is hell on earth but bash scripting with shellcheck can be surprisingly pleasant. I had the same experience with portable makefiles and GNU Make.

For example, I managed to create a surprisingly good test suite with bash and the rest of the GNU coreutils:

https://github.com/lone-lang/lone/blob/master/scripts/test.b...

It even runs in parallel. Submitted a patch to coreutils to implement the one thing it couldn't test, namely the argv[0] of programs. I should probably go check if they merged it...


> Portable POSIX shell scripting is hell on earth but bash scripting with shellsheck can be surprisingly pleasant.

Assuming you meant `shellcheck`: you know it works for POSIX compatible shell too right?

What's wrong with bash is that people invariably end up requiring features that aren't available in some deployed version, and then you've lost a lot of the benefit of writing in shell in the first place.

Similar to shellcheck, shunit2 works just fine for running unit tests in posix compatible shell.


> Assuming you meant `shellcheck`

I did. Edited my comment, thanks. I don't know why I'm so prone to that particular typo. My shell history is full of it.

> What's wrong with bash is that people invariably end up requiring features that aren't available in some deployed version, and then you've lost a lot of the benefit of writing in shell in the first place.

But I've gained quite a lot too. Bash has associative arrays. I just can't go back to a shell that doesn't have that.

Shell scripting makes it simple to manage processes and the flow of data between them. It's the best tool for the job in these cases. So there are still reasons for scripting the shell even if one is willing to sacrifice portability.


I just can't believe it's 2023 and we're actually praising a shell for having arrays and dictionaries. Or that there are still multiple shells in use that don't. Or that people still ask questions like "What's wrong with bash though?" with a straight face, as if they don't know.

Now how long are we going to have to wait until somebody invents a way to do named parameters? That will revolutionize the computer industry! I guess it's way to much to ask for a built-in json or yaml parser, all we can hope for is maybe a stringly typed sax callback based xml parser after another 20 years from now, because dom objects in a shell would be heretical and just so unthinkably complicated.

Why are people so afraid to just use Python? Shell scripting and cobbling together ridiculously inefficient incantations of sed, awk, tr, test, expr, grep, curl, and cat with that incoherently punctuated toenail, thumbtack, and asbestos chewing gum syntax that inspired perl isn't ever any easier than using Python, especially when you actually need to use data structures, named function parameters, modules, libraries, web apis, xml, json, or yaml.


This answer is pretty convincing:

https://stackoverflow.com/a/3640403/512904


Thank you! That's a great analysis, and I love the design of PowerShell, which addresses most of my arguments against bash. The Unix community (especially Sun Microsystems) has traditionally had this arrogant self-inflicted blind spot of proudly and purposefully cultivated ignorance in its refusal to look at and learn from anything that Microsoft has ever done, while Microsoft is humble and practical enough to look at Java and fix it with C#, look at bash and fix it with PowerShell, etc.

Here's the summary of a discussion I had with ChatGPT about "The Ultimate Shell Scripting Language", in which I had it consider, summarize, and draw "zen of" goals from some discussions that I feel are very important (although I forgot about and left out PowerShell, that would be a good thing to consider too -- when I get a chance I'll feed it the discussion you linked to, which made a lot of important points, and ask it to update its "zen of" with PowerShell's design in mind):

The Zen of Python.

https://peps.python.org/pep-0020/

Discussion about Guido van Rossum's point that "Language Design Is Not Just Solving Puzzles".

http://lambda-the-ultimate.org/node/1298

https://www.artima.com/weblogs/viewpost.jsp?thread=147358

Discussion of Ousterhout's dichotomy.

https://en.wikipedia.org/wiki/Ousterhout%27s_dichotomy

https://wiki.tcl-lang.org/page/Ousterhout%27s+Dichotomy

Email from The Great TCL War Part 1, started by RMS's "Why you should not use TCL".

https://news.ycombinator.com/item?id=12025218

https://vanderburg.org/old_pages/Tcl/war/

Email from The Great TCL War Part 2, started by Tom Lord's "GNU Extension Language Plans".

https://vanderburg.org/old_pages/Tcl/war2/index.html

Summarization of the important points in those discussions that apply to The Ultimate Shell Scripting Language, resynthesized into a "zen of" list.

Discussion of Support for Declarative and Procedural Paradigms and how it applies supporting standard declarative syntaxes including json, yaml, and xml (which bash still doesn't and probably never will, and PowerShell does of course).

Suggestions for some more "zen of" and design goals, specifically focused on addressing weaknesses or design flaws in popular languages like sh, bash, tcl, python, perl, etc.

Discussion of how "Simplified Debugging and Error Handling" can be balanced with "Macro Processing and Syntax Flexibility", better than debugging C++ templates, Lisp macros, TypeScript code compiled to minified JavaScript, etc.

Discussion of how JavaScript / TypeScript fall short of those goals.

Discussion of visual programming languages for interactive shell scripting as well and general purpose programming.

Discussion of layering visual programming languages on top of textual programming languages.

https://donhopkins.medium.com/the-shape-of-psiber-space-octo...

Interoperability of text and visual programming languages with LLMs for efficiently and reliably analyzing and generating code.

https://docs.google.com/document/d/1QJ98QwC2ubsTNKOFAzUAw6Zy...

Discussion of using Python to build a visual data flow node based shell scripting language on top of Blender (that just happens to support 3D, image processing, video editing, GPU programming, machine learning, Python module integration, and everything else that Blender is great at).

https://www.youtube.com/watch?v=JOeY07qKU9c

Discussion of how to make efficient use of token budgets when using LLMs with text and visual programming languages.

Here is the condensed discussion with the bulk text I had it analyze omitted, so you can more easily read the summaries and recommendations and "zen of" manifestos:

https://docs.google.com/document/d/1wKhdEoLWCZX9TNaftQxLp6ot...

Here is the entire unexpurgated discussion including all the email messages and articles I had it analyze, if you want to see what it was considering (they're interesting discussions to read in their entirety if you're interesting in that kind of stuff, but ChatGPT is excellent at summarizing them and highlighting the important and relevant points):

https://docs.google.com/document/d/1RTxeYjZ2vZsNU4xTj0ZhiXXj...

With the rise of LLMs I think now is a great time to reconsider the design of text and visual scripting and programming languages, with LLM interoperability and compatibility in mind.

ChatGPT:

Designing a text representation for a visual programming language that interoperates well with Language Models like GPT necessitates careful planning to ensure the representation is both meaningful and efficient. Here are several considerations and strategies to optimize the text representation:

Token Efficiency: The text representation should be concise to make efficient use of the token budget of the language model. Avoid verbose syntax and prefer shorthand notations where possible.

Syntax Simplicity: A simple and consistent syntax helps the language model predict and generate code more accurately. Overly complex syntax can lead to increased errors and confusion for both the model and the user.

Standardization: Using standardized syntax and well-known programming patterns can help the language model better understand and process the code since it has likely been trained on similar patterns.

Semantic Density: Each token should carry a significant amount of meaning. This can be achieved through the use of expressive keywords or operators that encapsulate complex actions.

Modular Design: Break down the text representation into modular components. This approach allows the language model to process each part in isolation, reducing the complexity and cognitive load.

Contextual Keywords: Use keywords that clearly indicate their function within the language, making it easier for the language model to provide contextually relevant suggestions.

Window Size Awareness: Design the text representation to be legible and understandable within a typical window size, avoiding lines that span too long and require horizontal scrolling.

Minimize Ambiguity: Ambiguity in text representations can lead to multiple interpretations by the language model, resulting in inaccurate predictions. Clear and unambiguous syntax helps maintain consistency.

Commenting and Documentation: Incorporate support for comments and documentation within the text representation. This not only aids human understanding but can also provide language models with additional context for generating code.

Escaping Visual Elements: For elements that are inherently visual and do not translate well into text, provide an escape mechanism that allows these elements to be described textually in a way that the language model can process.

Alignment with Model's Training Data: Align the text representation with the structures and syntax that the language model has been trained on. This can include following common coding standards and naming conventions.

Iterative Refinement: Use an iterative approach to refine the text representation based on feedback from testing with the language model, ensuring that the design evolves to better suit the model's capabilities.

By considering these aspects, you can create a text representation for a visual programming language that is both language model-friendly and conducive to an efficient programming experience.


Agreed. If you don't test your script in other shells and on other platforms, why advertise compatibility. And not just the script, but also the programs called.

I got burned by supposedly "portable" scripts by the whole Ubuntu dash disaster, and again when I started using Mac OS X, and then again once I used Cygwin and msys2 on Windows.

I do keep portability in mind when writing shell scripts to ease porting later, but without testing there's really no way to be sure "/bin/sh" is right. And some of the Bash features such as arrays are legitimately useful.


I favor POSIX sh myself, but BASH sits on a happy medium of portability and features; zsh might well win on features, but Darwin is the only OS I know of that installs it by default, where BASH is nearly universally installed by default on Linux distros and still has considerably more features than POSIX /bin/sh.


For scripts I'm with you on POSIX.

I was referring specifically to interactive use: ie why someone would install bash to use as their shell, when zsh is already there.


I use bash mostly out of habit and because it's installed by default, but also because zsh had some sort of incompatibility with some of my aliases that I never got around to debugging.


The BSD licensed stuff isn't outdated, it's actually directly from FreeBSD 14.0 afaict. ZSH is current (5.9) in Sonoma for example.


Installing GNU coreutils etc was also common on Solaris. However as it usually wasn't in your path before SysV utils it was only used with its full path.


and in case anyone wants to bring up GNU/kFreeBSD - that's officially dead as of July.


Tragically true:( But you can still install https://www.freshports.org/sysutils/coreutils/ on FreeBSD if desired.


Recent and related:

test, [, and [[ (2020) - https://news.ycombinator.com/item?id=38387464 - Nov 2023 (225 comments)


Thanks hn for doing what you do! I'm so glad to see this here.

A few months back I noticed '[' under /bin on my mac. I tried googling to understand what it was, but my google-fu came up short. It felt like one of those ungoogleable things. This link is an excellent starting point for me.


Searching for /bin/[ gets reasonable results for me

https://www.google.com/search?q=%2Fbin%2F%5B


Doesn't mac have manual pages for shell utilities? In other words, this should work: man [


If I can go "man [", then why doesn't "man (" work? And why aren't there man entries for the rest of all the punctuation marks that bash uses?


They look similar, but to your shell they are different: [ is the name of an executable, ( is a syntax symbol of bash. The man page for the syntax is the man page of bash.


it is funny that [ is the name of binary but when using it, you can't just type [ without the closing ]

by the name, you'd think you could just use [ 0 -gt 1 <enter>

after all, we don't have to type grep query perg


What's not so funny is that /bin/[ has no way of enforcing this syntax rule, or any other. Just like sed can't enforce any syntax rules on its regular expressions. This is why the shell is still a crock of shit for scripting even with "set -e" on.

The invoked binary has no way of aborting script execution. All it can do is barf out on stderr and return an error code, which the shell interprets as false and `if [ "x" = "x"` (without ]) goes into the else branch.


I like the cut of your jib. Finally a breath of fresh air from a sane person amongst all the crazy people making fanatically apologetic excuses and heaping evangelical praise on toxic burning dumpster fires.


That does work if you call it as `test`; just a quirk of faking shell syntax in an external binary I think.


I was halfway expecting /bin/[ -> /bin/test, but that did pass the smell test since the ] is required, so it couldn't just be symlink


It could have still been implemented like that in the filesystem, if the binary read the name it was executed under and modified its behavior based on that. As an extreme case, on a stock Alpine Linux system, both test and [ - along with most other core system programs - are all symlinks to a single busybox binary that reads argv[0] and acts like whatever program it's been called as. I'm actually somewhat surprised that GNU didn't do that in this case; I, too, would have expected test and [ to be some manner of link to the same program, either with identical behavior or using invocation name to decide how to behave.


There's the old $0 trick. I use that for shell scripts that share a lot of code.

    if [[ $0 == "norg" ]]; then
      gron --ungron "$@"
   
And so on


Are norg and ungron similar to wobble and wibble?


The really funny part is running it as /bin/[ probably requires omitting the ].


nope, but that would be as easy as trying it in the terminal to see that your guess is not correct at all


There must be reasons to use the standalone programs instead of the shell builtins. But the test builtin will be the same in, e.g., NetBSD sh as it is in Linux or Android dash. That is what I use 100% of the time. It is much faster to use builtins.


I vaguely remember that the program and built-in could not always do the same, eg the program might not have supported -e. But it’s possible that ksh was used to get -e. Does anyone remember better than I?


Much faster in theory, unnoticeable in practice. I assume nobody does number crunching using shell scripts.


You don't need to do "number crunching"; the performance differences are very real even for simple scripts on modern systems. Launching a process is comparatively expensive (read binary from disk, run linker, allocate new process space in kernel, etc.) whereas a builtin is just "if progname == "[" { .. }". And when you do something like "loop over a few thousand files" the difference really adds up.

Consider this example:

  for i in $(seq 1 10); do
      for f in /*; do
          if [ $f = /etc ]; then
              : # Do nothing.
          fi
      done
  done
I have 23 entries in /, so this will run [ 230 times – not a crazy number.

  % time sh test
  sh test  0.00s user 0.01s system 91% cpu 0.009 total
bash, dash, zsh: they all have roughly the same performance: "fast enough to be practically instantaneous".

But if I replace [ with /bin/[:

  % time sh test
  sh test  0.11s user 0.38s system 96% cpu 0.509 total
Half a second! That's tons slower! And you can keep "fast enough to be practically instantaneous" for thousands of files with the built-in [, whereas it will take many seconds with /bin/[.

(Aside: if I statically link [ it's about 0.4 seconds).


I know what it costs to start a process. My point is just if that script is done in 0.5 seconds that's good enough for me as an occasional human caller. I don't care that it could be much faster.

Of course if the script were to handle all files of a filesystem with many small files it could get disturbingly slow. I don't deny that there are cases were it matters. But in over 90% of the scripts I write or use it doesn't.


230 files is hardly "all files of a filesystem", and if you run [ 2 or 3 times per file it becomes even slower. Many scripts are little more than glorified for-loops, and "run this on a bunch of files" in particular is a major use case of shell scripts, which is also why we have find -print0, find -exec{} \+, etc.

It's up to you whether half a second is "fast enough" (just an example: can easily also be 2 seconds, or 5 seconds), but it's definitely a lot slower than 9ms and not "only much faster in theory", or "unnoticeable in practice", and "number crunching" doesn't come in to play regardless.

Since the cost of this optimisation is minimal (you can just use the source of /bin/[ as a builtin) I don't see why anyone would choose half a second over 9ms.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: