Hacker News new | past | comments | ask | show | jobs | submit login
Bringing the Unix philosophy to the 21st century (2019) (kellybrazil.com)
238 points by todsacerdoti on Aug 22, 2021 | hide | past | favorite | 146 comments

So I've been writing shell scripts for about two decades, about 75% of that time professionally. Parsing the unstructured, text-based output of utilities is not the problem for anyone who's had maybe a few weeks of training. Most `... | grep ... | cut ... | sed ... | awk ...`-abominations the post laments can be replaced by a single informed call to `awk`, making everything a lot more elegant and concise.

Having JSON as an intermediate representation to work on instead is not going to save anyone - what we'd really need is for all tool versions/variants output on all platforms (all GNU/Linux distros, all the BSDs, all embedded Linux variants, all commercial UNICES, etc.) to be the same, all the time. That's not going to happen, so shell scripting is going to stay messy.

Also, for my INTERACTIVE shell, anyone can try to pry free-form, text-based and semi-structured output from my cold, dead hands. JSON or YAML output might be an acceptable compromise between being easy to parse and bearable for human consumption, but for my daily work, I would rather have my tools make it easy for me, the human part in the whole equation, and not some parsing logic that might not even (need to) exist. Shell scripting provides most of its value from the fact that since I'm in its repl-of-sorts all the time, I can translate that familiarity to scripts and executables effortlessly, and I would not want that going away. But I am rather certain it would, if we had JSON (or another form of more structured data interchange syntax) adopted as the "universal" interface between UNIX tools.

You mentioned one pain point of shell scripting, variability among platforms. The other pain point is that different tools have different methods to treat errors and warnings. This makes debugging shell scripts a nightmare, compare to scripting languages like python. If you need to debug your work, you need to check several levels of script vs. commands, and use different ways to check for errors depending on what is causing the problem.

If I had to decide on a tool to replace shell script I would vote for tcl, since it maintains many of the advantages of the shell but provides a better handling of the programming aspect. Unfortunately these days it seems that you either use shell or some full-scale language like python.

Tcl is also a full scale language, the startup I was on during the com wave was shipping products written in Tcl + C, just like Python.

I don’t think the issue is that it’s hard to manually parse. The problem is that it’s hard for someone else to read your ad-hoc parser years later and reason about what you did if they need to modify it.

Disclaimer: I am the author of the article and JC.

This is even more true for the ungodly long `jq` incantations that people write.

It's like I get it, the old way is ugly and not always easy to decipher but at least it's shorter and your chances of understanding it are better.

I've had both -- the classic piped chain of UNIX commands and various JSON-producing tools piped to `jq`. The former were still easier to work with.

Opinion: Today's shells should have structured data; `jq` is a symptom of a shell that can't handle structured data.


My bias: I'm the author of Next Generation Shell.

Yes, I have seen those too! That’s why I also wrote Jello, which is like jq but uses pure python without the boilerplate. Python is nearly universal now and typically easy to read, though more verbose. Jq is just as much a write-once tool as awk and perl for more complex queries. For simple attribute calls, though, it’s both terse and readable.

This is why I wrote murex shell (https://github.com/lmorg/murex), it's an alternative $SHELL, so you'd use it in place of Bash or Zsh, but it's optimised for modern DevOps tools. Which means JSON and YAML are first class citizens.

It's syntax isn't 100% POSIX compatible so there is some new stuff to learn but it works with all the existing POSIX tools and is more readable than AWK and Perl but while also being terse enough to write one liners.

I wrote something similar by description in ruby, I’d be curious to see your python implementation.

I considered python but Ruby’s easier chaining with map/filter/etc made it easier for me to use when writing just one line with it to transform some json.


Sure thing!


Other languages are superior in their handling of maps/arrays, but Python is just so damned popular now I thought it was a good choice to democratize JSON handling.


(Sorry, this is my first time trying to do a formatted comment here.)

What I like to do is comments like:


  * Collates

  * [

  *    { 

  *        id: 4

  *        dept: 'oncology',

  *        name: 'Joe S.'

  *    }

  *    .

  *    .

  *    .

  *  ]


  *  into


  *  {

  *     4: {  // id

  *        'Joe. S': { // name

  *           dept: oncology

  *        }

  *     .     

  *     .     

  *     .     

  *   }


( Then insert horrible one-liner that does the transformation. )

Two spaces in front of a line will create a code block:

   * Collates
   * [
Removes all the unneeded vertical space created by inserting the extra newlines and makes the entire thing more readable. Put two spaces in front of each line that will be part of the code block and then any extra needed spaces. Some text editors can format text in this manner automatically for you.

But it's very gratifying to the person who re-invents the wheel for the millionth time and feels good because of it.

The Unix command line is the Candy Crush of interfaces, giving us a little dopamine hit every time we solve a problem that didn't need to be solved.

I can see the potential in Babashka and other Clojure-based systems from the approach point of view. There needs to be a simple integrated editor or much better readline implementation though that needs to work out of the box. That should improve the interactivity in a major way. We also need to have a set of very simple tools to work with files, network connections (e.g. traffic dumps, netcat) and other typical shell tasks. This isn't really as efficient when used interactively in the Clojure ecosystem or anywhere else from what I have seen besides just using e.g. ls, du, df, fdisk, tcpdump, nc ...

More long term, I have high hopes of having an environment above the kernel that basically does what /proc and /sys and others do currently. Programs equivalent to those in /bin /usr/bin etc. would be just dynamically loadable modules or built into the basic tool set. No shell scripts anymore, no random filesystems with custom formats for everything, just a JITed VM that has a strong set of tools but anybody can extend it with either stuff written in some Clojure-like language, or something compiling to the VM or something native that has some kind of interface (FFI?) to be usable from the programs running on the VM (e.g. for cryptographic stuff or stuff that needs to be as efficient as possible). We would also need something like SSH but for structured data that would support a SHELL/ REPL-like workflow as a byproduct but really be meant for more or less high performance, efficient communication (e.g. useable even for large file copy operations and such). In the end, parts of this system could connect to a in kernel VM (BPF?) and execute there but we would interact with them using the nice, structured REPL.

This would be a huge undertaking but I can't really see, how we can radically improve the efficiency of work with the current systems. It seems, we are mostly just patching old approaches to do new tricks and to me, it seems to be falling apart. The complexity we impose upon us is crushing and I don't think all of it is necessary.

Babashka is really nice in that it has json and yaml libraries built-in, so you end up using your Clojure data structures throughout the script, and yet still consuming/emitting json/yaml if you need to.

So basically a Lisp Machines REPL.

Can JSON compare with line-based data, which many original UNIX utilities seem to target. JSON's design assumes the user can read the entire file into memory. It's really easy to exhaust resources with JSON. And fast, crash-proof JSON parsers become more challenging to write.

Whereas it's not nearly as easy to exhaust memory with line-based data processing nor to crash utilities that read line-by-line, e.g., sed. If lines are too long, I can chop them down to a reasonable size on some sentinel.

IMO, JSON, like Javascript, is web/browser centric. For someone who rarely uses a browser or Javascript and is comfortable with UNIX, e.g., yours truly, JSON is not particularly advantageous. For large data, line-based is more robust (and memory-efficient) than JSON, IME.

Better than JSON is netstrings or bencode.



> JSON’s design assumes the user can read the entire file into memory

No? The design of most JSON libraries assumes that, but there are perfectly good incremental JSON parsers out there[1–3]. It’s just that people don’t seem to have figured out a good API for not-completely-incremental parsing (please prove me wrong here!), but this applies equally to any structured data format as soon as you want to pull out pieces of data that are nested more than one level down.

The lack of length prefixes in JSON does indeed make a solid parser somewhat more difficult, but you get the ability to author and validate it manually instead. All in all a draw and not because of the incremental parsing thing.

(Tabular or otherwise homogeneous data is indeed reprsented wastefully, but unless the individual records are huge json+gzip is a perfectly serviceable “worse-is-better” solution, and its self-describing nature along with support for structured cell values can at times make for a better experience than TSV. And at other times not.)

[1] https://github.com/ICRAR/ijson

[2] https://github.com/AMDmi3/jsonslicer

[3] https://github.com/danielyule/naya

All three of those Python scripts use the same library, YAJL

It comes with an example program called json_reformat which I have experimented with in the past.

However the "reformatters" I write using only shell utilities work just as well. YMMV.

Yes, but raw YAJL (or its Yajl-Py binding, or other incremental parsers like jsmn) is a right pain to use while these are actually interesting in terms of API design.

I doubt your JSON reformatter was entirely correct (hey, we all must parse [X]HTML using regex from time to time), as you either have to pretend you can make sense of JSON using regular expressions (you can’t, no language or format supporting unbounded nesting can be regular) or write what’s essentially a standard JSON parser in shell with all the associated inefficiencies. My own needs are usually adequately served by the likes of jq -r ... | while IFS=$"\t" read -r ..., which feels much less hackish.

(Even correctly handling CSV with quoting, escaping, and embedded newlines in a UNIX pipeline without a purpose-built utility is surprisingly difficult—although not impossible, as CSV is a regular language.)

Thanks for the summary of problems with JSON. For ease of use, it will not likely become a de facto interchange format in the UNIX context. However not all JSON is the same. Simple JSON is easy to parse and one does not need jq or libraries, for example JSON DoH responses. People have even tried to standardise simpler JSON, e.g., jsonl, as another reply mentions. But, as I said, the design of JSON allows for and encourages complexity. If that's what you like, go for it. An HN commenter recently identified the "complexity fetish" that many programmers have.

Not everyone suffers from that fetish, fortunately. With excessively complex JSON, I prefer to extract the specific data I want. The speed of json_reformat demonstrates how slow reformatting is when done "correctly".

Every user's needs are different. jq fails to meet mine.

JSON Lines is often used for this type of thing.


> JSON's design assumes the user can read the entire file into memory

It doesn't. I have personally written JSON parsing code that runs on embedded systems with less RAM than the size of the file.

"I have personally written JSON parsing code that runs on embedded systems with less RAM than the size of the file."

I have too. I used shell utilities. :)

YAJL will also work.

PowerShell is just so much nicer to use than anything where text munging is the only way to do things, and it’s just as “pluggable” as Unix shell commands. And it can output text (or JSON or XML or YAML) or whatever you want easily by piping into relevant commands (or just not piping anywhere if you wanted text). I don’t imagine those are going anywhere anytime soon but I think it is wrong to say the system cannot be improved upon.

I had to write a powershell script and my impression was that objects are a much worse interface than the worst text manipulation tricks. Because you are passing objects around it quickly becomes really hard to understand what is going on on the code and which object interfaces the script is using, because the object lives behind the pipe so to speak. It was my impression that it is too easy to create unreadable powershell scripts

It also was unbearably slow in a modern computer, to the point that it made debugging difficult.

I'm not sure I understand exactly what you mean. I would rather read Select { $_.Property } than some awk or sed stuff to fetch the same thing any day. I'm not sure how this is different than most other modern languages used to write scripts.

While you can produce unreadable sphagetti in any language, powershell is just a poor mans Python without any real language features. I'd much rather use csharp(script) if I have a need to interface with some ms stuff (.net system libraries or whatever), powershell is just underwhelming and bloated without any real performance or productivity gains.

Unix shells, for all their idiosynchronacies, tend to focus on small bits of functionality directly composed from common shell or system utilities. The complexities lie in outdated design, not in the language or form.

We could definitely do with an upgrade to the underlying tooling, I don't find anything otherwise lacking in shell scripting otherwise. I don't agree with the OP that we need a web readable data format - perhaps the jq syntax could be useful but the best part of shell scripting is it is approachable by humans - if it is not you are doing too much and should be writing your missile control program in python or c, or similar. I dont think it should be easier to produce functioning junk nor should it be a requirement to master a data syntax in order to print a formated date. Maybe you need a different date program if you need to do something special.

Powershell is slow to start up. Do you have any benchmark to show that it actually runs slower than reparsing the same data thousands of times as text?

I would disagree and say that while PowerShell is well...powerful, it's not as pluggable as the Unix shell. Purely due to objects vs plaintext. PowerShell works off of .NET objects thus if I want to take output from one command and send it to another I have to verify that the receiving command can in fact receive the object type I'm sending. This is where the Unix shell to me is more 'pluggable' but it makes it more muddy as well. I can pipe plain-text information to any utility but to ensure that it will do what I want and not error out I may have to slice and dice my initial output a bit.

Maybe to you pluggable means an output object from a utility being the same input object as another utility? In this case, Unix isn't pluggable at all.

Powershell is much more than that, you are only seeing the basic stuff.

It is the only shell that ships by default on modern OSes that builds up on Xerox PARC ideas.

It is not only .NET, rather anything on the OS.

It also handles COM, DLLs, OLE Automation, WMI, pluggable filesystems.

With Powershell you can easily automate something like use the currently selected cell on an Excel document, and use it as input for something else, including another active application.

Something like that would be possible on UNIX shells with DBus like protocols, and ability to load shared objects into the shell, but on typical UNIX fashion everyone does their own thing and thus the whole experience remains fossilized.

You can always transform the output by selecting particular properties, so the commands don’t have to expect exactly the shape of your first command’s output. This is similar to what you’d do in Unix to massage text output anyway.

> awk

Or Perl. That's what Perl was originally created for.

Or sed. If awk is too mainstream for you.

Or Python.

Or just write the ugly pipeline in your script, and promptly forget how it actually works. I've done that a lot.

> Or sed

After learning how to use ed in scripts, I've found it's actually easier to use it compared to sed because it effectively has random access through the input rather than going from beginning to end.

But both awk and Perl have the advantage of storing values in variables over sed or ed.

sed, being a stream editor, is more efficient than ed, in that it does not need to keep the entire file in memory.

True, but if the file isn't that big, then the difference isn't noticable.

Being able to do something like

  /some string/-2 d
Which deletes a line 2 lines above where some string matched is something that's trivial to do in ed, but takes a bit of work in sed.

I know enough awk to know it solves my problem, but use it to little to remember how, and the development experience is never very clean.

What i would like is an cli awk ide. (Rolls of the tongue right? )

Ideally i could write ... | awk-ide "scripts/thing.awk" | ...

If it exists, run it. Otherwise have it open an editor session that runs an awk program as i type it on some buffered data, shows the result, and have some hints on what syntax/variables are available.

I use this occasionally in Emacs with awk-mode to live edit awk programs against a data file:


What about something like

  while inotifywait --event modify,move_self,delete_self script.awk; do awk -f script.awk input.txt; done
in a tmux pane on the right with vim/emacs/whatever on the left?

Note: You need to watch the move_self and delete_self events, since editors might use some file move/deletion dance instead of a direct modify to help prevent data loss in the event of a crash.

I wrote something similar to this to query JSON and JSON lines with python instead of awk for text. It’s called Jellex (Jello Explorer) which is a TUI front-end to Jello. Jello is a python analog to JQ.


> what we'd really need is for all tool versions/variants output on all platforms (all GNU/Linux distros, all the BSDs, all embedded Linux variants, all commercial UNICES, etc.) to be the same, all the time

The industry has approximated this state of affairs by approximating a GNU+Linux monoculture. Doesn’t matter that BSD tar behaves differently if ~nobody uses it.

Nobody except OSX, which means the great majority of Unix desktops use BSD tar.

HP-UX, AIX and Solaris are still around.

Plus there are a couple of POSIX like OSes for embedded deployment.

About this few weeks training you mentioned for Unix-style parsing of unstructured text - what resources do you recommend?

The GNU Awk manual has several practical examples in it - as manuals should. (Looking at you, nearly all man pages in linux distros.)

The original 1988 book The AWK Programming Language by A, W and K is, in my opinion, one of the finest pieces of documentation ever written, on any subject. A joy to read and be instructed by.

And $722 AUD on Amazon. Did they never make a second printing?


I have occasionally dipped into this book for instruction and consider it a freely accessible reference:

Unix Text Processing, Dougherty and O'Reilly, 1987 https://www.oreilly.com/openbook/utp/

Cherry pick chapters like those on the Shell and AWK, and avoid those on troff, macros etc unless specifically interested.

I can understand why the idea of more structured, object-like input and output is appealing, but after using PowerShell for a while, my take is that it's much harder to manipulate objects into a consistent format than it is to manipulate text. For instance, if you want to compare Azure DNS records with DNS records from a Windows server, it's a huge pain because Get-AzDnsRecordSet and Get-DnsServerResourceRecord return objects with different structures. Same problem if you want to pipe output of one util to another which expects slightly different format of input. More generally, text is great for loose coupling; structured objects, less so.

> More generally, text is great for loose coupling; structured objects, less so.

I disagree. Parsing "loose coupled" text and converting it to the format that a different tool expects is a rather non-trivial problem, and one that's often poorly specified to begin with; converting structured outputs is generally straightforward in comparison.

Sure, converting (or interpreting) structured data is easier then unstructured. I doubt anyone is arguing that. I interpret op's statement however in that unstructured data is easier to couple unrelated tools. Tools perhaps which haven't written yet, by teams not knowing of each other and hence are unable to agree on a structure.

I feel like if that were true then you’d have the majority of REST APIs delivering unstructured output instead of structured JSON, XML, CSV, etc. The market has spoken - there is true utility in structured text output.

I think that's why the article proposes adding command line options and alternative APIs to output JSON objects not forcing all things the use objects all the time.

For any serious data retrieving or scripting the light object wrapping will be superior. For quick one off use maybe the plain text option will be quicker to reason about.

It's that the languages suck.

I've tried writing a language around this but it's... rusty. You kind of want to pipe Get-AzDnsRecordSet and Get-DnsServerRrsourceRecord to the same interface structure and then build another transform to some other structure from the interface... The program wires together the transforms. So the programming paradigm becomes rather unconventional, you only write transformations of “events”, you address other shell commands by listening to outputs of their “eventspace,” etc. ... You get a sort of strange logic programming language where the pipe operator has to be slightly specialized every time because it always needs to transform a little as it pipes

You can always just thunk down to the text representation and do things that way. Having the structure is a strict plus, no?

This is a much more complex text representation.

Mandating tab separated columns with a consistent quoting for embedded spaces would be a net benefit. And it would match today's tools well.

There are better shells out there for handling structural data, like Murex and Elvish

No seriously, there really are. I appreciate it requires learning a new tool but rather than downvoting me how about those unconvinced ask me questions instead? I’m happy to answer.

While there will always be a need to keep Bash around for comparability, there are a plethora of other tools out there that solve many of the shortcomings of POSIX.

Are you using one of these exclusively? Which one have you chosen and why?

I use murex exclusively. It’s still beta quality so not totally bug free but the bugs are rarely showstoppers and it’s still in active development so fixes are usually just one GitHub issue away.

I don’t use it on servers, I stick with Bash for that. But I do use it as my primary local shell.

As for why murex, that’s a combination of personal preference and simply not being aware of other shells until after I’d already started using murex.

What I like about it is it’s typed and does a lot of the boring stuff automatically in the pipeline based off what type is (JSON, YAML, CSVs etc are all “types”). So working with a JSON file is as natural as working with a flat text file. But also it’s trivial to bypass the clever stuff and use it like a dumb bash shell too. Which is where Powershell and its ilk fall down.

Thank you, I'll check out Murex.

Why not go all the way and use a format capable of expressing code and data? I refer, of course, to S-expressions.

They also have the benefit of properly handling numbers. Some might look at the absence of maps as a negative, but I think alists are preferable anyway due to their constant ordering.

Yes. I never understood why JSON over s-exprs. The absence of maps is not a negative. S-exprs can represent maps. There are no maps in JSON, really anyway. It is just text. How that data is represented in memory is the output of parsing. You could just as well parse (dict (a 1)(b 2)(c 3)) into a hash table if you wanted. You could also have sets (set 1 2 3) or whatever other data structure.

You could do any of those things, but you have to pick a convention and other people have to agree on it.

JSON is nice in that it has just enough structure to do a good number of tasks in one obvious way. The biggest omission is probably some kind of time and/or date type (but ISO8601 in a string is the obvious solution there).

It’s not a coincidence that JSON was reverse-engineered from a language with convenient literals for dictionaries and arrays, and most languages provide those two collection types because they cover most use cases, so JSON fits most languages fairly well.

It’s just handy having both arrays and dictionaries available, rather than stretching one data structure to cover both, whatever Lua or Lisp might say.

You parse the s-exprs and execute them in the context of a namespace of data constructors. Then you can have whatever data structures in memory that are defined by the constructors. This is NOT equivalent to having one data structure to cover both as Lua does. It is having one text format that can construct any kind of data structure in memory for which you have constructors defined.

Okay, but we’re talking about a lingua franca for exchanging plain old data between different programs. You would have to pick your data constructors in advance.

I’m saying that arrays and maps give good bang for the buck, so you don’t really need to define anything beyond those. And if you accept that, having special syntax for arrays and maps is more convenient and readable than S-exprs.

> And if you accept that, having special syntax for arrays and maps is more convenient and readable than S-exprs.

S-expressions are arrays, and maps are really just degenerate unsorted arrays of key-value pairs. Taking a look at https://json.org/example.html, I think this is easily more readable:

      (id file)
      (value File)
        (item (value New) (onclick "CreateNewDoc()"))
        (item (value Open) (onclick "OpenDoc"))
        (item (value Close) (onclick "CloseDoc()"))))

    {"menu": {
      "id": "file",
      "value": "File",
      "popup": {
        "menuitem": [
          {"value": "New", "onclick": "CreateNewDoc()"},
          {"value": "Open", "onclick": "OpenDoc()"},
          {"value": "Close", "onclick": "CloseDoc()"}

Each to their own!

A big part of the difference is that all the JSON keys are quoted, which I agree is ugly (I like JSON5 myself).

You’ve also omitted the “menuitem” from the S-expr version. That could have been omitted from the JSON too but I assume it’s meant to be there for some good reason.

Fear of parentheses, basically.

But json already has quotes, commas, brackets and braces :-)

Yeah, apparently finding misplaced quotes, commas, brackets and braces is magically easier than misplaced parentheses. :)

You jest, but I find it easier to visually parse code organized using parentheses, quotes, commas, and brackets compared to code organized using parentheses, parentheses, parentheses, and parentheses. The latter approach is simple and elegant but makes it hard for me to read someone else's code.

I think the first reason is that the distinct characters provide a kind of visual checksum which makes me (slightly) more confident when initially matching a closing character to the correct starting character.

The second reason is that each character has a conventional meaning which makes it possible to form an initial guess as to its purpose.

Having said this, I freely admit this opinion may be colored by my only (paid) experience writing Lisp which was for a sprawling 20 year-old AI codebase mostly written by professors and graduate students who were often learning Lisp as they went.

That’s a good debate to have. I settled on JSON due to its readability and ubiquity in web APIs. It’s something people all the way down and up the stack are very familiar with these days.

True, folks are very familiar with JSON, but it does have problems, and the best time to pick the best solution is before one has to deprecate a lesser solution. Computing is built atop a pile of decisions which made sense at the time and cannot be changed now due to compatibility (spaces in Makefiles, anyone?): there is no time like the present to simply choose to do the most correct thing.

'But folks won't use it!' Well, they might not. But if one gives folks a choice between the capability they need using an unfamiliar technology and not having the capability at all, they will learn the unfamiliar tech.

Specially because it keeps being forgotten that Lisp Machines and Interlisp-D workstations shells were basically a graphical based REPL.

To put it in 2021 terms, Jupiter Netbooks in 1980's instead of PDP-11 green phosphor terminals.

Even the best Unix shells are a mere toys in comparison to what Lisp machines were doing in the 80s. I don't wanna say that using a Lisp machine was "life changing" or anything but let's just say I don't like Linux anymore.

I may look like very anti-UNIX, ironically I was big into UNIX, my first experience was with Xenix.

However my university library had a very good section on all kinds of OSes and programming languages since the dawn of computing, thus I could see how the future might have been and there was definitely a much better path, specially when coupled with my own experience across Amiga, Windows and Mac OS.

It is like the bulb paradox, just applied to OSes.

I'd rather avoid mixing the two. Shell injection is already a danger.

Mostly because S-expressions are terribly hard for humans to read... and everyone is already familiar with javascript syntax.

I feel that depends on the formatting. Non-pretty JSON isn't very readable either. ``` (dict :name apple :ip :nested (dict :property value ) ) ``` Is almost JSON.

Rivest's proposal includes an "advanced transport" representation, a more human-readable version. Here's a sample: http://people.csail.mit.edu/rivest/sexp-sample-a

S-expressions aren't inherently more difficult to read than JSON, it's just a matter of getting used to it.

Chinese isn't more difficult to read than English; it's just a matter of getting used to it.

Obligatory links:

Rivest's proposal, with source code: http://people.csail.mit.edu/rivest/sexp.html

McCarthy's Common Business Communication Language: http://jmc.stanford.edu/articles/cbcl.html

> (iii) [...] Don't hesitate to throw away clumsy parts and rebuild them.

Some forgotten wisdom right there.

Uh ... what about when I don't want to load the entire stream into memory before the next stage starts running? Are these implicit json arrays that stream out?

Or do we now have incompatible json shell tools and streaming text tools as a permanent fixture?

I'm excited about structured output ideas, but json? I'd much rather have streams of whitespace separated words than json. That's in that "No type system is better than a bad type system" metaphorical area.

I'll take grepping with a theoretically brittle regex over this jq[1] any day of the week.

[1] jq -nc --stream 'inputs | select(length==2) | select( [.[0][0,2,4]] == ["results", "data", "row"]) | [ .[0][6], .[1]] '

I agree with you in broad strokes, but as a piece of anecdata: I've had a lot of success building tools that emit and consume JSONL[1] instead of entire JSON documents. JSONL preserves the Unix pipeline's inherently parallel design (people tend to forget this, even when waxing about the Unix philosophy!) but gives us all of the nice typing of a JSON stream.

That being said, I too will take a `sed` or `awk` one-liner over some of the `jq` monstrosities that I've seen.

[1]: https://jsonlines.org/

Yeah, if we can't get something like edn (json but well designed and extensible) then jsonl is at least OK.

> people tend to forget this

There are people with strong opinions about this that don't even understand why you need something like JSONL in the first place and I'm surprised by that. If they never use unix pipelines I don't get why they feel we want to hear their hot take about how they should be redesigned.

I agree JSON is probably not right for every type of program output, but the age of web APIs has shown us that is probably great or adequate 90% of the time. If something is spewing out long lines of data I think JSON Lines would be a good option so you don’t need to read the whole structure into RAM. But any other structured output that has a healthy community and ecosystem supporting it would be better than just space delimited lines, or worse - groups of lines you need to deal with.

For the first point, there's no reason why we can't use JSON array wrapped output. In fact, we likely should use something like this for uniformity. Loading everything into memory is also not too problematic assuming a working swap space and reasonably well-architectured output schema.

For the second point, whitespace sensitivity is the one mistake that greatly pisses me off with Unix. I should be able to pass arguments and filenames with as many spaces as I want. We are in the 21st century and occasionally do use spaces in filenames.

While you're at it, bring Unix into the 21st century, in which we use Unicode (UTF-8 encoding). The number of Unix utilities that badly support Unicode is, well, painful to those of us who deal with non-ASCII data all the time.

Which utilities don’t support UTF8? I’ve never had encoding issues when using coreutil binaries e.g. grep or sed or awk. I guess if you want to display emojis or something, that might not work properly, but that’s an issue with your terminal, not with the utility you’re running.

Unix "2.0" (plan9) invented utf-8.

That `jc` command that converts non-json output to json is neat. But, um, seems slightly kludgy to be obvious. What if the `ifconfig` format changes slightly. I suppose its a stop-gap until all commands have a json output option.

If the ifconfig output changes slightly, then it will probably break the thousands of ad-hoc parser implementations buried in scripts worldwide. With this approach, only a central parser library (open source) needs to be maintained so it can be fixed quickly and robustly.

That being said, the goal is that command line tools that output useful data for scripts should have a structured output option like JSON while still keeping the human text output option so something like JC doesn’t even need to exist.

You have a point but when when I write a script to parse a command's output I don't read/parse every line. I'd probably grep for the one line I want.

True. When I try it with `date` on Debian Sid I get parser error:

    $ jc -p date
    jc:  Error - date parser could not parse the input data. Did you use the correct parser?
             For details use the -d or -dd option.

Odd. What locale are you using? Should work fine with C or en_UTF8, per the readme.

    $ echo $LANG
    $ date
    2021-08-23T08:07:02 CEST

Ah, yes - that locale is not supported. From the Caveats section:

For best results set the LANG locale environment variable to `C` or `en_US.UTF-8`. For example, either by setting directly on the command-line:

  $ LANG=C date | jc --date
It is possible to add support for more locales and you can always override the built in parsers with your own plugin to support it.

Yea, however this means that @ape4's point still stands:

> But, um, seems slightly kludgy to be obvious. What if the `ifconfig` format changes slightly.

Absolutely! I didn't write JC to be the be-all end-all. I wrote it because I believe using structured text between processes is usually better than plain text and this tool allows people to try it out and see for themselves. The goal is for people to see the benefits and require tools to output structured data, either for old utilities, or especially new ones.

Lots of people have asked for this, but the argument has always been that it can't be done, it's too hard. So I created JC to help open minds and change behavior. As I've said many times: the goal of JC is for JC to never have to exist. Hopefully it will persuade people that there is a better way and we should expect better from our existing tools without having to completely change the way we do things today.

JC supports over 70 programs and file-types today. I slowly, incrementally added more and more parsers over the last two years. Now it's to the point that it's hard to find popular apps that don't have coverage. lately, I get requests to add parsers for apps that already provide JSON output, which I don't do. :)

And those old apps like `ifconfig`? Many of those haven't been touched in a decade. There's not a huge risk of the output changing any time soon. (Believe me, I know - I went through the source code in several of these utilities to be able to figure out what to call some of the undocumented fields)

What I'm trying to say is that there are 100 reasons to say it won't work until you actually try it and you find that it actually works pretty well and opens up possibilities you hadn't thought of before.

This `date | jc --date` seems to be the way

FreeBSD has started supporting JSON output for various tools via `libxo` in their base system for quite some time now.

Article from 2019 basically advertises https://github.com/kellyjonbrazil/jc which converts output of many unix cli commands into json, depends on Python

"Up until about 2013 it made just as much sense as anything to assume unstructured text was a good way to output data at the command line..."

But in 2013 a certain data format called JSON was standardized as ECMA-404..."

"Had JSON been around when I was born in the 1970’s Ken Thompson and Dennis Ritchie may very well have embraced it as a recommended output format to help programs “do one thing well” in a pipeline."

This whole post hinges on the theory that JSON is a revolutionary technology that no one had created something like before, and no one even considered creating before.

But that seems completely wrong, right?

"The Xerox Network Systems Courier technology in the early 1980s influenced the first widely adopted standard. Sun Microsystems published the External Data Representation (XDR) in 1987. XDR is an open format, and standardized as STD 67 (RFC 4506)"


I don’t follow. Your examples are of data formats standardized in the 80’s while Unix was developed in the 60’s and 70’s. JSON even existed before 2013, but the fact that it became a standard in addition to being popular is the point I was making.

I've spent a lot of time reverse engineering long dead hardware (and forgotten protocols), which means I've also spent a lot of time reading ancient papers generated by IBM, Bell, Arpa (so many long gone DoD programs), etc. So I sometimes have a hard time distinguishing between what is common knowledge and what is something that only myself and maybe a handful of others care about...

That said, you know that every potential variation of every possible approach to accomplish what I presume your objectives are has already been exhaustively explored, documented, implemented, and finally abandoned by organizations with functionally limitless resources (aka open ended government contracts) - 50 years ago? Have you considered leveraging some of that work? I don't think many people know about the massive amount of work already done - that was just abandoned for a variety of reasons: reasons that often no longer apply, and very rarely have anything to do with the technology's utility. For example: everybody here knows about the OSI model and how sparsely filled out it is - but did you know that there is a layer set aside to do exactly what you are talking about, and that it just isn't being used? Thats right, #6, the presentation layer - specifically the virtual terminal protocol: which is where the designers wanted the object exchange and structured data to go... not in a mess of json one layer up. The VTP was outlined in several papers going back to at least '72. You could also lean on the Airforce's work for your data model, they ran that to ground pretty thoroughly with IDEF. You've also got a huge amount of free work from IBM when it comes to structured documents, and architecture that lends itself to semantic reasoning.

Anyway, my larger point is that instead of exasperating the issue of wastefully bloated software teetering on increasingly high layers of abstractions (do a stack trace and consider the insanity of it), maybe the way to really improve our circumstances has already been discovered and then lost for a time. It would be no more difficult than trying to make kornshell-json a thing.

A Survey of Terminal Protocols (1979) DOI 10.1016/0376-5075(79)90001-1

Computer Network Architectures and Protocols (1983) DOI 10.1007/978-1-4615-6698-4

IDEF: https://en.wikipedia.org/wiki/IDEF

IBM's various journals (pre-'00) are also worth reading.

The first version of UNIX may have been created in 1969, but it continued to evolve for 20ish years, and I don't think 1969 UNIX really resembled what we had in 1989. But then the UNIX world stagnated, because Linux people were obsessed with cloning System V. Sometime during the Linux era, text became "cheap" enough to be used as a data format. So perhaps I'm wrong, and JSON's time has come in the UNIX world after all.

I was literally just thinking about this a few days ago. I'm super excited by https://www.nushell.sh/ . I think they are hitting on an order of magnitude improvement paradigm of shells that fit very nicely with the theme of this article.

This reminds me of powershell

I've always been looking for anushell.


This goes against the philosophy explicitly mentioned in OP's article. E.g. avoid tabular formats.

This is systemd against init again. A powerful, but overreaching shell that becomes unreplaceable and bloated with concerns. In constrast, traditional *nix/GNU programs work well, and interact well, in every shell.

You see traditional Unix/GNU programs as working well and interacting will. I see a programming paradigm designed for six-char identifiers, not designed for whitespaces and punctuation in names, running in something that emulates a physical terminal that emulated a paper-based terminal. Terseness is there, but ergonomics of use for non-utter-experts could be improved.

I think that the shell paradigm could be so much better. If there's a momentum to make a change from that, I'll jump on the bandwagon. Any change is better than the current state of the shell.

Nushell has a 'to json' command. The tables are just nice to look at, it's not a data format they're expecting people to parse.

FWIW, "ip -j route" should work correctly since v4.17:


Thankfully he didn't propose XML. Unfortunately it looks like he (and many others) really thinks that JSON is better, even if it's underspecified, thus leading to possible insecurities. See the recent jsonsec thread. (Undefined key ordering and duplicate handling)

So I have to bring in jsmn.h to parse protocols? Sorry no. Been there, done that. We are pushing too much unnecessary JSON around already.

Unix is also about KISS. Structured data are lines and paragraphs.

> even if it's underspecified, thus leading to possible insecurities

what it replaces (unstructured text) is much less secure so I don't think that counts against json

But at least the parsers are very, very battle tested at least

Parsing ad-hoc text formats always seem very fragile, as single spacr character (or even newline character) could easily break the pipeline. Using a proper data structure could make scripts works more "properly" when encountering these edge cases.

Which format to use a as representation of data structure might be debatable, but I think JSON is a reasonable choice. (Formatted) JSON is readable by human and could be easily parsed by programs.

So, this is not "bringing it to the 21st century." I feel like what would actually do that is related to the idea of don't try to succeed perfectly, try to fail elegantly?

I want some sort of non-destructive way for "the shell" to take a best guess at what's going on when I make a typo, or when the data's not formatted quite right, like a live linter that's there all the time.

I'm more impressed by

    12 Principles for a Diverging Desktop Future [dd]
which is the vision for the Arcan project.

[dd]: https://www.divergent-desktop.org/blog/2020/08/10/principles...

The idea is good, but the approach cannot execute sufficiently. First, jc has to individually support each program and program mode, so improvements are bottlenecked on jc. Everytime you load a program you need to load all jc parsers. I also wonder how resistant the parsing is to surprises (filenames with bracket characters in ls output for example).

Second, this does not handle interprogram piping (e.g. 'find . | xargs dosomething'). We have filtering to console but little piping (we can map the json output to a single column and pipe that, using the textual interface with all its deficiencies).

To be clear, the author has no choice (rewriting all userland is a daunting concept), but a proper approach for 21th century Unix would be for all utilities to output and receive json (or whatever format is chosen) natively, and that the console would then be able to mange that into a textual format for the user.

Apparently, Plan9 has failed in this regard. What went wrong?

I guess the team moved on to Inferno, plus this kind of shells don't play well with UNIX related culture.

Xerox PARC workstations and Lisp Machines already solved the problem via their REPLs.

This is how you "integrate" `jc` with Next Generation Shell ( https://github.com/ngs-lang/ngs ):

    data = ``jc PROGRAM ARGS ...``
The double-backtick syntax runs the external program, `jc` in our case, and parses the output. It means that the "integration" is not `jc` specific.

`data` is now structured data that comes from the parsed JSON.

Example (run from your shell):

    ngs -pl '``jc ifconfig``.filter({"name": /docker/}).ipv4_addr'

Will print IPs of all docker interfaces, one IP per line

Related, I am wondering if somebody is aware of a project to wrap cli utilies in a minimal 'TUI', along the lines of this: https://github.com/chriskiehl/Gooey , but staying in the terminal. You'd pass the name of the program to run as the first argument. Haven't thought this through fully.

EDIT: well, some discussion about this: https://github.com/chriskiehl/Gooey/issues/296

With all the discussion here about object shells, it's a bit surprising that the project page doesn't make stronger emphasis that this (jc) is also a Python library that can take in the output of subprocess.Popen rather than fooling about with the mess that is jq. With that said, Python isn't especially nice for calling external programs. I can't wait for nushell/oil shell/elvish to hit maturity soon enough.

If I had to wish only one thing in context of shell scripts, it would be to be able to separate passing back the actual result from a function from stdout/stderr/logging without messing with boilerplate around non-standard file descriptors/named pipes/temp files etc.

What, besides the actual result, do you write to stout? (And shouldn't it go to stderr instead?)

Typically informational messages/log message. Ideally yes logging should go to stderr but some applications that intercept the file descriptors consider anything to stderr as error and it is not always possible to change the handling for those applications. E.g. Octopus Deploy[0] does this: it doesn't outright fail the step assuming other "successful" steps happen after that but if the last command writes to stderr, it would; I use explicit exit statements as a workaround to handle this.

[0]: https://octopus.com/

I guess I just wanted to be on a high horse and declare that anybody who puts something other than what you asked for on stdout is doing it WRONG. But you're right, fixing the universe to fit the convention isn't always in the cards.

If I could design it all from scratch I'd have a convention for getting a program to tell me:

- info, user prompts, debug -> stderr

- foo data, bar data -> stdout

That way I could use those defaults sometimes, but other times specify per-execution preferences that would route the data differently.

I was looking more into this and found something interesting[0]:

> A variable can be assigned the nameref attribute using the -n option to the declare or local builtin commands (see Bash Builtins) to create a nameref, or a reference to another variable. This allows variables to be manipulated indirectly. Whenever the nameref variable is referenced, assigned to, unset, or has its attributes modified (other than using or changing the nameref attribute itself), the operation is actually performed on the variable specified by the nameref variable’s value. A nameref is commonly used within shell functions to refer to a variable whose name is passed as an argument to the function.

So something like below:

    set_var() {
        local -n vname=$1   # use nameref for indirection
        # some processing and assign value

    use_var() {
        local output
        set_var output # call function to populate the variable
        echo "output=$output"
        unset -n output


I guess this is pretty close to "clean". Unfortunately, it is only available in bash 4.3, if I'm not wrong, so I don't have a choice to use it (our hosts use 4.2.x).

[0]: https://www.gnu.org/software/bash/manual/html_node/Shell-Par...

I have noticed that people intuitively use unix philosophy _inside_ their programs. Not often you see well written software with functions that do not respect the unix philosophy (if only they would be programs and not functions).

Emacs is in many ways anti-Unix-philosophy and I think the way it’s software tends to be structured is no exception. Functions can get ad-hoc extensions or modifications all over the place with advice and hooks and dynamic scoping[1]. The only similarity is that in Unix most things are files and in Emacs most things are buffers.

[1] Unix has weird ad-hoc mechanisms too like environment variables or your PATH containing modified versions of programs, and it has programs that do a thousand and one things, but I claim those are mostly violations of the Unix philosophy.

The Unix philosophy includes the use of software libraries in one's programs, albeit only resorted to when truly necessary. That often involves resorting to many of these tricks, usually for the sake of greater software reuse.

A philosophy that is cargo cult, as it was hardly ever followed on commercial UNIX systems.

I agree that they handle some amount of unix philosophy intuitively. But, for example, lots of command line programs don't handle SIGPIPE.

why in the world would you pipe through grep to select certain lines and then use awk? You could just select lines with awk. The same could be said of cut. If you want to use grep and cut that's fine, but using grep and cut with awk implies you are doing it wrong.

Cut for field 1 (`cut -d/` -f1) is a cheesy way of saying we want to delimit fields with '/' too. So add that to awk via script variable or via -F command-line option.

It's probably not the most correct, but it is a very common practice[0].


Huh. I like that. jc.

JSON is a terrible intermediary format since its structure is incompatible with streaming data; and demands full structure parsing before consuming the data.

The input and output of jq are JSON streams:

> jq filters run on a stream of JSON data. The input to jq is parsed as a sequence of whitespace-separated JSON values which are passed through the provided filter one at a time. The output(s) of the filter are written to standard out, again as a sequence of whitespace-separated JSON data.


Structured data people should prepare for a rude awakening from deep learning based tools. You can’t even regex into the working of a convolutional or recurrent neural network. The map will never be the territory.

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact