Null separators between fields *is* structured data. It might just be the simple...

lelanthran · 2023-07-13T11:49:22

The painful part of all this is that there were and still are actual record separator and field separator characters in ASCII.

ASCII was perfect for table output. All tools had to do when outputting is use the standard existing characters when no TTY is detected, else use tabs/spaces and newlines.

I always wondered why no one uses the ASCII rs and FS codes.

tpoacher · 2023-07-24T19:07:14

That ... is an excellent question.

nmz · 2023-07-13T16:00:12

How would you deal with a tree?

tejtm · 2023-07-13T19:29:57

Newick trees

https://en.wikipedia.org/wiki/Newick_format

nmz · 2023-07-16T14:42:12

That seems like a text format on its own? doesn't even use RS/FS. Even when Translating it uses { (),; } that's 4 characters to indicate a tree. Maybe you could use

  FS=, RS=; GS=( US=)

Is that what you mean?

lelanthran · 2023-07-13T16:59:01

Who cares?

It deals with 99.99% of uses cases that pipes are being used for right now.

That's more than good enough.

If you really needed to output a tree using only rs and fs characters, and since they are only a single byte each, you could cover that 1 in a 1000 use case by using empty fields for depth of a record, and attaching any record to the current parent until the depth changes.

nmz · 2023-07-16T14:49:16

I care, I wrote a converter[1] for it a long time ago, when I wanted to write a json one I got stuck.

Depth of a record seems useful, but then you'll have to make fields mandatory to indicate between real records and depth. slowing down the parsing.

[1]: https://github.com/Nomarian/AsciiDT

imiric · 2023-07-13T10:07:53

Sure, but the reason it works with any tool is because it's generic and simple. If each tool had to implement anything more sophisticated like a JSON parser and serializer, it would be a nightmare to maintain. Projects like Nushell essentially need to handle every type of output from any command, and every type of input to any command, which is an absurd amount of work, and just not scalable. Subtle changes in this strict contract means that pipelines will break, or just not be well supported[1].

If programs simply input and output unstructured data, it's up to the user to (de)structure this data in any way they need. The loose coupling is a feature, not a bug.

[1]: You can see this in Nushell's issue tracker[2]. I'm not judging the amount of issues, as any healthy OSS project will have many issues, but some of these are critical bugs related to command handling and interop. I'm not blaming the Nushell team either, my hat's off to them, but just pointing out that the nature of the project will inevitably lead to a neverending stream of these types of issues.

[2]: https://github.com/nushell/nushell/issues

p-e-w · 2023-07-13T12:01:34

> Projects like Nushell essentially need to handle every type of output from any command, and every type of input to any command, which is an absurd amount of work, and just not scalable.

I think you're misunderstanding how Nushell works. They don't parse outputs, or generate inputs, from/for standard Unix commands. Instead, they implement their own commands with the same names as standard commands, and generate/consume structured data by default, using the same data structures everywhere. There is only a single implementation of those data structures. That's very easy to maintain.

So running `ls` from Nushell does not shell out to the `ls` program on your system, and then try to make sense of its output. It runs a Nushell-internal command that is tailored to the kind of pipelines that Nushell is built around. They already have hundreds such commands implemented and working, and that approach absolutely does scale. Whatever issues may remain, it already works much more reliably than the default Unix tools.

Saying that unstructured text streams are a universal interface is like saying that atoms are a universal construction kit – it's technically correct, but pretty useless in practice.

imiric · 2023-07-13T14:28:34

You're right, I misunderstood the way it worked. But I'm not sure that approach is better. They either need to maintain full compatibility with existing tools, or users need to learn the idiosyncrasies of Nushell tools. And commands not reimplemented by Nushell wouldn't work, or they would need some kind of generic wrapper, which would have the drawbacks I mentioned.

But, hey, this obviously has users who prefer it, so if this works for you, that's great. Personally, I'll stick to the standard GNU and POSIX tools. I do concede that this is partly due to the robustness of this ecosystem and my familiarity with it, which is hard to abandon.

> Saying that unstructured text streams are a universal interface is like saying that atoms are a universal construction kit – it's technically correct, but pretty useless in practice.

My point is that offloading the decision of how the tools are integrated beyond raw byte streams to the user, is the most flexible and future-proof approach, with the least overhead for individual tools. Doing anything more sophisticated, while potentially easier for the user, would require maintenance of the glue layer by each tool's developer, or a central maintainer ala Nushell. This loose coupling is a good thing.

p-e-w · 2023-07-13T14:49:01

> They either need to maintain full compatibility with existing tools, or users need to learn the idiosyncrasies of Nushell tools.

The existing tools aren't fully compatible with each other either. There are significant differences between GNU and BSD tools, for example, and yet more differences with BusyBox and others. The idea of "standard" tools is unfortunately an illusion, so not much is lost there.

But more importantly, most of the many options the traditional tools offer are related to output selection and formatting. In Nushell, those problems are solved in a unified way by piping to builtin commands that work with structured data. So instead of learning twenty different cryptic flags for `ls`, you just learn three or four postprocessing commands, and use them for `ls` and everything else.

Brian_K_White · 2023-07-13T16:17:17

Both gnu and bsd grep input and output a stream of bytes.

Without ever having tried it, I know that one random day when I for whatever reason have a wish to take the output from bsd grep and send it over tcp through netcat, to be collected by zsh's built-in tcp support and fed in to gnu's grep, it will work. No piece along the way made any jerk assumptions or required any any jerk tight coupling, and the bsd and gnu tools were completely compatible.

That is more valuable and greater convenience than any other poorly concieved ideas to make it more convenient.

This all 50 years after unix pipes were invented and in an environment the inventors did not even try to predict and handle. Instead they handled infinity by not trying to predict anything. They just made useful low level tools which you assemble however you may turn out to need to, and the tools make as few as possible assumptions which will all eventually break. The hammer doesn't only work with one kind of nail.

imiric · 2023-07-13T15:34:56

> The existing tools aren't fully compatible with each other either.

Right, but those incompatibilities, as well as the way commands interoperate, are left to the user to resolve. No monolithic tool could realistically make that easier, unless they reimplement everything from scratch, as Nushell has done. But then you have to work with an entirely different and isolated ecosystem, and you depend on a single project to maintain all your workflows for you. Again, the ability for loosely coupled tools to work together is one of the strengths of Unix.

We clearly have a difference of opinion here, so let's agree to disagree. :)

hnlmorg · 2023-07-13T12:53:21

Murex (https://GitHub.com/lmorg/murex) doesn’t replace coreutils with builtins but manages interop with commands just fine.

Most output is relatively easy to parse, sometimes you need to annotate the pipe with what format to expect but that’s easy enough to do. And Murex does come with builtins that cover some of the more common coreutils use cases for instances when you want greater assurances of the quality of the data - but those are named differently to their coreutil counterparts to avoid confusion.

imiric · 2023-07-13T14:56:20

Murex is pretty neat, thanks for sharing.

Still, you must have issues parsing all variations of output, depending on the flags passed to the source command and its version. How do you parse the output of ls or ps without knowing the column headers, delimiters, or which version of the command was ran (GNU, BSD, BusyBox, etc.)? Piping data into commands also must require a wrapper of some sort.

Not knocking on the project, it does look interesting, especially the saner scripting language. But the usefulness seems limited to the commands and workflows it supports.

hnlmorg · 2023-07-13T16:01:57

Basically the same way you’d parse a CSV except white space delimited. You assume the headings are the first row. You can use named headings or numbered heading (like AWK) so you have options depending on the input and whether it contains headings.

The current implementation does break a little if records contain a space as part of its input (eg ‘command parameter parameter’ in ps) but I’m working on some code that would look at column alignment as well as separators etc — basically reading the output like a human might but without going to the extreme of machine learning. (I’m already doing this to parse man pages and --help output as part of automatic autocompletions so I know the theory works, I just haven’t yet applied that to more generalised command output).

p-e-w · 2023-07-13T13:09:50

That's the first time I'm hearing about this project (which I take you are the creator of?). Very interesting!

How would you say Murex compares to Nushell? The syntax seems vaguely similar. Are there any fundamental differences?

hnlmorg · 2023-07-13T14:08:49

Yes, I'm the author :)

Murex was created before most of the alt shells existed, created to scratch a personal itch. It's only relatively recently that I've been promoting it. What I wanted to create was a shell that had typed pipes but still worked 100% with traditional POSIX abstractions. So it's still just standard POSIX pipes underneath so type information is sent out-of-band. This basically means you can have a richer set of functionality from anything that understands Murex while still falling back to plain old byte streams for anything that doesn't.

I've also taken inspiration from IDEs with regards to the interactive UX. You'll get syntax highlighting, dynamic autocompletions based from man pages (I'm shortly going to push an enhancement in that area as well), smarter hints (like tool tips), inline spell checking, and all sorts.

There's also been some focus on making the shell more robust. Such as built in unit test framework, watches (for debugging), etc.

There will still be plenty of rough edges (as is the case with all shells to be honest) but it's a vast improvement over Bash in my biased opinion. So much so that it's been my primary shell for > 5 years.

skissane · 2023-07-13T11:14:45

> Null separators between fields is structured data. It might just be the simplest that can possibly work (because the null character is outside the normal data range for text, unlike the newline, which demonstrably doesn't work in that sense) but it is structure. A list structure, specifically.

Sometimes I daydream about a parallel universe in which the designers of Unix decided on record-oriented IO instead of stream-oriented IO.

If pipes were defined in terms of records as opposed to an unstructured byte stream, there'd be no need for a special character (whether newline or null) to separate records. How is in-band signalling in pipes any better than in-band signalling in telecommunications?

bbarnett · 2023-07-13T11:35:07

Sometimes I daydream about a parallel universe in which the designers of Unix decided on record-oriented IO instead of stream-oriented IO.

I would attack you with tea bags and waxed paper, if you tried to alter the timeline, and make this so.

On the other hand, I would happily sing your praises, if you invented another kind of pipe, to your specs, it sounds like a great additional.

Imagine combining the two, in one series of commands!

skissane · 2023-07-14T00:01:52

> On the other hand, I would happily sing your praises, if you invented another kind of pipe, to your specs, it sounds like a great additional

Already exists - Unix domain sockets. Some shells (in particular some versions of ksh) use them to implement pipes. And, on some platforms (Linux yes, but I think maybe not macOS???) Unix domain sockets support record-oriented operation (SOCK_SEQPACKET). The problem is that for it to work you don’t just need the kernel to support it and the shell to use it, you also need all the utilities to support it too-that’s a big ask.

The idea has been implemented on IBM mainframes (CMS pipelines aka Hartmann pipelines). But that’s a radically different platform, and IBM has never tried porting that to a non-mainframe platform, and nobody has ever sought to directly clone it (although stuff like PowerShell and NuShell share some of its ideas, albeit none of the details)

bbarnett · 2023-07-15T10:31:56

Hmm. Shame it never caught on in the OSS/GNU world.

Thanks for the FYI.

euroderf · 2023-07-14T15:38:45

What if most standard Unix commands had a new option to insert into their output those nifty separators ? (FS, GS, RS, US)

That ought to make life simpler for downstream commands that try to go beyond processing a text stream.

skissane · 2023-07-14T22:48:57

> What if most standard Unix commands had a new option to insert into their output those nifty separators ? (FS, GS, RS, US)

That's not what I mean by record-oriented IO though. That's signalling record boundaries in-band. I'm talking about signalling them out-of-band. So you have record lengths kept separately from the data, and passed around by APIs separately from the data bytes.

euroderf · 2023-07-15T08:32:07

I've had the same idea. It would (I guess) rewriting the kernel from the ground up to turn every stream into two streams ?

Telecoms uses both concepts (in-band signaling and out-of-band signaling) but I guess the Unix developers didn't get the memo ?

skissane · 2023-07-18T23:08:32

> I've had the same idea. It would (I guess) rewriting the kernel from the ground up to turn every stream into two streams ?

Many (but not all) Unix implementations already have support for “record-oriented pipes” in the kernel, it is just that support has rarely been used. Linux has record-oriented Unix domain sockets (SOCK_SEQPACKET), but few use them.

> Telecoms uses both concepts (in-band signaling and out-of-band signaling) but I guess the Unix developers didn't get the memo ?

The original Bell Labs Unix team created STREAMS, which does support true record-oriented IO. Most commercial Unix implementations (such as Solaris and AIX) support it (or at least did at one point). But Berkeley Sockets won the mindshare competition, and open source Unix-likes (such as Linux and BSDs) never added streams support. There was a project to add it to Linux, but Linus was opposed to the idea, so it never got merged, and I think it has since been abandoned.

Even before STREAMS, Unix supported record-oriented IO in the terminal subsystem (cooked mode). But it wasn’t general, it was very specific to the needs of interactive use. STREAMS was intended as a generalisation but it never caught on. So even today all Unix-likes have a purpose-specific (rather than general) record-IO implementation in their tty subsystems, pseudoterminals, etc

euroderf · 2023-07-19T07:22:07

It sounds like the situation is ripe for a couple of dedicated (crazed) developers to pick up the ball.