Hacker News new | comments | show | ask | jobs | submit login
Next generation Unix pipe by Alex Larsson (gnome.org)
156 points by jobi on Aug 11, 2012 | hide | past | web | favorite | 82 comments

I find the tendency to repeat Microsoft's mistakes deeply disturbing. Even if, in this case, the author acknowledges PoweeShell goes too far, his own idea goes too far.

I'd be all in with flags that make ps or ls spit JSON or XML, but this typed nonsense? What when I want to output a color? Will I need a new type?

Oh... and the sort thing... its not hard to sort numerically.

>I find the tendency to repeat Microsoft's mistakes deeply disturbing.

What, in this case, you consider "Microsoft's mistake"? I thought that PowerShell was commonly considered conceptually sound, but flawed in the implementation, mostly for it's verbosity making it unwieldy for interactive use. If this project can solve that, then I don't see it "repeating Microsoft's mistakes". Instead it would be correcting them.

What do you mean by "what when I want to output a color"?

I wonder where this comes from. There's no need for next generation.

People have, for thirty years or so, successfully printed their data into a suitable textual streams for processing with programs glued together with pipes, and optionally parsed the results back into some native format if so required.

Meanwhile, none of the "next generation" pipes have gained any momentum. Obviously they solve something which is either not a problem or they do solve some problems but create new ones in greater numbers than what they solved, tipping the balance into negative.

Any object or intermediary format you can think of can be represented in text, and you're back to square one. For example, even if there's XSLT and XQuery, you can just serialize trees of XML elements into a row-based format of expression and use grep on the resulting stream to effectively make hierarchical searches inside the tree.

I have the opposite experience, actually. Its so damn annoying to have programs communicating structured data with each other over pipes that more complicated things inevitably diverge into a) monocultural programs made with a single programming language and that don't communicate with the outside world or b) Some form of exchange format (JSON, XMl, etc) that needs to be explicitly supported by every participant.

And unix utilities suck at handling structured data. If your file format is line based you might have a chance of it being easy to work with but don't even ask what happens if you insert a newline inside a textfield then.

That is a terrible idea: sometimes the app can take advantage of a constraint to minimize work done.

In your example, if we just wanted to filter for a particular user, dps would have to print out ALL of the information and then you could pick at it. This doesn't seem bad for ps (because there's a hard limit) but in many other examples the output could be much larger than what is needed. That's why having filtering and output flags in many cases is more efficient in generating everything.

As a side note: To demonstrate a dramatic example, I tried timing two things:

    - dumping NASDAQ feed data for an entire day, pretty-printing, and then using fgrep
    - having the dumper do the search explicitly (new flags added to program)
Both outputs were sent to /dev/null. The first ran in 35 minutes, the second in less than 1 minute

Streams clamp everything in them to O(n). That's a problem in some cases; for example, your NASDAQ feed dumper probably has some kind of database inside itself that lets it run filters in massively sublinear time, and making it linear would be a significant performance hit.

However, there are an equal number of tasks that are not sublinear. Some of them are also very common and important sysadmin-y things. Iterate through a directory applying some operation to every file. Slurp a file and look for a particular chunk of bits. And so on. For those sysadmins, a little structure in their stream can make their job a lot easier. It'd be like the difference between assembly and C: all of a sudden things have names.

Obviously for many cases avoiding output is better than post-output filtering. For these cases the originating process should do the filtering. However, in many practical situations the data sets are small enough to not matter, or the operation wanted will not filter out most data anyway.

Basically, you're arguing that grep is a bad tool (it has the same issues) yet its a very commonly used tool.

If "sometimes the app can take advantage of a constraint" is an argument here, you should be against all usage of pipes.

That's not true. So in the case of `ps`, there is a known limit to the number of processes, and it is fairly small, so the performance hit is limited.

As another example in this context, if the original data source is gzip'd, it's faster to gunzip and then pipe rather than integrating the gzip logic into the app itself.

I still disagree. I think you are arguing for the inclusion of, at the least, grep, cut, head and tail in cat.

I do not claim that is a bad idea (conceptually, pipes do not require multiple processes, and those tools could be dynamically linked in) but why stop at those tools? Some people would argue that sed and awk also should be in, others would mention perl, etc.

I also do not see why it would be faster to use an external gzip tool through a pipe. If it is, the writer of the 'tool with built-in unzip' could always, in secret, start an external unzip process to do the work.

So the NASDAQ dumper should accept a structured query as its input. This is an architecture issue, not a data format issue.

For certain types of unix pipeing, I have found it useful to pipe from tool to CSV, and then let sqlite process the data, using SQL statements. SQL solves many of the sorting, filtering, joining things that you can do with unix pipes too, but with a syntax that is broadly known. Especially the joining I have found hard to do well with shell/piping.

I think a sqlite-aware shell would be awesone, especially if common tools had a common output format (like csv with header) where that also included the schema / data format.

very clever

My preference for a "next generation pipe": Shared file descriptors. (Sort of)

It would work virtually the same as a standard pipe; the difference being you could control whether it was read, write, or both, and every application you 'piped' to would have access to the same file descriptors as the parent, unless a process in the path of the pipe closes one.

The end result will be the equivalent of passing unlimited individual arbitrary bitstreams combined with the ability to chain arbitrary programs. In fact, you could simplify things by simply passing the previous piped command's output as a new file descriptor to the next program, so you could easily reference the last piped program's output, or any of the ones before it.

For example:

cat arbitrary_data.docx | docx --count-words <$S[0] | docx --head 4 <$S[0] | docx --tail 4 <$S[0] | docx --count-words <$S[2] | echo -en "Number of words: $STREAM[1]\n\nFirst paragraph: $STREAM[2]\n\nLast paragraph: $STREAM[3]\n\nNumber of words in first paragraph: $STREAM[4]\n"

STREAM[0] is the output of 'cat'. STREAM[1] is the counted words of STREAM[0] ($S[0] is an alias). STREAM[2] is the first 4 lines of the doc. STREAM[3] is the last 4 lines of the doc. STREAM[4] is the counted words from STREAM[2] (note the "<$S[2]"). And STREAM[5] is the output of 'echo', though since it's the last command, it becomes STDOUT.

There may be a more slick way of doing this, but you can see the idea. Pass arbitrary streams as you pipe, and reference any of them at any point in the pipe to continue processing data arbitrarily in a one-liner.


Actually, it looks like this is already built into bash (sort of), as the Coprocesses functionality. I don't know if you can use it with pipes, but it's very interesting.

I like the idea of processes dumping structured objects: pipes are rather often used for the processing of structured data, and while tabulated output certainly makes it easier, we still end up effectively using constants: cut to the third column, sort the first 10 characters, and print the first four lines.

This method is fragile when given diverse input: what if the columns could themselves contain tabs, newlines, or even nul bytes?

Passing objects as binary blobs, on the other hand, doesn't allow for ease of display or interoperability with other tools that don't support whatever format they happen to be. This, of course, can be rectified with a smart shell with pretty-print for columnar data (insofar as a shell could be charged with data parsing; you may imagine an implicit |dprint at the end of each command line that outputs blobs).

I'd also be interested in seeing a utility that took "old-format" columnar data and generated structured objects from it, of course, with the above format caveats.

Something like a cut, only we call it dcut? Actually sounds like a pretty good idea - that way those who don't want to switch to the new format don't have to, and you can pipe it through this program to create the new style structured output...

The inverse of dtable? Yeah, that would be very nice.

How would a column contain a newline?

What would be ideal to solve first is some sort of initial format negotiation on pipes. Otherwise you will end up with the wrong thing happening (eg having to reimplement every tool, spewing "rich" format to tools that don't know it, or regular text to tools that could do better).

We've already seen something like this - for example ls does column output if going directly to a screen, otherwise one per line, and many tools will output in colour if applicable. However this is enabled by isatty() which uses system calls, and inspecting the terminal environment for colour support.

Another example is telnet which does feature negotiations if the other end is a telnet daemon, otherwise just acts as a "dumb" network connection. (By default the server end initiates the negotiations.)

However the only way I can see this being possible with pipes is with kernel/syscall support. It would provide a way for either side to indicate support for richer formats, and let them know if that is mutually agreeable, otherwise default to compatible plain old text. For example an ioctl could list formats supported. A recipient would supply a list before the first read() call. The sender would then get that list and make a choice before the first write() call. (This is somewhat similar to how clipboards work.)

So the question becomes would we be happy with a new kernel call in order to support rich pipes, which automatically use current standard behaviour in its absence or when talking to non-rich enabled tools?

I would love it if grep/find/xargs automatically knew about null terminating.

man grep:

   -Z, --null
      Output a zero byte (the ASCII NUL character) instead  of  the  character  that  normally
      follows  a  file  name.   For example, grep -lZ outputs a zero byte after each file name
      instead of the usual newline.  This option makes the output  unambiguous,  even  in  the
      presence  of file names containing unusual characters like newlines.  This option can be
      used with commands like find -print0,  perl  -0,  sort  -z,  and  xargs  -0  to  process
      arbitrary file names, even those that contain newline characters.

   -z, --null-data
      Treat the input as a set of lines, each  terminated  by  a  zero  byte  (the  ASCII  NUL
      character)  instead of a newline.  Like the -Z or --null option, this option can be used
      with commands like sort -z to process arbitrary file names.
man xargs:

   -0     Input  items are terminated by a null character instead of by whitespace, and the quotes
      and backslash are not special (every character is taken literally).  Disables the end of
      file  string,  which  is treated like any other argument.  Useful when input items might
      contain white space, quote marks, or backslashes.  The GNU find -print0 option  produces
      input suitable for this mode.
man find:

      True;  print  the  full  file  name on the standard output, followed by a null character
      (instead of the newline character that -print uses).  This allows file names  that  con‐
      tain newlines or other types of white space to be correctly interpreted by programs that
      process the find output.  This option corresponds to the -0 option of xargs.

Yes, in other words, the parent is right that zero termination is currently not automatic.

He did not say automatic, he said "knew about" nulls. When you talk about automagically detecting nulls I have this image of an ascii-art Clippy with a cowsay bubble that says "I see you are using null terminated data, I have enabled --null for you."

I did exactly say automatic. It is the previous word to "knew about" you quoted!

And yes, I would expect that find detects that when it is talking to xargs then null termination should be used without the user having to go and fish out what the options are for each tool. And if you used ps with another tool that prefers json then ps can automatically do that, again without having to find and maintain flags.

"automatically knew", in a post which talks about format negotiation. It was fairly obvious to me he meant that it would use the format negotiation to automatically enable the --null switch.

null-terminated strings are hard to read in a shell window. and isatty(3) does not work for pagers.

content nagotiation only works with bi-directional data transfer (i.e. not with pipes).

Thats only true if you only negotiate via data in the pipe. dtools (in the article) uses non-mandatory file locks to do the content negotiation on the pipe.

My code does format negotiation on the pipe to determine whether to send the data in textual form or binary form.

It uses file locks (F_SETLK) on the pipe with a magic offset value offset to do the negotiation.

But you still have race conditions. The sender would have to ensure that the locks are setup before the receiver calls read() for the first time. Since pipes are often setup by the shell you have no control over the startup times. Sure you could have heuristics such as the receiver waiting a few seconds just in case locks show up, but that just makes things slow and unpredictable.

I stand by my assertion that this can only be solved well (ie 100% predictable behaviour no matter what order things start in or how long they take to intialise) by a new system call/ioctl.

No, I avoid the race condition by: 1) Reader sets the lock before reading any data 2) Writer writes a byte to the pipe 3) Writer waits until pipe is empty (FIONREAD ioctl) 4) Writer checks for existance of lock.

This should be race free.

That requires the first byte sent be compatible with whatever format is ultimately used. As an example for ps, the first byte in JSON should be a { while for plain text it should be space. (We get a little lucky since a space would also be acceptable for JSON, but I doubt there is a universal first byte.) And an initial space isn't accept for a programs like find or grep in either text or null separation mode.

I don't want to belittle what you've done, but the point remains. This can't be done robustly without an additional system call. What you have is tantalizingly close. Even a call as simple as telling the sender that the receiver has called read() would complete your solution.

Not really. All you need to be able to do is to produce the first byte of whatever would have been produced in the "fallback case", i.e. when the reader does not handle format negotiation. Then, when the writer sees that the reader supports format negotiation it will need to signal that the alternative format was chosen. I do this by sending a zero byte (which should never appear in the fallback text format). Then the two first bytes are skipped as part of the negotiation framework when a non-fallback format was chosen.

At the risk of stating the obvious - this won't take off for a simple reason of being too complex by Unix standards.

I'm not sure, it's probably not too complex by GNU standards.

Have you seen the command line options for ps? They're going to have to start using Unicode accent marks if they extend it much further.

"Even something as basic as numerical sorting on a column gets quite complicated."

    sort -g -k field_num

Two problems: the header is sorted along with the fields, and you have to look up the field number. Insurmountable? No; but somewhat complicated.

`grep -v` or `tail` seems like a lot easier workaround than desiging a brand new shell piping system. Maybe there are other use cases but sorting numeric fields is definitely not worth the effort. A lot of black magic can be easily conjured up with `col`, `tr`, `cut`, `column`, `head`, `tail`, `grep`, `pr`, and `sort`; and that's without ever even touching `sed` and `awk`.

Obviously a lot can be done, but its hardly easy, you yourself call it black magic. But its very easy to do with typed data, and only the beginning of what you can do.

When I said black magic[1] the last thing I was trying to convey was that using coreutils/bsdmainutils was complicated. Typed data is easy to work with, but creating a sophisticated unix pipes 2.0 is not. No matter how complicated you think coreutils/bsdmainutils mastery is, you have to admit its a lot easier than building unix pipes 2.0.

If you throw in numutils and moreutils you can go nuts with columns of data. What tasks would you like to accomplish on the command line with columns of typed data and pipes 2.0?

[1] On a side note I was surprised that we had different conceptions of what black magic. I was going for evil, nefarious, and/or unorthodox. Have I been using the term wrong? That's an honest question, it would not surprise me if I have been oblivious.

I don't expect every user to create unix pipes 2.0, so the difficulty of that is not really what needs to be compared. It will only have to be done once.

And once this is done any user can avoid having to painstakingly construct pipelines that try to cut out the right columns to treat as numbers, or avoid all the problems parsing strings that may contain spaces or other control characters. You can do an operation like:

filter out all processes with %cpu > 20 with uid > 1000 and sort by second cmdline arg as:

dps | dfilter pcpu ">" 20 uid ">" 1000 | dsort "cmdvect[1]"

Obviously a made up example, but something like this is easy to read and write, whereas something working on tabular ascii data would be quite long and complicated.

As per black magic, I have about the same interpretation as you. I didn't really misunderstand it to be about how complicated it was. However, "black magic" certainly has a feel of "you should not do this", and arguing that you can then use that in order to do something which could instead be simple and obvious in a typed system seems kind of weird.

Your example would be about the same length with awk and sort, with the only caveat that you need to figure out the field numbers, and the upside that I can trust the tools are available pretty much everywhere.

Its doable yeah, but its a lot more work.

First you have to handle the header specially (want it in the result but not in the comparisons).

In order to compare by uid you need numeric uids (-n), but that means you can't also get the readable username, so you need a custom output format.

Then you need to ensure the output format is such that nothing with possible spaces or control chars can end up in a column before the data you're looking at, as then finding the right column is hard.

Even then, extracting the first command line arg like in the example will fail in the case of a binary name that has a space in it (as there is no way to know which spaces in the commandline corresponds to actual spaces in the arguments or just delimiters).

Well, what's the awk/sort version?

As you admit your system is not simple to implement but you are correct the onus is on you to create pipes 2.0. But it also means that every other developer is going to have to implement pipes 2.0 compliant output for their program. Unless pipes 2.0 is going to auto-identify everything in addition to nulls?

Yes, the weak link here is obviously getting all kinds of output into the pipes 2.0 format. Triggering such output via format negotiation is possible, but you still would have to add support for actually outputting it.

Black magic is "A technique that works, though nobody really understands why." http://www.catb.org/jargon/html/B/black-magic.html

Sometimes I feel like POSIX idioms are like the bible to some here: untouchable.

Sometimes I feel like Hacker News stories are like mom's fridge to some here: everything gets two gold stars.

Neat! I've commented about this very problem before on several of the many threads regarding "object pipes", ie. REPLs.




Since that last comment, I've been working a bunch with Clojure, which has a far more expressive variant of JSON, as well as some heavy duty work with Google's Protocol Buffers.

A few points:

1) Piping non-serializable objects is a BAD IDEA. That's not a shell, that's a REPL. And even in a REPL, you should prefer inert data, a la Clojure's immutable data structures.

2) Arbitrary bit streams is, fundamentally, unbeatable. It's completely universal. Some use cases really don't want structured data. Consider gzip: you just want to take bytes in and send bytes out. You don't necessarily want typed data in the pipes, you want typed pipes, which may or may not contain typed data. This is the "format negotiation" piece that is mentioned in the original post. I'd like to see more details about that.

3) There seems to be some nebulous lowest common denominator of serializable data. So many things out there: GVariant, Clojure forms, JSON, XML, ProtoBuf, Thirft, Avro, ad infinitum. If everything talks its own serialization protocol, then none of the "do one thing well" benefits work. Every component needs to know every protocol. One has to "win" in a collaborative shell environment. I need to study GVariant more closely.

4) Whichever format "wins", it needs to be self-describing. A table format command can't work on field names, unless it has the field names! ProtoBufs and Thrift are out, because you need to have field names pre-compiled on either side of the pipe. Unless, of course, you start with a MessageDescriptor object up front, which ProtoBufs support and Avro has natively, but I digress: Reflection is necessary. It's not clear if you need header descriptors a la MessageDescriptor/Avro, or inline field descriptions a la JSON/XML/Clojure. Or a mix of both?

5) Order is critical. There's a reason these formats are called "serializable". Clojure, for example, provides sets using the #{} notation. And, like JSON, supports {} map notation. Thrift has Maps and Sets too. ProtoBufs, however, don't. On purpose. And it's a good thing! The data is going to come across the pipe in series, so a map or set doesn't make sense. Use a sequence of key-value-pairs. It might even be an infinite sequence! It's one thing to support un-ordered data when printing and reading data. It's another thing entirely to design a streaming protocol around un-ordered data. Shells need a streaming protocol.

6) Going back to content negotiation, this streaming protocol might be able to multiplex types over a single stream. Maybe gzip sends a little structured metadata up front, then a binary stream. ProtoBufs label all "bytes" fields with a size, but you might not know the size in advanced. Maybe you need two synchronized streams on which you can multiplex a control channel? That is, each pipe is two pipes. One request/response pair and the other a modal byte stream vs typed message stream.

Overall. This is the nicest attempt at this idea I've seen yet. I've been meaning to take a crack at it myself, but refused to do it without enough time to re-create the entire standard Unix toolkit plus my own shell ;-)

Regarding order. The dtools approach uses a stream (i.e. potentially infinite) of variants. Each variant is a self contained typed data chunk which is by itself not "streamable" (i.e. you have to read all of it). The data chunk is strongly typed and the type is self-described.

The supported primitive types are: bool, byte, int16, uint16, int32, uint32, int64, uint64, double, utf8 string (+ some dbus specific things).

These can be recursively combined with: arrays (of same type), tuples, dicts (primitive type -> any type map), maybe type, and variant type

In my dps example I generate a stream of dictionaries mapping from string to variant (i.e. any type). The type of each item in the map differs. For instance cmdvec is an array of strings, whereas euid is an uint32.

Thanks. I looked at the GVariant page a bunch too.

It seems like the encoding is a steam of {type, value} pairs, where values can contain per-type headers as well.

Protobufs, on the other hand, use {field, wire-type, value} where field is required to have an externally known type to parse value, but wire-type is sufficient to determine the length of value, so you can skip unknown fields (used for backwards compatible protocols). In theory, required fields could omit field and wire-type, but Protobufs deemed it more complexity than justifies the space and performance impact.

Primitive values like integers are totally expected in any such format like this. Their salient feature being that they're of known length. I'm a little more leery about "arrays" or other data structures of variable length which are encoded with a known length. Consider Pascal strings {length, [chars]} vs C strings {[chars], NULL}. The later lends itself much better to streaming protocols, but the former is far simpler to work with when you have a complete dataset.

I ran into this situation with a Protobuf I was designing where the first attempt had a message with a repeated field, but it became obvious that I wanted a begin message, a repeated message of singular fields, and then an end message, to allow a fast-start on the send, which didn't require to know the full data set length up front.

There are, however, situations where you do want the length up front. For example, if you need to allocate space to put things. You can get faster parsing if you know the total message size immediately. In general, however, I don't think it matters all that much with modern languages and hardware.

This is one reason why Clojure has both lists and vectors. Lists are lazy head/tail pairs and (count some-vector) is a constant time operation. Unfortunately, Clojure's reader doesn't seem to offer streaming reads of lists (I may be wrong about this).

The bigger issue with unbounded values is that they are more difficult to work with in most languages. Haskell, Lisps, and other functional languages fair far better than most, but once you start mixing fixed-sized messages with known fields, with variable-sized sequences, you wind up with a situation like {x, [ys], z} where a piece of code wants to look at z before looking at ys. If that tuple is represented as an associative structure {:x 1, :ys [2 3], :z 4} then it's suddenly very confusing that it's an ORDERED map and all sorts of assumptions go out the window.

Even more fundamentally: Source code is a serialized protocol. You write down text and the order of the characters on the page have meaning. Sometimes, that order may be over-specified, but regardless, humans see order and make assumptions from it, even when order doesn't matter.

I've quickly jotted some thoughts here: http://damnkids.posterous.com/rich-format-unix-pipes

Regarding this version, standardizing on a particular transfer format is a bad idea. If history has shown anything, it's that we like to reinvent this stuff and make it more complicated than necessary (see also XDR, ASN.1, XML, etc. :) pretty much on a 5 year cycle or thereabouts.

Do the bare minimum design necessary and let social convention evolve the rest.

Having to many different formats is also a problem though, as incompatible formats means you can't combine two apps in a pipepine.

The negotiations in dtools is made using a F_GETLK hack with a magic value offset. That approach could easily be extended to support multiple formats.

The way I see it there's two distinct problems being coupled together, kinda like inventing HTTP but defining it only to be used with (html, gif, png) or something. The reality is that if your solution gets even some adoption, conventions will quickly emerge based on actual use rather than expected use, which almost never goes well. Additionally when some after-market use is discovered that wasn't part of the original spec, yet makes fabulous sense, existing implementations may be better positioned to deal with it (instead of suddenly finding they're being fed PNG files which are actually base64-encoded XML, or something mad like that, typical shoehorning crap).

Similarly, producing a big ecosystem of utilities to go along with it will probably result in a bunch of 1970s style compatibility commands that nobody actually uses any more (say, in 2030).

But feel free to bake a glib-specific serialization in and I'll feel free to pass it up. ;)

Love your fcntl() hack. My kind of hack!

If I can do a slight PG impression, "what problem does this solve?"

Among others, this problem:


find -print0 is a lame hack, and even filenames with spaces (not newlines) are somewhat messy to work with on the Unix shell.

Or a little recurring problem I have: How do I grep the output of grep -C (matches showing multiple lines delimited with a "--" line)? I wrote a custom tool to do it, which does the job, but really it would be nice if I could use all the normal line-based Linux tools (sort, uniq, awk, wc, sed) with a match as a "line".

> find -print0 is a lame hack, and even filenames with spaces (not newlines) are somewhat messy to work with on the Unix shell.

This problem is simply a flaw in sh (and its descendants), other shells handle it much better, see for example Tom Duff's rc shell: http://rc.cat-v.org

Also note that Plan 9, the successor to Unix (and which uses the rc shell as its main shell) doesn't even have a find command, find's design is not really very unix-y.

As for your second questions, the answer might be structural regular expressions: http://doc.cat-v.org/bell_labs/structural_regexps/

> This problem is simply a flaw in sh (and its descendants)

Indeed, although it's not just sh; if you want to, say, make a table of filenames and some attributes of each file, you're in trouble if the filenames contain spaces (awk, cut, sort don't work as easily) and screwed if they contain newlines.

What does Plan 9 use instead of find?

> As for your second questions, the answer might be structural regular expressions:

I've actually been meaning to write a clone of the command line portion of sam, tack on some slightly more powerful features, and try living with it... it would be able to solve much of that use case, but I think it would be cleaner if all the normal tools just knew that the output of grep -C is, in fact, a list of multiline strings.

The Plan 9 approach is to avoid creating problems for yourself by not using spaces in file names to begin with. The file server initially disallowed spaces in file names just as nulls and slashes are disallowed. That restriction has since been relaxed,† but everyone still avoids spaces. If you cannot avoid files with spaces in their names, there exists trfs,†† a file system that transparently replaces spaces with something more convenient.

Instead of find, I run /bin/test on a list of files. For anything more complicated than what test can handle, I use Inferno's fs program.†††


†† http://a-30.net/inferno/man/4/trfs.html

††† http://www.vitanuova.com/inferno/man/1/fs.html It has a misleading name. It is not a file server.

filenames with newlines is an edgecase and a data problem. You can have fucked up characters in filenames doesn't mean you should.

You can spend 10 hours solving for edge cases or 1 hour redefining the problem. In this case by mandating we only work on files with ascii printable characters. It's often much, much easier to massage input than have follow on tools handle every freaking possible edge case ever.

The GVariant/dbus typesystem has both "string" which is a UTF8 text string and "bytestring" which is an array of bytes. The later is what you would use for filenames and would avoid problems with weird characters in filenames, etc.

Don't need to always use sed and awk. You can more easily sort, without recoursing to cut. You can let your pipe work on ranges.

Basically, use your imagination :-)

What's wrong with sed and awk? This may be the Stockholm Syndrome talking, but I like awk!

As for the sorting, what's wrong with using the -k flag? I only use cut when I really don't care about that field.

Well I'm not going to learn a new tool to solve imaginary problems. Is there anything in the real world, a simple scenario where this would be handier than tools that have existed for 20+ years?

If you know the tools, great! Don't let stop you from using them.

If I can do a slight Henry Spencer impression, "the problem of not understanding Unix and being condemned to reinvent it, poorly."

I've been playing around with a similar idea - using plain JSON as the message format, you can make a set of pipeable command line utilities for manipulating data from many web APIs.

Have you seen RecordStream? It's some good cli tools based on streams of json which might be handy.


I've used it a lot and it's a godsend for a most "record-y" manipulation.

A textual format needs parsing though at every stage in the pipeline. Thats why I think using an (optional, when supported) binary format is important.

I've noticed that the NUL-termination problem [1] has come up a number of times in these comments. If you want a solution to this that isn't so drastic as an object system, perhaps take a look at Usul [2], non-POSIX 'tabular' Unix utilities which use an AWK-style $RS.

[1]: http://news.ycombinator.com/item?id=4369699

[2]: http://lubutu.com/soso/usul

What about providing a filter that converts to whatever format you can think of? e.g. outputs in JSON or XML

Because what are you converting from? It can't be turtles all the way down, at some point there must be a defined system that everything speaks. Adding output formats after that is relatively simple.

The output is typed, in a standard format. He defines in his post (cf output of the dfs program).

Actually I prefer Powershell's approach to transfer objects, as it is more flexible than standardize in a specific transfer format.

But I do concede that it has the downside that if the object lacks the properties you want to access, then it might be painful in some cases.

But it is not like you cannot do normal string processing using cmdlets like "Select-String". And an object missing a property is almost same a column missing in the returned text output right?

Good point, I've forgotten about that.

I can't believe this. Just 2 or so weeks ago I set about writing exactly something like this in Haskell [1]. It's by no means complete or even working at this point, but basically what I had in mind was something like:

    yls | yfilter 'mdate = yesterday && permissions.oread = true' | yformat -ls
Every tool emits or consumes "typed" JSON (i.e. JSON data with an additional JSON schema). Why typed? Because then the meaning of things like mdate = yesterday can be inferred from the type of mdate and mean different things depending on whether mdate is a string or a date. In the case of a date, the expression mdate = yesterday can automatically be rewritten to mdate >= 201208110000 && mdate < 201208120000 etc. In the case of a string we do string comparison. In the case of a bool we emit an error if the compared-to value isn't either true or false, etc.

Basically, I wanted to build a couple of standard tools inspired by the FP world, like filter, sort, map, fold (reduce) and have an universal tool for outputting formatted data in whatever form is desired - be it JSON, csv files, text files or custom formats. Every tool would support an -f parameter, which means that its output is automatically piped through the format tool, so that something like

    yls -fls
is functionally equivalent to

    yls | yformat -ls
which would output the JSON data from yls in the traditional ls way on a unix system.

    yls | yformat -csv
would output csv data. Some more examples:

    yls | yfold '+ size' 0
prints out the combined size of all files in the current directory.

    yls | ymap 'name = name + .jpg' | ymv
would append .jpg to all files in the current directory.

    ycontacts | yfilter -fcsv 'name = *John*'
would print out all Google contacts containing John in their name as a csv file.

    yps | yfilter 'name = java*' | yeval 'kill name'
would kill all processes whose names start with 'java'.

The cool thing about this is that this approach conserves one of the main selling points of FP: composability. I.e. you can throw something like yfold '+ size' 0 in a shell script and then write:

    yls | size.sh
This way people would be able to build an ever growing toolbelt of abstracted functionality specifically tailored to their way of doing things, without losing composability.

[1] https://github.com/pkamenarsky/ytools

Personally, I'm not feeling the quotes and would prefer parens since they're nestable.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact