
Rob Pike: "Current Unix tools are weakened by the built-in concept of a line" - akkartik
http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf
======
haberman
The real problem is that Unix commands produce flat text output without any
information about how to parse that text back into structured data. Any user
who wants the structured version of the data has to parse it themselves, but
these parsers are ad hoc and incomplete by their very nature.

People praise perl, sed, awk, cut, etc. for being good at text processing. But
the only reason they need these text-processing tools for pipelines is because
they are trying to recover structure from the data that was already present
before the previous stage of the pipeline threw that structure away by dumping
it to flat text!

Text is obviously a convenient way for humans to view a program's output, so
clearly it's useful that all Unix programs (ls, ps, etc) can dump their output
as text. But there's no need to dump to text until the output is being sent to
a human. If you're piping "ls | grep" there is no reason for "ls" to dump to
text and "grep" to parse it back from text, especially since "grep" doesn't
know anything about the format of ls's output. It would be way more convenient
if you could say something like:

    
    
        ls | grep 'file.size > 1M'
    

But the only way to do this today is to parse ls's output first. There would
be no reason for this if ls could send _structured_ data to grep.

What I'm describing is similar to Monad, Microsoft's next-gen shell. AIUI it
can send .NET objects between processes instead of flat text. But IMO it's too
imposing to mandate a single object representation like .NET objects.

I'm experimenting with the idea of letting people specify the output of
command-line utilities as a Protocol Buffer schema, for example:

    
    
      message DirectoryEntry {
        optional uint64 inode = 1;
        optional string name = 2;
        optional uint64 size = 3;
        // etc.
      }
    

I think this could be a compelling way of making the next generation of
usability in command-line pipelines, by saving people from having to write ad-
hoc parsers all the time.

~~~
stcredzero
_The real problem is that Unix commands produce flat text output without any
information about how to parse that text back into structured data. Any user
who wants the structured version of the data has to parse it themselves, but
these parsers are ad hoc and incomplete by their very nature._

There are a variety of serialization schemes that are quite easy to parse and
would be suitable for the output of most Unix command-line tools. BEncode from
bittorrent would do nicely, and it's quite easy to parse. JSON would do nicely
as well.

Better yet, unify the shell with a virtual machine that is used to implement
the OS, and have everything available as 1st class Objects.

~~~
haberman
> JSON would do nicely as well.

Yep, some friends of mine did this with JSON, but didn't make the schema
explicit like I mean to: <https://github.com/benbernard/RecordStream>

> Better yet, unify the shell with a virtual machine that is used to implement
> the OS, and have everything available as 1st class Objects.

Please no. This is the Microsoft PowerShell approach, where everything is a
.NET object. Once you start dictating _representations_ of objects, you are
dictating far too much about the implementation of individual pipeline nodes.

~~~
neutronicus
How can your tools all accept the same kind of structured data _without_
dictating its representation? I don't get it.

~~~
haberman
When I talk about a "representation," I mean an in-memory format. For example,
the "representation" of an HTML tree is the DOM.

Yes, you have to agree on a serialization format (JSON, Protocol Buffers,
etc), but that's not the same thing. From a serialization format you can
represent the data however you see fit in your process. For example, a C++
user might represent a string as a std::string object whereas a Python user
would represent it as a native Python string.

The VM-based approach (like PowerShell) defines an in-memory tree
representation, namely .NET objects. This means that you can't really
interoperate with this stack unless you use .NET too, since you don't have an
easy way of converting .NET objects to your own objects.

~~~
akkartik
Just a nit: I think protocol buffers include representations, not just
serialization formats. You need to have the schema of the proto to parse it
correctly, know which fields are required, repeated, etc. Am I understanding
you correctly?

~~~
haberman
It's true that many Protocol Buffer libraries include representations, but
these are for convenience; Protocol Buffers are defined in terms of their
serialization format and schema.

~~~
akkartik
I don't follow. Isn't schema the same as representation? It's the equivalent
of the DTD for an XML document.

To be super concrete, I can't read a file containing protos without knowing
their type, what fields they contain, etc. I can however read JSON just fine
without knowing the precise schema being encoded.

------
stcredzero
Take this sort of thinking to the limit, and you end up with a unification of
a programming language with the OS. Lisp Machines and some of the first
instances of Smalltalk were like this. (Smalltalk used to be an OS.) In this
case, you don't ever parse text output from your shell tools. You just code
directly against objects and streams and collections of objects.

~~~
kbd
This is the principle behind Microsoft's PowerShell. Everything is .NET
objects.

------
buster
I guess one can say that 25 years later this has been shown as (partly) wrong.

I don't think Powershell really solves a problem, it's too complex to work
with for the majority of problems. If i want complex data handling i write a
script and put #!/usr/bin/env {bash,python,perl} in the first line.

I think the missing point here is that the nice, line-based, really simple
approach is that this is how we speak and write and think. (in a series of
flat words, so to speak). It's extremely easy to get into this kind of
handling "data" and it's sufficient for a lot of tasks. I always admired what
can be done with one line of bash/GNU utils.

As i said: If it get's more complex, we use a "real" programming language with
more complex data structures anyway.

~~~
xtractinator
I'm weary of using structured data, that is data that is structured in such a
way that you don't need to understand what it's passing around. Keeping it
simple means that the developer will always know exactly what data is being
passed around.

This is important for a lot of reasons not all of them are related to only
development as a job.

------
saurik
Some researchers at the University of Helsinki have studied the hell out of
this problem, and even provided a useful tool that is available in many Unix
distributions called "sgrep".

<http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>

~~~
luriel
There is also sam -d ;) <http://man.cat-v.org/plan_9/1/sam>

(It hides the GUI and gives it a more ed-like interface that you can easily
script, but this is kind of a hack for Plan 9-nuts ;)

------
mooism2
When is this from? (I presume this is a historical paper that was not
influential at the time.)

~~~
wcarss
Not to say that anyone should have to do this, but I did the following:

[http://scholar.google.ca/scholar?hl=en&q=Structural+Regu...](http://scholar.google.ca/scholar?hl=en&q=Structural+Regular+Expressions)

Which yields the paper as the first result, showing its publishing year of
1987.

~~~
4ad
Not sure about the date of publication, but Rob developed these ideas in the
early 80s, his Sam editor (that Ken Thompson still uses) is based on it.

~~~
gits1225
offtopic:

sam(and acme) sounds like an interesting method at first, but when you try it
out, its really weird and _slow_ to get stuff done. Its because for me, mouse
is really inferior to keyboard to do the majority of the tasks.

This is an except from coders at work:

 _Seibel: Is there anything you would have done differently about learning to
program? Do you have any regrets about the sort of path you took or do you
wish you had done anything earlier?

Thompson: Oh, sure, sure. In high school I wish I’d taken typing. I suffer
from poor typing yet today, but who knew. I didn’t plan anything or do
anything. I have no discipline. I did what I wanted to do next, period, all
the time. If I had some foresight or planning or something, there are things,
like typing, I would have done when I had the chance. I would have taken some
deeper math because certainly I’ve run across things where I have to get help
for math. So yeah, there are little things like that. But if I went back and
had to do it over I’m sure that I just wouldn’t have it in me to do anything
differently. Basically I planned nothing and I just took the next step. And if
I had to do it over again, I’d just have taken the next step again._

It would be interesting to know if Ken Thompson is using sam because he is a
slow typist and hence doing stuff the sam way (mouse oriented) boosts his
productivity or simply because he likes using sam.

~~~
luriel
Acme mouse chording is really cool, doing something similar with keyboard-
oriented editors is much more tedious and often outright painful (
<http://acme.cat-v.org/mouse> ).

I suspect ken spends most of his time thinking rather than typing, and I have
found this to be true of most great hackers.

Other famous Sam users include Brian Kernighan, Bjarne Stroustrup(!) and Tom
Duff. Kernighan writes at least as much English as code and I wonder how that
affects his editor usage patterns, but I suspect that even when writing
natural languages most time is best spent thinking (his writing style is very
concise and clear, one could say similar to ken's code).

I think the obsession with saving keystrokes is very misguided, I still use
editors like vi frequently, and having to think about what magic combination
of commands to use to perform a task can be very distracting, is fun and feels
good, like a tiny puzzle game built into your editor, but it doesn't help you
write better code faster IMHO.

P.S.: Is interesting who has stuck with Sam and who moved to Acme (Dennis
Ritchie switched to Acme, and most of the Go team Google besides ken use it
too).

See also: <http://sam.cat-v.org/>

~~~
adbge

        > I suspect ken spends most of his time thinking rather than
        > typing, and I have found this to be true of most great 
        > hackers.
    

Who cares? Slow input is still not a desirable property in a text editor.

------
silas
Every scribd link is marked as private for me Hacker News for some reason, is
this broken for anyone else or..?

~~~
saraid216
Me, too. I'd really like to read this...

~~~
akkartik
But surely the main non-scribd link works?

~~~
saraid216
It didn't work when I clicked initially; it worked later on.

------
honr
I think tools that generate or pass records or rows of data should certainly
have the option of providing a schema based output as well. In addition to
"find ... -print0" providing either of "find ... -proto" or "find ... -json",
assuming the schema for the json is known as in a
/usr/share/SOMEWHERE/find.json-schema or similar, is really appreciated. And
let's not go overboard with this, as there are many cases where parsing is
genuinely the sanest approach.

------
postfuturist
I ran into a wall when trying to use unix-y tools to do somewhat complex,
regex/replace functions for code refactors. Basically, anything inside a
single line is easy, but once you cross that line barrier, the complexity
increases dramatically. That rendered the changes useless because most
programming languages allow you to add arbitrary new-lines between any token
in the language.

~~~
nhaehnle
I know what you're saying. I ran into the same problem a while back and ended
up hacking a tool that does the kind of structured pattern matching I wanted.
Its syntax is a bit awkward, but what the hell - you can find it here if
you're interested: <https://github.com/nhaehnle/patrex>

~~~
bazzargh
I also hacked up a tool for structural search. I'd noticed intellij's built-in
structural search has a very simple pattern language - you just type the code
you're searching for and insert a wildcard like $x$ or $foo$ where an
expression can vary. However, its restricted to the languages Intellij can
parse.

(in what follows HN is mangling asterisks, so I used 'X')

Tokenizing a language is much simpler though. So I just did that, and let
wildcards match like <token>X? up to the next expression boundary at the same
nesting level - the boundaries were ',' ';' ')' '}' and ']' when I tried this
on java, C and perl.

This turns out to be simple enough and powerful enough to be useful. For
instance, your example:

boost::bind(& ${id} $( :: ${id} )+, $( boost::ref( Xthis ) )|ref| $.X )

Would be: boost::bind(& $id1$ :: $id2$, boost::ref( Xthis ) ) (for 2 args -
I'd need to repeat for more args like so: boost::bind(& $id1$ :: $id2$,
boost::ref( Xthis ), $id3$)

This works because $id1$ is non-greedy, and whitespace is ignored. I did tend
to tweak the tool to what I was searching for (I'd have dropped ',' as a
delimiter for this one), which I guess is cheating on keeping the syntax
simple!

------
sitkack
His example is poor, but the message is that the body of a record can cross
line boundaries. While the UNIX tool chain is predicated on the concept line
== record, this doesn't have to be the case. With a generic record level
marshalling system the class of problems solved by composing command line
tools together would be greatly expanded.

What Pike describes is analogous to the RecordReader in Hadoop.

~~~
dougabug
Except that Pike's paper precedes Hadoop by two decades.

~~~
imd
That's not an 'except', grandparent never implied otherwise.

------
dllthomas
In terms of flexibility, I think this would be a fantastic addition to the
tools. Having it be a shell var instead of an argument might be worthwhile -
if I have a few stages in a pipeline dealing with the same kind of record, it
seems useful to be able to say

( RECORD_PATTERN=somepattern; my | pipe | line | whatever )

rather than

my -R somepattern | pipe -R somepattern | line -K somepattern | whatever -R
somepattern

~~~
stcredzero
It could just be a set of separate tools one could pipe data to.

<http://news.ycombinator.com/item?id=4113231>

~~~
dllthomas
Separate tools have the advantage (over regex) of handling nesting properly,
which could certainly be significant. On the other hand, handling deep-enough
nesting with regexp is usually not hard, and when you're stringing together a
bunch of unix commands quickly you're usually looking for "good enough". I
don't want to have to write a new everything to handle a new format. Maybe
there's something in between?

~~~
stcredzero
I guess I didn't explain well enough, since you completely misunderstood my
suggestion.

 _Separate tools have the advantage (over regex) of handling nesting properly,
which could certainly be significant._

Okay, thanks, but I've known about basic automata theory since I was an
undergrad, two decades ago.

I had something like this in mind:

    
    
        ls -af | jsonify 'ls -af' | this_reads_a_json_stream
    

The jsonify command would retrieve and run a script from a central repository,
which people could contribute code to. This way, the parsing efforts of one
coder could be re-used by the rest of the world.

~~~
dllthomas
>Okay, thanks, but I've known about basic automata theory since I was an
undergrad, two decades ago.

I wasn't trying to educate; I was discussing the relevant limitations of my
approach. The fact that I can't spin a perfect regexp for anything (including
JSON, sexp, xml) that nests arbitrarily deeply is an issue with what I
proposed - one that I think can be worked around sufficiently, but an issue
nonetheless, and I wanted to acknowledge that. I don't see what prompted the
defensiveness I take from your comment - I'd expect most people here to know
at least that much automata theory, here. I've not known it for _quite_ two
decades, but two decades ago I was 8. I'll reply to the constructive bits of
your response separately.

~~~
stcredzero
_I wasn't trying to educate; I was discussing the relevant limitations of my
approach._

Oh, sorry, I thought you were implying that about my approach.

~~~
dllthomas
Ah, gotcha. Yeah, that would have been incoherent.

------
saintfiends
Given the examples in the paper and how old it is, I wonder what would be an
elegant way to do it today with the available tools.

------
snorkel
Nope, can't say I've needed to do any 2D pattern matching on the UNIX command
line, ever. Sure back in age of dinosaurs when command line was being used for
2D graphics this may have been useful, but ... we have actual GUI interfaces
to do that now.

~~~
quanticle
What about source code? Like Pike, I think that it's a little bit nonsensical
that

    
    
        foo(shortArg1, shortarg2, shortArg3); 
    

is easy to find, but

    
    
        foo(longArg1, 
            methodCall().stuff(),
            evenMoreComplicatedStuff);
    

is much more difficult to grep for.

------
vegas
You sir, have made my morning(I mean afternoon).

Thanks!

------
calinet6
... why would you ever use Unix tools to solve those sorts of problems?

~~~
dllthomas
Because you're already at the command line and your skills with them are such
that it's actually lower impedance to just rock it out, rather than write up a
proper script. Not true of everyone, but it happens.

------
gouranga
Possibly true, but we have perl, python, ruby to deal with those cases which
all embed regex in a structural way.

Occasionally you can just convert it to lines first as well and problem
solved.

~~~
_delirium
I don't think they really give you a structural way to use regexes; more of a
procedural way, where you can embed regexes within explicit loops that iterate
over lines, matching and updating state-variables as you go. Unlike regexes,
which are declarative and abstract away the details of the match algorithm and
its internal match-progress variables, _you_ are explicitly maintaining the
match state there. I find myself writing manual-FSM code like
if($scanning_for_new_record) { /.../; }. And even that only really works if
you can do a one-pass match, without needing to backtrack across line
boundaries.

It's inelegant enough that I sometimes do a two-step process instead: 1)
transform the input so that whatever view I want of it maps to a line-oriented
format; and then 2) process the result in the standard Unix line-oriented
fashion.

~~~
Roboprog
Seems like it shouldn't be too hard to write a little function (perl, python,
whatever) to act as a "visitor" to hold this state, then just pass it little
closures (maybe a hash/map?) to evaluate the regexs and pass the match into a
code block?

Something like (perl):

&visit_matches(

{ ' +' => sub { $x += length( $1); },

    
    
      '#+' => sub { print $1, ' at ' $y, ',', $x'; $x += length( $1); },
    
      '\n' => sub { $y++; }
     

});

(not an exact match to the pseudo-awk, but enough to get the idea)

