Hacker News new | past | comments | ask | show | jobs | submit login
Rob Pike: "Current Unix tools are weakened by the built-in concept of a line" (cat-v.org)
137 points by akkartik on June 14, 2012 | hide | past | web | favorite | 106 comments



The real problem is that Unix commands produce flat text output without any information about how to parse that text back into structured data. Any user who wants the structured version of the data has to parse it themselves, but these parsers are ad hoc and incomplete by their very nature.

People praise perl, sed, awk, cut, etc. for being good at text processing. But the only reason they need these text-processing tools for pipelines is because they are trying to recover structure from the data that was already present before the previous stage of the pipeline threw that structure away by dumping it to flat text!

Text is obviously a convenient way for humans to view a program's output, so clearly it's useful that all Unix programs (ls, ps, etc) can dump their output as text. But there's no need to dump to text until the output is being sent to a human. If you're piping "ls | grep" there is no reason for "ls" to dump to text and "grep" to parse it back from text, especially since "grep" doesn't know anything about the format of ls's output. It would be way more convenient if you could say something like:

    ls | grep 'file.size > 1M'
But the only way to do this today is to parse ls's output first. There would be no reason for this if ls could send structured data to grep.

What I'm describing is similar to Monad, Microsoft's next-gen shell. AIUI it can send .NET objects between processes instead of flat text. But IMO it's too imposing to mandate a single object representation like .NET objects.

I'm experimenting with the idea of letting people specify the output of command-line utilities as a Protocol Buffer schema, for example:

  message DirectoryEntry {
    optional uint64 inode = 1;
    optional string name = 2;
    optional uint64 size = 3;
    // etc.
  }
I think this could be a compelling way of making the next generation of usability in command-line pipelines, by saving people from having to write ad-hoc parsers all the time.


For people who may not have heard about what happened to Microsoft's experimental Monad shell, you might want to mention it turned into an actual product called Powershell that comes with Windows 7. It's very nice, nice enough I wish there was a good linux port.


The real problem is that Unix commands produce flat text output without any information about how to parse that text back into structured data. Any user who wants the structured version of the data has to parse it themselves, but these parsers are ad hoc and incomplete by their very nature.

There are a variety of serialization schemes that are quite easy to parse and would be suitable for the output of most Unix command-line tools. BEncode from bittorrent would do nicely, and it's quite easy to parse. JSON would do nicely as well.

Better yet, unify the shell with a virtual machine that is used to implement the OS, and have everything available as 1st class Objects.


> JSON would do nicely as well.

Yep, some friends of mine did this with JSON, but didn't make the schema explicit like I mean to: https://github.com/benbernard/RecordStream

> Better yet, unify the shell with a virtual machine that is used to implement the OS, and have everything available as 1st class Objects.

Please no. This is the Microsoft PowerShell approach, where everything is a .NET object. Once you start dictating representations of objects, you are dictating far too much about the implementation of individual pipeline nodes.


How can your tools all accept the same kind of structured data without dictating its representation? I don't get it.


When I talk about a "representation," I mean an in-memory format. For example, the "representation" of an HTML tree is the DOM.

Yes, you have to agree on a serialization format (JSON, Protocol Buffers, etc), but that's not the same thing. From a serialization format you can represent the data however you see fit in your process. For example, a C++ user might represent a string as a std::string object whereas a Python user would represent it as a native Python string.

The VM-based approach (like PowerShell) defines an in-memory tree representation, namely .NET objects. This means that you can't really interoperate with this stack unless you use .NET too, since you don't have an easy way of converting .NET objects to your own objects.


Just a nit: I think protocol buffers include representations, not just serialization formats. You need to have the schema of the proto to parse it correctly, know which fields are required, repeated, etc. Am I understanding you correctly?


It's true that many Protocol Buffer libraries include representations, but these are for convenience; Protocol Buffers are defined in terms of their serialization format and schema.


I don't follow. Isn't schema the same as representation? It's the equivalent of the DTD for an XML document.

To be super concrete, I can't read a file containing protos without knowing their type, what fields they contain, etc. I can however read JSON just fine without knowing the precise schema being encoded.


Schema isn't representation the way haberman is using it. He means something like implementation, which protobuf has in many languages in various VMs and can be easily ported to more.


Yes, but you could also just use IronPython and script in Python.


The more decisions you force on your users, the more reasons you give them to choose some other technology, and the less future-proof you are. For example, .NET is only really usable on Windows.


Allow a shell variable to control output record and field separators. Default to space and newline if nothing is specified.

ORS=: OFS=, ls -l # rw-r--r--:1:uname:gname:1025:Jun 1:somefile.txt,...,...

Add a specifier for dates, usernames, groupnames, etc. DSF="%Y-%m-%d" ls -l


Why does the protocol define the objects? Shouldn't the object metadata should have enough information to transform the output? E.g. object is a URL vs protocol is a URL


> JSON would do nicely as well.

Until you need to incrementally process 2GB of data.


BSON in that case?


Actually I've been considering/dreaming about a thing like this for a long while now (ever since I got started with jQuery and discovered how smooth I could sail through the DOM with it). The idea I came up with (I'm most likely not the first to think of it, so tell me if someone already implemented it) is a sort of JavaScript shell for *nix that'd work similar to jQuery: passing along (collections of) JavaScript objects with properties and functions.

e.g.:

  var sum = 0;
  $("~").ls({"type": "file", "size": ">1M"}).each(function () {sum += $(this).size("mb");});
  $.echo("total size of home directory: %d MB", sum);


Powershell on Windows does something like what you want:

    PS C:\Some Directory $sum = 0
    PS C:\Some Directory dir | where { $_Length -gt 1MB } | %{ $sum += $_.Length }


First thing I looked for when I checked the comments, was there a mention of Powershell. Powershell has the concept of passing objects (via .net clr) instead of passing strings. It sucks when trying to deal with streams of data, but fantastic for acting as script glue between various systems.

One of the things I feel Microsoft really got right.


First thing I checked for too. Second thing I looked for was someone pointing out that this is asking for object oriented systems (such as Smalltalk or Self).

But the third thing, no one mentioned: SNOBOL. Reportedly (because I haven't used it myself) it is better than AWK for complex matching.


I played with SNOBOL. It may have a more powerful matching engine, but power doesn't equate to usability. Having each line followed by 3 gotos does not good UX make.


Icon or its descendant Unicon might be worth a look, then.


Maybe. I don't think I've ever really been constrained by the power of Awk's pattern matching. I wish I could do it recursively, but that's not so much a power issue as a composability issue.


Or, one could write a library of standard parsers and serializers for the Unix tools that would parse and produce known JSON representations of data that could be passed between scripts.


i did something like that at my last job - the cluster manager was designed as a set of command line tools, and every program had a -j flag that would make it parse stdin as json and write json to stdout. we then wrote wrappers around all the linux utilities we were using, to enable the same behaviour. once the basic system was in place it let us experiment with new features very rapidly indeed because everything could be tested in isolation from the command line using automated test scripts. what was especially nice was that we could use the same test framework to chain several utilities together and test the combination, because it was all a black box with json going in and coming out.


That's recordstream


recordstream seems to concentrate on lots of formats. Really, it should concentrate on lots of tools, with a shared repository of parsers for tools.

EDIT: Actually, they concentrate on JSON, but they also try to provide a generic set of tools for reading any format. I think this could be better structured to "just work."


Problem with PBs is that the receiver needs a schema to parse. JSON or S-expressions obviate that need.


Sender can send the schema, which it obviously has.


You're being too practical within unix constraints - metadata doesn't need to be tied to a schema or format it only needs to have metadata. Tell me what the metadata _does_ not what it is and let the protocol / format decide how to format the data. E.g. object_1 is HTTP 1.1 is ...


find . -depth 1 -type 'f' -size +1M


I like your observation but I don't like your conclusion.

I think the major thing missing from the world is easy piping a la Unix. You can't pipe your list of paying customer's email addresses to paypal at the command line with some switches - really, no matter how many switches you use - and bill each one the amount stated, because "paypal" is not a commandline app. You can't pipe the results of some long-running analysis to Twitter to announce that you've finished computing it, no matter how many switches you add, because twitter is not a commandline app.

The direction you're suggesting we take things is, in fact, a fuller API. These exist. They're slower and worse.

The amazing thing about text is that it's a lowest-common denominator. Think of communicating with a person.

Communication is faster with a mind-meld where you're looking at another person's face and picking up micro-expressions and body language. That's also the easiest thing to misinterpret.

When you pipe through such a 'human-readable' lowest common denominator, you're actually setting the ground for a very dynamic and versatile channel.

I don't know what the easy solution is to the problem you bring up, but I don't think your proposal is it.

Maybe there is no easy solution. Several unicode characters to abstract away the tab character and the newline character to instead n dimensions of characters, and corresponding negative characters so you can put something on a 'line' out of channel, (negative tab comment, tab, text, newline repeat) would probably solve some issues but is too abstract to even discuss. What we have really isn't that bad.


> The direction you're suggesting we take things is, in fact, a fuller API.

Nothing about my proposal has anything to do with an API. I'm just proposing a structured stream of data instead of an unstructured one.

> They're slower and worse.

I am proposing avoiding a serialize/parse step between every pair of pipeline elements (or using a more structured/optimized format if a serialization step is desired). Doing less work cannot possibly be slower.

> The amazing thing about text is that it's a lowest-common denominator. Think of communicating with a person.

Communicating with a person is an endless process of content negotiation. What are you and I talking about? Using our interface of plain text, we could be talking about literally anything. I could suddenly start talking about kumquat farming in Russia. At that point you could decide to follow suit and weigh in with your opinions on kumquat farming, or you could stop talking to me altogether because I've gone off-topic. If I start speaking complete jibberish, you could try to learn the language that I'm speaking, or you could start doing something equally nonsensical.

You and I can respond to unexpected communications in useful ways because we are fully autonomous, intelligent, sovereign beings that are capable of learning, creativity, and curiosity. I don't know about you, but I don't want my software to be autonomous or react to unexpected situations in unpredictable ways.

Data processing software should be as simple, predictable, and deterministic as possible. To use your example, if I somehow got an email address in my list called "send $1000 to Lucy," I don't want PayPal to decide to get smart and interpret the invalid email address as a command to send money to Lucy.


> Communicating with a person is an endless process of content negotiation. What are you and I talking about?

Right. So how does grep know it's talking to ls about dir contents with a (disk) size field and not talking to ps about process table contents with a (mem) size field? Some sort of slower content negotiation, I presume.

> cannot possibly

I suggest that every time you think that, you double-check your assumptions. In this case, I think you're just pushing all the extra hard work off onto some other part of the algorithm.


His point is that he's pushing off the extra hard work to be done only once, rather than at every pipe. I'm not sure how he needs to recheck his assumptions there since he stated them.


> I'm just proposing a structured stream of data instead of an unstructured one.

That's basically what an API is.


So things like CSV, XML, and JSON are APIs now?


Can any of those not be an interface to an application layer?


Plain text also "can be an interface to an application layer" (see SMTP, IMAP, POP, etc). Your argument is meaningless.


You are confusing the "how" (the API specification) with the "what" (the data the API exists to provide access to).


> I'm experimenting with the idea of letting people specify the output of command-line utilities as a Protocol Buffer schema, for example:

Looks like an API spec to me



No, they have command line interfaces.


I could not agree more with your comment. There's not a problem with text. Why do people pretend there is? They often only achieve making other people's jobs more difficult. Text is what people can read. People do not read binary. When something goes wrong, debugging binary formats becomes insanely cumbersome.

The concept of lines is a human one. It is how humans parse. If humans could parse without needing the concept of a "line" then, e.g., there would be no problems with programming in C which has very shaky support for the concept of "lines".

But there are problems as we all know. It's proof that people do need to think in terms of "lines". Even though the computer does not need them. The only "problem" with this is that people are not computers.

I'm not sure anyone outside of the most unrealistic nerds would agree that this is a "problem".

1. As for the paper, I can think of at least one tool/language to work with text that does keep data in a binary format while one performs a series of transformations. It's not line-based. And I would guess that Mr. Pike does not know how to use it. It's fast and efficient. Probably faster than sam.


> I could not agree more with your comment. There's not a problem with text. Why do people pretend there is?

My problem is not with text per se, but with unstructured text. I'm fine with JSON in cases where efficiency is not a top concern.

Let me ask you this; how would you do the equivalent of this hypothetical command?

  $ ls | structured-grep 'file.size > 1M'
The answer is that you can't in today's world without writing a parser (or some code that calls readdir/stat manually). That is the problem with unstructured text.

Text formats like CSV that seem simple actually end up being hugely complicated once you push them to their limits. Nothing is worse than software that breaks once something unexpected happens, like a string that contains an embedded comma.


"ls" prints a list of files. "find" knows about the filesystem and can print a list of files based on file metadata and a path (such as a filename). If you have a list of files (eg from "ls") -- you would need to look up the metadata you want to filter on; it is not part of ls' interface to give them to you directly:

  ls | find -size +1M
On the other hand, if you want to make list of files and their sizes, that can be stored, sent over the network, etc -- and then filer that list, you can do:

  ls --size | awk 'file_blocks=$1 $file_blocks > 1024 { print $2 }'
For most tasks, I think the fact that any human can look at the output from eg: "ls --size" and then produce a valid test dataset in the same format is more valuable than having to explicitly "cast" the metadata while processing.


>>Let me ask you this; how would you do the equivalent of this hypothetical command? >> $ ls | structured-grep 'file.size > 1M'

# find . -maxdepth 1 -size +1M

really, flat text is fine. Maybe it's not perfect, but it's good enough that most want to not add any complexity to it that would make it incompatible.

And if you really need that complexity, it's usually worth whipping up a parser for.


"find" is a poor man's "structured-grep." It provides a bunch of functionality for filtering a result set, but is totally specific to lists of files. You can't use find with ps, netstat, iptables, ifconfig, or any other command-line program that produces a list of records.

> And if you really need that complexity, it's usually worth whipping up a parser for.

No work is worth doing if it could just as easily have been avoided.

The vision in my head has less complexity than the status quo, not more. How many flags does "find" have? One for every field name you can filter/sort by. Done right, a "structured-grep" that can grep on any field sent to it is much, much simpler.


I think I see your point, and it has merit ... but ... :)

in a way, the 'find' program is like what you envision, except it's just for files. That means that someone somewhere along the road, had the same idea/problem (but limited to files) as you and whipped up a parser to produce that meta-data. That particular parser proved to be so useful to so many people, it became it's own program.

There's more than 40 years of sofware-"evolution" contained in unix, and apparently retrieving structured on the command-line has only proven universally useful for files. Unix has outlived many at the time more modern operating systems, and I think it's partly because it lacked a "grand unifying vision". Instead it has a "small, quick&dirty unifying vision" of which "flat text processing" on the commandline is a central part. It has turned out to be the greatest common denominator for being able to write programs, that might be quick and dirty oneliners, but ultimately they got the job done. And only those tiny little utilities that proved to be universally useful were developed into bigger more stuctured programs.

I'm not saying you idea is without merit, but it does apply the principle of "this concept A is useful for this particular problem-set. Let's apply it natively to all problem-sets so it can be useful there too!" ( in a way like Java did with the OO concept).

When simpler visions and concepts are actually implemented in the end it usually turns out one has been replacing witchcraft with voodoo.


The solution I often contemplate is to address the process by which the data is created. (Let's assume we're talking about data you would find on the web.) That is, why don't we impose rules on that process? Why can't we "mandate" that the output be structured data from the get go? Instead we allow vast quantities of unstructered data to be created and then we try to normalize it. I know this is a radical view, but it could make sense in some circumstances.

I disagree with your comment about CSV. But it's impossible to have a meaningful argument unless you provide an example: Give me a job to do, a CSV file and let me have a go at it. I'm serious. Post a link to a CSV file, define a task and let's see what we can do just using plain ole UNIX. Could be a fun exercise.

As for your hypothetical command, I do not understand what is so difficult about this. The stat command is what you want, not ls. No self-respecting UNIX user would parse ls when he can use stat (I recommend the BSD one over GNU.).

But here's what I would do:

1. If your UNIX filenames have spaces in them, rename them. There is no sensible reason to leave spaces in filenames in UNIX. Fix this first before it becomes a problem.

2. Write a one-liner and save it as a function, perhaps in your .profile, or maybe in RCS, or save it as a script. There's so many ways to manipulate output as a stream. Pick one that suits your tastes. That's the beauty of UNIX. Make your own solutions as you go. There is no right or wrong answer. It is a form of customization. My choice will no doubt make some people cringe. Assuming there's no user named "[0-9]M":

   whatever(){ ls -lhS |tr '\011' '\040'|sed '/ [0-9]M /!d' ;}
or save what's between the brackets as a file named "whatever". Maybe you save it in a directory called "x" and add that to your PATH. Then you do

   . whatever
Of course how long the list of files is going to be makes a difference. I might take a different approach if the output was going to be an enormous list.

There are so many ways to get what you're after. The point is that you should be able to tap out a one-liner that does the job. Maybe it takes a few iterations to get the right output. Tweak it until the output is what you want. Viewing command-line history is perfect for seeing the process of creating a one-liner to manipuate output. You can see the line grow incrementally as you build it, until you finally have the output you want. This sort of history allows you to go back to any stage in the process. If you're a vi fan, you can use vi-mode on the command line to move around the line quickly as you edit. Eventually you can hit "v" and edit the thing visually in your EDITOR, then save it. I've built over 700 useful functions this way and the number keeps growing.

I do understand there should be a way to "extract" the file size column the way Pike decribes in the article. To do this I think you have to free yourself from "line-oriented" thinking and imagine another type of structure. And I think using another language you can do it. But for something as simple as this -- manipulating ls command output (cf. manipulating large datasets) -- an "ugly" one-liner suits me fine. The more you use the boring old utilities the more you can get them to do.

Regular Expressions are indeed "crude". But, to me, that is just fine in a lot of cases.


Yes try working on the full OSI Stack you had to learn asn.1 just to beable to read what a concrete decode was doing.

Though my boss did impress me by watching an OSI transaction in flight on our network monitor stop it and point to a dword and say thats wong and its Sprints broken x.400 implimentation.


Take this sort of thinking to the limit, and you end up with a unification of a programming language with the OS. Lisp Machines and some of the first instances of Smalltalk were like this. (Smalltalk used to be an OS.) In this case, you don't ever parse text output from your shell tools. You just code directly against objects and streams and collections of objects.


This is the principle behind Microsoft's PowerShell. Everything is .NET objects.


I guess one can say that 25 years later this has been shown as (partly) wrong.

I don't think Powershell really solves a problem, it's too complex to work with for the majority of problems. If i want complex data handling i write a script and put #!/usr/bin/env {bash,python,perl} in the first line.

I think the missing point here is that the nice, line-based, really simple approach is that this is how we speak and write and think. (in a series of flat words, so to speak). It's extremely easy to get into this kind of handling "data" and it's sufficient for a lot of tasks. I always admired what can be done with one line of bash/GNU utils.

As i said: If it get's more complex, we use a "real" programming language with more complex data structures anyway.


I'm weary of using structured data, that is data that is structured in such a way that you don't need to understand what it's passing around. Keeping it simple means that the developer will always know exactly what data is being passed around.

This is important for a lot of reasons not all of them are related to only development as a job.


Some researchers at the University of Helsinki have studied the hell out of this problem, and even provided a useful tool that is available in many Unix distributions called "sgrep".

http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html


There is also sam -d ;) http://man.cat-v.org/plan_9/1/sam

(It hides the GUI and gives it a more ed-like interface that you can easily script, but this is kind of a hack for Plan 9-nuts ;)


When is this from? (I presume this is a historical paper that was not influential at the time.)


Good question; I updated the title.

Update: oh, no room in the title. To answer your question, it's from 1987. I found it via this morning's C vs Go article (http://news.ycombinator.com/item?id=4110480)


Not to say that anyone should have to do this, but I did the following:

http://scholar.google.ca/scholar?hl=en&q=Structural+Regu...

Which yields the paper as the first result, showing its publishing year of 1987.


Not sure about the date of publication, but Rob developed these ideas in the early 80s, his Sam editor (that Ken Thompson still uses) is based on it.


offtopic:

sam(and acme) sounds like an interesting method at first, but when you try it out, its really weird and slow to get stuff done. Its because for me, mouse is really inferior to keyboard to do the majority of the tasks.

This is an except from coders at work:

Seibel: Is there anything you would have done differently about learning to program? Do you have any regrets about the sort of path you took or do you wish you had done anything earlier?

Thompson: Oh, sure, sure. In high school I wish I’d taken typing. I suffer from poor typing yet today, but who knew. I didn’t plan anything or do anything. I have no discipline. I did what I wanted to do next, period, all the time. If I had some foresight or planning or something, there are things, like typing, I would have done when I had the chance. I would have taken some deeper math because certainly I’ve run across things where I have to get help for math. So yeah, there are little things like that. But if I went back and had to do it over I’m sure that I just wouldn’t have it in me to do anything differently. Basically I planned nothing and I just took the next step. And if I had to do it over again, I’d just have taken the next step again.

It would be interesting to know if Ken Thompson is using sam because he is a slow typist and hence doing stuff the sam way (mouse oriented) boosts his productivity or simply because he likes using sam.


Acme mouse chording is really cool, doing something similar with keyboard-oriented editors is much more tedious and often outright painful ( http://acme.cat-v.org/mouse ).

I suspect ken spends most of his time thinking rather than typing, and I have found this to be true of most great hackers.

Other famous Sam users include Brian Kernighan, Bjarne Stroustrup(!) and Tom Duff. Kernighan writes at least as much English as code and I wonder how that affects his editor usage patterns, but I suspect that even when writing natural languages most time is best spent thinking (his writing style is very concise and clear, one could say similar to ken's code).

I think the obsession with saving keystrokes is very misguided, I still use editors like vi frequently, and having to think about what magic combination of commands to use to perform a task can be very distracting, is fun and feels good, like a tiny puzzle game built into your editor, but it doesn't help you write better code faster IMHO.

P.S.: Is interesting who has stuck with Sam and who moved to Acme (Dennis Ritchie switched to Acme, and most of the Go team Google besides ken use it too).

See also: http://sam.cat-v.org/


    > I suspect ken spends most of his time thinking rather than
    > typing, and I have found this to be true of most great 
    > hackers.
Who cares? Slow input is still not a desirable property in a text editor.


continuing the offtopic:

You say it sounds like an interesting method at first, but then you try it out and it's weird and slow. How much of a try did you give these editors? In the same way that a new Vim user will basically use the arrow keys, "i", ":w", and ":q" and nothing else, a new user to Sam or Acme can very easily miss a lot of the power.

The Sam language is pretty powerful. For instance, you can use the "X/regexp/ command" form to apply a command over every file whose filename matches the regex, so you can (for instance) make a change to every .c file while leaving README alone. Acme lets you use the same command language, but also lets you execute other arbitrary commands by simply typing them and mid-clicking on them--yes, I know Vi lets you do something similar, but with Acme you'll typically build up a "guide" file, full of convenient commands that you just sweep over with the middle mouse button and release to execute. Acme also presents files in a sort of tiling window manager fashion that makes it one of the most convenient editors I've used. I'll frequently have up to about 50 files open, which in Emacs or Vi would drive me nuts trying to constantly switch around them or split the screen into one or two panels (oh boy, C-x b!). In Acme, I can always see the titles of the individual buffers, and if I need to see into the file I can with a mouse click or two expand the buffer into a convenient size.

I think one change that could really have a big impact is making the ESC key switch back to the command window; this makes things a lot more familiar for Vi-heads, and it makes sense to eliminate a mouse movement for something that simple and frequent.

That ended up a lot longer than I intended, but I wanted to try and share a little bit of my thoughts on Acme and Sam, and encourage others to give them a more thorough try. At this point, I use vi for quick edits, but when I have to write a lot or make a lot of changes, I bring out Acme. When I'm stuck on a strange system that's not my own, I have a tarball of an old version of Sam which, with a little tweaking, has compiled on every version of Unix I've had to use.


I am using emacs and doing the same things in acme at the moment for around 2 weeks now. Its just that, whenever I need to type I have to disconnect from the mouse and whenever I need to do I have to disconnect from the keyboard. Also when using the mouse I miss the precision with which I can navigate to different parts of the screen real quick and efficiently. There is always a delay and re-focus to get to the target


Every scribd link is marked as private for me Hacker News for some reason, is this broken for anyone else or..?


Yes, it's broken.


Me, too. I'd really like to read this...


But surely the main non-scribd link works?


It didn't work when I clicked initially; it worked later on.


I think tools that generate or pass records or rows of data should certainly have the option of providing a schema based output as well. In addition to "find ... -print0" providing either of "find ... -proto" or "find ... -json", assuming the schema for the json is known as in a /usr/share/SOMEWHERE/find.json-schema or similar, is really appreciated. And let's not go overboard with this, as there are many cases where parsing is genuinely the sanest approach.


I ran into a wall when trying to use unix-y tools to do somewhat complex, regex/replace functions for code refactors. Basically, anything inside a single line is easy, but once you cross that line barrier, the complexity increases dramatically. That rendered the changes useless because most programming languages allow you to add arbitrary new-lines between any token in the language.


I know what you're saying. I ran into the same problem a while back and ended up hacking a tool that does the kind of structured pattern matching I wanted. Its syntax is a bit awkward, but what the hell - you can find it here if you're interested: https://github.com/nhaehnle/patrex


I also hacked up a tool for structural search. I'd noticed intellij's built-in structural search has a very simple pattern language - you just type the code you're searching for and insert a wildcard like $x$ or $foo$ where an expression can vary. However, its restricted to the languages Intellij can parse.

(in what follows HN is mangling asterisks, so I used 'X')

Tokenizing a language is much simpler though. So I just did that, and let wildcards match like <token>X? up to the next expression boundary at the same nesting level - the boundaries were ',' ';' ')' '}' and ']' when I tried this on java, C and perl.

This turns out to be simple enough and powerful enough to be useful. For instance, your example:

boost::bind(& ${id} $( :: ${id} )+, $( boost::ref( Xthis ) )|ref| $.X )

Would be: boost::bind(& $id1$ :: $id2$, boost::ref( Xthis ) ) (for 2 args - I'd need to repeat for more args like so: boost::bind(& $id1$ :: $id2$, boost::ref( Xthis ), $id3$)

This works because $id1$ is non-greedy, and whitespace is ignored. I did tend to tweak the tool to what I was searching for (I'd have dropped ',' as a delimiter for this one), which I guess is cheating on keeping the syntax simple!


As A sysadmin, I have two general rules about scripting;

1. If it's longer than 15 lines, rewrite it in {python,ruby,perl} 2. Never accidentally rewrite grep, sed or awk.


His example is poor, but the message is that the body of a record can cross line boundaries. While the UNIX tool chain is predicated on the concept line == record, this doesn't have to be the case. With a generic record level marshalling system the class of problems solved by composing command line tools together would be greatly expanded.

What Pike describes is analogous to the RecordReader in Hadoop.


Except that Pike's paper precedes Hadoop by two decades.


That's not an 'except', grandparent never implied otherwise.


Coincidentally, == is also the standard Unix separator for multiline records, IIRC.


In terms of flexibility, I think this would be a fantastic addition to the tools. Having it be a shell var instead of an argument might be worthwhile - if I have a few stages in a pipeline dealing with the same kind of record, it seems useful to be able to say

( RECORD_PATTERN=somepattern; my | pipe | line | whatever )

rather than

my -R somepattern | pipe -R somepattern | line -K somepattern | whatever -R somepattern


It could just be a set of separate tools one could pipe data to.

http://news.ycombinator.com/item?id=4113231


Separate tools have the advantage (over regex) of handling nesting properly, which could certainly be significant. On the other hand, handling deep-enough nesting with regexp is usually not hard, and when you're stringing together a bunch of unix commands quickly you're usually looking for "good enough". I don't want to have to write a new everything to handle a new format. Maybe there's something in between?


I guess I didn't explain well enough, since you completely misunderstood my suggestion.

Separate tools have the advantage (over regex) of handling nesting properly, which could certainly be significant.

Okay, thanks, but I've known about basic automata theory since I was an undergrad, two decades ago.

I had something like this in mind:

    ls -af | jsonify 'ls -af' | this_reads_a_json_stream
The jsonify command would retrieve and run a script from a central repository, which people could contribute code to. This way, the parsing efforts of one coder could be re-used by the rest of the world.


>Okay, thanks, but I've known about basic automata theory since I was an undergrad, two decades ago.

I wasn't trying to educate; I was discussing the relevant limitations of my approach. The fact that I can't spin a perfect regexp for anything (including JSON, sexp, xml) that nests arbitrarily deeply is an issue with what I proposed - one that I think can be worked around sufficiently, but an issue nonetheless, and I wanted to acknowledge that. I don't see what prompted the defensiveness I take from your comment - I'd expect most people here to know at least that much automata theory, here. I've not known it for quite two decades, but two decades ago I was 8. I'll reply to the constructive bits of your response separately.


I wasn't trying to educate; I was discussing the relevant limitations of my approach.

Oh, sorry, I thought you were implying that about my approach.


Ah, gotcha. Yeah, that would have been incoherent.


It appears that I did misunderstand your suggestion - I'd thought you to be proposing a separate set of tools (grep, sed, etc) for operating on each packet format. Instead, it appears you meant a set of tools for converting output to a particular format? That still means either 1) reimplementing tools to deal with each format, or 2) many conversions and perhaps still an inability to get the data to chunk the desired way for any particular tool. Just the ability to have sed, grep, and sort chunk in an arbitrary way would be significant.


If one could simply get all of the data parsed to a particular format, people using such tools could easily pick which data they need and discard the rest.

Also, there would be "many conversions" - but I'm envisioning that these would be shared in a library, so it would "just work" for most developers.


Given the examples in the paper and how old it is, I wonder what would be an elegant way to do it today with the available tools.


Nope, can't say I've needed to do any 2D pattern matching on the UNIX command line, ever. Sure back in age of dinosaurs when command line was being used for 2D graphics this may have been useful, but ... we have actual GUI interfaces to do that now.


What about source code? Like Pike, I think that it's a little bit nonsensical that

    foo(shortArg1, shortarg2, shortArg3); 
is easy to find, but

    foo(longArg1, 
        methodCall().stuff(),
        evenMoreComplicatedStuff);
is much more difficult to grep for.


Look beyond the immediate, contrived example. Text can have meaningful structure beyond lines, which line-oriented tools handle poorly. So much power is available with UNIX command line tools, but often you must abandon them for python, ruby, or whatever. The proposed modest extension would allow the command line to be comfortably used for a larger problem space.


Graphics? What about just grepping over more than one line of code?


There are plenty of reasons to do graphics on the command line (e.g. batch operations) and anyway that is not at all the main point of the paper, it's just an example.


What are you talking about? Graphics have nothing to do with it.


You sir, have made my morning(I mean afternoon).

Thanks!


... why would you ever use Unix tools to solve those sorts of problems?


Because you're already at the command line and your skills with them are such that it's actually lower impedance to just rock it out, rather than write up a proper script. Not true of everyone, but it happens.


Possibly true, but we have perl, python, ruby to deal with those cases which all embed regex in a structural way.

Occasionally you can just convert it to lines first as well and problem solved.


I don't think they really give you a structural way to use regexes; more of a procedural way, where you can embed regexes within explicit loops that iterate over lines, matching and updating state-variables as you go. Unlike regexes, which are declarative and abstract away the details of the match algorithm and its internal match-progress variables, you are explicitly maintaining the match state there. I find myself writing manual-FSM code like if($scanning_for_new_record) { /.../; }. And even that only really works if you can do a one-pass match, without needing to backtrack across line boundaries.

It's inelegant enough that I sometimes do a two-step process instead: 1) transform the input so that whatever view I want of it maps to a line-oriented format; and then 2) process the result in the standard Unix line-oriented fashion.


Seems like it shouldn't be too hard to write a little function (perl, python, whatever) to act as a "visitor" to hold this state, then just pass it little closures (maybe a hash/map?) to evaluate the regexs and pass the match into a code block?

Something like (perl):

&visit_matches(

{ ' +' => sub { $x += length( $1); },

  '#+' => sub { print $1, ' at ' $y, ',', $x'; $x += length( $1); },

  '\n' => sub { $y++; }
 
});

(not an exact match to the pseudo-awk, but enough to get the idea)


You do realize that this paper is from early 80s and Perl, Python and Ruby did not exist, right?

As for your second statement, if you would have read the paper you would have seen why it is not the case.


Perl - Appeared in 1987

A bit late in the 80s, but still.

[Edit] I read your comment too quickly and hadn't noticed that you said "early 80s". My apologies.


Keep in mind that Perl4 was the first "big" version of Perl, available in 1991 with the publication of the camel book.

When this was written companies were advertising "Cut and Paste" as a big feature for word processors.


Except that PCREs (or most of their implementations which are used in all those languages, re2 and Go's standard regexp package are exceptions) have some fundamental issues: http://swtch.com/~rsc/regexp/regexp1.html




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: