Hacker News new | comments | show | ask | jobs | submit login
Sculpting text with regex, grep, sed, awk, emacs and vim (2012) (might.net)
119 points by aburan28 975 days ago | hide | past | web | 32 comments | favorite

One really cool tool that web programmers should know if they work with JSON data a lot is jq: http://stedolan.github.io/jq/. It's a line-oriented tool like sed, awk, and grep, but it's for manipulating JSON data. It can be really useful for quickly making sense of JSON-formatted log files. For example, you can do something like

    jq -c .'select(.server_name == "slow_server") | .end_time - .start_time' < my_log_file
where your log file might look like

    '{"server": "slow_server", "timings": {"end_time": 1406611619.90, "start_time": 1406611619.10}}'
to get your web request timings.

Because it's line-oriented, it also works seamlessly with other tools, so you can pipe the output to, say, sort, to find the slowest requests.

Somewhat similar to xmlstarlet (http://xmlstar.sourceforge.net/docs.php) for xml documents.

What a great article. Even though I've picked up a lot of this through osmosis, I wish I'd read such a clear and lucid primer of Unix basics (including the author's other articles on the subject) a few years ago.

A good follow-up to read, from the same person, is his article on relational shell programming: http://matt.might.net/articles/sql-in-the-shell/

Thanks for the kind words.

The comment on jq (which I'd never seen) had me thinking about the relational shell programming again.

One could implement a remarkably robust relational DB at the shell with jq.

The composite (or prime) filter regexp is brilliant.

see OPs linked article http://zmievski.org/2010/08/the-prime-that-wasnt for details

One tool that should have become more common but isn't is Rob Pike's structural regular expressions, which are a fascinating generalization of awk for non-line oriented data.


Many people have tried to generalize unix pipes and homogeneous data, few have succeeded.

I prefer pcregrep, it's more feature rich and syntax is much neater. Using \d instead of [0-9] and etc. makes regexes more readable.

Is that any different than grep --perl (should be available most places GNU grep is)? I use that for complex regexes.

For even longer ones I just started using perl with /x, so you can uses insignificant whitespace and comments.

I didn't really compare, since grep marks --perl as "highly experimental". pcregrep on the other hand is around for a long time already.

One very useful feature in pcregrep is outputting the matched subpattern only. For example if you do:

    echo 'abcdefg' | pcregrep -o2  'a(bc)d(ef)g'
It will output only second matched subpattern.

And depending on the context makes them wrong. \d is not equivalent to [0-9].

Not only if it uses different locales. Normally it is equal to [0-9]. Don't even start on Unicode and regexes.

Perl is better than sed/awk, but you're still going to write unreadable code. Python or Ruby are a better choice for maintainable scripts.

This is just a side effect of being the first popular scripting language. While Perl was being used by anyone & everyone to get stuff done, people who desired a more formalized and strict object-oriented structure migrated to Python. Non-programmers and get-it-done types hacked together Perl since the notion of objects (let alone subroutines) was beyond their ability. It's natural Python scripts are easier to read, since they tend to be written by more professional programmers.

Now that Python has eclipsed Perl's popularity, it's just a matter of time before you start seeing the same level of quality issues in Python. The untrained, non-programmers will be creating write-only scripts in the new language soon enough.

It was just a few months ago that I debugged some Python scripts for a QA department at a smallish company. This code was the equivalent of any nightmare that I've seen in Perl. Not only was it all very "un-Pythonic", it didn't use classes, it hardly used subroutines, and it was equal parts of commented out tries along with the "working" code. The gem was a script that wrote another Python script and executed it (written because the author only knew how to initialize multidimensional arrays, but didn't know how to build them on-the-fly).

(And yes, there were popular scripting languages before Perl. I remember arguing the superiority of Bourne shell scripts of C-shell scripts...)

Perhaps we can retire the notion of "write-only" Perl -- all languages of sufficient complexity provide the means to obfuscate.

Are you sure that reams of special-use syntax and the "there's more than one way to do it" philosophy to language design don't play some role? While it's perfectly possible to write beautiful code in Perl and ugly code in Python, it seems a stretch to claim that it's just as easy as doing the reverse.

> Not only was it all very "un-Pythonic", it didn't use classes

I thought that we were supposed to stop using classes in python.


One shouldn't discount the free concurrency offered by the unix pipelining paradigm. There are several data-mining usecases where you'd wind up writing much more performant scripts exchanging I/O between processes, with less brainpower than it takes to write a bunch of for-loops in python.

No, it really isn't.

Languages like Awk and especially tools (you could say "language" because it's strictly true but, come on) like sed have built in safeguards against writing code that stretches for more than a certain number of lines. That safeguard is that it's a really awful experience to actually do that. As a result these scripts tend to be short and to the point.

Perl does not have this safeguard.

It depends on what you want to do. Even for simple text processing in one-liners, there are quite a few common tasks that are difficult in awk. A big one for me is capture groups in regular expressions:

    perl -ne 'print $1 if /foo="(.*?)"/' 
    awk '/foo=".*"/ { ??? }'
You can do it with gawk, but it's ugly:

    gawk 'match($0, /foo="(.*?)"/, a) { print a[1] }'
Another is manipulating hexadecimal numbers, which is also a gawk extension.

Python and Ruby are just as bad as Perl for this. Python being the least fluent of the three for scripting.

If you want a proper language to support your scripting goals, then if you go right up to something like Ocaml or Haskell you'll skip all the pointless stringly-typed problems of perl/ruby/python.

TCL, REBOL or Red - or maybe some kind of Lisp even - could be better than Python for scripting. There are probably other good languages for this, like maybe Io.

Haskell and OCaml and Java and C++ are about equally badly suited for the job. No, they don't make a good scripting languages. And they don't even want to. Why would anyone try to write shell scripts with them is really beyond me.

It's hard to take people seriously who suggest Haskell as a good language for general scripting. That's some powerful religion.

Yeah, "stringly-typed" isn't really a problem when you're mostly dealing with files made up of strings/lines. Interfacing between programs and files which output mostly idiosyncratic output over an interface of files and strings isn't really made any easier or more robust by using a heavy type system and functional purity...

Can you give an example where Haskell or Ocaml would be more appropriate than the scripting languages you mention?

You're in for a fun surprise if you ever do embedded linux software where a full python or ruby interpreter will either blow your flash space requirements, be too slow or simply unavailable. Perl is a way better option but even then might be too heavy. Busybox however will have a sed or awk.

These are tools that are not going away tomorrow just because something better exists.

awk+sed, perl, and python are at least somewhat universal - ocaml and haskell are not. Heck, awk+sed even more so, any unix system almost no matter how old, or odd has some version of those two tools on them.

Bad programmers, not languages, write unreadable code.

Everyone writes unreadable code sometimes. Bad languages make it easier.

Insinuating that perl is a bad language.

It's not bad, it's just drawn that way.


You don't know how to write readable perl?

no, because Perl is a write only language ;)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact