

Sculpting text with regex, grep, sed, awk, emacs and vim (2012) - aburan28
http://matt.might.net/articles/sculpting-text/

======
natnat
One really cool tool that web programmers should know if they work with JSON
data a lot is jq:
[http://stedolan.github.io/jq/](http://stedolan.github.io/jq/). It's a line-
oriented tool like sed, awk, and grep, but it's for manipulating JSON data. It
can be really useful for quickly making sense of JSON-formatted log files. For
example, you can do something like

    
    
        jq -c .'select(.server_name == "slow_server") | .end_time - .start_time' < my_log_file
    

where your log file might look like

    
    
        '{"server": "slow_server", "timings": {"end_time": 1406611619.90, "start_time": 1406611619.10}}'
    

to get your web request timings.

Because it's line-oriented, it also works seamlessly with other tools, so you
can pipe the output to, say, sort, to find the slowest requests.

~~~
troels
Somewhat similar to xmlstarlet
([http://xmlstar.sourceforge.net/docs.php](http://xmlstar.sourceforge.net/docs.php))
for xml documents.

------
nilkn
What a great article. Even though I've picked up a lot of this through
osmosis, I wish I'd read such a clear and lucid primer of Unix basics
(including the author's other articles on the subject) a few years ago.

A good follow-up to read, from the same person, is his article on relational
shell programming: [http://matt.might.net/articles/sql-in-the-
shell/](http://matt.might.net/articles/sql-in-the-shell/)

~~~
mattmight
Thanks for the kind words.

The comment on jq (which I'd never seen) had me thinking about the relational
shell programming again.

One could implement a remarkably robust relational DB at the shell with jq.

------
agumonkey
The composite (or prime) filter regexp is brilliant.

    
    
        $^(11+)(\1)+$
    

see OPs linked article [http://zmievski.org/2010/08/the-prime-that-
wasnt](http://zmievski.org/2010/08/the-prime-that-wasnt) for details

------
shanemhansen
One tool that should have become more common but isn't is Rob Pike's
structural regular expressions, which are a fascinating generalization of awk
for non-line oriented data.

[http://doc.cat-v.org/bell_labs/structural_regexps/](http://doc.cat-v.org/bell_labs/structural_regexps/)

Many people have tried to generalize unix pipes and homogeneous data, few have
succeeded.

------
shmerl
I prefer pcregrep, it's more feature rich and syntax is much neater. Using \d
instead of [0-9] and etc. makes regexes more readable.

~~~
chubot
Is that any different than grep --perl (should be available most places GNU
grep is)? I use that for complex regexes.

For even longer ones I just started using perl with /x, so you can uses
insignificant whitespace and comments.

~~~
shmerl
I didn't really compare, since grep marks --perl as "highly experimental".
pcregrep on the other hand is around for a long time already.

One very useful feature in pcregrep is outputting the matched subpattern only.
For example if you do:

    
    
        echo 'abcdefg' | pcregrep -o2  'a(bc)d(ef)g'
    

It will output only second matched subpattern.

------
alayne
Perl is better than sed/awk, but you're still going to write unreadable code.
Python or Ruby are a better choice for maintainable scripts.

~~~
dsturnbull2049
Python and Ruby are just as bad as Perl for this. Python being the least
fluent of the three for scripting.

If you want a proper language to support your scripting goals, then if you go
right up to something like Ocaml or Haskell you'll skip all the pointless
stringly-typed problems of perl/ruby/python.

~~~
alayne
It's hard to take people seriously who suggest Haskell as a good language for
general scripting. That's some powerful religion.

~~~
hderms
Yeah, "stringly-typed" isn't really a problem when you're mostly dealing with
files made up of strings/lines. Interfacing between programs and files which
output mostly idiosyncratic output over an interface of files and strings
isn't really made any easier or more robust by using a heavy type system and
functional purity...

