Hacker News new | past | comments | ask | show | jobs | submit login
The Awk book’s 60-line version of Make (benhoyt.com)
260 points by nalgeon on Sept 10, 2023 | hide | past | favorite | 65 comments



Ben notes that Kernighan regrets the way local variables are handled in Awk.

I patched GNU Awk to have a @let extension that gives you scoped locals (usable in functions as well as in BEGIN/END blocks):

  $ egawk  'BEGIN { x = 3; print x; @let (x = 4, y) { print x } print x }'
  3
  4
  3
@ is used because there is at least one other existing extension which is like that: @include.

https://www.kylheku.com/cgit/egawk/about/

This was rejected by the GNU Awk project, though. I was encouraged to make a fork and give it some kind of different name, so I did that.


Did they cite a reason that sounded reasonable? Like the particular implementation breaks some design principle they want to stick to or something? Did they suggest it might be acceptable some other way or in some other form?

It's curious because gawk cannot for one second claim something like needing to stick to some legacy standard, not with a straight face.


If I were the gawk maintainer I would be unwilling to take on features by default. It is widely used infrastructure and keeping out bugs is far more important than taking on features. If the userbase keeps asking for the same feature over and over, at that point it would be up for consideration, but no sooner.


I think the point is that the list of features added in gawk vs. POSIX awk is miles long... They've not exactly shown restraint.

That said, proving it's value in a fork first seems reasonable.


The thread is here:

https://lists.gnu.org/archive/html/bug-gawk/2022-04/msg00025...

There is more around it. I had the idea in two other forms.

Initially I had a @param:<ident> syntax which indicated that the given variable is to be allocated in the parameter space (a local variable frame where function parameters go). This only worked inside functions.

Between that and @let was a @local thing.

The maintainer of GNU Awk is one of the two authors of the "Fork y Code Please", the other being the Bash guy:

https://www.skeeve.com/fork-my-code.html

So ...


> We are glad to receive input from our user community about:

> Suggestions for new features that:

> Cannot be accomplished using existing features in a straightforward way > Don't (too badly) break compatibility for existing code

So.. in theory your @let should qualify


I may end up renaming it let and hiding it in --posix mode.

GNU Awk has switch, which is an extension and not hidden by @.


Programming languages try very hard to be backward compatible so every feature you add is an eternal commitment.


I'm a big awk fan but I'm not sold on this. The awk program is not very readable- I think that's fine for a dense one-liner, I'm not really sure it carries over to a 60 line script. I think for something like this I'd prefer a bash script, maybe with awk invoked somewhere, that would be much easier to understand at a glance.

Is there something in the awk script that makes it advantageous over a shell script?

Edit: I hadn't read the author's conclusion yet when I posted, he agrees

  I consider AWK amazing, but I think it should remain where it excels: for exploratory data analysis and for one-liner data extraction scripts


A while ago I wrote a program to renumber TRS-80 Model 100 BASIC code in awk. Then re-wrote it in bash (pure bash, no sed/grep/cut/tr/awk/bc/etc), and the two are practically identical. That suprised me just how practically identical they wwere in the end.

awk is like a hidden miracle of utility just sitting there unused on every machine since the dawn of time.

Normally if you want something to be ultra portable, you write it in sh or ksh, (though by now, bash would be ok, I mean there is bash for xenix), but to get the most out of ksh or bash, you have to use all the available features and tricks that are powerful and useful but NOT readable. 50% of the logic of a given line of code is not spelled out in the keywords but in arcane brace expansion and word splitting rules.

But every system that might have some version of bash or ksh or plain sh, always has awk too, and even the oldest plain not-gnu awk is a real, "normal", more or less straighforward explicit programming language compared to bash. Not all that much more pawerful, but more readable and more straightforward to write. Things are done with functions that take parameters and do things to the parameters, not with special syntax that does magic transformations of variables which you then parlay into various uses.

Everyone uses perl/python/ruby/php/whatever when the project goes beyond bash scope, but they all need to be installed and need to be a particular version, and almost always need some library of modules as well, and every python script breaks every other year or on every other new platform. But awk is already there, even on ancient obscure systems that absolutely can not have the current version of python or ruby and all the gems.

I don't use it for current day to day stuff either, there's too many common things today that it has no knowledge of. I don't want to try to do https transactions or parse xml in awk. I'm just saying it's interesting or somehow notable how generically useful awk is pretty much just like bash or python, installed everywhere already, and almost utterly unused.


Well I think generally a 60 line program fits in that spot of "write once, read never, start from scratch if it ever turns out to be inadequate"

... also known as the APL Zone


I'm not dead set against it, but if there was any mistake or bugs I don't know how you'd find them and fix them in that approach


By checking the correctness of the outputs, which you need to do anyway?


Okay, so the first dev writes 60 lines of indecipherable code, runs some sample invocations, looks at the output, says it looks good. A few months later, someone - maybe the original dev, maybe some other sucker - notices that in some edge case the code misbehaves. Now what? (Obviously, any answer that involves "don't write code with bugs" or "write perfect tests" is a nonstarter)


If we're going with the "start from scratch if it ever proves inadequate" philosophy, then the person who notices the misbehavior looks at the original code, sees that it's written in some obscure language, is undecipherable, but also is only 60 lines long, and decides that it will probably be simpler to make a new (short) implementation in their own favorite language that correctly handles both the original use case and their new requirement. The key insight is that given how much easier it is to write fresh code than understand old stuff, they could very well be correct in that guess, and the end result is a single piece of small clean code, rather than a simple core with layers of patches glued on top.

In this particular case, we're talking about a "make" replacement, so testing the new implementation can be done by simply running "make all" for the project. If it passes, then the new implementation must be identical to the old one in all the ways that actually matter for the project at hand. In all likelihood, for a simple program like this, fixing one bug will also silently fix others because the new architecture is probably better than the old one.


I actually really like this approach, and have been thinking about this in regards to coding with an LLM - for a sufficiently simple program (and assuming no security concerns), once you trust your test suite, you should trust AI generated code that passes it. And then if requirements change, you should be able to amend the test cases, rerun the AI until it passes all tests and linters, maybe give the code a quick glance, and be on with your life.


The point is the “and fix them”


Not only is fixing more difficult, but also looking for likely weaknesses (and thus the inputs and outputs to focus on for testing).


> The awk program is not very readable

What do you find hard to read about it? If you know what make does, I think it is fairly easy to read, even for those who don’t know awk at all, but do know the Unix shell (to recognize ‘ls -t’) and C (both of which, probably the audience for this book knew, given that the book is from 1988)

> I think for something like this I'd prefer a bash script

But would it be easier to read? I doubt see why it would.


Bash also would have been an unlikely choice for a book published in 1988, considering it wasn't released until 1989 (Per Wikipedia).


It would have been ksh, which was the bash of the day, as in, the more featureful sh-compatible sh-superset.

But a bash or ksh script would have been less readable than awk.

bash (or ksh88 or ksh93) is powrful and useful but not readable if you're actually using the powerful useful features.

In bash, a lot of functionality comes in the form of brace expansions and word splitting, basically abusing the command parser to get results there is no actual function for. In awk and any other more normal programming language, those same features come in the form of an explicit function to do that thing.


>In bash, a lot of functionality comes in the form of brace expansions and word splitting, basically abusing the command parser to get results there is no actual function for. In awk and any other more normal programming language, those same features come in the form of an explicit function to do that thing.

Right. That's one of the reasons why the man page for bash is so long. IIRC, going way back, even the page for plain sh was long, for the same reason.


Indeed. But at least it acknowledges it, with the iconic "It's too big and too slow."


Interesting, didn't know. Been a while since I read the page.


> It would have been ksh

No, it wouldn’t have been ksh or any other shell, nor C or Perl, nor anything else but awk, in a book titled “The AWK Programming Language”.


Someone didn't read the thread (or lost the plot), but that didn't stop them from making a non-sensical remark about it.


> for exploratory data analysis and for one-liner data extraction scripts

I think both you and the author just don't like AWK if that's the takeaway. What you're describing is literally 1% of the AWK language -- like you don't have to like it, it's weird in many respects but you're treating AWK like it's jq when it's actually closer to like a Perl-Lite/Bash mix. An AWK focused on just those use-cases would look very different.

One of my favorite resources on AWK: https://www.grymoire.com/Unix/Awk.html


I think it should be appreciated in context: it's a good way to teach both awk(1) and make(1) to someone new to UNIX. It also demonstrates how to use awk(1) for prototyping, which IMO is a good programming habit to "develop": it forces to focus on the essential, and not to unnecessarily overthink.


> Is there something in the awk script that makes it advantageous over a shell script?

Pseudo multi-dimensional associative arrays for representing the dependency graph of make. This part:

  for (i = 2; i <= NF; i++)
      slist[nm, ++scnt[nm]] = $i
The way awk supports them is hacky and not really a multidimensional array, but still is better than what you would have to do with bash, because of split() and some other language features.

It would be much easier with any scripting language though, Perl for example.


It seems pretty readable to me, in particular the "update" function parses as JavaScript if you fix the implicit string concatenation (template literals or +) and replace the # comments with //. I'm actually surprised JavaScript is so similar to awk; it feels like a descendant language tbh.


Bash would really be bad idea if it is going to use bash stitching so many gnu utils for this kind of job.

I once had to rewrite a bash script into awk[1] that is big enough and it made the program more readable and the total time execution came down from 12 mins to less than 1 second.

I think maybe the original bash script would have written badly, (each util command will invoke it's own process and it has to piped to others instead of using awk which will be running in a single process).

[1] - https://github.com/berry-thawson/diff2html/blob/master/diff2...


Writing this make program in bash would invovle even more difficult to read hacks, as bash also does not support multidimensional arrays.


I would find it easier to read with more sensible, non-abreviated variable names.


awk and sed are cool, but whenever someone tells me they're interested in learning them, I always redirect them to learn perl's `-n` and `-p` flags instead, particularly with `-la` added. This gives you, basically, a superset of sed and awk, which makes many things easier to express, often resulting in clearer and more concise code.

For those who have taken this advice, they've always told me later they're really glad they did so, and generally express surprise that this isn't more widely known.

(If you already know awk and sed well, then you mightn't view learning perl in addition worth the effort -- I'm not sure either way. This advice is for people that currently are not strong users of either.)


ISTM one could use the same premises to reach the opposite conclusion, namely, that because awk is basically a subset of Perl (excluding CPAN, of course), many things are easier to read, often resulting in more regular, if sometimes longer code. :-)

(FWIW, I learned Perl before sed and awk, and when I was using Perl every day, it was easy enough to whip up one-liners and throwaway scripts. However, I find that as I stopped using Perl on a day-to-day basis about 17 years ago, I can't produce Perl without re-learning the language; but I can produce sed and awk a few times per year without any refresher. I suspect that -- for me -- the smallness of each of sed and awk has something to do with it. YMMV, of course.)


Ruby similarly has -n and -p that does much the same, and Ruby has many of the same `$` vars as Perl/Awk. But even so I often find myself reaching for awk for the simplest stuff because I know it'll be there on pretty much every machine.

Only for 1-2 liners, typically though, the moment something grows beyond that, I don't use Awk any more, so really I only use a tiny sliver of what it can do.


Is perl as widespread as sed and awk in linux distros and other OSes? If I want to make a script that works across the board I feel like the latter are much more adopted, is that correct?


Perl is essentially a standard on UNIX. In particular, there's a Perl interpreter backed with OpenBSD's base installation[0]. If you avoid recent features (you wouldn't be missing too much), then your Perl code should run with little trouble.

However, in my experience, when I start to feel the need to use something more sophisticated than sh/sed/awk, I tend to shy away from Perl in favor of more "robust" languages. Go often is a good-enough substitute (static typing, single-file deployment, trivial cross-compilation); YMMV.

[0]: https://marc.info/?l=openbsd-misc&m=159041121804486&w=2


Perl would be better suited as a portable solution, since you only have to take care of feature differences between Perl versions.

For grep/sed/awk, you also have to worry about implementation differences (GNU/BSD, gawk/mawk/nawk and so on).


I only write Perl for one-liners anymore, and use -F"\t" all the time (mostly tab-delimited files), but I wasn't familiar with -l to avoid writing chomp. Thank you!


Late to the party, but I believe Raku deserves a mention. I replaced most of my shell scripting with Raku. It offers a lot of features that help, one of the biggest wins being Grammars. They make parsing and transforming parsed data a breeze. There's a book full of oneliners[1] for many typical tasks - it's a little more verbose than Perl sometimes, but it's more consistent and predictable, which helped a lot in learning it.

[1] https://leanpub.com/raku-oneliners


I know that these programs are only for didactic purposes and my comment may seem nitpicking, but I can't help noticing that the age comparison of the two versions differ semantically: AWK version uses greater-or-equal, but Python uses strictly-less-than. The behavior is different when target and prerequisite have exactly the same age/mtime: AWK will execute the commands, Python won't.

Python's behavior seems wrong to me. It shows up in rules with a phony target and phony prerequisites, which by definition share the same age (9999) and mtime (0). For example, it wouldn't delete prog in the following rule:

  clean: clean-objs
    rm prog

  clean-objs:
    rm *.o
On the other hand, the AWK version has a subtle bug in that it sets to zero the age of a newly updated target: this is not required in the most common cases (because the target will likely be the first file listed by "ls -t" anyway) and makes it incompatible with GNU make in those rare cases when the commands don't actually touch the target. I know they're rare, but just imagine a rule that uses rsync to replace a file with a copy fetched from a remote site only if a newer version exists on that site. If rsync does not download a new version, there's no need to artificially assume that the file was changed, and propagate "upwards" the need to recompile everything that depends on it.

Both bugs are easy to correct, though. That could be left as an exercise for the reader!


Funny, I would have approached this by removing the while loop and the if else parts of the BEGIN clause, leveraging the stock file reading and line iteration along with AWK pattern matching (terminated with a next statement to skip to the next row), and then shoved the rest in the END clause.

It’s always been a “thing” with me of not liking to put everything into BEGIN. Kind of a “if I’m doing that, why am I using awk” thing.

Just how I approach problems with awk.


I'm with you on this, but is there a way to force input file to read in AWK? That would be a reason to choose the while loop over AWK's implicit iteration. In particular, overriding FILENAME in BEGIN does not do anything.


I wan’t aware that an updated book was on the way. Pre-ordered it immediately. It's wonderful to see that, even after 40+ years, the people who created a scripting language are still providing new, well-documented features.


These blog posts and discussion usually pit one language against others and often attempt to restrict a language to some specific context, ignoring that each user's experience, needs and preferences may be different. A more interesting debate would be language-agnostic, such as writing one-liners versus writing lengthy programs.

In short, the debate might be something like: What does the computer user prefer more: (a) writing one-liners or (b) writing lengthy programs. Not everyone will have the same answer. Knuth might prefer (b). McIllroy might prefer (a).

Assuming one reading this blog post knew nothing about programming languages, it seems to imply Python is not well-suited for one-liners, or at least not comparable to AWK in that context. Perhaps the interpreter startup time might have something to do with the failure to consider Python for one-liners.


I don't think Python is very well suited to one-liners, but it's not due to interpreter startup time (20ms on my machine). Rather, it's due to all the scaffolding needed, which AWK provides implicitly: AWK automatically reads input lines and splits them into fields, automatically initializes variables to the type's default value, and has terser syntax for things like regex matching.

Consider the following AWK one-liner which, for every input line that starts with a letter, prints the line number and the line's second field:

  awk '/^[A-Za-z]/ { print NR, $2 }'
The equivalent Python program has a ton more boilerplate: import statements, explicit input reading and field splitting, and more verbose regex matching:

  import re
  import fileinput

  inp = fileinput.input(encoding='utf-8')
  for line in inp:
      if re.match(r'[A-Za-z]', line):
          fields = line.split()
          print(inp.lineno(), fields[1])


Ruby and Perl has the -n switches to provide that boilerplate. E.g Ruby:

    ruby -nae 'print $.," ",$F[1],"\n" if $_ =~ /^[A-Za-z]/'
-n wraps an implicit "while gets; ... ;end" around the code; "-a" adds an implicit "$F = $_.split" at the start of the loop; "-" takes an expression from the command line; $_ contains the result of the `gets`; $. contains the line number of the last line read.

Alternatively:

    ruby -ne '$_.match(/^[A-Za-z]+(.*)/) { puts "#{$.}#{$1}" }'
`match` sets $1, $2 etc to the corresponding capture group, and calls the block if successful.

The scaffolding would be easy to provide w/Python too, but the extra Awk/Perl-isms to make it convenient is another matter (and while I use them occasionally for one-liners, I will get shouty if I find $1 etc. in production code...).

Even the Ruby differences are sufficient extra noise that I still reach for awk for simple stuff like that.


Everyone has their own personal preferences.

Here is how I would do that task, assuming (a) I had to do it more than once and (b) I could choose any software. On the computer I'm using, the statically-linked, stripped binary is 50k versus a dynamically-linked gawk which is 623k. This solution is faster than AWK, Python, Go, etc. and uses much less CPU and memory. This is quick and dirty, written in a few minutes. I am not a paid programmer. I'm the so-called average user. I'm not compensated for writing programs.

usage: a.out <-- minimal typing

NB. There is a two space indent added to each line. One must remove exactly two spaces from each line or there will be error messages and this will not compile.

  #!/bin/sh
  flex -8Crf <<eof
   int fileno (FILE*);
   int x,y,n=1;
  %option noyywrap noinput nounput 
  %%
  ^[A-Za-z][^\n]+ {
   printf("%d ",n);
   for(x=0;x<yyleng;x++){if(yytext[x]==32)y++;
   if(y==1)putc(yytext[x],yyout);
   }
   putchar(10);y=0;
   }
  \n n++;
  .
  %%
  int main(){ yylex();exit(0);}
  eof
  cc -O3 -std=c89 -W -Wall -pedantic -pipe lex.yy.c -static


If you suggested this as a joke, it is hilarious. Well done.


I always thought we should make a short of Python for one liners inspired from awk, where the loop over the lines would be implied.

the line, lineno and fields would be predifined, and I guess re, os, shutil, pathlib and sys are pre imported. maybe the whole stdlib acts as if it's preimported, while only being imported lazyly

here it would be something like

```

if re.match(r'[A-Za-z]', line): fields = line.split() print(inp.lineno(), fields[1])

```

so

```

cat makefile | pyawk 'if re.match(r"[A-Za-z]", line): print(lineno, fields[1])'

```

I don't see a way out of multiple if statements requiring multiple lines though, otherwise you would have to introduce brackets to Python lol


Often whole program generation in a prog.lang (& ecosystem!) that you already know can substitute for a new prog.lang. Python even has eval. You may be interested in: https://github.com/c-blake/bu/blob/main/doc/rp.md

You can actually get pretty far depending upon boundaries with the always implicit command-option language (when launched from the shell language, anyway). For example, Ben's example can be adapted to:

    rp -m^\[A-Za-z\] 'echo nr," ",s[1]'
which is only 5 more characters and only 3 more key downs (less SHIFT-ing) than the space-optimized version of his `awk`. { key downs are, of course, just a start to a deep rabbit hole on HCI ergonometrics ending in heatmaps, finger reach/strain/keyboard layouts, left-right hand switching dynamics, etc., but they seem the most portable idea. }

Nim is not Python - it is actually a bit more concise while also being statically typed and can be compiled to code which runs as fast as the best C/C++ (at more expense than one usually wants for 1-liner interactive iteration, though unless you need to test on very large data). That said, I find it roughly "as easy" to enter `rp` commands as `awk`.

If doing this in Python tickles your fancy, Ben actually has an interesting on these ideas: https://benhoyt.com/writings/prig/ you might also find interesting.

EDIT: and while I was typing in a sibling @networked mentions a bunch more examples, but I think my comment here remains non-redundant. I'm not sure even one of those examples has some simple `-m` for auto-match mode (although many would say a grep pre-filter is enough for this).


Sorry, I have removed the list of awk replacements for other languages from that comment because I thought it wasn't the right place for it in the thread. I'll just post it here.

- Common Lisp: https://github.com/sharplispers/clawk

- Haskell: https://github.com/gelisam/hawk

- Racket: https://gitlab.com/xgqt/racket-rawk

- Tcl: https://wiki.tcl-lang.org/page/owh+%2D+a+fileless+tclsh (disclosure: the page links to my fork)

One use for an awk replacement is emitting more structured data. I have used my fork of owh a few times to emit JSON after awk-style parsing. I know GNU Awk can generate JSON with https://www.gnu.org/software/gawk/manual/html_node/gawkextli..., but I haven't tried it.


No problem. It might also bear mentioning that if one is willing to learn more specialized tools, even less key-downing is possible, such as (using https://github.com/c-blake/bu/blob/main/doc/cols.md):

    grep ^[A-Za-z]|cols 2
You just lose that row number in the original input coordinates feature of Ben's example which could probably be recovered with `grep -n` & `cols -d' :'`, etc., etc. In exchange, you can say `cols 2:5` to get a block of columns trivially. And then, of course, once you have any oft-repeated atom you can save it in a tiny script/etc.

A lot of these choices come down to atom discovery & how willing/facile someone is juggling/remembering syntax/sub-languages. In my experience, willingness tracks facility and both are highly variable distributions over the human population.


Alec Thomas wrote a script like this called pawk.py (https://github.com/alecthomas/pawk). It reads input automatically, and for each line, defines "n" and "f" to the line number and fields list (among other things). It even supports /regex/ patterns. Even the print is implicit. So the example above would be:

  pawk '/^[A-Za-z]/ (n, f[1])'
By the way, triple backticks don't work on HN. You have to indent by 2 spaces to get a code block.


thanks a lot for mentioning pawk, it really looks like what I had in mind


A sibling comment already mentions PAWK. You can do

  cat makefile | pyawk 'if re.match(r"[A-Za-z]", line): print(lineno, fields[1])'
in a Python one-liner without PAWK by abusing list comprehensions:

  python -c 'import fileinput, re; [print(re.split(r"\s+", line)[0], fileinput.lineno()) for line in fileinput.input() if re.match(r"[A-Za-z]", line)]' makefile
Edit: Removed a list of other awk replacements to post in a separate comment (https://news.ycombinator.com/item?id=37465164).


>The equivalent Python program has a ton more boilerplate: import statements, explicit input reading and field splitting, and more verbose regex matching:

"awks and pythons"


Like apples and oranges.


Very good article. Enjoyed how it both explored the program and made a similar Python port. I find awk nice for one-liners but, even if interesting more complex programs can be written in it, I prefer the Python version. Worth mentioning the book had some more such programs, like a simple rdbms and a calculator.


> return 1 to the caller to indicate we did make an update.

This (both in awk and python code) seems useless, as the return value from update() is not used anyhere. Am I missing something obvious?


Yeah, it's used in the top-level update() call in the BEGIN block:

  if (update(ARGV[1]) == 0)
      print ARGV[1] " is up to date"


Right, I missed that one, probably considered it a cosmetic feature!


Is there a way to get Awk to emit a non-terse version of the script passed in?

ie awk '/test/' -> '{ if($0~/test/){print $0} }'


I must admit that I use awk only via GPT-4, which will write me the one-liner I need and I just run it. I somewhat cannot remember the syntax, provided I use the tool only occasionally.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: