
GoAWK: an AWK interpreter written in Go - ngaut
https://github.com/benhoyt/goawk
======
vvern
I would love to see support for calling out into go functions. The go stdlib
so often has good implementations of functionality in places where things like
the python stdlib doesn't.

There are some obvious questions around calling conventions and error
handling, method invocation, etc. but nothing there seems totally
insurmountable. Having a compliant implementation as a jumping off point is a
great start.

Looking at the interp internals, the representation of function call
expressions might need a little bit more structure to pull this off (rather
than just a big switch for the awk builtins and user calls as just more awk
instructions plopped inline). Furthermore there are questions about how
exactly to represent go objects but I suspect with some boxing it could be
made relatively ergonomic.

~~~
benhoyt
That's a neat idea! (Though not necessarily in the spirit of AWK as a simple
text processing language. :-) I _started_ playing with something like this in
a previous language interpreter I made in Go
([https://github.com/benhoyt/littlelang/blob/master/interprete...](https://github.com/benhoyt/littlelang/blob/master/interpreter/native.go)).
It's definitely possible to do with the reflect package, though it wouldn't be
trivial to do a full implementation.

------
kiwidrew
This is really cool! It's rare to see new implementations of awk these days.
Bonus points for running it through the nawk test suite.

In my opinion /usr/bin/awk is a thing of beauty. Certainly it's the most
usable out of the trifecta of scripting languages that are mandated by POSIX.

(There's /bin/sh, where merely _using_ variables can quickly turn into a
quoting nightmare. But the true nightmare material is /usr/bin/sed, which has
actually been shown [1] to be a Turing complete language!)

[1]
[http://www.catonmat.net/ftp/sed/turing.txt](http://www.catonmat.net/ftp/sed/turing.txt)

~~~
comex
Hmm… you missed ed, which is apparently Turing complete as well, at least
arguably [1]. And apparently newer versions of POSIX also mandate m4…

[1] [https://nixwindows.wordpress.com/2018/03/13/ed1-is-turing-
co...](https://nixwindows.wordpress.com/2018/03/13/ed1-is-turing-complete/)

------
stevekemp
Looks like a really good project, especially with the collection of test-
files.

I had fun fuzzing the "real" awk, finding a couple of trivial segfaults. If
you've not already experimented with fuzzing I'd recommend it - I found a few
minor issues in my own simple-interpreter, and language, via feeding them
malformed scripts.

------
linsomniac
I've been playing with writing a Python-based AWK-inspired library over the
last couple weeks. This isn't an implementation of AWK like here, but a Python
interpretation of "If I wanted to do the sort of things that AWK is good at,
in Python, what would that look like?"

For example, to extract and add line numbers to SQL table definitions:

    
    
      t = gawk.Gawk(sys.stdin)
      t.context.data = ''
      @t.range(r'CREATE TABLE', r');')
      def line(context, line):
          context.data += (('line %d:' % context.range.line_number) + line)
          if context.range.is_last_line:
              print(context.data)
              context.data = ''
      t.run()
    

[https://github.com/linsomniac/gawk](https://github.com/linsomniac/gawk)

I've used AWK for close to 30 years, but I've never achieved or maintained any
level of proficiency at it. I pretty much just use it for "{ print $1, $3 }"
in a filter or the like. Every time I try to do something more complicated I
spend an hour or more futzing around with it and more often than not getting
to: almost but not quite" where I want to be. This is, of course, a me failing
not an awk failing.

But it's left me wanting something that would make doing awk-like processes
easy in Python, which I'm very proficient at.

I ended up using the name "gawk" because it's an English word and nods to the
AWK inspiration, but then I remembered GnuAWK so I'll probably rename it.

~~~
srean
Yeah Gawk is Gnu's Awk, its a well established name.

Your code snippet does showcase Awk's utility. It brutally cuts through all
the ceremony around reading and iterating over lines.

------
samuell
This is so cool! I have been calling out to GNU Awk in like 50% of our SciPipe
workflow tasks lately (See e.g. [1]) ... now I should be able to keep it all
inside Go/SciPipe.

[1] [https://github.com/pharmbio/ptp-
project/blob/master/exp/2018...](https://github.com/pharmbio/ptp-
project/blob/master/exp/20180426-wo-drugbank/wo_drugbank_wf.go#L177-L329)

------
tomcam
Haven’t tried it yet but my favorite part is that it’s embeddable in your own
programs.

~~~
srean
Just in case it's useful, you can embed/extend awka or mawk or rather the
library they are based on. Gawk allows extending it but the interface is not
pretty.

------
hi41
I am a big fan of awk. I find it very beautiful. I admire the authors of awk
so much. When I read the awk programming book, I find that has such clarity of
thought. Kudos to you for your implementation!

~~~
another-cuppa
I'm so glad I spent about twenty minutes to learn awk several years ago.
That's really all it takes to learn it and you get incredible power.

------
helper
Ha. I was just thinking how useful it would be to have the awk programming
language available in a tool that natively understood csv files. Suddenly that
seems a lot more doable!

~~~
benhoyt
Interesting point. I've though that AWK should have a mode where it does
proper quote parsing of CSV files. Maybe I'll add a -csv option for that (or
just have it do it automatically when the FS is ',' \-- though that wouldn't
be backwards compatible).

~~~
vram22
You probably know it already, but in case not, the CSV format is not fully
standardized, and so there are variations. So you might have to handle those
to provide better support of the feature, if you implement it.

[https://en.wikipedia.org/wiki/Comma-
separated_values](https://en.wikipedia.org/wiki/Comma-separated_values)

For example, csv.reader in Python's csv module in the stdlib, has a dialect
argument, due to this.

[https://docs.python.org/2/library/csv.html](https://docs.python.org/2/library/csv.html)

[https://docs.python.org/3/library/csv.html](https://docs.python.org/3/library/csv.html)

------
srean
Ah! This makes me wish for an awk with first class channels and loadable
modules. Gawk does allow talking over sockets but this would have been so much
sweeter.

One thing hope this implementation remedies is the absence of a linear time
string concatenation in awk. Awk has split but no join. Only way I know is to
iteratively join two strings which has a quadratic running time.

~~~
benhoyt
Huh, interesting. So one thing I haven't focused on much yet is performance.
It is slightly faster than "one true awk" on large inputs with very simple
programs, so I'm guessing Go's I/O speed is pretty good, but the actual
interpreter itself is significantly slower than awk's as yet -- hoping to work
on that soon.

I haven't looked at linear-time string concat. Interesting point -- I'll put
it on my TODO list. Though I think instead of string building you could simply
use printf to write output and that would be linear time.

~~~
kiwidrew
Yes, if the "destination" is stdout (or a pipe/file) then the obvious loop
works just fine:

    
    
      function join(ARRAY) {
        for (i=0; i in ARRAY; i++) printf "%s", ARRAY[i];
      }
    

But if you need the result back as a string for further processing, the
obvious methods are not linear:

    
    
      function join(ARRAY,_s) {
        for (i=0; i in ARRAY; i++) _s=_s ARRAY[i];
        return _s;
      }
      
      # try to be clever and join 2 items at the same time
      function join2(ARRAY,_s) {
        for (i=0; i in ARRAY; i+=2)
          _s=sprintf("%s%s%s",_s,ARRAY[i],ARRAY[i+1]);
        return _s;
      }
    

I for one definitely miss having a join() function, and it seems odd that this
natural complement to the split() function was never implemented...

------
yjftsjthsd-h
This is beautiful. I really love that awk has numerous implementations running
around.

------
wenc
Great work. Reimplementing a practical and useful mini-language in another
language is always a useful exercise.

I'm curious though, why code the lexer and parser by hand? What's the state of
lexing/parsing in the Go world?

~~~
benhoyt
Couple of reasons. 1) Because I'm a fan of few dependencies and lex/yacc are
non-Go dependencies. 2) I've never used them and it'd probably take me longer
to learn them than hand-write a lexer and parser. 3) Writing a lexer is
trivial, and writing a recursive-descent parser is fun and not that hard.

As to the state of lexing/parsing in the Go world. There's a simple scanner
(text/scanner) in the stdlib. I've run across this quite neat parser library
that's based off structs and tags:
[https://github.com/alecthomas/participle](https://github.com/alecthomas/participle)
... but I really don't know the landscape very well.

------
theparanoid
I feel let down, GAWK would have been a great name.

~~~
gexla
Since the mascot is a gopher... GOPHAWK.

~~~
bakoo
You could try to make GOPHAWK yourself.

