
GoAWK: an AWK interpreter written in Go - f2f
https://benhoyt.com/writings/goawk/
======
kris-s
AWK is such a cool little tool, great way to process log files (as shown in
the article). Glad to see it still getting some attention.

It's kinda nuts how far the unixy idea of "just streams of text" has gotten
us.

~~~
yolo1
_It 's kinda nuts how far the unixy idea of "just streams of text" has gotten
us._

As much as I agree that the tooling is, object-based tools would be _much_
nicer.

    
    
        for i in `ls`; do echo $i.size; done;

versus

    
    
        ls -la | awk '{print $5}'

I know which I'd prefer.

~~~
paavoova
No, that's awful if you consider its implications. Just do:

    
    
      for i in *; do du -bs "$i"; done

~~~
masklinn

        > for i in *; do du -bs "$i"; done
        du: illegal option -- b
        usage: du [-H | -L | -P] [-a | -s | -d depth] [-c] [-h | -k | -m | -g] [-x] [-I mask] [file …]
    

_cool_

------
kazinator
Awk in Lisp as a macro with Lisp AST syntax:

[http://nongnu.org/txr/txr-manpage.html#N-000264BC](http://nongnu.org/txr/txr-
manpage.html#N-000264BC)

* Implements almost all salient POSIX features and some Gawk extensions.

* _awk_ expression can be used anywhere an expression is allowed, including nested within another _awk_ invocation. Awk variables are lexically scoped locals: each invocation has its _nf_ , _fs_ , _rec_ and others.

* _awk_ expression returns a useful value: the value of the last form in the last :end clause.

* can scan sources other than files, such as in-memory string streams, strings and lists of strings.

* supports regex-delimited record mode, and can optionally keep the record separator as part of the record (via "krs" Boolean variable).

* unlike Awk, range expressions freely combine with other expressions including other range expression.

* ranges are extended with semantic 8 variations, for succinctly expressing range-based situations that would require one or more state flags and convoluted logic in Awk: [http://nongnu.org/txr/txr-manpage.html#N-000264BC](http://nongnu.org/txr/txr-manpage.html#N-000264BC)

* strongly typed: no duck-typed nonsense of "1.23" being a number or string depending on how you use it. Only _nil_ is false.

Recently accepted Unix Stackexchange answer featuring awk macro:
[https://unix.stackexchange.com/questions/316664/change-
speci...](https://unix.stackexchange.com/questions/316664/change-specific-
part-of-file-via-a-shell-script/316752#316752)

~~~
kristianp
Cool, but why not use Common Lisp?

~~~
kazinator
In 3.5 words: using isn't making.

------
jim_bailie
Lua and AWK. I'm inspired! I'm going to rewrite the Perl 5 interpreter in Go.
Take that Larry. Just kidding.

I'm all for scratching an itch, but why rewrite all these well established
tools?

~~~
stevekemp
I extended somebody elses programming language recently, then wrote a BASIC.
Mostly to make sure that I understood lexing, parsing, and AST stuff.

While you're right this is reinventing the wheel it can make sense to
reimplement old tools to improve safety, security, and to allow them to be
embedded in new environments.

Have you ever run a fuzz-tester against (GNU) awk? I have. Even now you can
segfault awk with bogus programs, for example:

[https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=816277](https://bugs.debian.org/cgi-
bin/bugreport.cgi?bug=816277)

No doubt this new implementation won't be perfect, but segfaults should be
ruled out via rust/go/ML implementations..

~~~
jim_bailie
Ha! My interaction with AWK throughout the years has been a fuzz test in
progress. Very interesting link though.

On the other hand, I think what you're saying about expanding a well known
tool's range of use in new environments makes sense.

------
simplegeek
I also really like such posts. I have never written an interpreter. What do
experts recommend i.e. how can someone without a formal CS background learn
about write an interpreter? Any good sources, articles, or books? Thanks in
advance.

~~~
nprescott
Depending on how involved the language you are interpreting is, you might get
by having only read chapter 6 of The AWK Programming Language[0] (linked in
the article), which covers "Little Languages", including what it terms an
assembler and interpreter.

If you are interested in more depth, either Crafting Interpreters[1]
(mentioned in the article) or Writing an Interpreter in Go[2] looks promising.
I've read more of Crafting Interpreters and really enjoy it, though it isn't
yet finished. One of the aspects I really enjoy is that the language is
implemented and re-implemented in different languages to gradually introduce
lower level concepts.

Finally, this one may be a little more "out there" than what you are looking
for, but if you are interested in designing a language more than the plumbing
of an interpreter Beautiful Racket[3] is really good.

caveat: not an expert

[0]: [https://ia802309.us.archive.org/25/items/pdfy-
MgN0H1joIoDVoI...](https://ia802309.us.archive.org/25/items/pdfy-
MgN0H1joIoDVoIC7/The_AWK_Programming_Language.pdf)

[1]:
[http://www.craftinginterpreters.com/](http://www.craftinginterpreters.com/)

[2]: [https://interpreterbook.com/](https://interpreterbook.com/)

[3]: [https://beautifulracket.com/](https://beautifulracket.com/)

------
akavel
Troff + pic + eqn ported to Go would be cool :>

~~~
lbruder
Holy shit yes please. And grap.

------
tyingq
I wonder if there's a file size + workload at which the coordination overhead
of a parallel awk is low enough for an overall performance win.

------
IloveHN84
Nice to see another implementation, but I still hope the original one will be
always kept alive

~~~
pmarin
It the one used by all the BSD systems.

Recently there is an official Github repository with commits from Brian
kernighan.

[https://github.com/onetrueawk/awk/commits/master](https://github.com/onetrueawk/awk/commits/master)

------
AstroJetson
While I'm cool with writing other language processors in a new language (Lisp
written in Cobol anyone?) I'm missing the value of this past the bragging
rights.

There was a similar article about writing the LuaVM in Go, to package it in
bigger Go applications. I've done lots of C based systems and bolted Lua on,
so the Go version makes sense.

But is imbedding Awk into a program that gets done on a regular basis?

~~~
101km
I believe in this instance it is the author wanting to level up on AWK and Go.
The value is learning and fun.

An AWK interpreter written in Go is unlikely to be an improvement, except,
well here is another blog post you might be interested in that has a similar
sense of adventurous tinkering (it's about improving on grep):
[https://ridiculousfish.com/blog/posts/old-age-and-
treachery....](https://ridiculousfish.com/blog/posts/old-age-and-
treachery.html)

That's from 2006 and the tl;dr was graybeards did things a certain way for a
reason. And yet nowadays with have things like rg (and ag and a bunch of
others).

~~~
bhengaij
I think my GP has an objection to it being shared and being on top when there
is nothing to learn from this in terms of ideas (which I share) and not to
people hacking away.

~~~
Insanity
But now other people who are interested in Go can learn from it. Seems pretty
much the point of HN - finding interesting things to learn.

------
LeoNatan25
Why do so many modern projects feel the need to include the language and/or
the tech stack used as part of the project name? Is it a type of virtue
signaling? Does “Go” or “JS” or “Swift” or “Node” make these project more
attractive somehow to an end-user (even if the end-user is a programmer)?

~~~
pjmlp
Back it the day GNOME projects had a G somewhere, while KDE ones had a K,
which now can be mixed up with Kotlin ones actually.

C++ projects used to add ++ as suffix, for example Rogue Wave Tools.h++ and
Motif++ libraries back in the glory days of commercial C++ compilers.

So this fashion is already quite old.

~~~
black-tea
But those are signalling compatibility with other tools. That's fine. It makes
sense. But just imagine if someone posted about the original awk (which
happens from time to time) and titled it "awk: a text processor written in C".

~~~
pjmlp
Thing is, we now have this thing that language X is not good for doing Y, so
when one does post a software for doing Y in X, and it actually does a good
job, it works as marketing material for language X.

Programming languages are software products as well, and their customers want
to feel they have made the right choices sticking with their options.

------
TheJoYo
party hard.

------
snug
should have just named it gawk tbh

~~~
kelp
Not sure if you're making a joke or unaware of
[https://www.gnu.org/software/gawk/](https://www.gnu.org/software/gawk/) :)

------
luckylittle
I am hoping one day there is going to be an operating system, where every
single tool is written in Go and everything runs in containers! Well done!

~~~
Twirrim
Why? GC is overkill for a good number of command line tools.

What advantage do you see in command line tool running in its own container?
Given how important pipes are, that's going to be a lot of overhead punting
data between containers.

~~~
dymk
Actually, _lack_ of a GC is overkill (in terms of control needed over memory)
for most command line tools.

Having to manually track memory liveness C adds a large amount of complexity
to tools like awk, sed, grep (which are already complex beasts themselves).

~~~
zzo38computer
Many commands that only do a few things (perhaps not awk, since it runs full
programs) don't need to free everything, since it will only allocate a few
things and will soon terminate and free the entire process.

~~~
dymk
Nearly any command that you pipe into, or stream content out of, must to
allocate and free memory in some non-trivial way.

Sure, those commands could just allocate and never free memory (a-la early C
compilers, or the D compiler), but now any use-case that involves a large
amount of data will leak noticeably. Not going to fly if you need these
commands to be durable and efficient. And unix commands need to be both.

A GC gets you the freedom to operate on large streams for free, without having
to worry about memory management (modulo optimization, but that happens later
anyways, regardless of GC presence or not).

~~~
zzo38computer
In some cases, yes. But not in all kind of programs. For example, my Farbfeld
Utilities programs, are different how much buffers is needed:

* Some deal with only one pixel at a time, or sometimes two. No dynamic allocation is needed.

* Some deal with one scanline at a time, or sometimes more than one (but a fixed number) at a time. The same buffer can be used for each scanline.

* Some deal with the entire picture (such as those that distort the picture).

But one possibility can be that a program might load multiple pictures and
each picture needing the entire picture at once, but does not use them
simultaneously, in which case it is sense to free each picture after it is
used.

(Or maybe I somehow misunderstood your message or something else.)

~~~
dymk
The point is, the default decision should be to not have to worry about memory
management. Most application shoudn't, because they're not realtime operating
systems, or in an environment where memory must be allocated statically.

Almost all unix command line utilities fall into this category. Having to
worry about pairing your `free`s with your `malloc`s is a strict increase in
cognitive overhead, which should have been spent on verifying the program's
semantics are correct.

Messing up low-level memory operations, when you just want to worry about
semantic correctness, potentially leads to bugs like RCEs, or dosing somebody
with too much radiation.

~~~
burntsushi
Thankfully, these days, you don't actually need to choose between GC and
manually matching up your `free`s with your `malloc`s.

There is at least some data that GC does have an impact on command line tools
like this: [https://boyter.org/posts/sloc-cloc-
code/](https://boyter.org/posts/sloc-cloc-code/) \--- More experiments like
that would be great to crystallize the exact trade offs here.

