Hacker News new | past | comments | ask | show | jobs | submit login

I wrote a compiler in awk!

To bytecode; I wanted to use the awk-based compiler as the initial bootstrap stage for a self-hosted compiler. Disturbingly, it worked fine. Disappointingly, it was actually faster than the self-hosted version. But it's so not the right language to write compilers in. Not having actual datastructures was a problem. But it was a surprisingly clean 1.5kloc or so. awk's still my go-to language for tiny, one-shot programming and text processing tasks.

http://cowlark.com/mercat (near the bottom)

(...oh god, I wrote that in 1997?)




I've always thought that AWK's most important feature is its self limiting nature: no one would ever contemplate writing an AWK program longer than a page, but once Perl exists the world is doomed to have nuclear reactors driven by millions of lines of regexps.

But no, there's always one. :-)


> I've always thought that AWK's most important feature is its self limiting nature

I agree. This idea doesn't receive enough attention. If you pick your constraints you can make a particular envelope of uses easy and ones you don't care about hard.

AWK's choice to be a per line processor, with optional sections for processing before all lines and after all lines is self-limiting but it defines a useful envelope of use.


I've written one or two awk programs that probably went beyond what the tool was intended for, but mostly I use short one-liners or small scripts. I use awk, grep, sed, and xargs pretty much daily for all kinds of ad-hoc automation.


> beyond what the tool was intended for

Not sure what that would mean. I think the tool was designed to be a user's programming language. I liken to think that `awk` was the Excel + VBScript of its days.


VBScript was largely replaced on Windows by Powershell. Awk is still popular for what it's good at.


Fair point. I guess I meant more what I thought it was intended for, i.e. mainly smallish text transforms where the entire program is given on the command line, often as part of a pipeline of several different utilities.


Im bookmarking that. Reason is David Wheeler and I's discussion of countering compiler subversion. Need to bootstrap the C compiler with something trustworthy & local. I looked into Perl since it's industrial strength and widely deployed. He mentioned bash since most (all?) UNIX's had it. My next thought was converting a small, non-optimizing compiler's source to bash or awk. So crazy it might work.

Now, you've posted a compiler/interpreter in awk for a C-like language that could allow easier porting of a C compiler's source. Hmmm. The license would have to be BSD so the BSD's could use it, too. Or pieces of it in my own solution.

I have a feeling whatever comes out of this won't make it into next edition of Beautiful Code. ;)


Now I look at it I see that it's not open source --- I'll add a cover note saying it's actually BSD.

Let me also say that if you actually want to use this for anything you're crazy. I wrote it when I was... younger... and when I had no idea what I was doing. The only thing it's useful for these days is looking at and laughing at.


I figure it might give me ideas for how to express some compiler concepts in awk. What I planto do is most brain-dead, simple implementation that's possible so anyone can understand & vet it.


Almost relevant: I wrote a parser generator in and for awk (called 'yawk' even though it did LL(1) instead of LR grammars), even older than this. But at some point I lost it, and it was never released online.


Which do you think would be better in terms of coming with all major distros and easiest to write compiler in: awk or bash? Ive forgotten both due to injury so cant tell without lots of experimenting.


I've never done any serious programming with bash, just simple Bourne shell scripts, because I don't want to think about all the escaping rules and such. I did write some programs in Awk in the 90s (notably https://github.com/darius/awklisp), so I'd go with that. Maybe someone who's bent bash to their will could speak up here?

AFAIK they're both ubiquitous, though you might need a particular awk like gawk for library functions, depending on what you need to do. Nowadays I'm way more likely to use Python, though of course it's a much bigger dependency.

Sorry about the injury, and good luck -- I'd like to hear how it goes.


The escaping in Bash can be a pain. I was recently writing an execution wrapper in Bash, and needed to send the results via JSON. Fighting with the quotes was almost enough to make me throw in the towel and move to a language with a builtin JSON parser, but I ran across this technique, of embedding a heredoc to preserve quotes in a variable. https://gist.github.com/kdabir/9c086970e0b1a53c3df491b20fcb0... It 'simplified' things and kept them readable.

Thanks for sharing awklisp. Nice reading for a Sunday morning.


I'm glad you enjoyed that, thanks. :)


Thanks for publishing it. I had long thought about writing a compiler in Awk. Finding yours through a comment here on HN some time ago served as a major validation of the idea. I ended up writing one.

Here is the result: https://github.com/dbohdan/all-caps-basic. It targets C and uses libgc along with antirez's sds for strings. The multi-pass design with each pass consuming and producing text is intended to make the intermediate results easy to inspect, making the compiler a kind of working model. The passes are also meant to be replaceable, so you could theoretically replace the C dependencies with something else or generate native code directly in Awk. You can see some code examples in test/. Unfortunately, the compiler is very incomplete. I mean to come back to it at least to publish an implementation of arrays.


The combination of "it worked fine" and "so not the right language" is intriguing. You wrote about the lack of data structures, can you share more (in both directions)?


Bear in mind that this was twenty years ago, so it's not exactly fresh in my mind; but basically: it was intended to do one job, once, which was to compile the real compiler (written in a simplified version of the language) into bytecode. Once that worked, I would never need to touch it again.

Which meant that it was perfectly allowable for it to be hacky and non-future proof, which it was.

Here's part of the code which read local variables definitions (in C-like syntax):

    function do_local( \
    nt, st, n, t){
        nt = readtype()
        st = tokens
        ensuretoken(readtoken(), token_word)
        n = tokens
        outofscope(n, 1)
        ldb[n] = "var"
        ldb[n, "type"] = st
        ldb[n, "sp"] = sp - spmark
        emit("# local " n " of type " st " at " (sp-spmark) "\n")
tokens is the value of the current token, ldb is the symbol table; you can see how I'm faking a structure using an associative array indexed by keyword.

There's nothing actually very wrong with this code, but there's no type safety, barely any checking for undefined variables, no checking for mistyped structure field names, no proper data types at all, in fact... awk wouldn't scale for making the compiler much more complicated than it currently is. But it did hit the ideal sweet spot for getting something working relatively quickly for a one-shot job. It's still really good at that.


Some points:

* Cannot have an array as an input to a function

* Cannot return an array from a function

* Meta-programming or pointers are only (barely) available in gawk

* There is an `@include` statement for `gawk` that is not part of POSIX, and there is no name spacing involved.

* Functions names can only exist in the global name space

There are some reasons somebody felt an urge to create perl... Still loving awk, and using it every day for text processing jobs.


>There are some reasons somebody felt an urge to create perl

Larry Wall (creator of Perl) says something pretty close to that here[1]:

"I was too lazy to do it in awk because it would have been hard to get awk to jump through the hoops I was wanting it to jump through. I was too impatient to wait for awk to finish because it was so slow. And finally, I had the hubris to think I could do better."

[1]http://www.linuxjournal.com/article/3394


I still think Awk is better for one-liners, but Perl gets the advantage for full size programs.


I actually found it really interesting that he was working on a high-assurance VPN when he created it to reduce his grunt work:

http://cahighways.org/wordpress/?p=8019

http://ieeexplore.ieee.org/document/213253/

BLACKER's heavy lifting in security was done by high-assurance kernel called GEMSOS:

http://www.cse.psu.edu/~trj1/cse443-s12/docs/ch6.pdf

It was a classified work for TRW whose details took a long time to get released. Might be why he rarely mentioned the BLACKER project in its origins. Possibly trying to obfuscate it a bit to avoid breaking laws.


http://cowlark.com/mercat/com.awk.txt

This is neat! Question for you regarding the formatting/syntax used-

Why the slash-newlines in function declarations?

E.g.

    function scope(name, \
    s) {


Awk does not support local variables. However, to simulate local variables you can add extra function parameters. I would guess that the backslash is inserted to separate the "real" parameters from the "local variable" parameters to make the code more readable. In your example, when the function `scope' is called only one actual parameter would be provided.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: