Hacker News new | past | comments | ask | show | jobs | submit login
Let's write a compiler, part 5: A code generator (briancallahan.net)
96 points by ingve 39 days ago | hide | past | favorite | 29 comments

Hi all, I have a bit off topic question but seems related.

I'm trying to write sort of a SQL compiler. The current goal is to analyze queries and find similarities, later maybe to translate between sql dialects. I found Uber's QueryParser[1] but it's in haskell, so I started wrapping the python sqlparse[2] library and implement a Visitor to traverse their weird AST. 1. How close is it to implementing a compiler? 2. Is there theory you can suggest further reading for that matter? 3. Would you use a different language/library then I picked?

Thanks :)

[1] https://github.com/uber/queryparser [2] https://github.com/andialbrecht/sqlparse

I’ve been playing with ANTLR[1] and pretty happy with the generated parser. You can find sql grammar on GitHub[2]

[1] https://www.antlr.org/

[2] https://github.com/antlr/grammars-v4

In Go you've got: go-mysql-server [0], vitesse's parser [1], and one I wrote [2].

In JavaScript you've got: Alasql's [3] (this is a large file) and here's another SQLite parser [4].

In Java there's JSQLParser [5] and I think you can access Presto's parser too [6].

[0] https://github.com/dolthub/go-mysql-server

[1] https://github.com/blastrain/vitess-sqlparser

[2] https://github.com/eatonphil/gosql

[3] https://github.com/agershun/alasql/blob/develop/src/alasqlpa...

[4] https://github.com/codeschool/sqlite-parser

[5] https://github.com/JSQLParser/JSqlParser

[6] https://github.com/prestodb/presto

Depends on the complexity of your queries, but if you have a narrow subset that you're interested in, implementing a recursive descent parser for just those parts of the syntax that fits your problem like a glove could be a better solution.


I'm aiming to analyze all the BI queries in my organization, some of them are quite complicated and most of them are hundreds lines of sql. Thanks for the direction, I'll dive into it

ZetaSQL[1] seems like it could be a fit for your use case. I've worked with Apache Calcite in the past and found it to be very complex to work with. I found ZetaSQL to be a little easier to use.

[1] https://github.com/google/zetasql

Lookup Apache Calcite that does most of what you describe.

You can contribute to the project.

> The current goal is to analyze queries

Then you should focus on the optimization phase. Where indexes, table statistics, etc are involved. The parsing and even translating sql from one dialect to another wouldn't give much for you to analyze.

I'm assuming you want to learn about query performance and execution plans?

Frankly, if you want to analyze queries, just install the rdbms, set up a data warehouse and use the tools available to analyze query plans/etc. There are so much that affects queries beyond the actual sql since the hardware and data load affects the queries. You could run the same sql query and end up with wildly different query/execution plans depending on how the data/tables/indexes are changed on your system.

The ontology of compilation seems to be the only thing people are engaging with regarding this blog post series here on HN, so I don't see why this series is on the front page every day when the discussion is so stale. If you want to talk about how compiling to C is or isn't compiling, save yourself the trouble, everything has been said already in the past week (including this thread):




Alternatively, congrats to the author, for managing to get his blog on the front page 4 out of the last 5 days in a row. Has that ever been done before on HN, I wonder? It's gotta be some sort of record.

Crafting Interpreters [0] just came out in print and was widely discussed here [1], there's a Lang Jam [2] event happening this weekend which also spawned a useful HN discussion [3], so I just think a lot of readers here are interested in simple and fun language development tutorials and inspiration. But it is curious, part 2 of Brian Callahan's series [4] failed to get any traction here, maybe people aren't that into lexing?

[0] https://craftinginterpreters.com/

[1] https://news.ycombinator.com/item?id=27997167

[2] https://github.com/langjam/langjam

[3] https://news.ycombinator.com/item?id=28021161

[4] https://briancallahan.net/blog/20210815.html

> so I don't see why this series is on the front page every day when the discussion is so stale

People submit it and upvote it.

This series of blogposts is amusing but it is kind of frustrating too: the end result is merely a compiler. It is a source to source translator from a source language which is very very similar to a subset of the target language. Well I guess you can call that a compiler, but it really doesn't teach the readers what an actual compiler looks like. Exaggerating a little, I would say that it feels like a few calls to sed could do the same thing using regexp.

Lots of compiler tutorials are like this - there's very little out there to explain how compilers really work.

This is my effort - trying to show genuine data structures and processes.


Hey, I just skimmed through this, it seems to be brilliant!! Just one question, can one go through this tutorial without much knowledge about Ruby? Should one read it with 'Ruby Under a Microscope'? I have a decent enough knowledge about compilers, have gone through Thorsten Ball's, 'Writing an Interpreter in Go'.

Only need to know basic Ruby programming - no libraries or deep knowledge.

But the project is unfinished I’m afraid - you will find dead-ends.

Oh I see, thanks for the info!! I'll see what I can gain out of it. Thanks once again for the info!!

It could not. Unless your language is regular class, then regular expressions cannot parse the language accurately.

You can build a parser around regexes though, where most of the code is regexes and then you have a little bit of code to deal with the irregularity. For instance consider arithmethic expressions consisting of constants, -, +, *, /, and parentheses. You could evaluate that using something like (expression to parse a numeral is left as exercise to reader).

   while expression is not a numeral
       replace all "\((NUMERAL)\)" with first group
       if find first "(NUMERAL)([*/])(NUMERAL)"
         replace with result
       else if find first "(NUMERAL)([+-])(NUMERAL)"
         replace with result

What you are doing there is conflating the lexing phase and the parsing phase. Regexes are perfect for recognizing tokens, but for a language with nested parentheses, you must have a push-down automaton to process it. Otherwise, you will not be able to verify that your delimiters are matched.

That's one way of looking at it I suppose. But it does require stretching the definition of token a little bit if you are eg using regexes to identify arbitrarily long multiplications. I think my prefered perspective is that you are using regexes to parse regular sublanguages and then something more powerful (e.g. a push down automaton, but there are even more powerful parsers than that) to bind those sub-parsers together.

That's why I said "if feels like".

But nonetheless even if regexps would not be able to validate syntax, it does not mean that the source-to-source transformation could not be (mostly) achieved using them, because the source and target language are quite similar.

Imagine a new language exactly like C but every semicolon is replaced by a duck "\_o<". You could not parse it using regexps, but regexp could mostly "compile" it to C as it only requires to look for all "\_o<" to replace them with ";".

This wouldn't work as you could have a semicolon in a string, to handle that you would need to know when you're in a string, and due to the existence of backslash escapable quotes in strings, you can't do that with a regex*

*A real Regular Expression. The extended versions with e.g backrefs can. But they're also Turing complete anyway

Actually, I think that the grammar for a C string with escapes characters is regular. The only bit I am worried about are the back-slash octal and back-slash hexadecimal notations. But they are not relevant if you want to detect escaped quotes in strings.

I know that. But admit that it is nitpicking and completely besides the point.

No, it’s not beside the point at all. It disproves your statement for any real language.

Sigh. My point was that it _feels_ like this because the tutorial does not actually do an actual compiler's work. It is a quite simple rewrite to a very similar language with the same level of abstraction. No language constructs have to be compiled, i.e., to be automatically reimplemented in a target language which doesn't have this.

Now, if your only concern is that everything that people feel and express have to be technically correct, well, have fun in your life.

However, if you actually want to learn about compilation it would be a better exercise to use way simpler languages but actually do the work, for example: write a simple stack VM which can only do NAND operation, and then parse and compile to it a boolean expression language which has 0, 1, not, and, or, xor. There you will need to build a simple AST and traverse it one time to translate the basic boolean operations to use only nand, and then a second time to compile the nand expression to your VM's assembly language.

C is his hammer, but tenure is tenure.

Alright, let's write a comment about the actual implementation described in the post. So: why make the symbol table a linked list that has its head fixed at "main" and grows and shrinks at its tail? It is used in a stack-like fashion, so why not reverse the order: put "main" at its tail and grow and shrink it at its head?

That simplifies adding symbols: you keep checking for duplicates until the depth of the symbols you see drops below your depth, and actual insertion doesn't need the pointer to the list's last element, you insert at the head.

That simplifies destroying symbols: you keep popping the head until you meet a TOK_PROCEDURE at which point you stop.

That simplifies looking symbols up: you search until the first name match and return it immediately, instead of remembering the last seen matching symbol while traversing the full list.

I am not even talking about efficiency (although of course this implementation is also more efficient in all 3 use cases), it's about code simplicity: the less context you need to maintain during a list traversal, the easier it is to understand what this traversal looks for.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact