I'm trying to write sort of a SQL compiler. The current goal is to analyze queries and find similarities, later maybe to translate between sql dialects.
I found Uber's QueryParser but it's in haskell, so I started wrapping the python sqlparse library and implement a Visitor to traverse their weird AST.
1. How close is it to implementing a compiler?
2. Is there theory you can suggest further reading for that matter?
3. Would you use a different language/library then I picked?
In Java there's JSQLParser  and I think you can access Presto's parser too .
You can contribute to the project.
Then you should focus on the optimization phase. Where indexes, table statistics, etc are involved. The parsing and even translating sql from one dialect to another wouldn't give much for you to analyze.
I'm assuming you want to learn about query performance and execution plans?
Frankly, if you want to analyze queries, just install the rdbms, set up a data warehouse and use the tools available to analyze query plans/etc. There are so much that affects queries beyond the actual sql since the hardware and data load affects the queries. You could run the same sql query and end up with wildly different query/execution plans depending on how the data/tables/indexes are changed on your system.
Alternatively, congrats to the author, for managing to get his blog on the front page 4 out of the last 5 days in a row. Has that ever been done before on HN, I wonder? It's gotta be some sort of record.
People submit it and upvote it.
This is my effort - trying to show genuine data structures and processes.
But the project is unfinished I’m afraid - you will find dead-ends.
while expression is not a numeral
replace all "\((NUMERAL)\)" with first group
if find first "(NUMERAL)([*/])(NUMERAL)"
replace with result
else if find first "(NUMERAL)([+-])(NUMERAL)"
replace with result
But nonetheless even if regexps would not be able to validate syntax, it does not mean that the source-to-source transformation could not be (mostly) achieved using them, because the source and target language are quite similar.
Imagine a new language exactly like C but every semicolon is replaced by a duck "\_o<". You could not parse it using regexps, but regexp could mostly "compile" it to C as it only requires to look for all "\_o<" to replace them with ";".
*A real Regular Expression. The extended versions with e.g backrefs can. But they're also Turing complete anyway
Now, if your only concern is that everything that people feel and express have to be technically correct, well, have fun in your life.
However, if you actually want to learn about compilation it would be a better exercise to use way simpler languages but actually do the work, for example: write a simple stack VM which can only do NAND operation, and then parse and compile to it a boolean expression language which has 0, 1, not, and, or, xor. There you will need to build a simple AST and traverse it one time to translate the basic boolean operations to use only nand, and then a second time to compile the nand expression to your VM's assembly language.
That simplifies adding symbols: you keep checking for duplicates until the depth of the symbols you see drops below your depth, and actual insertion doesn't need the pointer to the list's last element, you insert at the head.
That simplifies destroying symbols: you keep popping the head until you meet a TOK_PROCEDURE at which point you stop.
That simplifies looking symbols up: you search until the first name match and return it immediately, instead of remembering the last seen matching symbol while traversing the full list.
I am not even talking about efficiency (although of course this implementation is also more efficient in all 3 use cases), it's about code simplicity: the less context you need to maintain during a list traversal, the easier it is to understand what this traversal looks for.