Anyone can do whatever they want, and if you do want to write a parser in C this may be a good guide for you. But it's hard to understand the benefit to writing a language frontend in C, generally speaking. (An obvious exception might be if you are writing one of the first languages for a new architecture/operating system or one where only a C compiler exists.)
If you are writing a compiler, you'll just be emitting some other language anyway so the parser/emitter being in C doesn't really give you anything even if you are emitting C.
If you are writing an interpreter and you want performance, you're going to have to emit bytecode and have a bytecode interpreter. Wanting your bytecode _interpreter_ to be in C could make sense but your frontend could just emit the bytecode to a file and your interpreter written in C could read from that file.
It's a handwritten parser too so it's not like you picked the language so you could use a certain complex or nice parsing library.
And it's not like the performance (or binary size) of the parser/emitter typically matters so much that you need even that part to be so much faster (or smaller) than a parser/emitter in Go/Java/C#/whatever.
The overhead to writing C makes it a less friendly introduction to the topic if you have never written a compiler or interpreter (or parser, specifically) before.
Again, you can do what you want. And if you must do it in C, you will probably value any posts like this. Just wanted to throw out reasons you may not want to do this.
Brian's choice to use C is perfectly well justified:
"I am thinking that we should choose C to implement our PL/0 compiler. Eventually we will want to have this compiler run on many different platforms, including Unix, MS-DOS, and potentially even CP/M. C is a language that will run on every platform I would want to run our PL/0 compiler on, so C works for me. C also requires no special setup so our usual development platform of vim on OpenBSD will work just fine. Finally, it will help with bootstrapping the self-hosting version of the PL/0 compiler, since we in theory could natively compile the C implementation on our target platform then use that to compile the PL/0 implementation, also on the target platform."
The reader may learn something from these posts whether they choose to write a compiler in C or not.
I don't mean to debate Brian's choice (your quote shows it does fall into my exception anyway) but mention to readers here why they may be more interested in following guides on compiling/interpreting implemented in other languages than C.
C is actually a huge disadvantage here for me, since I have no use for it outside of this tutorial which makes it hard to justify learning it on top of learning how a compiler works...
I don’t know why they’re writing the parser using globals though instead of passing a pointer to a parser state struct.
Because there's only one. Why obfuscate that fact? Don't be tempted by the cargo-cult to make things more complex than they need to be just to appease some dogmatic "best" practices that actually aren't.
I think this is a really good example to point to in discussions of ‘cargo-cult’-ing. GP is merely reciting the standard ‘best’ practice for development in C, and on a more complex project or in another use case, they may be correct to do so. However, as you pointed out, in this case the globals are true single instance ‘objects’ and the use is semantically sensible and justified. It dovetails nicely with some of the DoD vs ECS discussion from the front page earlier today. Do what makes the most sense for your use case for reasons that apply to your project, everything else is abstraction.
No, those things are not a complexity of C. They're just complex to implement at a lower level. They're complicated in any language, often, if approached at the level C provides. You have to hand-hold the machine through it. A C-like implementation of linked lists in an array of memory using pointers, for example, looks just as bad if written in that way in unsafe {} Rust, or even some mutating thing in Haskell. You can do that if you want, but you don't have to. Those languages allow you to abstract that away very quickly (often already pre-abstracted for you with default approaches). That's what people mean when they say C is simple. Plain and simple. Minimal. You have to do much of the heavy lifting.
Respectfully disagree. A language that fails to adequately provide safeguards to a user is a complex language because it forces all the complexity of dealing with the underlying machine onto the user. These are things a user needs to know to be effective in the language, which in turn makes using the language complex. It's a deceptively simple language.
C is complex because of what it doesn't do, not what it does. That's IMO worse.
It's not so much that we're disagreeing as arguing different points about the language, I think. You are right that it is not a simple language to use, in many ways. Archaic, unnecessarily complicated to implement many standard approaches, and very repetitive. The header system is just the wrong approach. Etc. I wouldn't pick it for the main language for a new project, if at all avoidable personally.
It's very easy to implement, though. A compiler fits in 32 kilobytes of RAM on some architectures. Only a handful of control structures. Limited types. That's the sense of simple that I meant, and which most people meant. It's not a compliment. It's a neutral observation. It's actually C's greatest weakness. It's too simple.
Well, only the lexer deals with strings, and even then… it’s not really creating strings, just extracting some information from them. Creating strings efficiently and safely is IMO the hard part of string management in C.
> The overhead to writing C makes it a less friendly introduction to the topic if you have never written a compiler or interpreter (or parser, specifically) before.
That's true and was my first thought. C seems like an arcane choice for an introduction course.
> But it's hard to understand the benefit to writing a language frontend in C, generally speaking.
The advantage of writing in C is that the language provides very little “high” level features. All you have is just functions and memory so you end up doing everything which is a laborious process but rewarding to the mind especially if you are someone who never did low level programming.
I expect this series is similar to Wirth's own chapter on writing a PL/0 compiler. It seems to be following the same recursive descent approach, but using C rather than Pascal.
Since the implementation is in C, perhaps it might make sense to compile a version of PL/0 ("C/0") with C syntax rather than Pascal syntax. It might be possible to get something close to a C/0 compiler that could compile its own source code.
Though one issue with C is you also need a cpp implementation.
The author forgot default cases in various switch (tok) statements:
static void
factor(void)
{
switch (type) {
case TOK_IDENT:
case TOK_NUMBER:
next();
break;
case TOK_LPAREN:
expect(TOK_LPAREN);
expression();
expect(TOK_RPAREN);
}
}
The TOK_ enumeration has a good many more members than just those three. If any other one occurs in the code, or an outright corrupt type value occurs, it silently falls through. If the error is handled at all, it will result in
some strange diagnostic, in some higher level code that thinks it has seen an expression with a factor in it.
Adding a break after the last block in a switch is a good idea; someone adding a new case might forget. (Wit diagnostic support, like GCC requiring a fallthrough comment, this is less important nowadays).
If a switch deliberately falls through for unmatched cases, have
default:
break;
to express that explicitly.
This whole approach of the parser consisting of void functions is poor for 2021. It's as if the author is deliberately trolling everyone who has ever taken any level of interest in functional programming.
The author will have a hard time fitting the necessary semantic actions into this code, like constructing an abstract syntax tree, and might succumb to the temptation to do it with some hacks which build the structure procedurally using global variables.
The parser would be a more future proof if it anticipated the need for backtracking. So that is to say, a function such as factor() above could be considered to be a pattern matching construct, and return a useful value indicating whether it matched or failed to match. Furthermore, if it it failed to match, then it would backtrack, leaving the state of the input stream as it was prior to entry.
You can then do interesting things, because you're essentially LL(k) now, like speculatively parse the input using several possible constructs in priority order, until one matches.
The code inconsistently mixes the use of concrete character constants like '.' and '{' and #define symbols that expand to character constants like #define TOK_PLUS '+'.
Using the character constants in the code as in expect('+') or whatever is clearer. The Yacc approach of using constants for abstract tokens, (which are enumerated starting at some value above 256), and concrete character constants for single-character tokens that denote themselves, is worth mimicking.
You're not going to redefine TOK_PLUS to something other than '+' in the future, and doing so would be like like changing #define FOUR 4 to #define FOUR 5. TOK_PLUS doesn't inform any better than '+', and isn't any easier to find with grep.
> The author will have a hard time fitting the necessary semantic actions into this code, like constructing an abstract syntax tree,
It's a declared non-goal.
> The code inconsistently mixes the use of concrete character constants like '.' and '{' and #define symbols that expand to character constants like #define TOK_PLUS '+'.
> Using the character constants in the code as in expect('+') or whatever is clearer.
That's not how it works, TOK_* represent whole elements, the element may happen to be a single character, but it can be a word. It would be highly confusing to pass single characters to expect() as you wouldn't know at first sight whether it expects a simple character or an element. The TOK_* could as well be assigned a random number as you say in your following sentence. And they were until yesterday :-) I guess it is easier to debug with mnemonic characters.
Compilers can go straight to output without building an AST, which is just one form of output. I should have more generally written about output.
Output can be produced procedurally in a single pass. However, it will still require refactoring to work that in. For instance, say we go to a register machine (one that has an unlimited number of registers). The expression() function needs to know the destination register of the calculation. If it returns void as before, then it has to take it as an argument. Or else, it has to come up with the register and return it.
I think that if the target is a stack-based machine, that would be the easiest to work into the parsing scheme. This is because without any context at all. Let's use term() as an example:
Original recognizer skeleton:
static void
term(void)
{
factor();
while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
next();
factor();
}
}
Stack-based output:
static void
term(void)
{
factor(); // this now outputs the code which puts the factor on the stack
while (type == TOK_MULTIPLY || type == TOK_DIVIDE) {
next();
factor(); // of course, likewise outputs code.
output_stack_operation(type); // we just add this
}
}
What is output_stack_operation_type:
static void
output_stack_operation_type(int type)
{
switch (type) {
case TOK_MULTIPLY: output_line("mul"); break;
case TOK_DIVIDE: output_line("div"); break;
// ...
}
}
For instance if we see
3 * 4 / 6
we have the grammar symbols
factor TOK_MULTIPLY factor TOK_DIVIDE factor
the term function calls factor() first, and that function produces output, which might be the line:
push 3
the term function calls factor() again, so this time the output:
push 4
is produced. Then it calls output_stack_operation_type(type) where type is TOK_MULTIPLY. This outputs:
mul
The loop continues and further produces
push 6
div
Because stack code implicitly accesses and returns operands using the stack, the syntax-directed translation for it doesn't have to pass around environmental contexts like register allocation info and whatnot.
The stack-based operations do not take any argument, and therefore a parser whose functions don't return anything and don't take any arguments can accommodate stack-based code generation.
Stack-based code can be an intermediate representation that is parsed again, and turned into something else, like register-based.
If the plan is to go to stack-based output, the code can work.
I second all this. Reading the blogpost I was wondering how the author will do with this parser to actually build an AST rather than just validating syntax.
The thing is that building an AST is typically the kind of thing where "do it with some hacks which build the structure procedurally using global variables" as kazinator said is a sure way to have horrible bugs. The inherently recursive nature of an AST doesn't mix very well with procedural manipulation of a global state.
I don’t think there will be an AST. https://briancallahan.net/blog/20210814.html: “We will be writing a single-pass compiler for a simple language and immediately output our final output code as soon as our compiler has enough information to do so”
That “as soon as” implies code will be generated before the entire program has been parsed.
Also, for me single-pass implies “no AST”, as you would need at least one pass to construct one, and iterating over an AST counts as another pass for me.
Single-pass to me also implies a code generator that writes the object file directly, rather than compiling to C or assembly (or some other language) as an intermediate format. But words don't mean anything anymore, so long as people can plausibly convince others that making them feel bad by calling out improper use means that they should be allowed to use words however they want, like a child owed the opportunity to exercise their unbridled spirit.
Note that loading the program into memory is also a pass. Producing an installation image is also a pass, as is installing the program into the target. When we are talking about compiler passes, we don't count these, because they are outside the compiler.
When we consider a compiler that puts out assembly language, if that works without producing an AST, just by generating assembly in a syntax-directed way as it followed the phrase structure of the code, then we call that a one-pass compiler. What it makes is assembly language, and it does that in one pass.
Producing executable code from the assembly requires another pass, but that pass belongs to the assembler. The compiler is not even running, and so it can't be a pass of that compiler.
HN's capacity for intellectual dishonesty knows no bounds. That's a dishonest attempt to bankrupt the term "single-pass compiler", and remove its value to communicate something meaningful. By that token, the Amsterdam Compiler Kit, notable for being an example of the classic multi-pass compilation strategy, is apparently not a multi-pass compiler at all! It's just a bunch of single-pass compilers.
"a one-pass compiler is a compiler that passes through the parts of each compilation unit only once, immediately translating each part into its final machine code"
Even if you want to raise the "That's just Wikipedia! Anyone can edit it!" argument, rational thought is still not on your side.
Unlike a program that has been compiled into object code, there is no value in a program that has been compiled into assembly except to run it through an assembler and turn it into a binary. The only accurate description for this would be a two-pass compiler (not single-pass). The assembly is an intermediate representation, and the "compiler" is the union of the language frontend that outputs assembly + the assembler itself. (It does not matter that the assembler runs in a separate process and is developed independently.)
There's already plenty of bikeshed here but that just takes the cake. The only code that's truly "future proof" is code that you're willing to rewrite; which I'm sure the author is capable of doing if/when the need arises.
The code isn't perfect, it's a learning project and still WIP, but you can check the code for a parser for the LOX language ( c-like) from 'Crafting Interpreters' in this project. 'Ast.fs' and 'Parser.fs' should be enough to get an idea of what it's doing.You can also google the grammar of the language which is quite simple:
F# is an ML-family language, and ML stands for meta-language -- it was designed for language tools and compilers! Andrew Appel's Modern Compiler Implementation in ML is a great resource for code examples that aren't exactly F#, but will look damn close. The book has also been published with C and Java, if you want side-by-side comparisons
ML was designed to be the metalanguage for the LCF proof system, used to write proof tactics. It was then recognized to be interesting in its own right as a programming language.
Oh, thanks for the correction and apologies for the inadvertent misinfo.
Your comment sent me reading the HOPL paper about SML -- I think I (based on folklore knowledge/informal conversations) conflate the original development of ML with subsequent development and standardization around SML. All of this was well before my time so it's quite interesting to read about some of the details.
Lemon is an infinitely better parser generator than Yacc; it solves a lot of the issues that make Yacc terrible to work with: named non-terminals, non-terminal destructors, re-entrancy, etc. It's part of the SQLite codebase. Highly recommended.
In many cases, though, I think parser combinators thread the needle elegantly - nom in Rust for instance. Or even just a hand-rolled LL(1).
Why would you say that? Yacc and other parser generators exist for a good reason: hand written parsers can be quite hairy to debug and extend, while parser generators offer a domain specific language to specify your grammar and can generate efficient parser code based on it.
What are the arguments in favor of manually writing parsers?
Fundamentally I don't think parsing is a problem that's complex enough to warrant a custom tool and language. And even if it was Yacc is the wrong tool for the job in most cases.
Since most languages are context-sensitive, you almost always have to bend Yacc, which is designed for context-free languages, out of shape to apply it.
It's a hammer but we hardly have any nails, and the nails can just be pushed in by hand without a hammer, and they're the wrong type of nails anyway so that hammer isn't really compatible, and top of that it's a really expensive hammer.
Yes - so straight away you have to leave the DSL and bypass the formal mode of Yacc. At which point why bother with it? If the first thing you need to do with your tool is hack around it, maybe the tool isn’t so well-designed?
And actions often do things like side-effect changes to the lexer state, so it gets worse from there.
If you've got the right utility functions, a handwritten parser visually maps very directly to its grammar. It won't be as concise of course, but cognitively you can work with it almost as if it were just a grammar.
This post is good for learning, we have too many “systems“ developed in Java (Hadoop et al - I’m looking at you - you sluggish bugs in the rug), we need more programmers to learn C and build systems in it.
Also, you can do this so much easier in scala using parser combinators, if the goal is time to market and the code to be compiled is not very large. Here is an example to see how the approaches compare:
https://www.prophecy.io/blogs/scala-packrat-parser-combinato...
While C may be criticized for making it too easy to misuse pointers, for other features that are usually mentioned as security problems for C programs, e.g. out-of-bounds addressing and numeric overflow, the culprit is not the C language, but the manufacturers of the most popular CPUs, e.g. the Intel/AMD CPUs.
On most modern CPUs, checking for addressing bounds or for overflow is too expensive and the software developers almost always choose speed over correctness.
There have been a few C compilers with optional run-time checks for bounds and overflow, but almost nobody used those options for production code.
Unlike the Intel/AMD ISA, there are other instruction sets which include a variety of exception conditions, for a cheap implementation of the run-time checks (e.g. the IBM POWER ISA), but even there I do not know if the most recent implementations of those architectures have efficient exceptions.
"security"? You mean the paranoia FUD of Big Tech justifying their enslavement of users and locking them out of their own bought general-purpose-computers --- I mean devices --- to further the authoritarian corporatocracy?
"Trusted" computing, DRM, and all its ugly ilk are premised on making things "secure". Maybe its time we opened our eyes to realise that uncomfortable truth. If the government and the megacorps are scared enough by encryption to want backdoors into our lives, maybe we should want ones into them too.
"Those who give up freedom for security deserve neither."
The comment you replied to had absolutely nothing to do with encryption or DRM.
C being low level means you're forced to do many tedious and complicated things manually. This makes the code you write prone to errors which (in the worst case) can be exploited by attackers to run arbitrary code on your system.
... but I'm sure you're already well aware of this which leaves me wondering why you wrote what you did.
If you are writing a compiler, you'll just be emitting some other language anyway so the parser/emitter being in C doesn't really give you anything even if you are emitting C.
If you are writing an interpreter and you want performance, you're going to have to emit bytecode and have a bytecode interpreter. Wanting your bytecode _interpreter_ to be in C could make sense but your frontend could just emit the bytecode to a file and your interpreter written in C could read from that file.
It's a handwritten parser too so it's not like you picked the language so you could use a certain complex or nice parsing library.
And it's not like the performance (or binary size) of the parser/emitter typically matters so much that you need even that part to be so much faster (or smaller) than a parser/emitter in Go/Java/C#/whatever.
The overhead to writing C makes it a less friendly introduction to the topic if you have never written a compiler or interpreter (or parser, specifically) before.
Again, you can do what you want. And if you must do it in C, you will probably value any posts like this. Just wanted to throw out reasons you may not want to do this.