Hacker News new | past | comments | ask | show | jobs | submit login
Compiler Construction by Niklaus Wirth [pdf] (ethz.ch)
135 points by vector_spaces 27 days ago | hide | past | web | favorite | 59 comments

OP's link points to a .pdf covering Chapters 1-8. Web page below has a link to Chapters 9-16 as well:


Wirthian languages, and especially Oberon-07, are definitely a topic I want to familiarise myself with, but there is so little information about it on the Web, and even fewer codebases to learn from. And a lot of the compilers and tooling seems to be weirdly Windows-oriented. The questions I still have very little answers to are:

* How is error handling done in Oberon-07?

* How does one do generic programming in Oberon-07?

* What is Oberon-07's async story?

I will be thankful for any pointers (heh).

Here (https://github.com/rochus-keller/Oberon) is a Qt/C++ based viewer/cross-referencer for Oberon-07 code bases as well as an Oberon to C++/Lua/LuaJIT bytecode transpiler/compiler; an IDE is work in progress.

> How is error handling done in Oberon-07?

There is an Oberon programming language with compiler written in Oberon, and there is an operating system called Oberon as well; you can have a look at the source code, but there are also a couple of free books about it, see http://www.projectoberon.com/

> How does one do generic programming in Oberon-07?

No support for generics yet; there was a proposal in the nineties, but Wirth didn't integrate it. Neither the Go language which is a brain child of Oberon-2 has generics (yet).

> What is Oberon-07's async story

Oberon is single threaded; there is a coroutine library part of the Oakwood standard. But nothing comparable e.g. to Go.

Error handling is done the old style, you just test for error conditions and such.

Generic programming and async are supported on the Active Oberon programming language in the Oberon family branch. Which also does support exception handling.

Most recent version from Active Oberon


The A2 OS written in Active Oberon and respective documentation


Some A2 overview


While Modula-2 had co-routines and exceptions, Wirth has chosen a minimalist path with Oberon.

Oberon-07 design goal has been to remove as many features as possible from original Oberon, while keeping a GC enabled systems programming language.

Then you have the other path, taken by others at ETHZ, which extended Oberon into a more expressive language, namely Oberon-2, Component Pascal and Active Oberon.

With Active Oberon being what is actually still used at ETHZ for teaching some OS classes and OS research subject.

> And a lot of the compilers and tooling seems to be weirdly Windows-oriented

When I read this book I had an hard time running and compiling the source code included. If you want to read the book and try out the example compiler (oberon0), you can follow the instructions here: https://github.com/lboasso/oberon0. You just need a JVM >= 8 and the oberonc compiler: https://github.com/lboasso/oberonc

People should note that this isn't usually how we construct compilers in industry today.

(This comment is posted every time but it's true.)

It really wasn't how compilers were constructed in '96 either. I was very into compilers in the late 80's and even then we didn't handwrite lexers or parsers. I was using bison and flex around '89.

Basically no serious programming language uses bison or flex, and everything that does (awk is YACC based) has terrible error messages. People totally do handwrite lexers and parsers, if you use some language in earnest the chances it has a handwritten parser or lexer or both are far greater than that it has been developed with bison/flex.

> Basically no serious programming language uses bison or flex,

https://en.wikipedia.org/wiki/GNU_Bison#Use: Ruby, PHP, GCC until 2004, Go, Bash, PostgreSQL, MySQL, Perl.

Can we stop with this "duh, real programmers write their own handwritten parsers"? Parser generators are awesome! They force you to write a declarative grammar which a human can understand. If you're designing a language this is very useful. With a handwritten parser you're very likely to encode some unintended behaviour in edge cases; using a parser generator early in this process will very quickly reveal the ambiguities.

That doesn't mean that parser generators are always the best. There are many use cases where a handwritten parser is better. But the picture is way more nuanced than "no serious programming language uses bison or flex".

The current go compiler uses a hand-written parser, judging from it's source.

The benefit of having a BNF grammar for your language is rightfully enormous, and all modern languages definitely lay out their lexing grammar and BNFs explicitly in their specification. But there are downsides. Debugging LALR grammars is painful; error conditions are almost uniformly painful. Many languages make tokenization dependent on the current context of a parse for language ergonomic reasons [1], and you might not be able to specify the action in the right place in your automated parser generator. Hell, the language might not even conform to the LL(k) or LALR(1) your parser generator translates.

[1] I'd say this is context-sensitive lexing, but the resulting grammar isn't context-sensitive. Just somewhat more complicated than a naive translation of the grammar would indicate.

I can name just as many "serious programming language" implementations that use recursive descent or migrated to it because a parser generator turned out to be a bad idea after all:

Java, MSVC, GCC (after 2004), Clang/LLVM, ICC, C#, PowerShell, V8(JavaScript).

They force you to write a declarative grammar which a human can understand

It's a separate language that also has to be learnt, an extra step in the compilation process, and the difficulty of debugging parser-generator-generated code cannot be overstated. A recursive-descent parser written in a programming language is easily understood, debuggable, and does not incur any extra complexity.

Perhaps that is where the "no serious programming language" argument comes from --- the migration away from parser generators is a sign that the language has received enough attention such that the limitations of a generated parser become prominent.

The problem with LALR parser generators is that you end up with an LALR parser. Which is fine as long as the input is error-free, but trying to get a reasonable error message out of any LALR parser is about as much fun as repeatedly poking yourself in the eye with a sharp stick.

FLEX, on the other hand, is perfect for the job. I can't imagine why anyone would hand-write a lexer.

> FLEX, on the other hand, is perfect for the job. I can't imagine why anyone would hand-write a lexer.

Flex introduces a new dependency to your project, which involves generating source files at build time. This is a lot of complexity in your build system if this is the first time you need it. And it also by default uses a style that isn't really modern--it's more annoying to use if you want to use it in a multithreaded programming, or if you want to have multiple lexers in one program. (Or you're not even using C/C++ in the first place!).

Furthermore, lexers are actually really easy to write. Most tokens [1] are recognizable within a character or two, and usually amount to a regex on the order of [a-z][a-z0-9_]+ or "([^\"]|\.)*" -- the sorts of regexes that have very trivial implementations to handcode. The lexer I wrote for a JSON-ish file format is 128 lines of code. Sure, it'd be more like 30 lines of code if I used an automatic generator, but running into any issues integrating the automatic generator into my workflow would toss out the benefits of saving 100 lines of code.

[1] Excluding keywords, but keywords are usually implemented by parsing it as an identifier and looking for the identifier in a hashtable of known keywords.

In my experience, re2c is an improvement on lex/flex in most ways, but I agree - flex is indeed nicer than hand-rolled in every way (except if you are Arthur Whitney - his hand rolled lexers are a marvel)

I'm guilty of being overly hyperbolic, and you are right to correct me. But if your poster boys to illustrate the benefits of parser generators forcing you to write a declarative grammar which a human can understand are perl, bash or even postgresql, I'm not sure what to say ;)

Apart from Go, (and maybe PHP, which I don't know well enough to judge) all the languages you list have pretty horrible syntax that probably few (ruby?) to basically none of its users (bash, perl, C++, postgresql) really understand.

Agreed. I'm not going to say whether Ruby syntax is nice or not, but its use of parser generators is not a model of simplicity:



It also changes syntax in patch level versions: https://github.com/whitequark/parser.

> if your poster boys to illustrate the benefits of parser generators forcing you to write a declarative grammar which a human can understand are perl, bash or even postgresql, I'm not sure what to say ;)

And that is a straw man argument because, of course, no one is saying that.

> They force you to write a declarative grammar which a human can understand.

The problem is many languages' grammars are not declarative. They're full of completely uncontrolled side effects. So why are we trying to use a declarative tool for them? It's the wrong tool for the job.

> It's the wrong tool for the job.

Yes! We should have both handwritten parsers and parser generators in our toolkit and then depending on the language we want to parse we should choose the best tool for the job.

My point about a declarative grammar was more that if the language permits it (e.g. is simple enough for Bison), it's so valuable to have a declarative grammar which serves as both specification and parser. It will be way less bug-prone than any handwritten parser. (From my experience handwritten parsers tends to have both weird crashes and unexpected handling of edge cases. But maybe I just suck at programming?)

I've seen people using Bison on one complicated languages, given up, and then they bring the negative bias to other languages they want to parse — even when the other languages are very simple.

How is that relevant to this thread in which I am saying that the parsing/lexing as described in the link is very outdated? Maybe you should start a separate discussion.

> Maybe you should start a separate discussion.

I started this one!

Hmm, I wrote a VHDL (a superset of ADA used for hardware design) front end using bison and flex. That's a serious language.

Regarding error messages, if one implements the grammar as specified by the language spec, yes, confusing and useless error messages will result. That's because the language spec grammar includes restrictions based on semantics. To get good error messages, one has to remove/relax the spec grammar and then use semantic checking. Parsing is an art; not a science.

Right - no 'serious language' is hyperbole, and another counter example is Ruby.

But... if I was asked to implement a language tomorrow I wouldn't even consider using Bison and Flex. Why? Because they're context-free tools, and almost every industrial language is context-sensitive. Square pegs and round holes.

These tools are a great example of over-complicating and over-theorising programming languages. We invented a theory that doesn't even apply to the problem we have, and then created crazy tools to solve the wrong problem, and gloss over the fact that they don't apply with terrible hacks and side-effects. We made a mistake with that phase in programming language history...

Interesting. I've always thought that there's no real reason why e.g. Ruby has to be context-sensitive. Yes, I am aware of the fact that local variables are resolved statically, but do they have to be resolved during parsing? Couldn't they just be parsed as a method_call_or_var_lookup, and then later resolved? The main "feature" that Ruby's context-sensitive parser has is the ability to parse differently depending on what precedes it is a local variable or not, and I'm not sure if that's a good feature or not.

Are there other parts of Ruby's grammars that are also context-sensitive?

I'd say that Bison's biggest issue is that it's limited to LALR; if it actually supported full context-free grammars it would be far more useful.

> Yes, I am aware of the fact that local variables are resolved statically

That's the one that comes to my mind yes. I'd bet that there are others, but couldn't list them to you now.

> do they have to be resolved during parsing?

I guess not in practice?

But why are we doing this to ourselves, right? These parsing tools work great! (As long as you only do half do the job you were trying to do and then sort of hack it back together afterward.)

> But why are we doing this to ourselves, right? These parsing tools work great! (As long as you only do half do the job you were trying to do and then sort of hack it back together afterward.)

I agree with you in practice: The current parser generators are not as good as I'd like them to be. They are too restrictive and too locked into a "let's parse this subset of grammars mathematically optimally" instead of "let's make it easy for programmers to declare grammars".

(Disclaimer: I started working on a completely new parser compiler https://www.sanity.io/blog/why-we-wrote-yet-another-parser-c... which is capable of parsing full context-free grammars. I personally believe this is part of the puzzle of making them more usable.)

However, since we're talking about Ruby's syntax: I don't think Ruby's parser would be any nicer if it was handwritten. The problem here is that they discovered the power of "lexer tweaks parser state" and let it drive the design of the language. This has caused the language itself to be complicated (both for humans and for computers!) Unfortunately I don't have time for it, but I'd love to dive into the changes required in Ruby's syntax to make it context-free (e.g. pushing the local variable lookup into the bytecode generator).

Thanks I'll read that article.

Is there any grammar for a modern language that is context sensitive and expresses the constraints better than a context free grammar?

It’s true almost no usable language is really CF, but a CF grammar with a few semantic patches still seems to be the better way to describe them.

It’s been a while since I needed to parse a grammar, but if I had to I’d probably use re2c+lemon, Ometa/ohm, or a hand-rolled Pratt parser. (Or an Earley parser if the grammar was ambiguous and it made sense)

That's more hyperbole. You're putting a lot of responsibility on some simple tools. Programming language design complexity came first and not as a result of yacc or lex.

> Programming language design complexity came first and not as a result of yacc or lex.

I'm really not sure where you got the impression I was blaming Yacc or Lex for the complexity of the languages - I think you've imagined that.

The situation we're in in practice is that our language grammars are complex and are context-sensitive. So why are we trying to use context-free tools? It's madness - we're trying to hammer a nail using a screwdriver.

I got that impression from this sentence of yours:

> These tools are a great example of over-complicating and over-theorising programming languages.

If programming languages are over-complicated, it's not the fault of these tools.

> So why are we trying to use context-free tools?

I no longer keep up with compilers (or languages), but when I was, there was no other tools. Either go through the tedium and effort of hand-rolling a lexer/parser or munge the language grammar to fit.

I see why you'd think that, but I meant over-complicating the implementation of languages, not their design.

> there was no other tools

I don't think you need tools. Just write your parser as a normal program directly in a general-purpose programming language. I'm not sure that parsing is so special as to require an external DSL.

No tool is ever necessary. But not using yacc is tedious, slow, error-prone, uninteresting, and utterly no fun.

What production languages are context sensitive? C++ surely, Java and C# only when dealing with angle brackets for generics, what else?

Defining a variable before use makes a language context sensitive. So does e.g. typedef in C - is (a)*b multiplying a by b or typecasting what b points to? Depends on the types.

We usually semantically patch that at the right stage through the lexer or the AST so things still work in a CF setting, but almost no modern language is CF.

Use before declaration is resolved at the type checker level, not during parsing. C-style casting isn’t context sensitive if you don’t mind treating the casting type has an expression and then pulling a type out of it in the type checker (or evaluator if dynamic).

Type checking is where all the complexity really is. Parsing and lexing are just easy things you do before the hard stuff.

C has a lot of things which are context sensitive, and you can't treat the casting type as an expression:

parses under (double a, b, c;) as ((ca)-b) ; but under (typedef double a; double b, c) it parses as c-b, (with a type conversion applied after the negation) which is a different parse, and which can be more complicated than that to the point that fixing the AST is equivalent to reparsing the source.

Use before declaration can only be resolved at the typechecker if you assume every name is a variable, but the introduction of typedefs makes that assumption wrong, so you can't wait for the type checker - every C parser I know patches that information back into the lexer so that it would parse correctly. Indeed, the "typename" keyword was added to C++ because you can't leave that to the typechecker if you want a unique parse before instantiation.

Parsing with type information is not the norm. It was only done in C because no one knew any better back then. Later language that adopt C-style casting (C#, Java) do not need to know if something is a type to deal with parens, they look for the follow on and evaluate the tree accordingly. That is why in C#:

  var x = (Foo) - foo;
Will be rejected with a type error (must enclose -foo in parentheses because Foo is a type) rather than a parse error.

I'm reminded of another comment from some time ago [0] that recommended 'permissive' parsers for a different reason: you can build a list of syntax errors for the user, which isn't possible if you explode immediately, and you may be able to generate better syntax errors than your parser would have.

[0] https://news.ycombinator.com/item?id=17072670

Many mass-distributed compilers have hand-crafted lexers and parsers since (a) they aren’t hard to write and (b) hand crafting gives you complete control over error recovery.

Still usefull little tools to parse in configuration dsls etc.

Yes that's true - configuration files, network protocols, etc.

How do we usually construct compilers in industry today? I saw others mentioning parser generators -- is that the main difference?

This whole thread has been about parser generators, but that's not really what I meant!

I think the biggest difference is the rich intermediate representations we use these days. These classic texts don't really address intermediate representations and are usually syntax-directed instead.

These days a program will spent almost all of its time deconstructed into an IR.

Keep in mind that Wirth is a "simplificator": given any task, he'll try to take the simplest approach (as an example see his talk for his 80th birthday and his remark on the complexity he saw at Xerox PARC which drove him to design Oberon). His languages are even designed so that they are easy to implement. Intermediate representations might be what many compilers use nowadays, but they are not necessary to write a compiler, thus they're out - sacrificed at the altar of simplicity :-).

To be fair, he just adopted the Xerox PARC approach regarding code generation, where bytecode gets generated, with multiple possible execution paths.

The Xerox PARC workstations were microcoded, and the one needed to load the microcode on boot for the specific environment (XDE, Smalltalk, Interlisp-D,...).

Likewise the P-Code as means of portability for Pascal, M-Code on Lillith, the Ceres and latter FPGA RISC for Oberon.

I actually started reading Wirth because I was frustrated at how little coverage there seems to be of syntax related matters (lexing and parsing) in modern introductory books (with a couple of exceptions, notably the Dragon Book, but I've been having a hard time wrapping my head around it).

I emailed an author of a book that specifically calls parsing a "distraction" (Shriram Krishnamurthi, in PLAI) and he explained that he and other textbook authors of recent texts are reacting to a sort of over-emphasis on parsing in older books and tools (for example, YACC standing for Yet Another Compiler Constructor despite being simply a parser generator).

So it's interesting what you're saying about IRs -- I've definitely noticed Shriram's book and a few other recent ones seem to focus really heavily on them. I wonder if there's also a relationship between a more IR-focused approach and the reduced emphasis on parsing?

There might be a bit of a backlash yes. But I think for example the Dragon book is comically unbalanced towards parsing. It's just ludicrous how much it drones on about this tiny tiny part of writing a compiler.

I've been working professionally on one compiler continuously since 2013. I've spent like two or three days working on the parser. That's it. I've probably spent longer tweaking the Markdown of the README.

For me the issue is that parsing is basically a solved problem. I don't see how you can make it much better for compilers. If you give me a language I can write you a parser for it. It won't be interesting, fun, or be better or worse than any other parser. It's a purely mechanical job. (Of course it could be a nicer parser, or a faster parser, or give better error messages, or work better for IDEs or tooling, but stick with me.)

But the transformations that happen after that point? Well there a whole world opens up to you! You can do so much, in so many different ways, and achieve incredibly different results for different requirements.

The tiger books are relatively alright, with ML, Java and C flavours.

They have a couple of typos, and one can feel that ML one was the first variant, but they are relatively good, covering a lot, specially if one does the exercises.

There is a 2nd edition for the Java one, where the tiger language gets replaced by a Java subset.

Which are some books focused on IRs you’d recommend?

“Engineering a compiler” deals with IR

I think the main difference is that in today’s compilers the parser is such a small part that whether you use a parser generator or roll your own recursive descent parser is mostly just a footnote.

But a lot can still be learned by a hobbyist by doing so.

Hmmm... I'm not as sure. I wonder if we could do people a service by showing them how it's done today from the start, rather than showing them yesterday's stuffy, ponderous, inelegant approach (sorry for the opinion) and then having to roll that back when they start working on something in practice.

How would you show a novice how it's done today?

I've always wanted to learn about compilers, but something tells me the dragon book is no longer the go-to source.

It's actually a real problem - the knowledge of how for example JavaScript VMs and JITs work isn't really written down very well. It's all in people's heads and you learn as an apprentice (PhD student). I don't have a better solution than to show you in person.

One of the issues, I think, is that compilers has really become a topic that is way too broad to fit into one book. You'd need at least:

* Foundations: the frontend, codegen [to IR] (including vtables, lambdas, and coroutines, probably), the big scalar optimizations, register allocation, maybe instruction scheduling?

* Loop optimizations: dependence analysis, all the loop opts, autovectorization, autoparallelization. Maybe cover some OpenMP or heterogeneous computing (e.g., SYCL, CUDA) here

* "Fancy" optimizations: profiling (including what & how to profile), link-time (including thin-lto techniques), SLP vectorization, possibly superoptimization and dataflow analysis

* Dynamic language VMs: JITs, garbage collection, shape optimization, and several of the various tricks (e.g., NaN-boxing, deoptimization guards)

* You might be able to make an entire book out of fuzzing and symbolic/concolic execution, but I think it'd be thinner than the other ones here.

And there are still things I don't know where to put. Alias analysis, exception handling, debugging…

Just discovered Wirth's book The School of Niklaus Wirth: The Art of Simplicity. I would like to know if it's worth the time investment?

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact