Maintaining state and mutating state is useful and neatly done in OO. That is, when done right, which is hard. I certainly agree that for processing data, good functional languages reign supreme. They are usually terser while more expressive.
It seems that solving problems need their own tool. Compilers typically fit a data processing pipeline. Yet, a use case such as editing an AST (e.g. a text editor showing incomplete syntax state) probably could better use OO as a tool.
Most current software then simply uses the wrong tool for the job; they do data processing with OO, and state representation and mutation using functional pure idioms. A problem that is not yet solved very well in the industry is interplay with these idioms. .NET with C# and F# comes to mind, which allows for easy workflows leveraging both OO and functional.
The "best" design in stateful parts of systems has yet to come to my eyes. IMO FP gives you better foundations to start stateless then see where you can group things into private mutable memory. Not that I know how to do it but at least I can reason instead of invoking innate black art or long experience.
> Maintaining state and mutating state is useful and neatly done in OO
I started laughing when I read "neatly done". I assume you don't mind long range state mutating effects making your code a lot more complex to reason about than it needs to be. Or the "grab a banana and you grab the whole planet" dependency issues OO almost inevitably introduce. Or ... (I could go on). OO with mutable state is anti-modular.
> I think we somehow took a wrong path when OO got embraced as wholly as it did back in the 90s or so.
I think the particular paradigm that was adopted by C++, Java, etc. was a bad one. Having now worked quite a lot in a language with multiple dispatch (Julia), I find it really cumbersome to work in class-based single dispatch languages.
Yep, because it's forcing you to do awkward hierarchies that most people get wrong. And even if you get it right, it tends to be awkward a lot of the time.
I was always a big fan of CLOS (Of Common Lisp) and Dylan (poor Dylan).
By the way, the spiritual successor to Dylan (IMHO) Stanza http://lbstanza.org/ doesn't seem to get much love.
Virtual dispatch in C++ and Java is dynamic dispatch, but it is singly so rather than multiply so as found in CLOS, Dylan, Julia, etc. If you want multiple dispatch in C++ and its kin, you need to jump through hoops with visitor patterns and such.
Not quite. Multiple dispatch (as mentioned in sibling comments) is basically 'virtual on both object and argument types', so like:
class Vehicle {
virtual void Collide(Vehicle other);
}
class Car {
void Collide(Truck other) { /* hit a truck */ }
void Collide(Car other) { /* hit another car */ }
}
class Truck {
void Collide(Truck other) { /* hit another truck */ }
void Collide(Car other) { /* hit a car */ }
}
Vehicle c = Car();
Vehicle t = Truck();
c.Collide(t); // with multiple dispatch, calls Car::Collide(Truck)
You are using static overloading on the argument rather than dynamic dispatch. If the static type of truck is obscured to a super class, your behavior won't be as expected. In general, overloading shouldn't be used as a replacement for dynamic dispatch, rather it should be used to accommodate multiple incompatible types (e.g. foo(X) and foo(Y) but never foo(Z) if X <: Z and Y <: Z unless foo(Z) maintains the behavior of foo(X) and foo(Y) given an x or a y under subsumption).
Multiple dispatch is typically meant to be purely dynamic in meaning, as wiki puts it:
> Multiple dispatch or multimethods is a feature of some programming languages in which a function or method can be dynamically dispatched based on the run-time (dynamic) type or, in the more general case some other attribute, of more than one of its arguments.[1]
The static type of both c and t is Vehicle, but by dynamic multiple dispatch Car::Collide(Truck) is selected. That's not how C++ works, but the example was exactly about illustrating how C++ doesn't work.
Multiple dispatch is overriding virtual member functions on argument types, not just receiver type.
Overloading and overriding are duals; the former is statically determined, the latter dynamically determined. Just like you can overload on the full argument list, multiple dispatch lets you override on the full argument list. This lets you do things like implement the visitor pattern without double dispatch. And C++ doesn't support it without libraries.
Unless it's via some new-fangled C++ feature I haven't heard of yet, I'm pretty sure C++ doesn't have multiple dispatch. Can you elaborate on what you're referring to?
Similarly, I like Python and C++’s emphasis on duck-typing. I use STL containers often, and provide similar interfaces to my custom data structures for flexible, general, fast code.
Ah gotcha. Yes in the current situation we can do nearly everything that the concepts proposal for C++20 does with enable_if, void_t, and a few other helpers. Though concepts have the benefit of letting you "say what you mean", instead of relying on template magic.
This should be more than possible with Pythons type annotations, when done right. I.e 'iterable[str]' can accept a generator, list, tuple, any class with the right __iter__ and of course a string itself.
This is what I like about Rust and Rust's traits. You write a generic and specify the properties of the generic type parameter, for example that it must implement another trait or something.
> That's why I like Clojure's philosophy of a few data structures and many functions operating on them.
Which in practice means that a few data structures that are not perfect for solving a specific problem are forced (with repeated hammer blows) to bend to the problem. Like the old style LISP AList's used as a map. Probably the worst possible structure for representing a map. But yeah it kinda works as long as you don't have much data.
The assocation list is a poor data structure for representing a map, if the most important requirement is to have fast access to any key of a large map.
However, the association list is an excellent data structure if an important requirement is to be able to cheaply extend a map with new keys functionally: without altering the original. It also provides the requirement that duplicate keys added to the front shadow existing ones. It is also very simple; it is easy to verify that operations are correct.
With the above properties, association lists, directly provide an easy and correct implementation model for lexical environments.
The empty top-level lexical environment is an empty list. Adding a frame with new variables, including ones that shadow existing variables, is just consing pairs onto the list. The assoc operation correctly implements lexical variable lookup. The environment for making a lexical closure is just a copy of the assoc list pointer that you have now. Because the bindings are independent pair objects in the association list, individual bindings can be located and selected, and combined into another association list. A newly created association list environment can trivially capture an arbitrary subset of bindings from another one without mutating it.
Even if we don't use this representation in production use, it provides an important reference model. Anything more complicated has to produce the same results; we can start with assoc based environments, build a lot of code, and then use that as regression test cases for going to a different environment representation.
It's good enough for pattern matching and logic languages where the number of variables isn't great, but the search has to do things that are complicated enough without a heavy-weight environment structure.
> AList's used as a map. Probably the worst possible structure for representing a map.
They're more flexible than hash tables and often faster; what makes them so awful? They both have trade-offs that make them better and worse in different contexts, depending what you're optimising for.
Under the right circumstances, a lookup table could be worse. In a simple interpreter, I would hate to pass down a vector of pairs for the environment representation; it would have to be wholly duplicated to extend the environment.
I remember my undergrad classes where the types of software engineering diagrams were presented. There were some very informative useful data transformation diagrams that were just glossed over in order to reach the useless UML shit.
I took a very important lesson from those classes, but not the one the professor was trying to explain.
If you want to avoid OO you'll also want to avoid treating functions as first-class values. As soon as you have structs that contain functions, you can build OO.
OO is a way of designing programs. You can avoid it in most languages if you want to and you can do it in most languages if you want to. The language only determines the amount of boilerplate you need.
Folks, this is Chapter 3 of Structure and Interpretation of Computer Programming. Technically you don't even need a struct; closures and mutability will do.
And how do you serialize data that function pointers in it?
You're not downvoted because people ignore that fact. You are downvoted because people disagree that it's important.
Yes, you can create objects with closures, but when people talk against OO programming, they mean the paradigm, not merely the possibility that a language can do objects. Forth and C can do object oriented programming just fine too, if one is so inclined.
I am almost positive Rich Hickey hasn't read that book. The only thing more stunning than his ignorance when he talks about OO is the amount of young programmers who believe everything he says.
It blew my mind when I realised I could write objects from scratch in languages with closures, even statically typed ones! Of course each object has its own v-table in the SICP code, so probably not something you should use in production.
For my main compiler project I’ve actually been moving away from this “pipeline” style because it’s inflexible and not great for incremental operations.
Inspired by the Forth dictionary, gradually I’ve arrived at a design where the compiler behaves like a database server—it takes requests in the form of program fragments and produces responses in whatever form you want by querying the database and generating intermediate results as needed. The pipeline is still there, it’s just not hard-coded, and modelled in a more relational way. (I plan on doing a detailed blog post on this after I land the code.)
Ordinary compilation? Push a bunch of files and pull the generated code of anything referenced transitively from the program entry point or library exports. Documentation or TAGS generation? Pull the appropriate metadata. Syntax highlighting? Pull the lexical classes and source spans of the tokens—no parsing needs to happen. In general, if you want an intermediate result, you can just ask for it—there’s no hard-coded code path for “run the pipeline up to this point and bail”.
The advantage is that this is amenable to producing partial results even when there are program errors—a lexical error doesn’t prevent parsing, it just produces an “error” token that can be displayed in an editor without compromising higher-level features like syntax highlighting, autocomplete, navigation, &c.
Many pipeline approaches include error tolerance. I specialize in incremental lexing and parsing for IDEs, and even have experience with some state based error recovery (so, for example, unbalanced braces remember previous brace associations, or a bad transient parse error doesn’t destroy the good trees on the precious run). I don’t see how a database approach would help, you just need to incrementalize the structures in your pipeline (or have reproduceable persistent markers like Eclipse’s elastic positions).
You can make your pipeline mostly oblivious to the incremental nature of each stage, which has some SE benefits. However, I like to run all my phases as damage/repair work lists after the initial incremental lex stage: token changes dirty trees associated with them, putting them in a list for releasing, which then can cause retyping of those nodes, causing those to go on a work list, though the parsing worklist is cleaned before the type checking one before the execution (for incremental live programming) one.
I’ve never seen two IDE-oriented incremental compilers use the same approach actually, there is a lot of diversity here.
That’s an interesting approach. Are you generally working in imperative languages? Last I recall you were using…Scala?
I’m new to it, but it seems like there’s a lot of innovation quietly happening in this space.
My approach is more like a declarative build system, I guess. To add TAGS support, for example, you add an output that can generate a TAGS file, and declare that it depends on a resolved AST (one with fully qualified names). Then you can make queries like “pull tags from {this set of files, the whole database, everything in this namespace, &c.}”. Dependencies are resolved, and errors reported, mostly automatically unless you want to change the default behaviours.
If you want to regenerate tags after changing some input, then as an optimisation you can compute only the diff resulting from whatever dependency was changed upstream (usually a file, ultimately) at the memory cost of caching intermediates (which right now means leaving the server running).
It's a good article. Indeed, compiling a programming language is such a complex job that is frankly quickly dismissed and taken for granted. My view is that learning and writing a simple compiler will make most of developers on notch better. The very least you'll understand why parsing HTML or XML with regex is such a bad idea :)
> My view is that learning and writing a simple compiler will make most of developers on notch better.
Agreed. One of the interesting insights I got while building my own compiler was that despite what every compiler book says, there are languages that can be compiled on the fly without building an AST. I'd say for most languages it's an unnecessary step that makes compilation slower and more resource hungry. However, compiling without AST is a next-level skill, another notch I think.
> One of the interesting insights I got while building my own compiler was that despite what every compiler book says, there are languages that can be compiled on the fly without building an AST.
PL (but not compiler) researcher here. What you're saying sounds to me like "One of the interesting insights I got while programming my own game was that despite what every game programming book says, there are games that can be programmed without defining any functions." I mean, yes, you can do that, and it may in fact be more efficient, but it's going to cause trouble down the line if you ever want to do something more complex.
> What you're saying sounds to me like "One of the interesting insights I got while programming my own game was that despite what every game programming book says, there are games that can be programmed without defining any functions."
Ex-game programmer and current PL programmer here. I think a better analogy is that you can make a game without having a scene graph. And, indeed, you can. And it works fine for many simple games, though as some scales it starts to be a hindrance.
Yeah, the grandparent is only true if you don't want to do interesting things with the language, like nearly any but the most basic peephole optimization.
But a fun fact, if you know about single-pass compiling, you can figure out a lot of why the original C syntax is what it is. A declaration is a "reserve some memory" statement. Local variables had to be declared at the beginning of scope so that the compiler could deallocate them at scope close. Without an AST, the compiler had to produce instructions immediately.
With a true AST, you don't need such a restrictive syntax.
Yes, there's a reason compiler books start with parsing and building an AST.
But I think studying Lua [1] or the Wren codebase [2] is still worthwhile, provided you keep in mind that they're relatively simple languages that are designed to be both fast and embedded.
No, it's not the same thing. A compiler without the AST phase itself is more compact and arguably more beautiful, but again not every language is compile-able this way (C++ being a notable example).
My experience has been that you can get incredibly far (much farther than you'd think) without any form of AST and just iterate over the raw parse tree and tokens, but the problem you eventually encounter when you do this is that if you want to support all the different cases in a 'real' and messy evolved language like C is that you get a combinatorical exposion when you try to support all the corner cases of the language. A great example of this would be if you try to figure out how to write a function to iterate over the tokens that make up a struct definition in C:
This is the reason for needing the extra 'abstract' syntax tree layer that lets you generalize over all those different cases and avoid having 20-level nested if statements in your compiler.
For example: once you see what the compiler does, you can write readable code without sacrificing efficiency.
When you don't really know what's going on under the hood, consciously or unconsciously you write code based on assumptions, which a lot of the time are wrong. It's very hard to exemplify because the mental model programmers have about programming can be very different, and therefore the benefits of a better understanding provided by learning about compilers can have different effects on different people. And still, everyone who has tried recommends learning about compilers.
Programming languages are our most fundamental tool as programmers, and yet it's pretty much impossible to make a fair assessment of why they are designed the way they are, why they work the way they do, why they provide the abstractions and structures they provide, until you try for yourself. Understand programming languages better => use them more naturally.
Compilers are a great example of taking a very complicated task and breaking it down into simpler (though not necessary simple) steps. Just going through the practice of building one might change how you look at other projects you've worked on in the past.
And I think the insights you take from designing a compiler translate into data processing as a whole, as other commenters have noted.
You get an insight into why current languages and compilers are designed the way they are. For example, why does C have #include? Probably because it's the easiest possible way of including/referencing source code from another file.
The dragon book is not recommended. It is very out of date. I write (among other things) high performance compilers for a living. And the Dragon book is not useful at all.
Sunday before last (twelve days ago) I started working on a VM, assembler/disassembler and compiler for TXR Lisp, just early morning and evening spare time.
Assembler and VM handle lexical closures, exception handling, unwind-protect, function calls (of course) and global variable access, and dynamic environments (special variables). Compiler handles a significant number of important special forms: let, lambda, function calls, assignment and others. lambda handles optional parameters with default values, and any parameter can be a special variable, correctly saved and restored, and excluded from the lexical environment.
VM closures are hooked into the rest of the environment; they are represented as a new kind of function and can be used in the normal ways.
Compiling Lisp is in many respects easier than interpretation. A huge simplifying factor is that in the compiler, you can mutate an environment object as you compile: that is to say, add a binding, compile something, add a binding, compile something ... The compiled something, even if it contains a lexical closure, does not see any of the bindings you didn't add yet. Can't do that in the interpreter; not with the same environment representation. If you interpret something which captures a closure and then extend the captured environment, you've screwed up.
I have to grapple with a couple of messy special forms. That's how it goes when you've been interpreting for years and now you're introducing a compiler. One form is the one which underlies the evaluation of the string quasiliteral syntax. Unfortunately, the special form is deeply interpretive; it passes the environment into some functions which walk the syntax of the quasiliteral doing eval here and there, which come up with the pieces of strings that catenate together to form the output. I'm unraveling that stuff for compiling; coming up with a way to hoist the evals out into static code, and some function API that can then just work with the results of the evaluation, yet maintain compatibility with the interpreted semantics.
The annoying thing is that just the partially complete transformation code for this crud is 60 lines of Lisp so far, whereas the entire compiler (so far) which accomplishes so much more is only 378. One is a big percentage, in other words. It's an annoying thing in programming that sometimes code which is very specialized and achieves little in the "big picture" burns more LOC's then you'd think it deserves.
One thing I tried to do in dfsch is to use the same IR format (in the sense of being interpretable by the interpreter) accross all the stages. What I found out is that this is surprisingly hard to do in Lisp-1 language while preserving the semantics to the extent of somehow abandoning the language and thinking about making it CL-style fully lexically scoped (ie. with suport for labels and similar special forms) Lisp-2.
(CPS-style IR does not make much sense for dfsch, as it explicitly and intentionally does not support call/cc, as the main use-case is easy and straightforward C interoperability)
Edit: dfsch distinguishes mutable, immutable (which are somewhat redundant and in fact consume one combination of tag bits which can be used for something more useful, like strings) pairs and CDR-coded lists, which are always immutable and CDR-coded lists have additional slot that records where they came from in case of things that either was parsed from source or is result of macro-expansion, which in my opinion meets the intention of R5RS. On the other hand the very aggressive constant folding done by the macro-expansion stage probably does not.
I find this interesting from the perspective of the design mentality. I've read so many comments of people here starting with "I am a PL researcher" and "I am a compiler designer" often with a qualification of "but not the other one", it seems like unless you are doing a Forth style language you're almost never both...?
My interpretation of the authors main points are: breaking compilation down into a pipeline of blocks has obvious advantages, but much like when we break any large implementation problem down into blocks, we can loose sight of the whole - in short, it's important to maintain a holistic view while simultaneously making decisions of where to "cut" your problem. But that holistic view to me seems to mostly end at the intersection between compiler and language.
When I started my job it involved a lot of one way design implementation, as these became more adventurous I gradually found myself questioning the designs more, until now where I see basically no division between design and implementation - when a "design" for something comes my way I basically pull it to pieces and interrogate the author in order to understand all of the decisions behind the design and then rebuild it (with the author) with implementation and practicality in mind, it always benefits everyone involved. I feel like this previous division that i destroyed is quite similar to this division in compiler and language, but I am mostly an outsider to that world so i could be wrong.
Compiler design is somewhat unusual just because there are so many different cases to deal with. An AST might have 20 different kinds of expressions, and anything traversing it needs to deal with them all.
This means that whatever solution you come up with needs to scale to lots of cases, more than you can comfortably remember. And for a language that's not stable, you need to update all the tools when the AST changes, to handle the new construct.
(Lisp isn't immune; it calls them special forms and they're uniformly represented as lists, but you still have to deal with them for any kind of deep analysis.)
You can traverse the AST via internal functions (in OO terminology, virtual methods that make the case matching implicit on this). It really isn't much of a problem if the functions are typecheck, codegen, execute, etc...
I don't really get the appeal of using giant mega case matches in a compiler.
AST transformations are a exemplary instance of the Expression Problem. Solutions in mainstream languages, eg the Polyglot framework for Java, lead to ugly overengineered code. Writing AST transformations with multimethods might be nice.
I think the OO part is a bit ranty, but other than that it's a nice article: Software engineers can really learn a lot from compiler design. Never regret I did take that lecture (we excluded C typedefs and some other meanies). If you're bored and want to learn, write a LLVM frontend, and maybe a simple optimization (e.g. constant propagation or dead code elimination).
There is a tutorial for a functional language in the LLVM docs, or just do a subset of C, or your own DSL ;)
This all assumes that you want most code to have the properties of compilers, but this seems misguided to me. Compilers don't handle complexity well, they're unreliable and not stable against small random purtubations, naïve approaches tend to be inefficient and complex ones global, etc. A lot of this is just that they're solving a difficult task (converting terrible C code into less terrible assembly), admittedly.
Almost every language is less thorny than C because of C's context-sensitivity and the preprocessor. Pascal is a classic language that's useful but easy to compile. Any of Wirth's successor languages are too (occam, etc.). SML is a step up, but still not intractable. Java 1.0 might be fun.
The hard part of writing a compiler is not the lexing, parsing, type checking, or code generation. The hard part is the optimizing bit.
There do exist languages that are more amendable to whole-program analysis but the things that cause problems are usually not obvious. Java, weirdly, is among the easier languages to analyze because its memory model happens to be very easy to work with (excluding nulls).
Language design goes hand in hand with compiler's design. No wonder we have so many languages being born every year. Different mix of ideas being tried in each of them or forgotten lessons being rediscovered.
Just taking a look at what D, Rust, Lisp and Scala proposes will give one the chance to ask crucial questions about what are the implication of that guarantee, pros, cons and how it will all work out.
I assert that scannerless parsing is a bad idea, for reasons of efficiency and code complexity. He's making some argument around generated parsers which I don't understand.
I believe he's conflating the issue of whether you generate your parser with whether you have a separate parser and lexer. (The OSH parser is hand-written, while the lexer is generated. This is actually the opposite of what you see in many languages -- e.g. in Python the parser is generated, while the lexer is hand-written.)
There's no reason that you can't produce good error messages with a separate parser and lexer. In fact I believe you'll get better error messages. I use a "lossless syntax tree" representation which allows good error messages:
(On the one hand, it's like a parse tree, but less verbose. On the other, it's like an AST, but it doesn't lose information.)
----
Regarding intermediate representations, I had a lot of success with Zephyr ASDL, which is a DSL for using algebraic data types in procedural/OOP languages:
I also don't really see the strong contrast betweeen OOP and FP he's making. Oil uses both styles -- the IR is at the center, using a functional style (data without methods), but the other components are objects/modules/interfaces/encapsulated state/data hiding.
IMO OOP and FP are the same thing when done right: they're about being principled about state. The way I see it, a pure dependency injection style is pretty much equivalent to functional programming.
ASDL has some relation to "Stanford University Intermediate Format", which sounds like it could have been a precursor to LLVM IR. In other words it happens to be used in interpreters right now, but it was designed for compilers.
The use of middleware seems to embrace this concept too for very flexible systems. Just pass it along the assembly line and you don't care who's next in line or who was before you.
It's there. E.g. ANSI Lisp has the concept of a token. Reading multiple objects until a closing ) via read-delimited-list is like parsing. The actions which accumulate token constituent characters into a token and produce some object from it, like a number or symbol, are like lexing.
Of course Alan Perlis wrote "It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures." in 1982.
But pipelining outputs into inputs (you'll see it a lot in F# too) is just a simple, flexible concept.
As Rich Hickey once said, "We're data processors".
I think we somehow took a wrong path when OO got embraced as wholly as it did back in the 90s or so.