
Oops, I Wrote a C++ Compiler - mpalme
https://praeclarum.org/2018/08/27/oops-i-wrote-a-c-compiler.html
======
chubot
_The process of creating the parser went very smoothly thanks to this. The
real work involves creating C# syntax classes that mirror the grammar. These
classes form the Abstract Syntax Tree (AST) of the compiler._

To address this problem, I use Zephyr ASDL to describe the abstract syntax of
a language (which CPython also uses!)

It's a little DSL that lets you write an ML-style algebraic data type, and
generates a bunch of code for you. In CPython, you go from ~100 lines ASDL
schema to ~10,000 lines of C! It underlies the 'ast' module in the stdlib.

Why is Zephyr ASDL?
[http://www.oilshell.org/blog/2016/12/11.html](http://www.oilshell.org/blog/2016/12/11.html)

It eliminates the need for a lot of boilerplate he pointed out:

[https://github.com/praeclarum/CLanguage/tree/master/CLanguag...](https://github.com/praeclarum/CLanguage/tree/master/CLanguage/Syntax)

~~~
fmihaila
For those writing language processors in Java, I highly recommend having a
look at JastAdd [1]. It's a great framework for implementing semantic analysis
using attribute grammars [2]. The tool automatically generates Java classes to
represent ASTs, based on a high-level description (a CFG grammar) of the
desired AST shape. You can use any Java-based parser as long as the AST it
produces is constructed from those classes. That's only the starting point;
the real power is in the excellent support for expressing attribute
computation. It uses a high-level, lazy equational formalism extended by the
programmer with custom logic, which is then turned into automatically-
generated Java code. Compilers/interpreters written in JastAdd tend to be very
well structured and easy to maintain. It's one of my all-time favourite
libraries, actively maintained, and very well documented.

[1] [http://jastadd.org/web/documentation/concept-
overview.php](http://jastadd.org/web/documentation/concept-overview.php)

[2]
[https://en.wikipedia.org/wiki/Attribute_grammar](https://en.wikipedia.org/wiki/Attribute_grammar)

~~~
lopatin
And for Scala, there's Kiama
[https://bitbucket.org/inkytonik/kiama](https://bitbucket.org/inkytonik/kiama)

------
amelius
Please don't call it C++ if it doesn't even support constructors and
destructors.

People will be trying to port their existing code, and will be unpleasantly
surprised.

> But the truth is, I’m just going to keep listening to users and improving
> the parts I can. I am not trying to recreate GCC or LLVM.

Why not _use_ GCC or LLVM as part of your project? That seems like a better
investment of your time.

~~~
jchw
It is very well explained? GCC and Clang would be nightmares to integrate into
an app that runs on many platforms including iOS, and even once you did, you'd
still need to implement an interpreter, because you're not allowed to JIT on
the app store. Compiling a toolchain for a desktop platform is easy. Cross
compiling a toolchain is moderately easy. But embedding a toolchain into an
iOS app, where you have to get it and all of its dependencies compiling for
iOS, definitely seems like a massive challenge.

~~~
comex
For Clang/LLVM, it wouldn’t be that hard. It’s one source tree to build (Clang
is in a separate repo from LLVM core, but you check it out into a specific
subdirectory and they build together), all C++, no external dependencies, and
the result is a set of static libraries you can easily link into an iOS app.

As for interpreting — there are a few options. The simplest approach would be
to use LLVM‘s built-in IR interpreter (lli), which did exist in 2010.

Alternately, if you want more control: these days, the targets supported by
upstream LLVM include:

\- eBPF

\- WebAssembly

\- AVR

The first two are bytecode formats, both designed for (relative) simplicity
and portability. There are open-source interpreters available for both, or
alternately they’re both simple enough that you could relatively easily write
your own. AVR, on the other hand, is the actual instruction set used by the
Arduino; I’m not sure what the status is of open-source AVR emulators (I see a
few projects on GitHub but don’t know how good or complete they are), but
there’s probably something out there that would work.

But to be fair, all three of those targets were merged into LLVM upstream only
recently, much later than 2010. Also, 2010 was the year that Clang got C++
support working well enough to become self-hosting, according to Wikipedia
[1]; it was still relatively experimental, not nearly as much of a “safe bet”
as it is today. (Still, even then, it was pretty much guaranteed to work
better than writing your own C++ compiler from scratch! :p)

GCC is older but much harder to embed, even today. For one thing, it’s under
the GPL. iCircuit is closed source, and even if the author were willing to
open it, there are supposedly legal issues with publishing GPL code to the App
Store. And there are technical obstacles as well: last I checked, GCC still
expects to be run as the main executable as part of a traditional Unix
toolchain, not embedded as a library. If that has changed, it was only
recently.

[1]
[https://en.m.wikipedia.org/wiki/Clang#Status_history](https://en.m.wikipedia.org/wiki/Clang#Status_history)

~~~
jchw
Honestly, Clang does seem like it would be pretty easy to embed in comparison.
They do mention their app is written in C#, and integrating C++ code with it
on every platform may carry its own challenges. I know it's relatively easy to
do with desktop platforms with Mono and Microsoft .NET just using normal
PInvoke, but have no idea about iOS; code signing looks complicated on iOS,
though I'm sure you know it inside and out. Either way... yeah, if you
consider Clang, it really does appear like a waste of time to write your own
C++ compiler, even for just a subset.

------
mynegation
It takes hundreds, if not thousands person-years to write C++ compiler
intentionally, damn next to impossible to write it accidentally as title
implies.

Frank is a talented engineer, I used to follow his work on Calca very closely.
But as others noted, this does not seem to be anywhere close to C++. The
problem with parsing C++ starts with C. Say you see the code “T(t);” at the
beginning of a function body. What is it? Declaration of variable t of type T?
Call of function T on variable t? You cannot parse it properly without symbol
table. No context free parser handles C properly, let alone C++. It gets
progressively and exponentially worse from here.

~~~
sedachv
> You cannot parse it properly without symbol table. No context free parser
> handles C properly, let alone C++. It gets progressively and exponentially
> worse from here.

C is actually pretty easy if you ignore all of the bad advice to use parser
generators. I looked into the various YACC grammars for C I could find on the
Internet, and all of them either had bugs or were incomplete. TCC[1] has a
simple recursive-descent parser. With a recursive descent parser you also have
the option of implementing the C pre-processor in the same step. Turns out I
was able to implement a single-pass C parser and pre-processor as a bunch of
Common Lisp read macros[2].

I have not looked into it, but the approach for C++ looks like it would be
very different because template instantiation needs its own step.

[1] [https://bellard.org/tcc/](https://bellard.org/tcc/) [2]
[https://github.com/vsedach/Vacietis/blob/master/compiler/rea...](https://github.com/vsedach/Vacietis/blob/master/compiler/reader.lisp)

------
njsubedi
This post reminded me of myself in 2012. Where did all that energy go?

------
ncmncm
Does this seem insane to anyone else?

Emulating an Arduino, you get no benefit at all from re-implementing a
compiler. The Instruction Set Architecture is rigorously defined. All you need
to do is implement that, plus some peripherals. Write a compiler from ATmega
machine code to your VM, if you like.

This C++-- will never be actually useful to anyone, and people developing for
Arduino already have more than one anyway. It might be educational, but almost
all the work has gone into the least useful and least educational parts of it.

The destructor is by far the single most essential feature of C++. After that,
constructors, templates, and the standard library take 2nd, 3rd, and 4th
place. If you think integer conversions are important, you have not learned
much at all.

There are already several ATmega emulators done sanely. One more would not be
bad, and could break new ground. Use virtualization primitives to make an x86
program translated from ATmega think it's a real chip, and emulate the Arduino
100x faster than the real one.

------
maxxxxx
Nice! I think doing this is a great learning experience. I think I understand
how to write a basic compiler but I am completely stumped by writing an
optimizer. How do you model that?

~~~
archgoon
The llvm compiler has it's optimization routines here:

[https://github.com/llvm-
mirror/llvm/tree/master/lib/Transfor...](https://github.com/llvm-
mirror/llvm/tree/master/lib/Transforms)

You can play around with how it takes in llvm code to optimized llvm code by
using the opt tool.

Here's a basic example.

    
    
      $ cat test.c
      #include <stdio.h>
      int main() {
        for (int i = 0; i < 3; i++) {
          printf("hello world");
        }
      }
    
      $ clang -c -S -emit-llvm test.c
      $ cat test.ll
      ** too long replacing with gist**
      https://gist.github.com/cwgreene/b6f33d40ad735a448ed15057d91fdbdc
    
      $ opt -S -print-before-all -O2 test.ll
      ** too long, replacing with gist**
      https://gist.github.com/cwgreene/99b3075dbfffae35a5745934ded217fb
      ** final result**
      @.str = private unnamed_addr constant [12 x i8] c"hello world\00", align 1
    
      ; Function Attrs: nounwind uwtable
      define i32 @main() #0 {
      entry:
        %call = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2
        %call.1 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2
        %call.2 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2
        ret i32 0
      }
    

The command `opt -S -print-before-all -O2` can be manually modified to specify
exactly which compiler passes you want to see. You can run `opt --help` to see
a list of all compiler passes. (you can figure out what code performs by
looking at the * * * IR Dump Before `MODULE NAME` * * * and finding the
associated logic in the Lib directory)

In the case of llvm, the idea for them is they break the code into things
called 'basic blocks' which, along with using Single Static Assignment, are
more easily analyzed, transformed, and simplified.

------
aaaaaaaaaab
More like a “C++ [1] compiler”.

[1] restrictions may apply

~~~
saagarjha
To be honest, that's most C++ compilers.

~~~
mhh__
That's only technically true. The main C++ compilers in common use basically
support, aim to support or have collectively agreed not to support parts of
the C++ standard.

~~~
jcranmer
> have collectively agreed not to support parts of the C++ standard.

Those parts were ripped out of the standard. (The big one is export templates,
where the expert compiler implementers when asked for feedback about how to
implement it said "don't").

------
cift
This looks like a cool app, the post says it's available on Android but I
can't seem to find it on the Play Store. Anyone know what happened?

------
hasahmed
Very cool

