Show HN: My own C compiler (just for fun)

rcorcs · on June 14, 2016

I know there are hundreds of C compilers and that I am just reinventing the wheel. But my main goal is just learning. I have already written another compiler before, but it was just for a toy language. For this reason, I would like to build a compiler for a "real-life" language, and C is an important but yet small language. I want to master the whole process of a fully working compiler for a "real-life" language, and afterwards continue to build on top of this knowledge, since I have been doing research on automatic parallelisation, and I am interested in optimising compilers in general.

Even if this project doesn't turn out to be useful for a lot of people, I hope at least to inspire a few people to tackle big problems.

userbinator · on June 15, 2016

I'm curious to know your reasoning behind using a generated parser instead of writing your own. This might be the first "just for fun" C compiler I've seen on HN or elsewhere that doesn't use some variant of recursive-descent/precedence, which is employed even by the "real" compilers like GCC, Clang, ICC, and MSVC.

If you are doing this for learning then I'd recommend studying C4's parser (https://news.ycombinator.com/item?id=8558822 ) and this article:

https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm

The precedence-climbing algorithm is so amazingly simple and concise that I think anyone playing with compilers should implement a parser using it at least once, just to experience its astounding elegance. IMHO seeing an entire programming language's AST parsed almost entirely using a single recursive function with very little code and a table concisely describing its grammar can be quite exhilarating.

bluetomcat · on June 15, 2016

> IMHO seeing an entire programming language's AST parsed almost entirely using a single recursive function with very little code and a table concisely describing its grammar can be quite exhilarating.

I recently implemented a toy interpreter for a small language which uses a hand-written, table-driven shift-reduce parser. The parser simply maintains a stack of terminals and non-terminals and applies the corresponding reduction if a suffix of the stack matches a rule in the table. Precedence rules are resolved through an additional "should_shift" function, keeping the grammar very understandable. I think it's a very elegant algorithm:

https://github.com/bbu/simple-interpreter

d99kris · on June 15, 2016

The code base looks nice and easy to navigate. I didn't see any license mentioned - I'd suggest stating one clearly for those considering to build something on top of it, or to contribute.

vasili111 · on June 15, 2016

You can also look here for advices: https://www.reddit.com/r/Compilers/ There are people who is specialized on developing compilers.

imtringued · on June 15, 2016

Automatic parallelisation is a bad idea because the overhead caused by synchronisation is bigger than the gain. It's also a bad idea to take away control from the programmer. What if every compiler did automatic parallelisation differently? Then you'd have impossible to fix performance problems.

pipeep · on June 15, 2016

> Automatic parallelisation is a bad idea because the overhead caused by synchronisation is bigger than the gain.

A sufficiently intelligent compiler would be able to heuristically determine if the synchronization overhead for a program segment would be greater than the performance gain. Plus, there's lots of little cases where synchronization overhead is trivial, like map().

C isn't the best language for this, but there's plenty of languages (e.g. Haskell) that have data flows that can be thoroughly statically analyzed for dependencies.

> It's also a bad idea to take away control from the programmer.

Ideally you'd have compiler pragma to disable this on a case by case basis, and you'd still have more traditional parallel programming tools available.

> What if every compiler did automatic parallelisation differently? Then you'd have impossible to fix performance problems.

If you had a bad enough performance regression, you'd disable automatic parallelization and file a compiler bug (because it shouldn't regress below single-core performance). You could make this same argument against any optimizing compiler. They all apply different optimizations, and do sometimes lead to "impossible" to fix performance problems or even contradictory execution behaviors between implementations. But I still think `gcc -03` is invaluable.

I don't think automatic parallelization is a cure-all, or even that it should be enabled by default on compilers any time soon, but it's better to have it as an option than not.

markchristian · on June 15, 2016

This is awesome. Thank you for sharing it with us. :)

warmwaffles · on June 15, 2016

kudos to you, this is something I don't think I could ever accomplish but I know it's not easy. Keep learning and growing.

pmiller2 · on June 14, 2016

I will always upvote something like this, but I am wondering: why write it in C++? I didn't really read the whole thing, but I skimmed a bit, and didn't see a lot of advantage taken of the higher-level facilities of C++. It seems like you've picked a route that gives you all the disadvantages of using C, while simultaneously destroying the possibility of it ever being self-hosting.

Still, a fun project, and you have my upvote. :)

unsignedqword · on June 14, 2016

Not everyone wants to write the most idiomatic C++, so writing variably C-ish code in C++ isn't really all that uncommon nowadays. If you're a C programmer but you notice C++ has a convenient feature (or features) that you might find useful in a pinch, hell, maybe it's worth it to bite the bullet (although that's not to say that using C++ in lieu of C can have its drawbacks, though)

pmiller2 · on June 15, 2016

I would completely agree with you if this weren't a C compiler project. In this case, the advantage of having a self-hosting compiler is huge! Among other things, you can use the compiler source itself to test the compiler, which is pretty cool.

I have worked on projects that were essentially "C with classes," and it was fine. I'm not saying every C++ project needs to go all out using every feature from C++ 11, but I didn't even see much beyond use of the string class here. It seems not worth giving up self-hosting to me.

mac01021 · on June 15, 2016

> the advantage of having a self-hosting compiler is huge!

I agree that it's kind of neat to compile your compiler with itself, but it's not clear to me that this constitutes a significant productivity boost. There are countless other C programs that have already been written and can be used as test cases.

So I'm left wondering what the huge advantage that self-hosting provides is. I am interested so, if you have more to say on the topic, I will certainly read it.

userbinator · on June 15, 2016

If we're talking about toy compilers and learning, then productivity isn't really in consideration. Having a self-hosting compiler for a non-trivial language means you've achieved the "bootstrap" stage, which is a significant milestone and really gives you a practical understanding of what's involved in developing a programming language. A compiler that is powerful enough to compile itself is on the way to exiting the "toy"/"theoretical" realm that most people who study compilers don't ever go beyond.

apk17 · on June 15, 2016

You could just start to replace C++ things by plain C until it is self-hosting.

userbinator · on June 15, 2016

while simultaneously destroying the possibility of it ever being self-hosting.

...at least until it begins turning into a C++ compiler ;-)

That makes me wonder, given that there seem to be quite a lot of "toy" C/C-subset compilers being written these days, just how much harder C++ would be. For a long time it was commonly thought that even the simplest C compiler would be extremely complex and difficult, but that notion appears to have been somewhat defeated, so the next logical step in that direction is C++ or maybe C++-to-C (like Cfront) compilers. The closest I'm aware of would be the C++-subset used in TempleOS, which is self-hosting.

apk17 · on June 15, 2016

Horribly. C++ has so many features, and interactions between them, that you really don't want to be dragged into that as a toy project. I'd wager that the C++ language spec is longer than the source of a functional C compiler (both without the standard libraries).

userbinator · on June 15, 2016

Maybe the interactions aren't actually the issue because the features are quite orthogonal and independent of each other, and the complexity is really the natural result of the interactions between them. If I wanted to write a C++ compiler I'd probably start by adding features in this order:

    - Member functions in structures; constructors; destructors
    - Single inheritance
    - Virtual functions
    - Exceptions
    - Multiple inheritance
    - Virtual inheritance
    - Templates

There's definitely more I haven't listed, but the first few don't seem too difficult...

codewritinfool · on June 14, 2016

I was wondering the same thing... ?? Oh, and you get an upvote from me too.

reacweb · on June 15, 2016

Having a self hosting compiler gives a very handy tool to assess the gains of a speed optimisation: if the speed improvement of the compiler compensate the more complex compilation algorithm, then it is justified.

cyphar · on June 15, 2016

Can you please add a free software license to your project, as it's currently proprietary? I'd recommend GPL, but go with whatever you want.

Contrary to popular opinion, copyright applies automatically to all of your works (unless you are an employee of the US Government). Due to the draconian nature of copyright laws, people have very few practical freedoms with your work unless you use a free software license (GPL, MIT, Apache, etc).

EDIT: Actually, looking through your GitHub profile it looks like most of your projects don't have free software licenses. Can you please rectify this, as it's clear you're doing cool work but it's not free software (or even "open source").

sheepleherd · on June 14, 2016

the computer science / computer programming problem I'd like to see solved is, keeping projects "fresh" and open/accessible enough that people like this could feel like they were learning in an unencumbered way, and at the same time contributing something useful to an existing project, while at the same time pushing the capabilities of what available open source projects can provide.

"Reinventing the wheel" projects absolutely litter public source nodes; believe me, I know why people do it; but my dream is the dream of software that most of us have given up on, code reuse, "reentrancy", shared libraries, etc.

Maybe something like a "wikipedia of source code".

I'm not discounting the benefit of doing a project to learn about it; what I'm saying is, too bad it's not code that will be useful for anything else without a lot more work; and too bad work is going into something that is not reuseful-able.

adrianN · on June 14, 2016

Any nontrivial project requires lots of time just to understand its design. Even minimal contributions will likely require comparable amounts of work to completing a toy project. And as any professional programmer knows, reading other people's code is a lot less fun than writing your own.

sheepleherd · on June 14, 2016

like you are saying "it's a problem", and like I'm saying "that's the problem I'd like to see solved"

as an example, what they teach us in school, and what large projects like NASA have do do, is to first agree on a specification for interfaces, then to write code to the interface, then iron out the kinks. Working on a project like that, and the bigger the project, soon we discover that there are many local wins if we can only change the interface that we agreed on because "we didn't know enough when we agreed" etc. etc.

As an example of what I'm saying (as a thought experiment solution) is that if a real live compiler project was written to clean specs (even if the specs came after the code), then there'd be a lexer, parser, etc. and for a little homebrew project like this one, you could write your own lexer from scratch, testing it all the while against the rest of a functioning compiler. Probably, you would not finish it because you would learn in a series of "aha" moments what "the hard parts" are, and how they are solved.

So you could abandon your own piece, but at the same time you would be now equipped to contribute to the real project.

Or you could move on to working on the parser... lather, rinse, repeat.

No need to tell me what all "the reasons that doesn't work is"... I know the reasons, and it's useful to identify the laundry list of them, but the part I'm interested in is the attitude that "hey, this is worth solving" and "hey, this could be solved..."

abecedarius · on June 14, 2016

I've done something like that for the CPython bytecode compiler: https://github.com/darius/500lines/blob/master/bytecode-comp...

"From clean specs" didn't really hold because the bytecode VM needs better documentation. I started to address that with a version of the VM in Python (cutting down Ned Batchelder's byterun): https://github.com/darius/tailbiter and I've started a very spare-time project to redo the parser as well (in the same repo). It'd be neat to see this getting used in a compiler course -- CPython's about as simple as a popular compiler gets.

(I agree with the grandparent that it took a lot of time to learn all I needed to about CPython internals -- and for the parser there's more to learn.)

tmm · on June 14, 2016

Does the LLVM introduction tutorial[0] kind of fit what you're suggesting? You learn how to implement a toy language called Kaleidoscope on top of the LLVM infrastructure with one data type (64-bit float), if/else, for loop, and a few other things.

It covers the lexer, parser, AST generation, and a few other things.

There's also one for writing a backend targeting a fake hardware architecture.

[0] http://llvm.org/docs/tutorial/index.html

sheepleherd · on June 14, 2016

thanks, that's very good.

quick critique (wanted to contribute to this conversation while it's active rather than delve deeply into LLVM for the rest of the day :) it's (naturally and understandably) written from the perspective of "this is how it is, if you want to connect with what we do here's what you need to do".

As a pedagogical tool (that is still a compiling tool) it could use an intro of more "here is what a lexer needs to do, here's how/why we chose to do it, here is why what is downstream belongs downstream, here is an example using a language syntax that is extremely simple" (C is not), "here is an alternative way you could try to do it", etc.

But definitely you point up a good way to start toward [mystic music] "my dream goal" in this example.

Again tho, I'm wishing that there were tools and "a way" that ALL projects could be managed this way, not just one great complier, but the several great compilers and editors, and all-the-types-of-things-people-keep-having-the-urge-to-reinvent

nickpsecurity · on June 14, 2016

Seems like the ETH work on Oberon etc did that. The students kept learning by improving on or porting both the compilers and OS's. A2 Bluebottle is a fast, usable OS as a result. Barely usable given no ecosystem and few features. Usable, though, in a safe, simple, GC-ed language. :)

Far as C compilers, would this one previously posted fit your requirements?

http://c9x.me/compile/

https://news.ycombinator.com/item?id=11555527

It seems to be quite similar to what you describe. Designed to be compact, easy to understand, maybe easy to extend, and useful in practice. I remember liking it more than most of these submissions for those attributes.

skybrian · on June 14, 2016

Well that's an interesting idea. What sort of programming language would let us build a package repo that could survive a Wikipedia-style edit war?

Wikipedia works because articles are mostly independent. There are links, but if they're broken it's not fatal. There are also a fair number of stub articles and a bureaucracy around what counts as a notable subject.

In practice, projects have owners, and they're looking for help, but not just anyone's help.

dragonwriter · on June 14, 2016

> What sort of programming language would let us build a package repo that could survive a Wikipedia-style edit war?

Really, the programming language isn't important, its the package management and repository system that matters for that. And, a repository that keeps history and a package/dependency management system that lets you specify the particular version of a dependency would seem to suffice to address the particular problem you relate.

enqk · on June 14, 2016

I find Kartik Agaram's writings interesting with respect to that problem.

http://akkartik.name/about

akkartik · on June 15, 2016

Thanks for saving me the decision of whether or not to do another shameless plug :)

I recently wrote this (in a very similar thread to this one) about the tension between real-world and teaching software: https://www.reddit.com/r/Compilers/comments/4jmb88/open_sour...

marvy · on June 14, 2016

more details?

sheepleherd · on June 14, 2016

more detailed question please? I'm happy to discuss but I'm not sure whether you are looking for bottom up nitty gritty details or more top down grounded philosophizing.

marvy · on June 15, 2016

Looking at the rest of the thread helped. Thanks!

m1n1 · on June 14, 2016

Where does one go for a fully representative sample of all kinds of valid C code to test a project like this? Or -- given a formal grammar, has someone produced a tool that generates representative code?

moosingin3space · on June 14, 2016

There's csmith: https://embed.cs.utah.edu/csmith/

nickpsecurity · on June 14, 2016

I give +1 to Csmith recommendation. They used it to tear up all kinds of compilers with practical bugs found as a result. They even found a few in CompCert despite formal proofs due to specification errors. The middle-end came out flawless, though.

So, such results speak for themselves. I think Csmith has to be the baseline. Not sure if it's easy to tune for the equivalent of basic, acceptance tests, though. If not, then it could be nice to have a long list of tests that each correspond to specific, C-language features to help a compiler writer gradually implement one.

Jabbles · on June 14, 2016

Actually the bugs they found in CompCert were not in the proved sections. The CSmith paper says

"As of early 2011, the under-development version of CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task. The apparent unbreakability of CompCert supports a strong argument that developing compiler optimizations within a proof framework, where safety checks are explicit and machine-checked, has tangible benefits for compiler users."

https://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf

nickpsecurity · on June 14, 2016

"The problem is that the 16-bit displacement field is overflowed. CompCert’s PPC semantics failed to specify a constraint on the width of this immediate value, on the assumption that the assembler would catch out-of-range values. In fact, this is what happened. We also found a handful of crash errors in CompCert."

That bug was in the backend. I believe it's verified code based on this paper's existence:

http://gallium.inria.fr/~xleroy/publi/compcert-backend.pdf

The code correctly implemented a bad spec far as I can tell. The crash errors aren't elaborated so I can't say what they imply. So, for backend, there was at least one bug found in verified versions due to inaccurate spec.

Shows value of multiple forms of verification which Orange Book era research already supported empirically. Corroborated by Altran-Praxis and others in recent times. I've included example papers on LOCK's design/cost and Altran-Praxis' CA since they're representative of trends I've observed over time. Proving part, for common errors at least, has gotten a lot easier though with tools like SPARK. I wanted CompCert derived into SPARK at one point.

http://www.cyberdefenseagency.com/publications/LOCK-An_Histo...

https://cryptosmith.files.wordpress.com/2014/10/lock-eff-acm...

http://www.anthonyhall.org/c_by_c_secure_system.pdf

zzzcpan · on June 14, 2016

For gradual implementation I would start from a test suite in tcc [1]. Eventually adopting the one from gcc and csmith.

[1] http://bellard.org/tcc/

nickpsecurity · on June 14, 2016

Forgot about that one. Thank you! I remember it being small and fast but didn't know it had memory and bounds checking. That's a great combination.

userbinator · on June 15, 2016

Csmith is a tool that can generate random C programs that statically and dynamically conform to the C99 standard.

It sounds a lot like a fuzzer, although only for testing success cases.

robertelder · on June 14, 2016

For specific test cases, you could use gcc's test cases:

https://github.com/gcc-mirror/gcc/tree/master/gcc/testsuite

Not sure about the latter, but you could probably accomplish something similar without too much work using a fuzzer.

hardmath123 · on June 15, 2016

Ha, finally an opportunity to show this off! My pet parser project has a built-in "unparser" to turn a parser (specified as a formal grammar) into a randomized test-case generator.

https://github.com/Hardmath123/nearley#the-unparser