
Show HN: My own C compiler (just for fun) - rcorcs
https://github.com/rcorcs/rcc
======
rcorcs
I know there are hundreds of C compilers and that I am just reinventing the
wheel. But my main goal is just learning. I have already written another
compiler before, but it was just for a toy language. For this reason, I would
like to build a compiler for a "real-life" language, and C is an important but
yet small language. I want to master the whole process of a fully working
compiler for a "real-life" language, and afterwards continue to build on top
of this knowledge, since I have been doing research on automatic
parallelisation, and I am interested in optimising compilers in general.

Even if this project doesn't turn out to be useful for a lot of people, I hope
at least to inspire a few people to tackle big problems.

~~~
userbinator
I'm curious to know your reasoning behind using a generated parser instead of
writing your own. This might be the first "just for fun" C compiler I've seen
on HN or elsewhere that doesn't use some variant of recursive-
descent/precedence, which is employed even by the "real" compilers like GCC,
Clang, ICC, and MSVC.

If you are doing this for learning then I'd recommend studying C4's parser
([https://news.ycombinator.com/item?id=8558822](https://news.ycombinator.com/item?id=8558822)
) and this article:

[https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm](https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm)

The precedence-climbing algorithm is so amazingly simple and concise that I
think anyone playing with compilers should implement a parser using it at
least once, just to experience its astounding elegance. IMHO seeing an entire
programming language's AST parsed almost entirely using a single recursive
function with very little code and a table concisely describing its grammar
can be quite exhilarating.

~~~
bluetomcat
> IMHO seeing an entire programming language's AST parsed almost entirely
> using a single recursive function with very little code and a table
> concisely describing its grammar can be quite exhilarating.

I recently implemented a toy interpreter for a small language which uses a
hand-written, table-driven shift-reduce parser. The parser simply maintains a
stack of terminals and non-terminals and applies the corresponding reduction
if a suffix of the stack matches a rule in the table. Precedence rules are
resolved through an additional "should_shift" function, keeping the grammar
very understandable. I think it's a very elegant algorithm:

[https://github.com/bbu/simple-interpreter](https://github.com/bbu/simple-
interpreter)

------
pmiller2
I will always upvote something like this, but I am wondering: why write it in
C++? I didn't really read the whole thing, but I skimmed a bit, and didn't see
a lot of advantage taken of the higher-level facilities of C++. It seems like
you've picked a route that gives you all the disadvantages of using C, while
simultaneously destroying the possibility of it ever being self-hosting.

Still, a fun project, and you have my upvote. :)

~~~
unsignedqword
Not everyone wants to write the most idiomatic C++, so writing variably C-ish
code in C++ isn't really all that uncommon nowadays. If you're a C programmer
but you notice C++ has a convenient feature (or features) that you might find
useful in a pinch, hell, maybe it's worth it to bite the bullet (although
that's not to say that using C++ in lieu of C can have its drawbacks, though)

~~~
pmiller2
I would completely agree with you if this weren't a C compiler project. In
this case, the advantage of having a self-hosting compiler is huge! Among
other things, you can use the compiler source itself to test the compiler,
which is pretty cool.

I have worked on projects that were essentially "C with classes," and it was
fine. I'm not saying every C++ project needs to go all out using every feature
from C++ 11, but I didn't even see much beyond use of the string class here.
It seems not worth giving up self-hosting to me.

~~~
mac01021
> the advantage of having a self-hosting compiler is huge!

I agree that it's kind of neat to compile your compiler with itself, but it's
not clear to me that this constitutes a significant productivity boost. There
are countless other C programs that have already been written and can be used
as test cases.

So I'm left wondering what the huge advantage that self-hosting provides is. I
am interested so, if you have more to say on the topic, I will certainly read
it.

~~~
userbinator
If we're talking about toy compilers and learning, then productivity isn't
really in consideration. Having a self-hosting compiler for a non-trivial
language means you've achieved the "bootstrap" stage, which is a significant
milestone and really gives you a practical understanding of what's involved in
developing a programming language. A compiler that is powerful enough to
compile itself is on the way to exiting the "toy"/"theoretical" realm that
most people who study compilers don't ever go beyond.

------
cyphar
Can you please add a free software license to your project, as it's currently
proprietary? I'd recommend GPL, but go with whatever you want.

Contrary to popular opinion, copyright applies automatically to all of your
works (unless you are an employee of the US Government). Due to the draconian
nature of copyright laws, people have very few practical freedoms with your
work unless you use a free software license (GPL, MIT, Apache, etc).

EDIT: Actually, looking through your GitHub profile it looks like most of your
projects don't have free software licenses. Can you please rectify this, as
it's clear you're doing cool work but it's not free software (or even "open
source").

------
sheepleherd
the computer science / computer programming problem I'd like to see solved is,
keeping projects "fresh" and open/accessible enough that people like this
could feel like they were learning in an unencumbered way, and at the same
time contributing something useful _to an existing project_ , while at the
same time pushing the capabilities of what available open source projects can
provide.

"Reinventing the wheel" projects absolutely litter public source nodes;
believe me, I know why people do it; but my dream is the dream of software
that most of us have given up on, code reuse, "reentrancy", shared libraries,
etc.

Maybe something like a "wikipedia of source code".

I'm not discounting the benefit of doing a project to learn about it; what I'm
saying is, too bad it's not code that will be useful for anything else without
a lot more work; and too bad work is going into something that is not
reuseful-able.

~~~
adrianN
Any nontrivial project requires lots of time just to understand its design.
Even minimal contributions will likely require comparable amounts of work to
completing a toy project. And as any professional programmer knows, reading
other people's code is a lot less fun than writing your own.

~~~
sheepleherd
like you are saying "it's a problem", and like I'm saying "that's the problem
I'd like to see solved"

as an example, what they teach us in school, and what large projects like NASA
have do do, is to first agree on a specification for interfaces, then to write
code to the interface, then iron out the kinks. Working on a project like
that, and the bigger the project, soon we discover that there are many local
wins if we can only change the interface that we agreed on because "we didn't
know enough when we agreed" etc. etc.

As an example of what I'm saying (as a thought experiment solution) is that if
a real live compiler project was written to clean specs (even if the specs
came after the code), then there'd be a lexer, parser, etc. and for a little
homebrew project like this one, you could write your own lexer from scratch,
testing it all the while against the rest of a functioning compiler. Probably,
you would not finish it because you would learn in a series of "aha" moments
what "the hard parts" are, and how they are solved.

So you could abandon your own piece, but at the same time you would be now
equipped to contribute to the real project.

Or you could move on to working on the parser... lather, rinse, repeat.

No need to tell me what all "the reasons that doesn't work is"... I know the
reasons, and it's useful to identify the laundry list of them, but the part
I'm interested in is the attitude that "hey, this is worth solving" and "hey,
this could be solved..."

~~~
tmm
Does the LLVM introduction tutorial[0] kind of fit what you're suggesting? You
learn how to implement a toy language called Kaleidoscope on top of the LLVM
infrastructure with one data type (64-bit float), if/else, for loop, and a few
other things.

It covers the lexer, parser, AST generation, and a few other things.

There's also one for writing a backend targeting a fake hardware architecture.

[0]
[http://llvm.org/docs/tutorial/index.html](http://llvm.org/docs/tutorial/index.html)

~~~
sheepleherd
thanks, that's very good.

quick critique (wanted to contribute to this conversation while it's active
rather than delve deeply into LLVM for the rest of the day :) it's (naturally
and understandably) written from the perspective of "this is how it is, if you
want to connect with what we do here's what you need to do".

As a pedagogical tool (that is still a compiling tool) it could use an intro
of more "here is what a lexer needs to do, here's how/why we chose to do it,
here is why what is downstream belongs downstream, here is an example using a
language syntax that is extremely simple" (C is not), "here is an alternative
way you could try to do it", etc.

But definitely you point up a good way to start toward [mystic music] "my
dream goal" in this example.

Again tho, I'm wishing that there were tools and "a way" that ALL projects
could be managed this way, not just one great complier, but the several great
compilers and editors, and all-the-types-of-things-people-keep-having-the-
urge-to-reinvent

------
m1n1
Where does one go for a fully representative sample of all kinds of valid C
code to test a project like this? Or -- given a formal grammar, has someone
produced a tool that generates representative code?

~~~
moosingin3space
There's csmith:
[https://embed.cs.utah.edu/csmith/](https://embed.cs.utah.edu/csmith/)

~~~
nickpsecurity
I give +1 to Csmith recommendation. They used it to tear up all kinds of
compilers with practical bugs found as a result. They even found a few in
CompCert despite formal proofs due to specification errors. The middle-end
came out flawless, though.

So, such results speak for themselves. I think Csmith has to be the baseline.
Not sure if it's easy to tune for the equivalent of basic, acceptance tests,
though. If not, then it could be nice to have a long list of tests that each
correspond to specific, C-language features to help a compiler writer
gradually implement one.

~~~
Jabbles
Actually the bugs they found in CompCert were not in the proved sections. The
CSmith paper says

"As of early 2011, the under-development version of CompCert is the only
compiler we have tested for which Csmith cannot find wrong-code errors. This
is not for lack of trying: we have devoted about six CPU-years to the task.
The apparent unbreakability of CompCert supports a strong argument that
developing compiler optimizations within a proof framework, where safety
checks are explicit and machine-checked, has tangible benefits for compiler
users."

[https://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf](https://www.cs.utah.edu/~regehr/papers/pldi11-preprint.pdf)

~~~
nickpsecurity
"The problem is that the 16-bit displacement field is overflowed. CompCert’s
PPC semantics failed to specify a constraint on the width of this immediate
value, on the assumption that the assembler would catch out-of-range values.
In fact, this is what happened. We also found a handful of crash errors in
CompCert."

That bug was in the backend. I believe it's verified code based on this
paper's existence:

[http://gallium.inria.fr/~xleroy/publi/compcert-
backend.pdf](http://gallium.inria.fr/~xleroy/publi/compcert-backend.pdf)

The code correctly implemented a bad spec far as I can tell. The crash errors
aren't elaborated so I can't say what they imply. So, for backend, there was
at least one bug found in verified versions due to inaccurate spec.

Shows value of multiple forms of verification which Orange Book era research
already supported empirically. Corroborated by Altran-Praxis and others in
recent times. I've included example papers on LOCK's design/cost and Altran-
Praxis' CA since they're representative of trends I've observed over time.
Proving part, for common errors at least, has gotten a lot easier though with
tools like SPARK. I wanted CompCert derived into SPARK at one point.

[http://www.cyberdefenseagency.com/publications/LOCK-
An_Histo...](http://www.cyberdefenseagency.com/publications/LOCK-
An_Historical_Perspective.pdf)

[https://cryptosmith.files.wordpress.com/2014/10/lock-eff-
acm...](https://cryptosmith.files.wordpress.com/2014/10/lock-eff-acmp.pdf)

[http://www.anthonyhall.org/c_by_c_secure_system.pdf](http://www.anthonyhall.org/c_by_c_secure_system.pdf)

------
melling
There are several other C compilers on Github. I list a few of them here.

[https://github.com/melling/ComputerLanguages/blob/master/com...](https://github.com/melling/ComputerLanguages/blob/master/compilers.org)

They probably deserve their own section.

~~~
pkd
Unrelated to the thread, but I opened a PR on your repo the last time you
posted about it here. Right now it is still open :/

~~~
melling
I'll have to fix it after my vacation. There's now a conflict. I didn't see it
because I'm subscribed to too many github repos.

