
C4 – C in 4 functions - petercooper
https://github.com/rswier/c4
======
abecedarius
On a first skim, this looks really nice; complaints that it's unreadable are
unfounded. The background that makes it readable are Wirth's _Compiler
Construction_
[http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf](http://www.ethoberon.ethz.ch/WirthPubl/CBEAll.pdf)
plus precedence climbing [http://en.wikipedia.org/wiki/Operator-
precedence_parser#Prec...](http://en.wikipedia.org/wiki/Operator-
precedence_parser#Precedence_climbing_method)

~~~
dabockster
_complaints that it 's unreadable are unfounded_

Not exactly. You have to remember that language and compiler design require a
LOT of work and experience to understand, and that many programmers will only
see this as, frankly, spaghetti.

I think it could have used some more block comments, but that's just me.

~~~
anigbrowl
I had the same first instinct, but given that a) it's very very tidy code and
b) if you want to understand the inner workings of a compiler then you really
do need to figure this out, I decided on review that it's basically self-
documenting.

Of course figuring out what it's doing is one thing - understanding _why_ it
is done in this particular way is another, and while I was able to find my way
around fairly quickly I'd cry if I had to re-implement it. I do love how small
it is though, that gives it great educational value.

~~~
blinks
> understanding why it is done in this particular way is another

Isn't that the reason for comments in the first place?

~~~
anigbrowl
I look to comments to tell me what a block of code is doing rather than why,
eg 'Performs a Discrete Cosine Transform on the contents of the buffer' or
'Bubble sort algorithm rearranges the records in at least as much time as
required to enjoy a nice cup of tea.'

The 'why' of a very low-level tools like this is the sort of thing that needs
to be explored at length in a paper or (in this case) a book, otherwise
they'll swamp the actual code. Sometimes as a learning exercise I'll take
something like this and comment the hell out of _everything_ , but the value
there is more in writing the comments than trying to read them again later. Of
course this is very much a matter of personal taste.

~~~
aeonsky
I am a junior dev without a ton of experience so correct me if I'm wrong, but
I strongly disagree. Comments should explain "why" something was written.
Wouldn't the function name indicate what you are doing (and comments in the
function)? This is especially true in business logic.

reverseNaturalSortOrder(listOfItems); // case sensitive sort of items by
reverse alphabetical order

or

reverseNaturalSortOrder(listOfItems); // sort this way because the upper layer
expects items in reverse order since user requested it

I think it is usually significantly easier to understand what something is
doing rather than why it is doing that. To answer the former it usually
requires a narrow scope of focus, but the latter requires a very broad scope.

~~~
DSMan195276
I agree with you completely - The code explains _what_ you're doing, comments
explain _why_ you did it that way. Ideally, any comments that explain what
you're doing would end-up being redundant when looking at the code.

I think this code details a special case of the above though, in that it
comments _what_ the enums are instead of just naming the enums. I give that a
pass strictly because this code needs to be able to compile itself, and I
don't think it supports named enums, so the comment was necessary to make up
for that.

~~~
MrTortoise
Its not that simple though, error fixes and edge cases often obfuscate
something that was understandable. A why comment is never bad, but a what
comment is often as valuable as a test

~~~
DSMan195276
Like I noted with my special case, it's not _always_ that simple, but I
routinely find the best commented code to be code which was written with the
comments explaining why and the code explaining what. There are definitely
time where a what comment is warranted, but it's just not the general case.

------
petercooper
I just wanted to credit Reddit's /r/tinycode sub-Reddit for this link:
[http://www.reddit.com/r/tinycode](http://www.reddit.com/r/tinycode) \- it's a
pretty cool place to discover minimalistic implementations of things.

~~~
whitten
Thank you for sharing that link to the tinycode Reddit. Did the original
poster (for the C in four functions) participate in that sub-reddit ?

~~~
pmh
It would appear so.
[https://www.reddit.com/r/tinycode/comments/2la785/c4_a_c_com...](https://www.reddit.com/r/tinycode/comments/2la785/c4_a_c_compiler_in_four_functions/)
is the post from /r/tinycode and its OP[0] is the same as the github user[1]
for c4.

[0][https://www.reddit.com/user/rswier](https://www.reddit.com/user/rswier)

[1][https://github.com/rswier](https://github.com/rswier)

------
userbinator
From a cursory glance this appears to be a much-condensed recursive-descent
parser with all of the usual parsing functions moved into one, so it's not all
that difficult to understand. I think recursive-descent is one of the easiest
to intuitively understand parsing algorithms, and it also makes for some
concise code; it's also far easier to debug a recursive-descent parser than
one of the traditional table-driven ones.

Edit: OTCC ([http://www.bellard.org/otcc/](http://www.bellard.org/otcc/) ) is
another _extremely_ tiny (to the point of being obfuscated) example of a
compiler using a recursive-descent parser.

~~~
barrkel
It's an operator precedence parser, which is actually quite a bit faster than
recursive descent for expressions when operator precedence is defined using
grammar productions that would otherwise get turned into methods.

For example, a simplified expression operator precedence can be expressed like
this (where {} denotes repetition, implemented with e.g. a while loop in a
recursive descent parser (RDP)):

    
    
        expr ::= term { '+' term | '-' term } ;
        term ::= factor { '*' factor | '/' factor } ;
        factor ::= '-' factor | <number> | '(' expr ')' ;
    

Typically this would be parsed in RDP using expr(), term() and factor()
functions, where expr() calls term(), term() calls factor(), and factor()
calls expr() if it sees an opening parenthesis.

The trouble is that this means that every single expression parse needs to
call through this deep code flow before it can see a single interesting bit of
the expression being parsed. Parsing "2 + 3" will result in 5 function calls:
expr, term, factor, then term and factor again.

The operator precedence parser presented avoids this deeply nested set of
calls before it starts parsing interesting things.

This problem is far more acute in a language like C, which has lots of
precedence levels, than it is in a language like Pascal, which only has a
handful.

~~~
userbinator
I was contrasting it against table-driven parsers (e.g. typical of the
Yacc/Bison type) which I believe don't have _any_ recursive calls and use a
separate stack, while this one does make recursive calls. In some ways
operator precedence/precedence climbing is like an optimisation of RDP,
replacing a set of recursive functions with a level counter and a loop.

~~~
barrkel
I can view using the CPU stack instead of an explicit stack, and the CPU
instruction pointer instead of an index into a state table, as optimizations
too (direct encoding). There also exist recursive _ascent_ parsers, which are
the RDP analogue of table-driven LALR parsers you're talking about.

Point being, there are a lot of different parsing approaches, with pros and
cons, and it can be useful to treat them separately rather than view them
within only a couple of lenses.

------
bjornsing
This shit doesn't scale:

    
    
      $ time ./c4 c4.c c4.c hello.c 
      hello, world
      exit(0) cycle = 9
      exit(0) cycle = 22614
      exit(0) cycle = 9273075
    
      real	0m0.067s
      user	0m0.067s
      sys	0m0.000s
      $ time ./c4 c4.c c4.c c4.c hello.c 
      hello, world
      exit(0) cycle = 9
      exit(0) cycle = 22614
      exit(0) cycle = 9273075
      exit(0) cycle = 933197195
    
      real	0m5.834s
      user	0m5.827s
      sys	0m0.000s
      $ time ./c4 c4.c c4.c c4.c c4.c hello.c 
    

Just kidding. :) Amazingly cool! Does anybody have a smaller self-hosing
compiler & bytecode vm?

~~~
bjornsing
For the record:

    
    
      $ time ./c4 c4.c c4.c c4.c c4.c hello.c 
      hello, world
      exit(0) cycle = 9
      exit(0) cycle = 22614
      exit(0) cycle = 9273075
      exit(0) cycle = 933197195
      exit(0) cycle = -1428163377
    
      real	9m23.409s
      user	9m22.673s
      sys	0m0.020s
    

:)

------
0x0
Would you call this a compiler, an interpreter, a virtual machine, a scripting
engine, or a combination of those?

~~~
marktangotango
It compiles a c subset to byte code, then executes in a virtual machine. I
think generally an intepreter can refer to either a byte code interpreter (ie
virtual machine) or an AST walking interpreter. I didn't see a way to embed c4
into a host language, so maybe not a scripting language?

IMO the real value of exhibits like this are boiling the problem (lexing,
parsing, compiler, interpreting) down to their most basic parts. One could
easily imagine this same language being implemented over 10's or 100's of
class files in a more verbose language.

~~~
stcredzero
_I think generally an intepreter can refer to either a byte code interpreter
(ie virtual machine) or an AST walking interpreter._

This brings to mind how fuzzy "interpreter" is as terminology. What is a
virtual machine that JIT compiles the byte code or the AST? Is it a JIT
implementation of an interpreter? What of implementations that only JIT the
most frequently used functions? Wouldn't those be half interpreter and half
virtual machine?

When it comes down to it, they're all really virtual machines. The real
distinction is how we've come to think of different implementations and the
representations sub-culturally. For some reason, it makes us feel better when
we call certain things interpreters, because of some meaningless (and
sometimes factually challenged) competitive instincts concerning
implementation speed. (Also, we arbitrarily feel that byte code is somehow
more "machine-y" than an AST.)

So do I have a problem with "interpreter"? Only when people correct others, as
if they're making a correction about something fundamental and factual. In
reality, the distinction is between machines that are intended to have the
same runtime semantics and really the distinction is only around what
optimizations are present in their implementations. Furthermore, if you look
at those optimizations in detail, the distinction gets even hazier.

~~~
TazeTSchnitzel
Most interpreters (not all, but most) are actually a compiler and a VM, yes.
The difference between a "compiler" and an "interpreter", in practise, seems
to be that "compilers" lack a built-in interpreter.

~~~
barrkel
Most static language compilers include an interpreter for constant expressions
at a minimum, because otherwise statically allocating things like arrays is a
bit tricky. Handily, this interpreter can be reused for constant folding.

C++ compilers nowadays necessarily include a Turing complete interpreter.

~~~
TazeTSchnitzel
C has one in the preprocessor, to do platform-independent conditionals.

------
astrodust
That is pretty stripped down. Could probably be ported to JavaScript without
breaking a sweat.

~~~
Lerc
That was my first thought, a micro-C to asm.js compiler wouldn't be hard.

------
TazeTSchnitzel
Pretty damn impressive.

Though I must wonder: how complete is it? What does it and does it not
support? It's at least complete enough to be self-hosting, but beyond that?
The code doesn't use that much of C.

~~~
cbhl
Judging from the comments in c4.cpp, it probably only supports enough of a
subset to compile itself.

Granted, while building a parser that can parse (let alone compiling) the full
C language is nontrivial, any undergrad should be able to build a parser and
compiler for a sufficiently simple subset of it. (In my undergrad, we used
this subset to build a "compiler" in second year:
[https://www.student.cs.uwaterloo.ca/~cs241/wlp4/WLP4.html](https://www.student.cs.uwaterloo.ca/~cs241/wlp4/WLP4.html))

~~~
JoeAltmaier
You can build a C parser in an afternoon. It only has a few language
constructs. Declarations are the hardest. Scanners are readily available for
expressions and constants.

~~~
rui314
C is not a simple language as a CIL guy says [1]. I wrote my own C compiler
[2] and I can say that writing a parser was harder than I thought. It would
take more than half a day at least.

[1]
[http://www.cs.berkeley.edu/~necula/cil/cil016.html](http://www.cs.berkeley.edu/~necula/cil/cil016.html)
[2] [https://github.com/rui314/8cc](https://github.com/rui314/8cc)

~~~
vidarh
That first link is almost all about language semantics, not parsing issues.

As for your example, I'll be solomonic and say that you and Joe are right, of
sorts (though I do think it'd be more than an afternoon).

It's certainly far more than a days work if you handwrite a lexer and parser
that does the amount of additional work that yours do (AST construction; a lot
of error reporting and sanity checking). But you can get very far with C very
quickly if you use parser generation tools and have prior experience writing
compilers and your goal is "just" to get something to parse it as quickly as
possible - it's a tiny language.

Of course, in practice most real compilers don't use these parser-generation
tools exactly because things like proper error reporting etc. is far harder,
and a simple recursive descent parser is so much easier to work with.

------
jdp
This is really similar to Marc Feeley's tinyc.c[0], which implements a C
subset parser, code generator, and virtual machine in about ~300 lines of C.

[0]:
[http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Comp...](http://www.iro.umontreal.ca/~felipe/IFT2030-Automne2002/Complements/tinyc.c)

~~~
abecedarius
That's a nice one. The big difference is that it doesn't compile _itself_ \--
not that there's anything magical about that, but it's a kind of threshold of
seriousness.

~~~
userbinator
One of the things I've wondered about is how small one could make an ISO
C89-compliant compiler (that would be able to compile itself), and all these
tiny compiler projects have inspired me to revisit that thought now... I've
written pieces of compilers like expression parsers and tokenisers, and even
then I felt like it wouldn't be so hard (if I had the time) to write a full
compiler.

These are all great for dispelling the notion that all compilers somehow
necessarily have to be greatly complex and impenetrable to anyone but highly-
trained professionals and theoreticians. (Look at the Dragon Book, for
example.)

~~~
abecedarius
You might enjoy
[http://canonical.org/~kragen/sw/urscheme/](http://canonical.org/~kragen/sw/urscheme/)
\-- it's by someone who'd just written his first compiler, with links to other
sources he found inspiring (one of them mine).

C89 I think has too much complexity for the amount of power it offers. lcc is
a nice example: Norman Ramsey said he asked one of the authors what he learned
in writing it, and got an answer like "Not to write a compiler in C." But
anyway the book about it
[https://sites.google.com/site/lccretargetablecompiler/](https://sites.google.com/site/lccretargetablecompiler/)
is very good.
[http://www.cs.princeton.edu/~appel/modern/](http://www.cs.princeton.edu/~appel/modern/)
is my favorite general text.

------
amelius
Next challenge: a fully compliant C++11 compiler written in 4 functions (1)

 _Then_ , I'd be impressed :)

(1) At maximum 500 lines per function, where each line is at most 200
characters long.

------
vesinisa
Neat, but this is not a general purpose C interpreter. It seems to lack the
preprocessor and is only able to execute the example program and itself
because it has wrapper implementations for a static set of standard library
functions including printf(), open(), read() and malloc().[1] Use a standard
library function it does not readily support, and you're out of luck.

[1]:
[https://github.com/rswier/c4/blob/master/c4.c#L492-L498](https://github.com/rswier/c4/blob/master/c4.c#L492-L498)

------
agumonkey
I wonder if it's faster than
[http://bellard.org/tcc/](http://bellard.org/tcc/) (in compiled line per sec)

------
nichochar
For mac users: gcc -Wno-all -arch i386 -o c4 c4.c

------
methyl
Unfortunately, I cannot compile it on my OSX 10.10... here's what I get:
[http://pastebin.com/cVvaYFEH](http://pastebin.com/cVvaYFEH)

EDIT: just make next() function void and it works.

EDIT2: still no fortune :(

    
    
      $ ./c4 hello.c
      [1]    33920 segmentation fault  ./c4 hello.c

~~~
cremno
I think it doesn't work on x86-64, since the code assumes

    
    
        sizeof(int) == sizeof(void*)

~~~
jparishy
Yep, you can compile it on 64 bit OS X with clang's -m32 option and it should
work:

    
    
        ➜  c4 git:(master) ✗ clang -m32 c4.c
        ...
        ➜  c4 git:(master) ✗ ./a.out hello.c 
        hello, world
        exit(0) cycle = 9

------
kgabis
Wow, nice. I once managed to write a brainfuck compiler/vm in ~100 lines of C
[1], but this is impressive.

[1]
[https://github.com/kgabis/brainfuck-c](https://github.com/kgabis/brainfuck-c)

------
ttouch
[http://git.dzervas.gr/c4](http://git.dzervas.gr/c4) An attempt to make this
readable :) (at least with right syntax...)

------
UnclePeepingSam
that's really few points blowing up a plane!

------
marcofiset
I honestly think this is ridiculous. Sure, this is an incredible feat, and
congrats. But serioulsy, I would be ashamed to publish such unreadable code
under my name.

What about naming your variables with descriptive names?

What about extracting complex conditions into well named function to
understand what is going on (thus defeating the purpose of the "4 functions")
?

This list could go on forever...

Writing software is not a contest for who can write the most amount of code in
the most cryptic way.

~~~
privong
> Writing software is not a contest for who can write the most amount of code
> in the most cryptic way.

It can be: [http://ioccc.org/](http://ioccc.org/)

~~~
Lerc
And indeed, that is how TCC [http://bellard.org/tcc/](http://bellard.org/tcc/)
began its life [http://bellard.org/otcc/](http://bellard.org/otcc/)

