
C4: C in Four Functions (2014) - azhenley
https://github.com/rswier/c4
======
tptacek
I unironically love this code. We used it as a starting point (or rather, the
same design; we wrote in Go) for reversing challenges at our last startup,
symbolically evaluating the stack machine bytecode to generate AVR.

The trick to reading it:

* next() is the lexer

* expr() is a precedence-climbing expression parser

* stmt() parses statements and generates code

* main() has the virtual machine loop.

~~~
lifthrasiir
C4 really has some good ideas by its own.

* It never bothers to produce a machine code. The use of stack-based VM simplifies the single-pass compilation, and it never makes use of unknown library functions so no machine-specific knowledge is required (like dlsym in Bellard's OTCC [1]).

* It chose its primitives wisely. It never implements structs and returning with values, for example, but the code is carefully structured that the lack of them doesn't make it harder to read.

* And yet it has tons of little tricks. Switching from r-value to l-value (triggered when infix `=` or postfix `++`/`--` are read) is a single opcode fix. Reserved words are initialized from an imaginary source code. The type is represented by a single number 2n+k where n is the number of indirections.

[1] [https://bellard.org/otcc/](https://bellard.org/otcc/)

~~~
ash
> Reserved words are initialized from an imaginary source code.

What does it mean? Could you point to a place in the source?

~~~
lifthrasiir
I mean this part:

    
    
        p = "char else enum if int return sizeof while "
            "open read close printf malloc free memset memcmp exit void main";
        i = Char; while (i <= While) { next(); id[Tk] = i++; } // add keywords to symbol table
        i = OPEN; while (i <= EXIT) { next(); id[Class] = Sys; id[Type] = INT; id[Val] = i++; } // add library to symbol table
        next(); id[Tk] = Char; // handle void type
        next(); idmain = id; // keep track of main
    

Throughout the entire source code p is a source code pointer, but at the very
beginning of the program it is a string containing all reserved words and
library functions, and they are read with the same lexing function `next` to
the symbol table before the memory for the actual source code is allocated.

------
skrebbel
I really like this, but I wonder why somehow "four functions" has to imply
"needlessly cryptic variable names". Is that just part of the exercise in
minimalism, i.e. art?

~~~
GuB-42
I found the code surprisingly readable.

Most variables names are what I expected them to mean despite their shortness:
pc, sp, bp are registers, a is the accumulator, fd is a file descriptor (of
the input file, what else?), tk is for the token, t is temporary, etc... For
the less obvious ones, it is usually not that hard to infer their meaning from
either the code or comments.

Because yes, they are comments, not many, but they are helpful. For example,
the VM has unusual instructions (for me) like LEV and ADJ, and they are
commented. The "obvious" ones like MUL and SHR are not.

The variable names are not "needlessly cryptic". I've seen (and written, not
proud of it) a lot of needlessly cryptic variable names, and believe me, these
are crystal clear by comparison. Here, there is a clear influence from
assembly mnemonics that really helps understanding.

Now on the why. This is minimalism, and minimizing the number of comments and
variable name length is part of it. It is actually a very interesting
exercise. The golden rule in making understandable code is making it as short
as possible. There is a limited amount of space on your screen and in your
mind, and the shorter your code is, the more you can see/understand at once.
Of course, too much is too much, you don't want to do things IOCCC style, and
striking a balance is difficult. So once in a while, reading or writing very
compact code can help you understand where shaving off characters is fine and
where it really hurts understanding.

~~~
LeifCarrotson
Why not just write out ProgramCounter, StackPointer, Accumulator,
InputFileDescriptor etc? It doesn't take significantly longer to type. It's
faster to read because you can recognize the shape of those words and don't
have to mentally substitute the actual works. Code is for reading, not for
writing.

~~~
GuB-42
Because it make the lines longer and long lines are bad. If it results in a
horizontal scrollbar, it is terrible, but even without it, there is a reason
papers are often printed in column format and most coding rules specify a
maximum line length (often 80, though 120 is becoming popular these days, with
big wide screen and all that).

So long lines need to be split. Which is difficult to do properly and results
in more lines, and more lines mean less of the code is visible at once and
that makes it harder to see the big picture.

But to each his own I guess. Anyway, you can try it out yourself. Just take
the code, do the replacements and see for yourself.

~~~
Hnrobert42
Are you reading this on an Apple Watch? I still generally use 80 characters
out of habit, but given how monitors have grown, 120 or even 140 should be the
new norm.

~~~
clarry
Adding an extra column for code|docs|other context is so much more useful than
allowing longer lines for obese identifiers that rarely serve to make a point
more clear.

I'll take my four or five columns of 80 chars over two columns of 120-140
chars any day.

------
jacobvosmaer
Clang on my macOS doesn't like the fact that main() takes and returns long
long. But it appears GCC on Debian 8 (random linux I have lying around)
doesn't mind.

I get it to compile on macOS if I remove the int define but then it segfaults
when you run it. I wonder if there is some magic flag to make "main does not
return int" a non-fatal error on Clang?

It's fun to read this code but running it is even more fun, you can see the VM
code it generates, with source line annotations and all.

~~~
flohofwoe
The "#define int long long" is annoying indeed. A quick hack to make it work
is:

#include <stdint.h>

...and then replace the two "int" in main() with int32_t.

~~~
psychoslave
Doesn't that make it fail to maintain the self hosting property that is most
likely behind the introduction of this define?

------
hereisdx
Can someone please explain to me what this code does? I know some C, but I
couldn't understand this.

~~~
bluetomcat
A one-pass compiler for a subset of C, relying on a recursive-descent parser,
doing the lexing, parsing and code generation in lockstep. The generated code,
consisting of abstract machine instructions, is then executed by an
instruction fetch and execute loop.

BTW, the code looks relatively short because many semicolon-separated
statements are crammed on a single line, and short variable names make that
somewhat manageable and even visually symmetric. If you were to unfold it with
each statement on its own line, I guess it would be at least 3 times the size.

~~~
epicide
> A one-pass compiler for a subset of C, relying on a recursive-descent
> parser, doing the lexing, parsing and code generation in lockstep. The
> generated code, consisting of abstract machine instructions, is then
> executed by an instruction fetch and execute loop.

This should be added to the README.

------
Iv
Am I the only one annoyed by the fact that the main loop has a series of
if..else instead of a giant switch statement?

But yes, many people rely on parsers and VMs without really knowing how it
works and assume some black magic, whereas it can be really simple and
elegant.

~~~
asdasdasdasdwd
Wouldn't the compiler optimize it either way?

~~~
Iv
Hmmm, gcc may be smart enough indeed.

------
6nf
This is awesome, less than 500 lines for a self-compiling C compiler!

~~~
TickleSteve
...and virtual-machine.

------
nighthawk7000
Seeing this made my day. There's something calming about seeing C used
so...hmm....elegantly?

Well maybe that's not the right word. But the minimalism of it is soothing.

------
michaelfeathers
It looks like a great base for a refactoring exercise.

~~~
abecedarius
I considered doing that once. Trouble is, how do you test it? It comes with
one tiny example in the C subset, plus one substantial one (itself). Its
behavior is generally undefined when the input departs from the subset. It
seemed like more work to address these issues than the fun you could have in
messing with the code.

So I sketched my refactorings without bothering to check them or to publish
it.

~~~
michaelfeathers
This is a way you can approach it.

[https://michaelfeathers.silvrback.com/characterization-
testi...](https://michaelfeathers.silvrback.com/characterization-testing)

------
jancsika
Why all the printf copy/pasta around the opcode enum?

Can't the author use pre-processor string concatenation to write once and then
leverage that to map enum to string and back?

At least I'm doing that in a little scripting language parser and it looks
portable.

Edit: I'm actually using pre-processor string concatenation for something
unrelated. You don't even need that to do the mapping.

~~~
dmytrish
c4 is meant to be self-hosted and it does not implement a preprocessor. It was
easier to re-use printf creatively.

------
rini17
Anything like that for C++? /s

------
schoen
(2014)?

~~~
keanzu
[https://news.ycombinator.com/item?id=8558822](https://news.ycombinator.com/item?id=8558822)

------
sedatk
I guess switch/case support is in order.

~~~
abecedarius
An earlier version had this, and basic structs too. It was only a little
longer, at least in line count, but harder to figure out. It took me four
evenings with a printout and a red pen before I was satisfied with my
understanding.

(I have mixed feelings about this code: it has both good ideas and pointless
obscurity. I guess the newer version would've taken me 'only' three evenings.)

------
jessermeyer
Faster than CPython? ;)

------
BadMrFrosty
Surprisingly succinct.

------
dntbnmpls
hello dot c my old friend, i've come to compile you again...

------
ConradKilroy
Thanks for commenting the code. _WTF sarcasm_

------
zelphirkalt
The repo says "An exercise in minimalism." – Looking at the code, there is
nothing minimalistic about it. Perhaps in terms of C it can be called that,
but in terms of programming languages in general, this cannot be called
"minimal" at all.

The code quality is terrible. Few explaining comments (actually mos of those
are only 1 or 2 words, so not explaining anything at all) and almost all
variable names consist of 1 or 2 characters, which do not say what the thing
actually is. Then to achieve this arbitrary goal of doing it "in 4 functions"
(actually procedures) loads of stuff was apparently stuffed into 4 those 4
functions, so much so, that they are longer than a whole page of code. Most of
it looks like gigantic switch statements. It's horribly written code. It looks
like what I think of as a C nightmare.

I will admit, I could not write such a thing myself. I lack the knowledge for
writing such a low level stuff, do not use C, and if I had that knowledge to
do it, my inner drive to do a cleaner job than anything remotely looking like
that code, would prevent me from ever sharing such a thing with anyone in
public. At least group cases into procedures as it makes sense. Even people
writing this low level type of code should be aware of how unreadable that
code is, right?

~~~
jackewiehose
> I lack the knowledge

And yet you have a strong opinion about how terrible it is.

It is very minimal and it is very readable. It is actually a joy to read. You
have no idea what a C-nightmare really is.

~~~
zelphirkalt
Yep, I can have a "strong" (what is strong?) opinion about it, because I know
what readable code in other programming languages looks like and it certainly
does not look like that. I did use C a few times in assignments and still my
code had more comments and better readable names for basically everything than
the code present in the repository.

Wait, you are telling me, that people write even less readable code in C?
Perhaps you are right. Perhaps people can really be that much without care.

It still does not make this code "very readable". If it was very readable, I
would have a vague idea about what each of the "functions" does from reading
its name or its docstring. Oh wait, there is no docstring at the beginning of
each of the "functions" and the name consists of one word, sometimes
abbreviated word. And the variable names don't give me hints either. I think
our definitions of readable code simply differ quite a lot. When I write code
myself, I am unwilling to accept code on that readability level, but we
probably have different standards.

Perhaps for entertainment purposes only, you could show me a real C nightmare.
I do honestly believe you, that there is worse ;)

~~~
clarry
I don't understand russian, therefore russian is unreadable.

I don't understand mathematics, therefore math is undreadable.

I don't understand music, therefore sheet is unreadable.

That is your problem in a nutshell.

Meanwhile, people here take a glance at the code and immediately get it,
because it is written in a language they understand. These people tend to find
it quite readable -- not necessarily an example of most pretty code, but
nevertheless readable.

~~~
zelphirkalt
Still you do not address the simplest of points, which there are: meaningful
variable names, meaningful procedure names, explaining comments. All of which
are minimum standards for software development these days.

I may be a C noob, as I already and in an honest way stated in the very first
post, but my points still stand. Those are not some subjective things. It is
very clear, that those things add to readability, yet the code does not have
them.

~~~
clarry
I addressed those points implicitly. Those names _are_ meaningful, in context,
to the people who understand the language; just as symbols in math are, in
context, understandable to those who understand mathematics. And thus there is
no need for comments. Arguably, there are too many comments, because I saw
many that said something obvious without adding anything useful. Short
identifiers aid reading for people who understand the language, because it
allows them to focus on what the code does (or, rather, how exactly it does
it) rather than what the things in it are (which is obvious to people who
understand the language).

