
9cc: A Small C Compiler - matt_d
https://github.com/rui314/9cc
======
kazinator
Regarding your Makefile; you should still pass CFLAGS through to the compiler
when linking, not only LDFLAGS. Suppose CFLAGS contains -m32 (supported by an
x86-64-targetted GCC to do 32 bit). You compile the .o files with that, but
then link without it, which fails trying to make a 64 bit executable out of 32
bit .o's. Some crazy distros pass a --sysroot in CFLAGS; if you don't have
that, your build finds the wrong library and header files.

Even when linking, don't call "cc", but "$(CC)". You're relying on the
implicit rule to compile your .c to .o which will use $(CC). $(CC) could be
some ARM cross-compiler supplied by a distro. When you link using "cc", you
end up using the build-machine's native compiler.

Write the build system like this is an awesome program that major distros will
be eager to pick up, and make it easy for the package maintainer. :)

Speaking of CFLAGS, only set that conditionally; don't clobber it:

    
    
      CFLAGS ?= -O2   # Only if CFLAGS is not specified externally, use this default
    

for things that your program needs in order to build right, put that in other
variables of your own:

    
    
      DIALECT_CFLAGS := -std=c11
    
      CFLAGS += $(DIALECT_CFLAGS)  # integrate external CFLAGS with our own.
    

Same deal with LDFLAGS. Both CFLAGS and LDFLAGS can include important things
that cause a bad build if you mess with them.

~~~
Zelizz
Where would one go to find more of this conventional Makefile wisdom? I've had
so many issues trying to use make the "right" way (flexible, clean, terse,
etc.).

I feel like one of the best ways to acquire this wisdom is to post a project
with lots of mistakes and let people tear it apart.

~~~
emmelaich
[https://nostarch.com/gnumake](https://nostarch.com/gnumake)

~~~
teddyh
That is not, contrary to what the name might imply, the official GNU Make
manual. That one can be read here, for free:
[https://www.gnu.org/software/make/manual/](https://www.gnu.org/software/make/manual/)

~~~
emmelaich
I don't take the name as implying that.

Anyway, it is excellent and has some good advice and tips.

Unlike the official manual which is more a reference.

------
akkartik
[https://github.com/rui314/9cc/blob/882e4b2dd8/main.c#L7](https://github.com/rui314/9cc/blob/882e4b2dd8/main.c#L7)

    
    
        int main(int argc, char **argv) {
          ...
    
          Vector *tokens = tokenize(path, true);
          Program *prog = parse(tokens);
          sema(prog);
          gen_ir(prog);
    
          if (dump_ir1)
            dump_ir(prog->funcs);
    
          optimize(prog);
          liveness(prog);
          alloc_regs(prog);
    
          if (dump_ir2)
            dump_ir(prog->funcs);
    
          gen_x86(prog);
          return 0;
        }
    

This is wonderful.

~~~
mmirate
And also wonderfully devoid of error-handling, too. It's the most common way
for beautiful-looking C code to look beautiful.

~~~
caf
A compiler is in the happy position where there is little point in continuing
to run after encountering an error, so it can bail right out with exit(2)
after reporting the error to the user. This means that the contract on
parse(), for example, can be that if it returns, it has succeeded.

~~~
vlovich123
Except of course LLVM has proven the value in not assuming this pattern &
building your compiler as a library of which the executable entrypoint is but
one frontend.

~~~
Asooka
Exceptions are the exit() of libraries. Or if you're using plain C,
setjmp/longjmp might be a good idea, depending on what you're doing.

------
kazinator
Small nitpick. Functions in C that take no arguments are written
name(void){...} not name(){...}. This latter form is an old-style definition
which doesn't introduce a prototype into the scope.

For functions that have prototypes, it is okay, but when they do not, you're
losing type checking (yes, even on the number of arguments).

The following program compiles with no diagnostics for me with -W -Wall -ansi
-pedantic, with GCC 7.3 on Ubuntu 18:

    
    
      int func()
      {
        return 0;
      }
    
      int main()
      {
        func(3);
        return 0;
      }
    

-std=c11 (as you're using) makes no difference.

The same is not true of C++: func() in C++ is a prototype definition. C++
supports (void) for compatibility with C, but even in nonsensical contexts:
class::class(void);

~~~
rui314
Quote from 6.7.6.3.14 of the C++11 spec ([http://www.open-
std.org/jtc1/sc22/wg14/www/docs/n1548.pdf](http://www.open-
std.org/jtc1/sc22/wg14/www/docs/n1548.pdf)):

> An identifier list declares only the identifiers of the parameters of the
> function. An empty list in a function declarator that is part of a
> definition of that function specifies that the function has no parameters.
> The empty list in a function declarator that is not part of a definition of
> that function specifies that no information about the number or types of the
> parameters is supplied.

So, my interpretation is that the following two function _definitions_ define
the same function of the exact same type

    
    
      void func() {}
      void func(void) {}
    

although the following two function _declarations_ declare two functions of
different types

    
    
      void func();
      void func(void);

~~~
kazinator
That is of course the C11 (draft) spec, not C++.

Interesting find there. The wording is also in the C99 draft; it is not new.

It is in fact saying that the empty list in a definition is a special case and
does declare that the function takes no parameters. To "specify" here can be
understood to mean as inserting information about type into the declaration
scope; that which a declaration does.

So that's a bit of a bug in GCC there; it should be treating this the same as
(void) and therefore diagnosing that way. If not by default, then at least
when -pedantic is applied. But nope:

    
    
       $ gcc -Wall -W -std=c11 -pedantic proto.c
       $

~~~
nezirus
My understanding is that, unless you have a function prototype, you will lose
type info about the parameters:

"The empty list in a function declarator that is not part of a definition of
that function specifies that no information about the number or types of the
parameters is supplied."

i.e. the most important is the end of that sentence, so GCC behavior should be
right.

~~~
kazinator
Note the: _" that is not part of a definition"_.

This is about declarations only.

------
utam0k
I'm rewriting 9cc with Rust from first commit.
[https://github.com/utam0k/r9cc](https://github.com/utam0k/r9cc)

------
userbinator
Good to see more C/subset-C compilers being written. Besides the immense
pedagogical value, they are also useful for preventing the "trusting trust"
attack: [https://dwheeler.com/trusting-trust/](https://dwheeler.com/trusting-
trust/)

I will also make a suggestion to use precedence climbing instead of plain
recursive descent for the parser; it makes the parser even simpler and table-
driven, which is important with a language like C that has many precedence
levels.

~~~
marktangotango
Interesting, do you have any useful links related to this concept?

~~~
userbinator
Here is the discussion of the original Trusting Trust attack:
[https://news.ycombinator.com/item?id=13569275](https://news.ycombinator.com/item?id=13569275)

~~~
marktangotango
Thanks I was referring to precedence climbing parser part of GP reply.

~~~
catpolice
[https://eli.thegreenplace.net/2012/08/02/parsing-
expressions...](https://eli.thegreenplace.net/2012/08/02/parsing-expressions-
by-precedence-climbing)

Precedence climbing is also (intimately) related to Pratt parsing, and there's
a useful series of articles about the two here:
[https://www.oilshell.org/blog/2017/03/31.html](https://www.oilshell.org/blog/2017/03/31.html)

------
aidenn0
It lists a goal as "compiling real-world programs such as the linux kernel"

The last time I investigated the linux kernel had so many gcc-isms that it was
probably true that if you could compile the linux kernel, you could probably
compile any program targeted to gcc.

~~~
wtracy
Other "alternative" compilers like tcc have pulled it off, supposedly without
patching the kernel, so it is doable.

~~~
e12e
Does tcc compile a vanilla kernel? I know tccboot added some minor patches:

[https://bellard.org/tcc/tccboot_readme.html](https://bellard.org/tcc/tccboot_readme.html)

------
EdSchouten
Interesting to mention: the author of this compiler, Rui Ueyama, is also a
developer at the LLVM project. He's one of the most active developers of LLD,
LLVM's linker.

------
bakul
How does this compare to Nils Holm’s subc compiler that has somewhat similar
goals? There is a book describing the compiler as well (though the current
version of the compiler supports a larger subset of the C language).

[https://www.t3x.org/subc/index.html](https://www.t3x.org/subc/index.html)

------
pjmlp
I really wish people would choose better names, we just had an hyphen and it
becomes a C dialect from the early 80's.

[https://en.wikipedia.org/wiki/Small-C](https://en.wikipedia.org/wiki/Small-C)

~~~
creatornator
"a small c compiler" is just the description, "9cc" is the name itself, so
there's not much similarity there...

------
csours
I wonder if this is a reference to the band 10cc -
[https://en.wikipedia.org/wiki/10cc](https://en.wikipedia.org/wiki/10cc)

~~~
rpz
G is the 7th letter of the alphabet

~~~
csours
It took me a distressingly long time to figure out what you meant by this.

------
jokoon
How does this compare with tcc?

------
rpz
kparc.com/b

------
mshockwave
>no memory management is the memory management policy in 9cc. We allocate
memory using malloc() but never call free().

>I know that people find the policy odd, but this is actually a reasonable
design choice for short-lived programs such as compilers.

I'm strongly disagree at this point. Memory management is important even for
short-lived programs. It would bring burden to the OS if you invoke this kind
of "short-lived, memory-management-free" programs multiple times.

>This policy greatly simplifies code and also eliminates use-after-free bugs
entirely

Not facing it is definitely not a good way to solve a problem. It's a bad
attitude as a programmer be honestly.

~~~
cperciva
> Memory management is important even for short-lived programs. It would bring
> burden to the OS if you invoke this kind of "short-lived, memory-management-
> free" programs multiple times.

All the memory is freed when the process exits. Why does it matter if you run
the program multiple times?

~~~
sifoobar
It will put more pressure on the allocator when running, doing that a lot will
likely have some kind of cumulative consequence down the line. I know from
experience [0] that even reusing allocated memory rather than bouncing it back
to malloc can have dramatic effects.

[0]
[https://gitlab.com/sifoo/snigl/blob/master/src/snigl/pool.h](https://gitlab.com/sifoo/snigl/blob/master/src/snigl/pool.h)

~~~
wtracy
Malloc doesn't interact with the kernel at all. The kernel sees pages, not the
data structures that malloc manages. The kernel doesn't even know whether you
free()'d the memory by the time the process exits.

There is exactly zero difference from the operating system's perspective
between freeing and not freeing the memory before program termination (except
that one might have a higher peak memory usage).

The classic implementations of several standard Unix utilities deliberately
never deallocated memory for performance reasons. If there was a problem with
this practice, I would expect Bell Labs--of all places--to know not to do it.

~~~
dexen
> _There is exactly zero difference from the operating system 's perspective
> (...)_

Close, but not quite: in case of larger allocations, malloc() tends to use
mmap( , , , | MAP_ANON) rather than brk() to request memory from the OS. For
example, the glibc's malloc() uses mmap() when requested size exceeds
MMAP_THRESHOLD, which is 128kB by default.

The mmap() approach gives large, continuous memory blocks that can also be
easily free()'d via munmap() with little to no bookkeeping needed[1] - not
being subject to the same fragmentation woes as memory allocated via brk() -
as long as your address space is significantly larger than allocated memory.

That aside, I fully agree with the author of 9cc.

[1] in fact a simplistic libc memory allocator could directly wrap malloc()
around mmap(), free() around munmap() and realloc() around mrealloc(),
leveraging the in-kernel allocator at the cost of one syscall with at least
two context switches at each call - i.e., _slowww_

~~~
wtracy
An implementation certainly _could_ return memory to the OS on calls to
free(), but to my knowledge none of the widely used implementations do so. (I
would be interested to learn of counterexamples!)

~~~
cperciva
jemalloc can use madvise(MADV_FREE) to return pages to the OS.

