Hacker News new | past | comments | ask | show | jobs | submit login
Hand-optimizing the TCC code generator (briancallahan.net)
143 points by ingve on April 7, 2022 | hide | past | favorite | 36 comments



Is it always safe to make this change without considering the surrounding code? If you start with this code:

    testl %ebx, %ebx  # Set zero flag if %ebx is zero
    movl $0, %eax     # Set %eax to zero
    jz somewhere      # Jump if %ebx was zero
then after optimization you get this:

    testl %ebx, %ebx  # Set zero flag if %ebx is zero
    xorl %eax, %eax   # Set %eax to zero *and set zero flag*
    jz somewhere      # Jump always
Maybe this is OK because TCC doesn't (currently) generate any code like this.


Great point.

Now if the program were divided into basic blocks, then any jump based on flags would necessarily be the last instruction of the current basic block. Suppose that the program has the property that flags are always consumed in the same basic block. Then the control info is enough. So whenever we replace movl $0, eax, by xor, we can just check whether or not we are in a basic block that ends with a jump conditional on the Z flag.

If that doesn't hold, we need data flow info. For each instruction, we can know from doing data flow analysis whether there are any live flags: CPU flags whose current values are used down later in the program.

E.g. say we have:

  movl $0, %eax
  movl $0, %ecx
  movl abc, %esi
  clz

Here, data flow analysis tells us that the Z flag is dead at the movl instruction, because the only possible execution path reaches a clz, and nothing uses the Z flag before then. Thus we are free to clobber it. Moreover, when we do so, the data flow picture doesn't change. The Z flag is dead at the movl, and is still dead when that is replaced by xorl. The optimization doesn't require a recalculation of the data flow info.


I had a vicarious feeling of satisfaction seeing that low-hanging fruit getting plucked.


I'm thinking about using tcc as game scripting language.

Not really sure if it's a good idea or if it's practical.

I want to believe that it is, but it might quite verbose to use.


Sure -- the compiler is small enough to include -- and fast enough. Bellard did an experiment where TCC was used to compile the Linux kernel on boot. Remarkable.

Yes, the source may be verbose. But, the API exposed would be C, and most systems have a C FFI.

Instead of TCC, you may wish to consider Quick JS (Bellard's Javascript interpreter).

The reason I use TCC is to validate C code. TCC evaluates arguments left to right, whereas GCC does right to left...


Really GCC -O1 is usually plenty fast if you are just precompiling things at startup that your main executable would dynamically load. The bigger question is, would you really want to be using a language with manual memory management for writing extensions to a program? Maybe including the boehm gc with it would not be terrible.


A wonderful and fun read! I've been writing my own compiler for my own language, and this may just be the motivation I needed to add optimizations to it! I was always thiking about really deep semantic analysis and optimization based on that, but this seems effective, simple, and like great fun!


The generated function also has a subtract zero from rsp which could just be plain eliminated. Though it's probably rarer to have this in general code, it means there are no local variables. Looks like there are still some low hanging fruits in there :-)


I thought maybe it was there to set or clear flags.


It looks like a generic prologue. Ideally the code generator would detect that no stack frame is required whatsoever and just emit the barebones return. It's possible that the prologue is generated ahead of time and then patched as the codegen determines how many locals it needs.

You could do a second pass if you detect no locals (you'd need to as you'll be computing new offsets for relative branches/loads/etc), or potentially generate two versions of each function as you compile and only commit the one that is smaller but those are some pretty big tradeoffs.

I don't have any idea how TCC is organized internally but my money would be on a generic prologue getting fixed up after-the-fact.

EDIT: I think that it might generate the prolog by rewinding to the beginning of the function here: https://github.com/TinyCC/tinycc/blob/e2e5377e7b68d57def00ac...

I'm not sure if sub rsp, 0 is faster than just filling w/a NOP but that could always be someone's future experiment.


Semi-related, besides TCC, are there any other C-ish compilers that are easily ported to a new ISA any of you can recommend?

Been playing with FPGA softcores, and while writing a compiler can be fun it kinda detracts from the main project.


There isn't really any value in doing that kind of stuff honestly unless you just like really boring and annoying work. Like, if you are designing your own ISA, you can punt that down to basically the end of the line, because all the actual important stuff comes first.

There's a kind of great irony to compiler work like this where you need some outside thing to drive the design, but if the design being driven ultimately isn't very bespoke, you eventually realize everyone else has done all the work already so everything you have to do is boring (e.g. wire up a new ABI to some existing codegen... how riveting...) Alternatively you have something unique, and find out the compiler is actually the hard part of the whole problem, because now you're on your own.

If you want a tool for early programming your own ISA I recommend using Lisp or something else as a macro assembler to exercise the system. If you simply insist on arbitrary C code, something like WebAssembly might actually be better than patching some random 3rd party code generator (i.e. just write a simple one-shot one-pass compiler that does no advanced regalloc or optimization and just turns it into a binary. This is relatively easy and does not require advanced algorithms and there are already a bunch of WASM compilers available.)


> If you want a tool for early programming your own ISA I recommend using Lisp or something else as a macro assembler to exercise the system.

That something else should be, and hear me out, Forth.

Yes I'm perfectly serious, if you look around you can find some interactive AVR Forth assemblers which are in the 2-4K installed range.

This is Forth's niche, it's a surprisingly capable language for many things but interactive programming a bespoke ISA, this is where Forth comes from.

Firm recommendation to at least read A Problem Oriented Language if working in this space, a combination of memoir and a distinctive philosophy of software development.


But, this is TCC's strength. It IS a simple one-shot, one-pass compiler that does not advance regalloc or optimization! With a single-file code generator that has already been ported. Goes straight from C source to executable binary in memory.

The main problem is that TCC has been "extended" to take on other roles (like linking ELF).


Sure but if you're working on a custom softcore processor, there's a lot to do, and a lot to learn, and it's very easy to stall out on boring parts, because in a production setting there is a lot of boring stuff. A C compiler is "boring stuff." Unless your ISA design is non-traditional, for sequential scalar machines I feel you're mostly just doing tedium to emit slightly different instructions from an existing codebase. It's not that it isn't valuable to have a C compiler, it's that for the purposes of learning and absorbing information and new things, someone else has done most of the "learning" parts for you. So you could probably just... do something else.

My opinion here is broadly informed by two things, which is that 1) if you want to write a compiler, you should write a compiler, not just tweak an instruction emitter (yes I know that's actually a very common part of the job for working compiler devs, but that's not what you're doing.) And 2) people are far too eager to fall into their old ways even when it isn't productive. A C compiler is comfortable. For someone experienced, it would be easy to port an existing one -- but they probably wouldn't have been asking about it in the first place. For those other people, it's boring and a lot of work when a macro assembler like Forth or Lisp can get you going and writing higher-level code in a few hours. When you're doing things like bringing up silicon you wrote, that's way more important and much more valuable.

You shouldn't eschew things like a C compiler forever if you plan to really get serious. Sometimes you can get one for free (MIPS, RISC-V) and that's super valuable. For a production toolset, it would be absolutely essential. But when you're learning it's easy to sink a lot of time into things like that, when that's not where the real meat and potatoes are.


Yeah I figured it's a long shot. Just wanted to see if I had missed something which would allow me to get going without much effort.

> If you want a tool for early programming your own ISA I recommend using Lisp

Good point, maybe this is the time for me to learn Lisp...


I've been porting subc https://www.t3x.org/subc/ to my 36-bit RISC machine arch. It's by no means 'full' C, however:

1. It is Public Domain. I like that.

2. There is a book ($) that explains it in detail (and it's good).

I respect the design choices made, and the compiler can compile itself once bootstrapped. It's really fast, but code gen is only 'ok'. That's a tradeoff that works for me.


GCC isn't all that bad. I wrote this guide a while ago, but GCC hasn't changed too much since then: https://atgreen.github.io/ggx/



That sounds encouraging, thanks!


Perhaps the lightweight qbe compiler backend. [0] There's a C frontend for qbe called cproc. [1][2]

See also this talk [3] which discusses the relative ease of porting qbe, at 11:14.

[0] https://c9x.me/compile/

[1] https://sr.ht/~mcf/cproc/

[2] Official GitHub mirror: https://github.com/michaelforney/cproc

[3] https://spacepub.space/w/pjgvVL74xtFwqnaKuoCXZj?start=11m14s


The article itself mentions this:

https://c9x.me/compile/


One alternative is LCC:

https://en.wikipedia.org/wiki/LCC_(compiler)

It is half way in complexity between TCC and GCC or LLVM and was created to be educational. I think there are major differences between more recent versions and what it described in the book.


C4 comes to mind (C in 4 functions), https://github.com/rswier/c4.

have you considered adding a backend for LLVM? perhaps a bit heavyweight, but it could be a good way to get C/C++, fortran, rust, etc. if that's something you'd like!


> C4 comes to mind

Seems like an interpreter? But maybe that could be tweaked... A bit messy code though.

> have you considered adding a backend for LLVM?

Just to the extent that I kinda dismissed it out of the gate due to seeming to be a fairly complex and dynamic target. Haven't used it personally but a good buddy maintains a backend.


Check PCC [1]. Official release was 2014, but development is still active [2].

[1] http://pcc.ludd.ltu.se/ [2] https://marc.info/?l=pcc-commit-list


Are you making your own softcores? Do you have a recommendation for getting started with that? I've recently gotten interested in making CPUs from logic gates, but I have no idea how to transfer that knowledge to FPGAs.


I'm fairly fresh to this field, so hopefully someone else can come along and give a better answer. I just picked up an iCE40[1] and Cyclone IV dev board[2] and started hacking Verilog.

I know Verilog isn't the best option around, and I'm planning on moving onto something better but for now it was easy to get started, and from what I could see many of those better tools end up generating Verilog so would be nice to be able to debug that.

If you know how to think in logic-gate-terms then FPGAs won't be that difficult I suspect. Essentially either you have plain logic gates where the output is purely a function of the inputs, or you have edge-triggered logic thanks to flip-flops.

Just think of it as writing code that specifies how to wire up set of logic gates, muliplexers and flip-flops.

As a brief example, here's an excerpt from my very first ALU:

     wire [8:0] res_and;
     wire [8:0] res_xor;
     wire [8:0] res_add;
    
     assign res_and = (bus_a & bus_b);
     assign res_xor = (bus_a ^ bus_b);
     assign res_add = (bus_a + bus_b);
    
     always @(posedge clk)
       case (op_sel)
         OP_AND:  result <= res_and;
         OP_XOR:  result <= res_xor;
         OP_ADD:  result <= res_add;
         default: result <= 0;
       endcase
Here "bus_a" and "bus_b" are wires (which may come from registers), while "result" is a register.

The wires don't hold state, while registers can. The <= means to latch the right side into the left side, in this case on a positive edge of clk. The case statement is just a multiplexer.

At least to me, it feels very much like "writing a circuit".

[1]: https://www.latticesemi.com/en/Products/DevelopmentBoardsAnd...

[2]: https://www.aliexpress.com/item/32949281189.html


Have a look at this book https://cc-webshop.com/collections/books-electronics/product...

He designs a Z-80 from scratch, giving Verilog hints / good practice along the way. The design isn't 'complete' -- you have to learn what he's saying and add in the other parts.

It's not a 'trivial' book but if you want to Get Good -- recommended.


Thanks! I feel like I actually understand that verilog code! Do you ever see the "rendered" state on the FPGA, or is that all abstracted away?

Would you recommend one of those FPGAs over another? I know iCE40 from this amazing project (https://github.com/nickmqb/fpga_craft) so maybe I'll just get that!


I also want to refer you to this http://fpgacpu.org/ especially the XSOC links. Jan builds a complete SoC-on-a-FPGA in really 'ancient' fabric (4000-series Xilinx) for Circuit Cellar (articles available on this site). He also hacks a C compiler to emit code for his 16-bit RISC machine (which is very clever).

This is historical, but a 'must read' if you're new to this stuff.


I've posted a few examples here[1]. It's from the same ALU code that I had trimmed to give you an example. It just outputs some results to an RGB led.

It shows the logical view, the allocated blocks and the routing in Lattice Radiant IDE.

I haven't done much with the Cyclon IV yet, but I found the iCE40 very easy to get rolling with and the freely available tools have been great so far.

It's not a huge FPGA though so if you want to do large projects maybe consider something else. But as a starter I think it's been great.

[1]: https://imgur.com/a/ZaV9Wvr


I use the iCE40 with 8K LUTs. It's 'old' but the open-source chain is nice, the vendor-supplied chain is fine, and it's 'big enough' to do fairly serious stuff.

Like the subc compiler I mentioned, the tradeoffs working with an older FPGA like that are ok to me.

GL! p.s. I mostly used the TinyFPGA BX to learn about the iCE40, but it doesn't have enough pins to make even a semi-decent full machine (with wide memory bus) so I use the Lattice HX 8K dev board. It's also easy to use.


You can certainly inspect the synthesized primitives, either graphically post-place&route, or in a netlist which is usually a tangled mess.

Typically you're doing so either because you're curious, you're debugging a timing problem, or you're trying to make sure the toolchains inferred a particular primitive you are trying to use on purpose. It's not terrifically informative otherwise.


Not a full-blown compiler but a backend similar to qbe or llvm:

https://github.com/robertmuth/Cwerg


If you are working on your own compiler, this may help a lot. Working on TCC is always good experience.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: