Hacker News new | past | comments | ask | show | jobs | submit login
A Plan 9 C compiler for RISC-V [pdf] (geeklan.co.uk)
137 points by fanf2 5 months ago | hide | past | web | favorite | 45 comments

Just double-checking the part of the presentation where they cite Plan 9's C compiler as "predictable" because it doesn't optimize away a useless loop... that's because the compiler is missing a bunch of useful optimizations isn't it?

Specifically they say GCC requires this form for the busy loop to be emitted:

for (int i = 0; i < 1000000; i++) asm volatile ("" ::: "memory");

Where 9c will output a bunch of useless code when you tell it this:

for (int i = 0; i < 1000000; i++);

And this is... a good thing?

I agree that it's a bit silly. They say:

>Plan 9 C implements C by attempting to follow the programmer’s instructions, which is surprisingly useful in systems programming.

It's like coding with -fno-strict-aliasing or -fwrapv in GCC, it's perfectly fine and justifiable but that doesn't mean that it makes sense for a compiler to default to it IMO because you're basically lulling your devs into writing into a specific dialect of C instead of the "real" language. It means that your code is effectively not portable anymore which is probably less of an issue for low level kernel code but could still easily cause issues as code is shared between projects. Again, there are situations where it makes sense to do so but I strongly believe that it should be an explicit choice by the programmer, not a compiler default.

Now I would argue that the for loop example is even worse than aliasing or wrapping-related issues because I very rarely write busy timing loops but I do very often write for loops that I expect the compiler to optimize (drop useless code, unroll etc...) correctly. So yeah, that really seems like a way to spin a limitation of the compiler into a "feature" that makes really little sense.

Also I just checked and gcc 8.2 does output the loop code when building with -O0 I guess they could alias that to --plan9-mode.

> but I do very often write for loops that I expect the compiler to optimize (drop useless code, unroll etc...) correctly

I feel like the "Plan 9 C" author would argue that optimizations like that should be explicitly enabled using inline pragmas, where something that has an optimization pragma is requiring the compiler to optimize it (so if it can't be optimized, the compiler should generate an error) and anything without the pragma requires the compiler to not optimize it. (And then you can have an "optimize if you can" pragma, too, but its usage would be comparatively rare to either explicitly requiring or disallowing optimization.)

Whereas, with regular C compilers—unlike compilers for most other systems languages—optimizations get turned on by a compiler switch entirely outside of the code, and then what gets optimized and what doesn't is invisible, and there are both no guarantees that anything will be optimized, and no guarantees that anything won't be optimized (unless you "trick" the compiler by using things like the asm volatile() above.)

I'm not sure if I personally agree with the PoV I just stated, but I think that's what they're thinking.

Compilers, including their optimizations, are implemented using abstractions. The component to remove a chunk of code might query some other component, "are any objects within this subtree used by anything outside this subtree"? If the answer is, "no", it gets removed.

Recognizing and preserving special syntax patterns requires additional work and can add substantial complexity. This is a common dilemma in software engineering, especially high quality software that applies sophisticated algorithms. The smarter a compiler in terms of the application of state-of-the-art algorithms, the more that these rigorous (but sometimes annoying) optimizations naturally happen. On the other hand, anything that breaks abstraction boundaries results in complexity which can make comprehension and maintenance quite burdensome.

If you've ever written code to build and transform an AST it should be obvious how difficult it can be to add in ad hoc logic that leads to inconsistent treatment of nodes. Even adding pragma opt-outs can add substantial complexity. The Plan 9 compiler recognizes this because it basically does no optimizations. In that sense it behaves much like GCC in preferring simplicity over ad hoc semantics; both recognize that to "have your cake and eat it too" is too costly.

Fortunately, C does make it relatively easy to compile different source units independently. So all you really need is a single mode that disables all optimizations, and put your special code in its own source file. But the trend is to remove this separate linking step (Go and Rust both do static linking across the application), and even C compilers are defaulting to so-called LTO which effectively recompiles the application at link-time and which deliberately violates previous semantics regarding cross-unit transformations and optimizations. That's something of a shame.

GCC does permit all manner of function-level attributes, but it adds substantial complexity, which is why clang and most other compilers don't support such flexibility to the same degree, and why GCC is often reticent to support yet another option.

> Plan 9 C implements C by attempting to follow the programmer’s instructions

Which, I might add, is a very silly thing to say. A programmer's intent and their written code are two very different things. How one maps to the other is defined only by the C standard, which says nothing about emitting specific assembly instructions, but only about the ultimate effect of code on memory.

The Plan 9 compiler deciding to pessimize your code because it assumes you actually meant for the code to be interpreted as portable assembly rather than a high-level description of a computation is kind of presumptuous. At that point it's just a different language with different (albeit compatible) semantics.

Plan 9 C is a different language than ANSI C anyway.

Not really. C99 adopted most (all?) of their extensions, including anonymous union and structure members, compound literals, long long, and named initializers.

Interestingly, with the exception of long long, these are the features that effectively forked C and C++.

Hell of a lot cleaner too.

Compiler optimizations are one of the primary culprits in making it difficult to reason about lock-free programs. Semantics-preserving optimizations in a single-threaded context are not necessarily semantics-preserving in a multi-threaded, lock-free context.

For example, if you're writing a spin-lock, the compiler may lift a read of the lock value out of a loop because, assuming a single thread, the value will never change. This can result in a non-terminating spin-lock. For more see Linux's ACCESS_ONCE.

The example you gave is unfortunate but the consequences of optimizing loops carelessly can be serious.

Isn't this the purpose of well-defined atomic primitives?

After all, not just the compiler, but also the processor can reorder operations. So you have to annotate synchronizing memory operations regardless of whether the compiler is optimizing. e.g., a lock-free algorithm implemented using only volatile (what ACCESS_ONCE does), even with -O0, is almost certainly wrong.

The alternative to explicit annotation is for the compiler to generate full memory barriers around every memory access. That would indeed preserve semantics in a multithreaded context, at a ridiculous performance cost.

> well-defined atomic primitives

The example I gave is simple and relates to the example of the parent but there are more complex cases for which it is a matter of ongoing research to define a semantics that also admits compiler optimizations.

For example the "well-defined" semantics of (C|C++)11's atomics admits executions where values can materialize out of thin air [1].

The broader point I was hoping to make is that optimizations are great but are not free in a multi-threaded context with data-races (even benign ones). As a consequence the choice to just remove many of them is one that is supported by many people in the weak-memory community and even appears in newer memory models [2]. For example preventing read-write reorderings to prevent causal cycles.

[1] https://www.cl.cam.ac.uk/~pes20/cpp/notes42.html

[2] http://gee.cs.oswego.edu/dl/html/j9mm.html (ruling out po U rf cycles)

So use a language with proper semantics, like later C versions. Why would you ever expect the compiler to honor a contract that was never written?

See my comment to sibling [1]. In the case of C and the JMM, "proper semantics" is not.

[1] https://news.ycombinator.com/item?id=18312101

If the loop is so useless, why is it in the code? Probably because it isn't useless. Hence the compiler should not optimize it.

> If the loop is so useless, why is it in the code?

Because perhaps it contains a body that optimizes away based on conditions out of control of the programmer? This happens all the time with macros/templates, and with platform-agnostic code. Only the compiler can resolve what's in the body; I want to trust the compiler to remove the loop if it is useless.

That kind of empty loops are actually used for delays, waiting on interrupts to kick in etc. in embedded systems, where you typically fight against the compiler using volatile keyword. Example from https://www.coranac.com/tonc/text/video.htm:

    #define REG_VCOUNT *(volatile u16*)0x04000006

    while(REG_VCOUNT < 160);

I'm curious as to if there's a tool that can map the sections of code that are optimized away by the compiler, and feed that back to the developer; thus code like this:

    for (int a = 0; a < 10000; a++);
would emit a message at compile time allowing the human to take an additional look at the code and determine its usefulness. ultimately the code would be removed or refactored just to stop the nagging.

Nice! I hope they publish their work. plan9 is a great and very portable OS for experimenting with new architectures, for the reasons outlined in the slides. You can cross-compile the entire OS for a foreign architecture by simply setting objtype=arm and running mk (plan9's take on make) - less than 5 minutes later the whole OS is done compiling.

It took a minute to compile plan9 kernel from scratch on the original raspberry pi (running plan9). You can even cross compile a x86 kernel in similar time. 10 seconds in 9vx emulator running on FreeBSD/amd64. I don’t recall the details now but a from-scratch Linux kernel compile was 10 or 11 hours (under Linux on the same raspberry pi). Thank goodness it wasn’t written in C++; the compile time would’ve been so much worse!

C compiler optimizations seems like micro-optimizations when people should be looking at the bloat elsewhere. Missing the forest for the trees.

C is basically a low level language. A portable assembly language. A predictable compiler shouldn’t second guess the programmer’s intent. To put things in perspective, if all the man-years spent on gcc were spent on GNU Hurd... :-)

fwiw I can compile the Linux kernel, depending on the configuration, in 15-20 minutes. I usually give it 4 cores.

EDIT: On x86. If you don't cross-compile your raspberry pi kernel, you're in for a bad time.

I compiled linux on the raspberry pi just for kicks! Most people don't recompile the kernel so it doesn't matter but this just goes to show how misguided our blind quest for micro-performance has been.

* long time


This is even the officially-documented way to turn your 32-bit 9front install into a 64-bit 9front install, IIRC from doing this exact thing when I installed 9front on an old laptop of mine.

I'm pretty sure if someone sends aiju or cinap a high-five unleashed we'll have an official "unofficial" 9front port running in a few months.

I went to a local RISC-V meetup last night, and it seems like something interesting to play with. Does anyone know when actual chips might become affordable? The only board I could find available at the moment is the HiFive Unleashed, which is $999.

There are a handful of micros. The lowfive and a few coming from China

Here is an AI chip:


It's an interesting proposition b/c they using RISC for the core, but the APUs are custom - so they can create some lock-in there for themselves (without lock in it'll just be a race to the bottom with razor thin margins)

And here is RISC-on-an-FPGA in a nice package. It's very Chinese hobbyist oriented https://www.cnx-software.com/2018/09/04/licheetang-anlogic-e...

Both those projects are by Zepan. That guy is a machine

But I'm not quite sure what's holding up general purpose CPUs (even just something crappy/good-enough).. The way I understand it CPUs aren't just beefy microcontrollers and they require some extra onchip hardware, but no one has done that yet for some reason.. Maybe someone knows better :)

> CPUs aren't just beefy microcontrollers and they require some extra onchip hardware, but no one has done that yet for some reason

For example, Graphics, Bluetooth, Wi-Fi, modem, are all heavily encumbered with patents. Very complex subsystems. Even components that have expired patents or no patents, such as an MMU, are non-trivial to create and take time. I suspect it'll take time before FOSS implementations appear.

Graphics can sit on a PCI bus. Same with Wi-Fi. The MMU is probably a blocker.

> But I'm not quite sure what's holding up general purpose CPUs (even just something crappy/good-enough).. The way I understand it CPUs aren't just beefy microcontrollers and they require some extra onchip hardware, but no one has done that yet for some reason.. Maybe someone knows better :)

There's general-purpose RISC-V CPU RTL lying around, and it's not too difficult to license the necessary peripherals, but it costs money to put together a board and fabricate at volume if you want to hit a Raspberry Pi/hobbyist price point. Unfortunately, it takes time and you need a market to justify the effort. But eventually it'll happen.

lowRisc and SiFive, there is no lowfive

It's a breakout board for the SiFive E310


There's the HiFive from SiFive though. Which is the Arduino form-factor board with their Freedom E310 core.

If you want actual hardware you can get:

- Kendryte KD233

- HiFive1 (https://www.sifive.com/boards)

- GAPUINO GAP8 (https://greenwaves-technologies.com/product/gapduino/)

- HiFive Unleashed (https://www.sifive.com/boards/hifive-unleashed)

Those are the only ones that exist commercially as far as I know.

You can buy affordable FPGA boards that can be configured with open-source RISC-V chip designs, like the Arty A7-35T[0] for $119. There are a number of other FPGA development boards that would run RISC-V at a much lower cost than $999.

[0]: https://store.digilentinc.com/arty-a7-artix-7-fpga-developme...

A Parallella should be faster than that Arty as it has 32-bit wide DRAM, the Arty only has 16-bit.

It looks like Richard Miller, author of the article and living UNIX legend, is using a verilog implementation by Clifford Wolf [1] in this FPGA board [2].

[1] https://github.com/cliffordwolf/picorv32 [2] https://www.tindie.com/products/Folknology/blackice-ii/

The closest seem to be Lowrisc.org and the Incore Shakti chip. LowRisc seems behind on their original timeline plan. Shakti booted Linux in August. Can't tell if they just want to make chips though...LowRisc is going to make full RPI type boards.

What are the advantages of using the Plan9 compiler versus TinyCC?



tcc only supports x86, and is 4-5 times bigger (lines of code) than the plan9 compiler.

Tcc has supported AMD64 and ARM for ages. It produces reasonably fast code, usable as a library, and has many other nice features. Worth looking at again if you last looked when it only supported x86.

Oh, neat, I will. Still, the main advantage of plan9's compiler is its simplicity.

Hm... A non-optimizing compiler? Nice hobby, but I don't see the point of this. Even folks doing safety critical stuff (like in failure = dead people) use -O0 and are craving for some optimizations. E.g. why no DCE? Constant propagation? With a proper representation (SSA?) some of this near-trivial.

The Plan 9 C compiler does perform optimisations including constant folding and dead code elimination. (Actually it's the linker which eliminates dead code, so it can remove functions which are not called from any other source file.) The example loop on the slide however was not dead code or useless: it was a timing delay loop, an idiom commonly encountered in OS kernels and embedded applications.

So which is the easiest compiler to re-target to a new processor? That would certainly have some value even if it's not the most optimizing compiler.

Great to see RISC-V news make it public. Always glad to find more Plan 9 hobby projects to learn from :)

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact