Hacker News new | past | comments | ask | show | jobs | submit login
I wrote a 231-byte Brainfuck compiler by abusing everything (briancallahan.net)
154 points by ingve on July 12, 2021 | hide | past | favorite | 52 comments

Here's an even smaller one for 64-bit Linux: https://gist.github.com/ianh/61e6219f0a9866b31a2b864033c1812...

    $ uname -a
    Linux personal-1 4.19.0-6-cloud-amd64 #1 SMP Debian 4.19.67-2 (2019-08-28) x86_64 GNU/Linux
    $ as -g -o bf.o bf.S && ld -o bf bf.o
    $ size bf.o
       text    data     bss     dec     hex filename
        167       0       0     167      a7 bf.o
    $ size bf
       text    data     bss     dec     hex filename
        199       0       0     199      c7 bf

a 231-byte Brainfuck-to-C compiler*

It's still an impressive feat, and very informative on the assembly details - but doesn't feel as incredible as the headline makes believe as the core logic seems to be a string search-and-replace of the strings in the "reviewing brainfuck" table.

Seems to me, you could write the next brainfuck compiler in sed.

Shouldn't the point of doing something as practically useless as compiling brainfuck be to do it in the most difficult way possible?

I'm more impressed by a brainfuck compiler written in brainfuck.



Awib is a brainfuck compiler entirely written in brainfuck.

Awib implements several optimization strategies and its compiled output outperforms that of many other brainfuck compilers

Awib is itself a 4-language polyglot and can be run/compiled as brainfuck, Tcl, C and bash

Awib has 6 separate backends and is capable of compiling brainfuck source code to Linux executables (i386) and five programming languages: C, Tcl, Go, Ruby and Java

The original point of Brainfuck was actually exactly to make the smallest possible compiler for a Turing-complete language.

The original was 240 bytes, and created an executable directly.

Brainfuck interpreter written in Tcl:


Brainfuck-to-Tcl transpiler:


It seems more like a parser than a compiler. It doesn't write a binary, it just translates brainfuck to C source code. In the OPs defense, they did say "by abusing everything."

A compiler doesn't have to emit machine code. It just needs to translate one language into another, which the author's program does.

Also, a parser implies having an internal, structured representation. As the C language is neither internal nor structured, I'd say the program is more likely a compiler.

I blame the rise of the word "transpiler" for making people think that compilation is some more magic/narrow/special thing.

There is a practical, empirical distinction between compilers and transpilers. Most compilers in practice have backends full of deep and well-explored techniques specific to machine code like register allocation, window optimizations, and other stuff like that. Most transpilers you see out in the wild share more in common with compiler frontends in the techniques and the theory they apply.

The term transpiler arose when there was a need to denote a special kind of compiler whose target is another language or version of the same language.

Well then,

I can write a tiny C compiler in bash...

Isn't a compiler supposed to produce one or more of: machine code, link-able segments of machine code, or at least 'bytecode' for a 'virtual' machine?

The definition of this problem strikes me far more as either a transpiler or a pre-processor for a Domain Specific Language.

There is no fundamental difference between compilers, code formatters, preprocessors or transpilers. Anything that takes a program and outputs another program related to it by some semantics is a compiler.

That said though, I do agree that calling an assembler like this (brainfuck is a kind of assembly) a compiler stretches things a bit, for an entirely different reason: usually there is some sort of a complexity threshold of the input language. The compiler has to at least maintain a symbol table for named entities (traditional assemblies has variable-like entities, macros and subroutines, the assembler has to keep track of all that). Brainfuck is completely linear, with no named entities at all. The "compiler" looks like a dumb string processor, just iterates over one buffer to transform it into another buffer, and doesn't maintain any sort of structures on the code being translated.

By the strict "compiler : program->program" definition, this doesn't matter. But my intuition holds that dumb string processing is a bit short of "true" compilation.

> There is no fundamental difference between compilers, code formatters, preprocessors or transpilers. Anything that takes a program and outputs another program related to it by some semantics is a compiler.

That is arguing semantics and, while not wrong, I think it muddies the waters: By that logic, sed and awk are compilers.

In the end, only machine code can be executed, so you'll need something at the end of your chain that produces machine code or at least executes your DSL in an interpreter loop.

So I think a distinction between "compilers" that generate machine code and "compilers" that don't is worthwhile.

Where do you draw the line? Is javac not a compiler?

In classic compiler theory, the term "transpiler" doesn't even exist, they're all compilers.

They literally didn't exist back then though. so not just the term was unknown back then, the practice itself too...

Apparently Intel shipped a transpiler for 8080 to 8086 source code in 1978.


Likewise, the first C++ compiler (cfront) might today be considered a transpiler.

“Compiling to C” is an established strategy; for instance, GHC used to compile to C--, a subset of C.

This doesn’t seem wrong to me; after all, assembly is already an abstraction layer over machine code, and outputting assembly would hardly be “uncompilerish” behavior. I suppose it depends on whether you view C as a low-level language.

C combines the power of assembly language with the readability and maintainability of assembly language.

> GHC used to compile to C--, a subset of C.

C-- is not a subset of C.

“C++++-= is the new language that is a little more than C++ and a lot less.” -Bill Joy

Bill Joy’s Law: 2^(Year-1984) Million Instructions per Second


You’re right, and as it happens, GHC compiles first to Cmm and then (as one half-deprecated option) to C.


> The name of the language is an in-joke, indicating that C-- is a reduced form of C, in the same way that C++ is basically an expanded form of C. ("--" and "++" mean "decrement" and "increment".)

Q&A #5 of https://www.cs.tufts.edu/~nr/c--/faq.html explains why C-- is not a subset of C.

A compiler compiles something to something. For example, gcc compiles C code to assembly, which is then compiled to machine code by gas. Clang does the same: https://freecompilercamp.org/clang-basics/

Clang definitely does not output assembly code and invoke an assembler. It generates machine code directly.

It can optionally also generate assembly code if you want to look at it, but it does not do this during normal compilation.

Not saving everything to files and not doing things via multiple executables does not mean that a compiler does things in a single step, generating machine code directly from the source code. Both gcc and clang support the -save-temps switch to keep intermediate files during the compilation process. These files are created even without this argument, except that their names are randomly generated somewhere and the files are deleted afterwards.

> These files are created even without this argument

That used to be the case in the dark ages of autoconf.

clang does everything in a single address space which is a lot faster now that you aren't doing a million filesystem calls.

btw I remember seeing a talk a few years ago about integrating clang with the build system so that the same compiler process can be reused to compile multiple files. Startup time is significant when you have a lot of source files

Past: >1 process and many fs calls per cpp. Present: 1 process and a few fs calls per cpp. Future: <1 process and a few fs calls per cpp (on average)

Was doing a quick check. Looks like the temporary files are not always generated and could not even see any pipes in Process Monitor.

EDIT: clang can tell what's being done:

$ clang -ccc-print-phases -x c t.c 0: input, "t.c", c 1: preprocessor, {0}, cpp-output 2: compiler, {1}, ir 3: backend, {2}, assembler 4: assembler, {3}, object 5: linker, {4}, image

Reference: https://clang.llvm.org/docs/DriverInternals.html

-save-temps is for backwards compatibility. The files are not generated if you do not give that flag. clang goes to the extra effort of generating them for you if you request them using that flag, but they are not part of the compilation process.

Machine code is generated "directly from the source code" in the sense that there are no intermediate languages produced. There are multiple stages of compilation, of course, but these all involved in-memory data structures, not textual languages.

Clang doesn't generate LLVM IR as part of it's normal process?

Not in a text file, no. It generates LLVM data structures that can be serialised to a text format, but aren't.

The only external binary clang invokes is, optionally, the linker. Otherwise, everything is done in one executable invocation without temporary files.

Yes and no. It uses the llvm API rather than pipe to llc

Adding to thewakalix's answer, many compilers use LLVM as a backend, where LLVM's IR is higher-level-than-bytecode language too. So that would mean that e.g Clang or current GHC wouldn't qualify either.

LLVM is a static library that you link. Clang does not generate LLVM IR text files, it invokes LLVM's code generator directly.

Most researchers I've spoken to loathe the term transpiler.

Plenty of compilers target relatively high-level languages. They're no less compilers for it; “transpiler” feels fairly arbitrary to me.

With a bit of effort and the right add-on (https://github.com/JuliaComputing/llvm-cbe) you can make any LLVM-based compiler produce C code!

A transpiler is a compiler

If you'd like to see some tiny BF interpreters, have a look here: https://www.hugi.scene.org/compo/compoold.htm#compo6

98 bytes, the smallest!

The smallest compiler would copy the BF program to the end of this interpreter.


This is my favorite answer to why this should or should not qualify as a compiler.

If translating to c is a compiler then anything is a compiler and the term has no meaning at all.

The word we need here is "trivial". A compiler is a function (or possibly, nondeterministic process) from source code in a first language, to source in a second language. The languages can be the same, for example, in the case of metaprogramming. The identity function is a trivial compiler. Calling it a non-compiler only serves to complicate the definition of compiler, with no discernable benefit.

How do you define "trivial?"

I've created a new language called Brainfuck2 that is exactly the same as Brainfuck except the + symbol is replaced with a t. Clearly my compiler from Brainfuck2 => Brainfuck does not implement the identity function, and is less than 50 bytes.

Ah, well that depends. The simplest definition is that the identity function is the only trivial compiler. One might take a more expansive view, that your Brainfuck2 language is homomorphic to brainfuck through a character-replacement table (though I must ask -- does your compiler introduce bugs if the original brainfuck source contains comments?[1]), and decide that character-replacement is trivial. You could even go further, and say that context-free token-replacement is trivial, at which point TFA counts as trivial compiler. That said, I'd estimate that most brainfuck compilers are trivial under that interpretation.

[1] if you did it right, it's idempotent, which is about as close to trivial as a nontrivial function gets under the lens of symbolic dynamics, but I'm getting a bit far afield

The original Brainfuck compiler was 240 bytes, and created executables directly rather than relying on an external toolchain.

Wouldn’t outputting amd64 instructions and jumping to it create a shorter interpreter?

amd64 instructions generally have longer encodings due to having to encode prefixes and 64-bit addresses.

sure, it can be x86 as well, right now the program outputs C code, I just thought that compiling to byte code would be more fun

Would a jump table create a smaller compiler? Trade off an indirect call (and more static data) for all of those comparisons.

I find the DLang version far more easy to understand. Also, it could do compiler optimizations : https://theartofmachinery.com/2017/12/31/compile_time_brainf...

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact