It's still an impressive feat, and very informative on the assembly details - but doesn't feel as incredible as the headline makes believe as the core logic seems to be a string search-and-replace of the strings in the "reviewing brainfuck" table.
Seems to me, you could write the next brainfuck compiler in sed.
Awib is a brainfuck compiler entirely written in brainfuck.
Awib implements several optimization strategies and its compiled output outperforms that of many other brainfuck compilers
Awib is itself a 4-language polyglot and can be run/compiled as brainfuck, Tcl, C and bash
Awib has 6 separate backends and is capable of compiling brainfuck source code to Linux executables (i386) and five programming languages: C, Tcl, Go, Ruby and Java
It seems more like a parser than a compiler. It doesn't write a binary, it just translates brainfuck to C source code. In the OPs defense, they did say "by abusing everything."
A compiler doesn't have to emit machine code. It just needs to translate one language into another, which the author's program does.
Also, a parser implies having an internal, structured representation. As the C language is neither internal nor structured, I'd say the program is more likely a compiler.
There is a practical, empirical distinction between compilers and transpilers. Most compilers in practice have backends full of deep and well-explored techniques specific to machine code like register allocation, window optimizations, and other stuff like that. Most transpilers you see out in the wild share more in common with compiler frontends in the techniques and the theory they apply.
There is no fundamental difference between compilers, code formatters, preprocessors or transpilers. Anything that takes a program and outputs another program related to it by some semantics is a compiler.
That said though, I do agree that calling an assembler like this (brainfuck is a kind of assembly) a compiler stretches things a bit, for an entirely different reason: usually there is some sort of a complexity threshold of the input language. The compiler has to at least maintain a symbol table for named entities (traditional assemblies has variable-like entities, macros and subroutines, the assembler has to keep track of all that). Brainfuck is completely linear, with no named entities at all. The "compiler" looks like a dumb string processor, just iterates over one buffer to transform it into another buffer, and doesn't maintain any sort of structures on the code being translated.
By the strict "compiler : program->program" definition, this doesn't matter. But my intuition holds that dumb string processing is a bit short of "true" compilation.
> There is no fundamental difference between compilers, code formatters, preprocessors or transpilers. Anything that takes a program and outputs another program related to it by some semantics is a compiler.
That is arguing semantics and, while not wrong, I think it muddies the waters: By that logic, sed and awk are compilers.
In the end, only machine code can be executed, so you'll need something at the end of your chain that produces machine code or at least executes your DSL in an interpreter loop.
So I think a distinction between "compilers" that generate machine code and "compilers" that don't is worthwhile.
“Compiling to C” is an established strategy; for instance, GHC used to compile to C--, a subset of C.
This doesn’t seem wrong to me; after all, assembly is already an abstraction layer over machine code, and outputting assembly would hardly be “uncompilerish” behavior. I suppose it depends on whether you view C as a low-level language.
> The name of the language is an in-joke, indicating that C-- is a reduced form of C, in the same way that C++ is basically an expanded form of C. ("--" and "++" mean "decrement" and "increment".)
A compiler compiles something to something. For example, gcc compiles C code to assembly, which is then compiled to machine code by gas. Clang does the same: https://freecompilercamp.org/clang-basics/
Not saving everything to files and not doing things via multiple executables does not mean that a compiler does things in a single step, generating machine code directly from the source code.
Both gcc and clang support the -save-temps switch to keep intermediate files during the compilation process. These files are created even without this argument, except that their names are randomly generated somewhere and the files are deleted afterwards.
> These files are created even without this argument
That used to be the case in the dark ages of autoconf.
clang does everything in a single address space which is a lot faster now that you aren't doing a million filesystem calls.
btw I remember seeing a talk a few years ago about integrating clang with the build system so that the same compiler process can be reused to compile multiple files. Startup time is significant when you have a lot of source files
Past: >1 process and many fs calls per cpp.
Present: 1 process and a few fs calls per cpp.
Future: <1 process and a few fs calls per cpp (on average)
-save-temps is for backwards compatibility. The files are not generated if you do not give that flag. clang goes to the extra effort of generating them for you if you request them using that flag, but they are not part of the compilation process.
Machine code is generated "directly from the source code" in the sense that there are no intermediate languages produced. There are multiple stages of compilation, of course, but these all involved in-memory data structures, not textual languages.
Adding to thewakalix's answer, many compilers use LLVM as a backend, where LLVM's IR is higher-level-than-bytecode language too. So that would mean that e.g Clang or current GHC wouldn't qualify either.
The word we need here is "trivial". A compiler is a function (or possibly, nondeterministic process) from source code in a first language, to source in a second language. The languages can be the same, for example, in the case of metaprogramming. The identity function is a trivial compiler. Calling it a non-compiler only serves to complicate the definition of compiler, with no discernable benefit.
I've created a new language called Brainfuck2 that is exactly the same as Brainfuck except the + symbol is replaced with a t. Clearly my compiler from Brainfuck2 => Brainfuck does not implement the identity function, and is less than 50 bytes.
Ah, well that depends. The simplest definition is that the identity function is the only trivial compiler. One might take a more expansive view, that your Brainfuck2 language is homomorphic to brainfuck through a character-replacement table (though I must ask -- does your compiler introduce bugs if the original brainfuck source contains comments?[1]), and decide that character-replacement is trivial. You could even go further, and say that context-free token-replacement is trivial, at which point TFA counts as trivial compiler. That said, I'd estimate that most brainfuck compilers are trivial under that interpretation.
[1] if you did it right, it's idempotent, which is about as close to trivial as a nontrivial function gets under the lens of symbolic dynamics, but I'm getting a bit far afield