This project is to bootstrap a language processor from a hand-written ELF binary.
This is a hobby project and is not intended for any practical use, but constructing a language processor completely from scratch is a lot of fun.
Please try running, reading and playing with the code!
Bootstrapping proceeds as follows:
1. Implement a simple Forth interpreter (really) by handwriting ELF.
2. Grow it into a relatively expressive language through self-extension.
Since writing a program in machine language by hand is a tough work, the first
interpreter is designed to be very simple. Every built-in word is single-letter and the interpreter just repeats that reads a character, looks it up from the dictionary and executes it. Also there is no error checking.
This is the actual code for the first interpreter, which is a 136-byte implementation of the interpreter followed by a built-in dictionary of 888 bytes.
However, at the end, you can write a normal Forth program as follows.
In this way, the interpreter and the language are self-extended during the
execution of the script.
$ tail bootstrap.fs
next-arg dup argv @ !
included
else
." Welcome to PlanckForth " version type
." [" runtime type ." ]" cr
copyright
." Type 'bye' to exit." cr
s" /dev/tty" included
then
; execute
Various functionalities like compile mode, immediate mode, control-flow structures, literals, variables, constants, file I/O, heap memory, etc. will be available after running bootstrap.fs.
Also, the language and bootstrap.fs are designed to be as environment-independent
as possible, so you can implement the runtime in other languages as well.
As an example, I added an implementation of the runtime in C and Python 3.
There is no space for detailed explanation here. So please read the following source code and comments.
Also, there is a commentary article in Japanese here.
I hope you can read it with google translate.
(I will write it in English if I have time. someday.)
Don’t underestimate yourself. His interpreter just has to read a byte, look up the definition in a dictionary, and then jump to the appropriate subroutine. The one he wrote reads from stdin; if you were actually bootstrapping a new machine from scratch you would be writing it to the boot sector of a disk, so it would need to read input from later sectors of that disk. Not really difficult to arrange, though the details would differ on different types of computers. You would just have it read a fixed number of sectors, long enough to read in your bootstrap program, and then everything would be in memory from then on out.
It has a rather minimal set of commands in that dictionary to start with. The next thing you want to do is give it a program that adds new commands to the dictionary to make the language less annoying and more capable. This is in the bootstrap.fs file. The first two lines define a command called `\`, which reads and discards bytes from the program until it finds a newline byte. This allows the language to include comments. It also defines a command called ` ` which does nothing except allow the program to include spaces for formatting. :)
Commands are added by writing to memory to add new entries to the dictionary. The dictionary is a linked list, and the bootstrap program “allocates” memory by simply starting at the beginning of free memory and writing values consecutively. That’s good enough to extend the dictionary as long as it never goes backwards. I haven’t read the whole bootstrap program, but presumably it implements something more sophisticated than that. On the other hand, memory is cheap. You can do a lot even if you never deallocate anything.
Since the language is so simple, you can learn to decypher it pretty easily. You’ll pretty quickly start to notice elements that are frequently repeated, such as `h@!h@C+h!`, which stores the byte on the stack to the current address in memory, then moves the current address to point to the next location in memory. In C that would be `*ptr++ = foo`. These repeated bits are soon defined as new commands, so that the program gradually gets less repetitive.
My favorite trick is actually one he does to improve readability. The command to read input, `k`, reads a single byte as a signed integer. If you want to read in something like -5, you could include that as a single byte with the value 251 (or 0xFB). Instead, he writes it as `k0k5-`. This reads the byte `0`, which has value 48 (0x30), then the value 53 (0x35), and subtracts them. But _visually_, it looks almost exactly like `0-5` to the programmer which is indeed the same as negative 5. This is used in the `\` command to jump backwards by 5 instructions after every read that doesn’t find a newline, for example. On line 645 he finally gets around to defining a command that negates a number, on line 1153 he defines one to read in a number in either decimal or hexadecimal, and on line 1304 is the command that reads in a double–quoted string. After that he can start writing out numbers almost directly: `s" -5" >number` reads in a decimal -5, but now it can be trivially modified to contain any number you need.
No, it more or less means that a language is described in (part of) the language itself. And before you ask, no, it's not quite the same as self-hosting (good question though). The latter is more associated with compilers that can take a source file and produce another compiler. The former is often more like an interpreter that provides support for part of a language, but those part are enough to define support for the rest of the language.
So in this case, the base implementation starts with support for only a subset of the Forth language, picked for being either essential or much more performant when implemented "natively" in the binary. A subset of this subset is then used to construct support for the rest of the language. This is very typical for Forth- and Lisp-like languages.
@nineties, I'm curious why you included things like 'mul', 'divmod', 'and', 'or' in the initial instruction set? These can all be composed, so would make your initial ELF binary a bit smaller.
I’d bet it was just because those are easy; there is a machine instruction to compute them. Might as well use it and have one less subroutine to define later.
Thank you db48x.
Yes, that's the reason. I added them to the first dictionary since these instructions are very simple even when writing in machine language.
In case of mod and divmod, implementing them with add, sub and loop has a serious performance issue. Replacing it with shift operations, bit operations, add, sub and loops could be an option.
Bootstrapping proceeds as follows:
1. Implement a simple Forth interpreter (really) by handwriting ELF.
2. Grow it into a relatively expressive language through self-extension.
Since writing a program in machine language by hand is a tough work, the first interpreter is designed to be very simple. Every built-in word is single-letter and the interpreter just repeats that reads a character, looks it up from the dictionary and executes it. Also there is no error checking.
This is the actual code for the first interpreter, which is a 136-byte implementation of the interpreter followed by a built-in dictionary of 888 bytes.
https://github.com/nineties/planckforth/blob/main/planck.xxd
The first interpreter and language is so esoteric that, for example, the Hello World looks like this.
bootstrap.fs self-extends this. After running bootstrap.fs, the Hello World program can be written like this: At the beginning of bootstrap.fs, you can only use one-letter words, so the code is difficult to read as shown below. However, at the end, you can write a normal Forth program as follows. In this way, the interpreter and the language are self-extended during the execution of the script. Various functionalities like compile mode, immediate mode, control-flow structures, literals, variables, constants, file I/O, heap memory, etc. will be available after running bootstrap.fs.Also, the language and bootstrap.fs are designed to be as environment-independent as possible, so you can implement the runtime in other languages as well. As an example, I added an implementation of the runtime in C and Python 3.
There is no space for detailed explanation here. So please read the following source code and comments.
https://github.com/nineties/planckforth/blob/main/bootstrap....
Also, there is a commentary article in Japanese here. I hope you can read it with google translate. (I will write it in English if I have time. someday.)
https://qiita.com/9_ties/items/349b2ed65b7cd8a7d580
Feedbacks and pull requests such as runtime implementations in other languages are very welcome.
Thank you for reading.