This reminds me of Alan Kay's STEPS project - an attempt to reinvent personal computer in 20K lines of code .
Always happy to see efforts in this direction.
So far the only desktop OS + compiler I know that can be reasonably understood in its entirety is http://www.projectoberon.com
I'm not convinced that this approach is the best - the first level language is a bit like "literal programming" machine code (not assembler) - but I think a level of simple substitution would help a bit (think m4 as an assembler) - allowing at least mnemonics. The insitence on numbers seem somewhat arbitrary as they still need to be translated from strings (letter sequences) to actual integers[e] - I don't think a table/variable look up would be too onerous.
On the whole this feels a bit more convoluted than building a Forth as level one - but no matter what I applaud the effort.
[e] assuming program source code is text, not binary?
If it's binary I suppose "filtering out" comments might be an approach as opposed to "translating" the source.
I've programmed on systems where the only access was through direct memory access to the machines memory, and other systems that had a boot loading initiated by a button that started a paper tape reader. I've also used minicomputers that had to have instructions manually keyed in on binary panel switches directly into memory to start a card reader where the cards contained machine instructions to bootstrap the rest of the boot loader off of the remaining cards.
In every case though, somewhere I had an assembler to write the code I needed. What would I do if I didn't even have an assembler? I could write one, but I don't want to do it in machine code. Now days, the easy way would be to simply construct a cross-compiler or cross-assembler that was written in say python on a different machine.
One alternative is to build a hand assembled, really simple macro system. m4 or Calvin Mooers' TRAC are possible prototypes, but these flexible macro systems would be difficult to hand compiler into machine code so one might get by with a really basic, more limited macro system. Then each machine instruction could be encoded as a macro and then a primitive assembler could be written as macros.
However, real assemblers do more work, they generally take two passes because they allow symbols to be used instead of literal addresses. So the next step would be to program a real assembler using the primitive assembly language.
After that I suppose a compiler for a simple language like Pascal, basic Lisp, or Per Brinch Hansen's Edison language could be written in assembly language.
Per Brinch Hansen provides the basic directions for construction a very simple operating system from scratch, including the assembler and compiler in the book Programming a Personal Computer.
But the text I'm thinking of was more a how-to on bootstrapping development under win 3.11 if you had no outside resorces, and just a plain install with no other development tools.
We tend to think of Assembly programming as using mnemonics, but really the names of instructions in x86 are more than that: there are 14 different kinds of add instructions (https://c9x.me/x86/html/file_module_x86_id_5.html) If a single name can map to this many distinct cases, it's not a mnemonic anymore, it's a little more than that.
When a single word can do so many different things, and when the underlying instruction set has gaps, it becomes hard to create good error messages. Oh you added two immediate values together. You can't do that. But what can you do? Any impedance mismatch between external notation and target format quickly adds to the amount of code needed to generate good, clear error messages. Since my goal was to build SubX in itself, it seemed like a good trade-off to force people to live closer to the target language. That makes the error messages easier to write. There's really only one case to address at any point.
(The alternative would be to give each opcode a distinct mnemonic. Does that seem useful? I didn't initially think it was, but I'm open to trying it.)
It certainly is a justification for exposing the different op codes.
I still think symbolic names for symbolic numbers make sense,though. And I'm not sure a table in one place looses out on clarity.
I do think it's good to re-think assembly notation (que gas/masm/nasm/x86/6800 notation wars).
I guess I feel at the level of level 1 - there is some symbolic manipulation and meta programming already (eg loops, labels for computed jumps) - and that "dumb" replacement might even turn out to me more clear.
Comments go away (replace with nothing), jump-to-labels become jump-to-offset, opcode mnemonic become binary numbers, numbers become binary numbers etc.
I confess I wonder at the choice of 32bit as I find amd64 to be both clearer, more sane (better? Calling convention) - and it should be more powerful (eg more registers) - faster and allow addressing more memory (if/when that becomes and issue - say mapping an entiere 6tb drive into contiguous memory?).
What should the mnemonics be called? SubX is already quite verbose. Would you be willing to put up with long names like `add_immediate`?
32- vs 64-bit is certainly a subjective choice. I outlined some reasons for my choice a year ago: https://lobste.rs/s/jkmwer/what_are_you_working_on_this_week.... But 64-bit is certainly a valid processor to target.
As for needing needing 32bit in 64bit - I guess it would depend a bit on if you could settle for a "all 64bit all the time" subset or not (ie: no 32bit integers)?
Now, for mnemonics - I certainly don't know. I suppose I see why they're generally implied in most assemblers...
I think add_i is still a lot more readable than 0x83 or whatever. Is there a system to the numbers? Could one "mask" immediate etc (eg ADD|I MEDIATE, MUL|IMMEDIATE or some such? Where the pipe implies xor or something (forgive me, I haven't touched shifts etc in years)).
I generally prefer descriptive names to terse names - but for an assembler I don't think it fits to go overboard - not unless you expect to end up with a Forth-like quick ascent to abstraction (like building a macro/substitution that abstract away the add_immediate back to just add).
I guess I'm starting to see why primitive text macros a la m4 made its way into no just C (the C preprocessor), but also in various forms in various assemblers...
Anyway, as your system needs to to some parsing and expansion (eg: strings to binary numbers, loops, label computation) - would not exposing that as a macro/meta facility make sense?
Maybe it's more of a level 2 thing in your design?
a) Does it really make the code easier for newcomers to read? On the one hand the names are familiar. On the other hand it may make cross-referencing with the Intel manual harder. And you do still end up needing to do that to a certain extent. My online help refers to the standard names in parens: https://github.com/akkartik/mu/blob/3a2d36a93680f2e9eb2bd2a3...
b) Is it useful in enough places to be worth the lines of code it takes to implement?
In any case, this seems worth exploring. Perhaps we should continue on the project site? I've opened a ticket: https://github.com/akkartik/mu/issues/39. Perhaps the first step is to just hand-translate some function (say https://github.com/akkartik/mu/blob/master/apps/factorial.su...) and see how it looks.
One final word on 32-bit vs 64-bit. I didn't realize when I wrote the comment from a year ago how easy it is to run 32-bit code on 64-bit machines. The reverse is obviously impossible.
Your thoughts on the benefits of language slightly higher level than assembler are interesting. Others have trod this path before.
Bliss32 was the systems language for vax/vms back in the 1980s through 2000s. Most of vms was written in it. It was heavily influenced by PL/360 from the days when IBM 360 or 370 mainframes were mostly programmed in assembler.
Both languages emitted very predictable assembler from their higher level constructs. They had no optimizations. But their objectives seemed to be focused on readability and predictable output, like your system.
The source code for 99% of vms was published on microfiche and distributed with the binary tapes. Worth finding a copy and the accompanying text on vms internals and data structures. An excellent intro to nonTorvalds operating system architecture.
Mu doesn't have its own OS at the moment. It runs on Linux and on Soso (https://github.com/ozkl/soso). I don't know either well, but hopefully I'll at least be a good study partner. And you can definitely ask me questions about the language side and the interface with the OS.
Mu doesn't have every feature of modern compilers and OSs, but it should hopefully give you a view of something simple working from end to end that you can then carry with you when making sense of more complex systems.
Awesome, that's what I was hoping for: something that will lower the entry bar but still provide useful education for the greater world.
In other words, it had nothing in common with today's Linux. (-:
89/<- ebp 4/r32/esp # copy esp to ebp
89/<- 5/rm32/ebp 4/r32/esp # copy esp to ebp
89/<- 3/mod/direct 5/rm32/ebp 4/r32/esp # copy esp to ebp
89/<- %ebp 4/r32/esp
Could you point me at where you see your first two versions?
The 3rd one is here:
Can you explain in a sentence what rules are used to complete the encoding for the variant:
> 89/<- %ebp 4/r32/esp
The concrete rules are illustrated at https://github.com/akkartik/mu/blob/469a1a9ace9b290658beb9d0.... In this case `%<reg>` always translates to `3/mod <reg-code>/rm32`.
For the basis notation I'm using a dialect of Joy, an extremely simple and elegant purely functional, stack-based, concatinative language. It's like a combination of the best parts of Forth and Lisp. https://en.wikipedia.org/wiki/Joy_(programming_language)
(I'm pretty sure there's no simpler useful language than Joy.)
To bridge Joy to the CPU I took the high road: a compiler written in Prolog. It would be possible to use Forth lore to bootstrap from the metal up, but I was gobsmacked by
David Warren's "Logic Programming and Compiler Writing" (
https://news.ycombinator.com/item?id=17674859 ) and have spent the last year "moulting" from Python to Prolog programmer.
Anyhow, the compiler so far is concise but the model it implements for Joy on the CPU is still very crude. Between life, work, and learning more Prolog I haven't had much time to improve it.
The thing is, Prolog is a very simple language, the compiler is very simple and small, and Joy is also very very simple and small (especially if it's implemented in Prolog), and the underlying CPU is very simple and small. The tricky bit would be to, say, implement a Prolog interpreter in Joy. But if you do that you've closed the loop for self-hosting: Prolog-in-Joy, running on Joy-on-the-metal, implementing a compiler for Joy to the metal, bootstrapped through Prolog on the host.
(It would be simple to convert the Joy compiler in Prolog to a Joy compiler in Joy using the Prolog interpreter in Joy and partial evaluation. FWIW.)
For the OS I'm going to build a simple clone of OberonOS but using Joy rather than Oberon language. I made a model of this OS in Python and Tkinter to guide the reimplementation in Joy. This will also serve as a stress test for Joy: can you implement a whole (albeit simple) "OS" in a purely functional, stack-based, concatinative language? Maybe it sucks?
To sum up, a simple text-based UI/OS inspired by and loosely imitating OberonOS, targeting a simple but capable 32-bit RISC chip, with Joy as the shell and glue language and Prolog as the under-the-hood powerhouse systems language. And the whole thing should fit in under a hundred pages of code.
The Prolog code is under the ./thun subdir. I apologize for the state of the repo & docs. I need to go through and update everything.
If you look at the Python code the vast majority of it is the type inference module and the library of functions and combinators. The parser, main loop, and most of the library are trivial. Most of the joy.utils.types module is a crude re-implementation of a subset of Prolog and can be discarded. The Prolog version of Joy is even shorter than the Python version.
The Joy compiler (which is only about half done) is really tight and simple (Joy doesn't use variable so no need for a symbol table; Joy doesn't use subroutines so no need for a call stack) and the last half of the functionality won't require double the code.
Meta-programming in both Joy and Prolog is really easy and powerful, and Joy in particular lends itself to a kind of Huffman encoding of your graph (Joy code forms a simple DAG.)
- - - -
I want to repeat, Mu is really inspiring. I'm looking forward to watching your progress. :-)