There are a few things I learned in the process that I think are relevant here:
1. Forget garbage collection in the sense of any variant of mark-and-sweep. For an operating system latency is more important than through put, so you really want real time memory management. If you use reference counting, then there is a lazy variant that gives you just that; no pauses due to reclaiming free memory. The downside with reference counting is that it doesn't work with cycles in heap-allocated memory, but if your data is all immutable then you can prevent cycles from being created in the first place.
2. Treat mutable memory as a form of I/O. Even if you just want to use memory mapped I/O, you will need some way to read and write to (some portion of) memory. Set aside a fixed block of memory (say, for instance, the first 10 MB), and only allow that memory to be read and written in an imperative fashion. Keep the managed heap completely separate from that mutable block, and don't allow any direct access to the managed memory.
3. You can actually perform preemptive multitasking entirely at compile time (or at program load time for apps not included when you build a binary for your OS). I've done this for both my OS and for Scheme (https://github.com/ojarjur/multischeme). It works extremely well and winds up being much faster than using the native primitives usually used for building multitasking support. For more details on exactly how to do it see the link to my Scheme implementation.
4. The biggest difficulty the author will hit, and the most likely cause for performance issues, will be in code generation. Since the proposal is for a new bytecode, the author will have to implement their own interpreter and/or JIT compiler for it, and writing an efficient one is an extremely large chunk of work. I've built an operating system and a programming language, and the language is by far the harder project. This is why I went from working on a new OS written in a new language to re-implementing my compile-time multitasking technique for Scheme; it allows me to take advantage of all of the hard work people have done to make compilers for that language which produce fast code.
One interesting exception to this rule is operating systems for network packet routers. These need to be optimized for throughput over latency. I mean, the difference between 10 and 20 million packets per second is important, but the difference between 10 and 20 microseconds of processing delay is not.
This is a fun area to be in at the moment because simple academic-looking code can actually compete with highly optimized kernels like Linux because those have been optimized for something else.
The old OS/390 that is still processing your bank statement and your payroll is another counter example that the above statement is too generic.
There is a resurgence of dataflow related research again, which you would probably be interested in. For example, Jack Dennis is pushing the 'Fresh Breeze Architecture': a write-only-once memory CPU with in-hardware GC. This should be right up your alley:
- House: http://programatica.cs.pdx.edu/House/
- SeL4 : http://ssrg.nicta.com.au/projects/seL4/
Emphasis is on the security and verifiability typed, pure-by-default Haskell provides, over OS design.
Instead of graph reduction, you could drive your execution in a completely immutable way by using string reduction. I don't think this has been explored much, I only found mention of it in 80s FP books (and Dybvig's thesis).
Is the assumption that mark and sweep is slower than real-time memory management? It's actually been proven false, e.g. the .NET CLR runs faster than native C code in certain situations.
If you use reference counting, then there is a lazy variant that gives you just that; no pauses due to reclaiming free memory.
Reference counting is extremely slow because of how memory caching works. Whenever you update reference counts, you're trashing the CPU cache. It seems like there's no way refcounting could be faster than mark and sweep due to this. What am I missing?
If you were actually going to try do tracing GC on the operating system level, you would probably have to use one of the more involved GC algorithms that require kernel support like Azul's C4 collector.
Which shouldn't be such a problem if you're already writing an OS?
Also AIUI the only "hard part" of C4 is making sure objects don't get modified while they're being moved (that's what the barrier's there to detect) - which isn't a problem if everything's immutable.
I guess that could've used more context. I was trying to point out that in the larger landscape of GC algorithms, the one used by .NET is not that advanced, even though it's better than the GC used by common scripting languages.
> Also AIUI the only "hard part" of C4 is making sure objects don't get modified while they're being moved (that's what the barrier's there to detect) - which isn't a problem if everything's immutable.
Would absolutely everything be immutable? How would you implement lazy evaluation with the correct time complexity?
The expensive part of any garbage collector with short pause times is bound to be the read barrier, which is still needed with only immutable data. If the GC is compacting and wants to relocate an object, it has to rewrite all references to point to the new copy of the object.
C4 uses a read barrier to enforce the invariant that all references point to the up-to-date copy on a load of a reference, not just on a use of a reference. It relies on its ability to immediately update all references to reuse compacted pages during compaction, which means that gratuitous amounts of free memory are not required (a common problem with moving collectors).
You might be able to come up with alternate collector design that uses some of the same ideas as C4 while exploiting immutability, but if you loosen the healing read barrier you weaken your ability to reuse memory and put a bound on memory growth, since a reference to an old copy might always escape to the next round of collection.
Exactly this. I would actually back off a bit and only claim that reference counting has more predictable latency, but the point is to sacrifice throughput in order to reduce worst case latency.
and by citation I mean example code that "proves" this.
The conclusion we kept coming back to is that it's technically not all that difficult to implement, but that to make it usable in the real world would mean that computers would have to get a lot more cautious about source vs. derived data.
The main thing is that on the one hand it seems pretty feasible to store all meaningful data a user generates over the course of using a computer and building an immutable persistent OS on top of that.
On the other hand though, when you think about doing this in the context of an actual computer as in use today you will very quickly run out of storage space.
What is separating these two scenarios is that a lot of data is (deterministically) derived from source inputs, but in the current state of technology it's impossible to determine what is derived data which can be easily recomputed and what is essential data without which the current state could not be reached.
I happened to just be reading this article on unikernels which coincidentally is something that would help get us a bit further along in that direction: http://queue.acm.org/detail.cfm?id=2566628
P.S. When thinking about this one key concept is to realize that you could can do things like implement a 'torrent file system' where a file is just a torrent hash which can be requested and signal when (parts of it) become available.
Can you elaborate more on that statement? I see it as an expensive problem, but not impossible.
If you have a completely deterministic VM/instruction set, you can recompute any function output from any set of inputs. If you have non-deterministic input (network, input devices, randomness, etc) you can store those inputs to replay at a later time.
I don't know if it's always necessary to store non-derived data. Do you need to save all keystrokes? All mouse moves? All network packets? Probably not. If those network packets just result in displaying a graph on the screen, they can probably be thrown away. Maybe you can throw away the headers and just compute based on the packet data. Or maybe you can simply store the fact that a stream of data that matches a cryptographic hash was received. It depends on what function receives that non-derived data.
So I think the difficulty would be designing a system that would isolate derived data and replayable/recomputable/refetchable data, and the processes that compute that data -- and do so with reasonable efficiency. I think this could be done at the process level rather than the instruction set level (though you would have to have a restricted instruction set -- i.e. limited floating-point).
That's why the Purely Functional OS is essentially a thought experiment in what it would take to extend this to the entire operating system.
When I said "impossible given the current state of technology" I meant exactly what you said in your last paragraph though: impossible without designing an OS that isolates derived data.
I believe such a system will come in the form of persistent data structures coupled with reactive programming but, while I believe it's inevitable in the long run, extending that concept all the way to the OS will be quite the challenge.
If the CPU has 'now' to consider, and now is Time value [t] and if [t]'s are all there is between states, then 'think about [t1 .. t2]' should be a standard op.
And isn't this the point about immutability, that it has to have had a Time value, to be of any value at all, anyway, to the user?
This might actually have significant performance benefits. When everything is immutable, GC can also be inherently incremental. You just do your defrag copy faster than new object creation. Then you cam limit your stop-the-world to a small "catch-up" so your copy-from can become a copy-to. Incremental GC nirvana!
Also you've just described existing GC systems - The Java G1 collector is very similar that (it's a lot more complicated though).
Immutability prevents the mutation of older generations, so no new references can occur.
So I expect generational GC to be easier to implement. I wouldn't expect it to be faster or slower though, because programs will need to be written differently to cope with the limitations on mutability. O(1) algorithms will likely be replaced by O(log(n)).
Memory access is often over 100x the cost of an L1 cache hit. It doesn't take too many of those to make a big difference if you're CPU bound.
My comments are general, I'm sure Clojure has access to arrays where necessary for interop at least.
Not really. On the other hand, it makes the implementation of GCs much simpler.
(immutability semantics also tends to make memory churn much higher, so what little gain you get from a simpler implementation is often lost in the GC having to do more work)
Also, it's not like these things are unstudied FFS.
Unless you're using refcounting: since immutable structures can't create cycles, a refcounting GC won't need a cycle-breaker, and thus won't need to ever pause. But you'll be using throughput (especially for allocation I'd guess).
Also not sure what your point is re. Azul, immutability is not what enables it since Zing is first and foremost a java VM.
Perhaps I'm misunderstanding something, but doesn't "tying the knot" in Haskell do just that? (Create a cycle from immutable structures.)
Wouldn't that turn shared objects into not-shared objects? Theoretically, that doesn't make a difference in a system without mutable data (assuming you leave out 'identity check', too), but in practice, it likely will blow up memory usage tremendously.
Edit: I think the way around this is to replace the moved object with a "this object moved" object. The trick, then, will be to know when it is OK to discard that "this object moved" object. That probably is the case after a full heap scan.
And I suspect neither does the author. I looked at the previous project (OpenGL Renderer) which comes in at 250 lines of code and barely any functionality. The current project is has no code at all.
I'm all for sharing ideas, but this is somewhat lacking. The design document touches on the memory model, but ignores all the other problems an OS has to solve (scheduling, networking, disk IO...).
I don't suppose this project will actually go anywhere, though I'd be pleasantly surprised to be wrong.
For example, there are discussions in the RTOS world to take a closer look at Rust (Google embedded Rust) which is why the terms "new OS", "GC" and "experimental" triggered my response to look over there for a starting point.
That way some (not all) of the potential advantages can be tested without having to start from scratch.
EDIT: Well... not really I guess. But idempotence has similar properties I think.
However if you look at the first-order similarity as not being immutable data structures, but instead the idea of an OS entirely built within a managed runtime, the Lisp Machines and Microsoft's Singularity OS (C#) are two interesting examples.
- persistent datastructures (cons cells, fp trees)
- nix package management, virtualenv, containers
- react/pedestal (pushing ux delta upgrades)
Be careful! That statement was once true of programming languages.
Over time, larger and larger numbers of people relaxed their expectations for performance in exchange for benefits like greater safety. Simultaneously, implementors invested in optimizing the performance of languages previously considered "slow."
Operating systems may well wind up with the same dynamic.
* Standard 'C' demo... Say hello to the world
* Compile command: cc hello
Even a state-of-the-art computer in 1980 (say, a Cray supercomputer) would have a hard time keeping up with <a href="http://www.chrisfenton.com/homebrew-cray-1a/">hobbyist system of today</a>, and it's because of this increase in computing power that we've been able to have the "toy languages" of yesterday be useful languages today.
But given that CPU speeds have leveled off, the easy performance gains are pretty much gone these days.
The clojure library "om" is able to use reference equality (instead of a deep comparison) to check for diffs. This results in a 2x or 3x performance gain over mutable data structures.
So I agree with you, this will likely just be a toy OS. But if his design demonstrate huge advantages, the ideas could be adopted by mainstream operating systems.
We have programming languages where everything is immutable (Haskell). This results in programming techniques that generate a lot of garbage. We have not seen that this results in programs with better performance than those that do not generate garbage to begin with. Why would it be different in an operating system?
The point is that you could possibly buy more for less in this trade off. It's not just GC. It's a close to pauseless incremental GC that might become available.
It's not so much that it has to be in an OS, though it seems it would be better in an OS, however.
I think that's more due to hosting providers being able to sell far more "machines" than they have hardware for.
In fact, we have some systems where things are even deployed in containers inside VMs for various reasons...
Not everything has to run a social network and not everyone is working on a twitter clone.
Also, Linux found a niche, and stopped being a toy very early, much before people started really caring about performance.
On the left sidebar, the "documentation" link is to a video. No thanks, I'm look for a sentence of text.
The next link down is a 404 error.
The community link is for insiders...lots of comments about people leaving the building,whatever that means.
Bottom of page, how to install. Why am I installing something when I don't know what it is?
Sites that expect you to already know what they are about, and do not deign to give so much as a sentence of explanation, are extraordinarily frustrating.
Urbit is indeed a purely functional OS, not super dissimilar to the OP's project. And it works at least well enough to run a chat server on its own network (which generated all those logs with "leaves the building"). But we ship no code before its time and we probably shouldn't even have a site up.
You'll enjoy the video though - it has good music, at least.
That lack of explanation, that expectation of speaking to insiders, is unfortunately very common on web sites, not just yours.
Please add the About explanation even if you decide to take the site offline. At least that way, it will already be ready, later, when the project is further along.
I was just reading about the defunct House functional OS yesterday, coincidentally, so I'm interested in the general topic.
Do you have a link to House? I wish people naming projects would realize that the Internet is dead and all we have is the Googlenet...
Here's what I've got, pasted out of my notes:
House has been successfully implemented in Haskell, I'd assume that Haskell would fit your criteria. Admittedly, House is an experimental OS, rather than production one. But they have managed to implement everything from kernel though network stack, to a rudimentary GUI.
[link broken, see search results below]
House is a demo of software written in Haskell, running in a standalone environment. It is a system than can serve as a platform for exploring various ideas relating to low-level and system-level programming in a high-level functional language.
More details are available in our ICFP 2005 paper: A Principled Approach to Operating System Construction in Haskell.
terse wikipedia article on House
short LTU thread on functional systems programming
...oh, also there's a book, "Unix System Programming with Standard ML"
by Anthony L. Shipman
Was free online, link is broken, available via archive.org:
I just found that, have not read it.
Urbit is intended to run virtually in the cloud and is an OS only in the sense of "stores your data and runs arbitrary programs". As a cloud computer its "function" is simply (old state, packet in) -> (new state, packets out). For communicating with the host OS this generalizes to "event in" and "actions out." Eg, I got an HTTP connection so I send an HTTP response. So not only isn't it done, it isn't even very smart... but it is a function and it is an OS.
Atoms in Clojure are reference types, and they are mutable in a controlled way (they can be changed to refer to another immutable value).
In Erlang, an "atom" doesn't refer an immutable value, it is the immutable value. An Erlang atom by definition cannot ever be any other value.
If this kind of OS encourages hardware changes, that's one thing. But aside from performance, it probably won't look too different from current FP languages.
I think you would be well-served to experiment with and learn some Erlang to help inform your design. In fact, Erlang often feels to me like an operating system due to the independence of processes (and the multitude of tools built on top of it, but that's not directly relevant here). I've often dreamt of an Erlang Machine like the Lisp Machines of the past.
As an aside, Erlang's primary goal is fault-tolerance. It's other properties, such as immutable data, message passing, and functional properties were all design decisions made to achieve this goal. My point is that your OS could be very well suited to fault-tolerant systems.
I do think message passing have a bright future, though, and my OS will not work at all in a shared-nothing manycore system..
I disagree with the statement "Erlang requires multiple machines to shine," but my motivations, and my definition of "shine," are likely different from yours. Erlang only requires >1 machines if you are creating a system with proper fault-tolerance. In my opinion, it has good SMP support and it performs well for general-purpose applications. You wouldn't use it to implement matrix multiplication, though :)
Anyway, the gist of my post is that the lessons learned from writing Erlang programs might be useful in your work.
Good luck! I look forward to hearing about your progress.
And you're right, "shine" is a generalization, I actually had fault tolerance in mind when I wrote that.
Hardware transactional memory might help here.
Copying data is slow and that's why Go's channels aren't the fastest solution, but they're less complex than dealing with the traditional concurrency issues.
I've written a toy kernel and I just can't imagine copying all process data because 1 little thing changed. However, there's no reason not to pursue this idea, as it may prove to be useful in certain cases. If real-time schedulers have their place, why wouldn't such an OS have its uses? I think Go (and other projects) have proved the concept works, so maybe it's time to see an OS?
Best of luck to you and I hope to hear good news in the future!
No. Go only copies the immediate values, if you send a pointer the pointee is shared for instance. Go's concurrency is shared-memory.
Erlang copies (almost) the data sent between processes, and has process-local heaps.
 binaries above 64 bytes (IIRC) are refcounted and shared.
(FWIW, though, Clojure's novelty to me was an STM system that I could get up and running without a night's worth of tweaking, and otherwise fall safely into a great deal of attention if something were to go hilariously wrong - most of the other solutions around "back then" seemed to regard themselves as research curios lagging behind in fit and finish.)
I imagine that this would be useful if doing many calculations on readonly values, but would take a performance hit for readwrite operations.
I'm not one for purely functional programming, but this seems like an interesting concept. I'm interested to see where it goes
It has become more and more common to model systems with append-only as the only form of write (Event Store, Datomic), so readwrite can probably be avoided in some cases. And things like IP packets certainly makes sense to represent immutably - anything else is a bit of a lie :) But there is certainly a risk that the trafeoff of immutabulity becomes too costly for an OS..
Also, the system language will not be pure, like Haskell. Just immutable values, like Clojure :)
Isn't the main purpose of an OS to bridge between the applications and the hardware? The hardware is pretty accurately abstracted as sth. with a state and input/output streams.
What is the advantage of abstracting the hardware as sth. without a state?
What is the purpose of abstracting the hardware as sth. without a state?
I don't know how an OS would qualify to get on the list other than not DOS, Windows, OSX (or any of the Apple OSs), Linux (especially not 'ubuntu with a different DE'), or the BSDs (i guess there are exceptions if they're on a toaster).
Those are more serious kernels I think. There are also a ton of research-related kernels. I guess you can find a few for some programming languages.
The key is to make it fast. Mutable -> immutable is not an O(N) copy.