
How the Mill CPU does fork() in a single address space - fla
http://millcomputing.com/topic/fork-2/
======
willvarfar
(Mill team)

So every turf (aka 'process') has a special value. If this value is XORed with
the high bits of a local pointer it turns it into a global pointer, and vice-
versa.

Example: turf A has a local space register of 1010100000 (binary; lets keep
the numbers small, but Mill addresses are really 60-bit). Note that the low
five bits are 0. All local space masks will have some number of low bits 0.

A local address is 0000011000. Note that the high five bits happen to be zero.
XORing in the local-space base would give the global address 1010111000.

Another local pointer might have the bit-pattern 0000100111. Note now that
there is a bit set beyond the low zero bits in the mask; this time, the XOR
will convert it into the global address 1010000111.

So as the local base mask has low bits all zero, all addresses within that
span will be sequential in memory. But local space is disjoint in global
space. This affects only the maximum continuous allocation size and not the
total number of allocations.

When turf A forks, the child has to have local base mask that preserves the
relative position of all allocations. But as the XOR mask will be different,
the global addresses will not overlap for these 'local' allocations.

The vast majority of mmaps are actually shared (file buffers etc) and don't
need this XORing. The CPU is transparently doing the XOR when needed and its
nothing normal programs need to know about; only the kernel, when giving out
new mmap regions, needs to know if it should hand out a global or local
address.

~~~
caf
_So as the local base mask has low bits all zero, all addresses within that
span will be sequential in memory. But local space is disjoint in global
space. This affects only the maximum continuous allocation size and not the
total number of allocations._

So I think what you're saying here is that that if turf 10101 has an allocated
local address 0000100111, then turf 10100 can't have an allocated local
address of 0000000111, because they correspond to the same global address -
they'd clash.

------
hyc_symas
Very nice. Much cleaner than the memory copying we had to do on ST-Minix. (But
the Mill obviously still has an MMU, the MC68000 had nothing so we couldn't
implement COW.)

------
DiabloD3
I just want to know why I can't buy a Mill yet. Stop teasing me, damnit!

~~~
jlebrech
i guess it'll drastically lower transistor count?

~~~
sliverstorm
Big Silicon?

------
charpointer
This work of Xoring and using shared mmaps is not new and it was published in
2009 and 2010 by two separate teams. See the following papers
[http://dl.acm.org/citation.cfm?id=1854288](http://dl.acm.org/citation.cfm?id=1854288)
and
[http://dl.acm.org/citation.cfm?id=1640096](http://dl.acm.org/citation.cfm?id=1640096)

I am not sure how this can be patented.

------
stcredzero
_> The result of all of this is that fork() on a Mill requires a cache flush
of dirty lines from the parent local space that is not required on (some) MAS
machines, but has no other added overheads. The gain is that sharing between
processes has byte granularity, not page granularity, and is vastly cheaper to
use._

What are the implications for server processes? The cache flush seems like a
good tradeoff for things like Apache. Extremely fast communication using RAM
between processes running on multi-core machines looks like a good capability
to have. Language runtimes like Erlang's could benefit from this tremendously.

(How about something like Golang, but with Erlang-like actors and the ability
to declare that certain modules cannon contain functions with side effects?
Without lazy evaluation, it wouldn't be great for processing large datasets,
but it could be awesome for games. (Actors would be distributed amongst
processes, each of which would have their own GC.))

------
JoeAltmaier
So, instead of switching page maps (and writing new TLB entries), the new
process is allocated new TLB entries, corresponding to its process id in the
high part of each pointer. Does this also require that the high bits of every
pointer access have the local process ID inserted on use? Or are all pointers
somehow tracked and rewritten on fork?

~~~
willvarfar
The XOR is applied inside the CPU when local pointers (any pointer with a
particular high bit set) are actually dereferenced. Its transparent to the
normal program. This is how we can COW the used pages without rewriting all
the pointers (which is not a solvable problem in the face of memory-unsafe
languages such as C).

------
vanderZwan
> _Now that we have finally gotten the patent filings for this area of the
> Mill in_

How much more to go?

~~~
willvarfar
Really not many. We are going to be getting a whole lot more implementation
done now the filing phase is passing :D

