
Mill CPU Inter-Process Communication - Someone
https://millcomputing.com/topic/inter-process-communication/
======
cwzwarich
Overall, this seems like one of the weaker Mill talks. Since they apparently
don't yet have a real OS running real software in simulation, they probably
haven't had the ability to test the ideas that affect how software is
structured at a higher level.

They don't provide nearly enough ways to transitively grant permissions. Using
the mechanisms discussed in the talk, it doesn't seem like you can implement a
simple asynchronous queue of units of work to perform, each having their own
permissions. The belt architecture encourages these sorts of second-class
mechanisms that have to be used in a rigid way, because the details can be
hidden in the belt and not be exposed architecturally.

Unless there's something else not mentioned in the talk, it seems like you
still need to trust the OS, because when the OS is asked to allocate a page
for a spillet there is nothing stopping it from creating a virtual alias of
that page elsewhere and allowing another thread to read its data.

The mechanism to support fork() is a total kludge. Why have a single address
space if you're just going to add segmentation in such an ad-hoc way for a
single use case? Just run the original binary in emulation until exec() or
something like that.

~~~
convolvatron
the fork exposition was weak. admittedly fork() was a mistake and constrains
alot of implementations in strange ways. i still dont understand how
local/global exactly matches the semantics of cow.

transitive permissions are capabilities.

while i'm sympathetic to the lack of market appeal to a capability based
system, doesn't it seem like you could implement posix on top of one by
compromising it? fd transfer over unix domain is already halfway there.

seems like a better alternative.

~~~
deepnotderp
> i still dont understand how local/global exactly matches the semantics of
> cow.

It's not, it's "copy on reference".

~~~
cwzwarich
They were light on the details, but what stops an OS from mapping the same
physical page into two distinct local regions of the address space and
implementing copy-on-write as usual?

~~~
deepnotderp
Presumably nothing, but it doesn't seem like that would break anything?

------
PhilWright
The Mill design is fascinating because it is genuinely very different to
anything else. But it seems that the entire team might die from old age before
they actually have any working silicon produced. Which would be a shame.

~~~
CyberDildonics
They aren't trying to produce silicon, they are trying to produce patents.

~~~
racer-v
The goal of their first funding round was to create a patent portfolio; the
second is earmarked for an FPGA implementation. Hard to do it the other way
around, although to be honest FPGA dev kits are so cheap these days I don't
know why this requires a funding round.

~~~
jacquesm
Depends on the size of the chip you are trying to model, once your chip
exceeds a certain level of complexity FPGA modeling can get expensive really
quickly.

[http://chipdesignmag.com/images/idesign/fpga/figure1.jpg](http://chipdesignmag.com/images/idesign/fpga/figure1.jpg)

And such don't come cheap.

~~~
racer-v
I suppose at $10K each, buying enough Altera Stratix 10 boards for the team
could start to add up.

------
kev009
Looking around the Mill Computing, Inc website, this feels like an
(accidental?) sweat equity scam. I realize that is a very loaded charge, but
this is NOT how real companies are run: "In the beginning we were a sweat
equity organization; no one received a salary; instead, contributors received
units that converted to stock when we incorporated. At incorporation 45 people
had worked on the Mill and became shareholders. After incorporation we are
still a sweat equity organization; we now use a stock option system for sweat
equity, and we still pay no salaries. Reward for work today is comparable to
what it was before incorporation."

I was involved in something similar around 10 years ago where we were working
on a revolutionary EDA suite for analog and mixed signal circuit design. The
owner was quite technically competent but kept upping the ante and theatrics
to the point that no customers or suitors took the company seriously, only the
desperate employees. They never closed sales nor sold the IP.

I advise extreme caution in dealing with the business side of this.

~~~
posterboy
I understand the sentiment but the wording is indeed too loaded. A scam would
mean the scammer gains a benefit.

~~~
__s
Drinking the sweat of your workers like a vampire drinks blood?

~~~
igodard
That's really more the VC model; a bootstrap is different.

------
WhitneyLand
On a regular basis Mill pops up here and generates some interest, I just can’t
understand why.

They haven’t produced an FPGA proof of concept, after claiming they would have
one ready last year. They now say they need investors to finish it, yet they
previously claimed to not even be looking for funding.

They claim to have angel investors, but they are all secret ones. Of course
it’s an investors right to stay private, but the reason you often see
investors and companies shouting from roof tops is because the funding event
itself can help a company. Publicizing it generates PR, gives the company
credibility in dealing with other companies, and is a signal that can generate
demand for more investors.

Even putting that aside, the biggest issue is they haven’t made a compelling
case for how their ideas will outperform existing CPUs in practical usage
scenarios. Yes a running FPGA would be nice, that’s not the only way to show
potential.

They could do quantitative analysis, modeling, or start adding a lot more
detail to their talks and papers (which tend to sound about as deep as you get
in an undergrad architecture classroom), and argue very specifically and
comparatively against today’s standards, even for just a few key scenarios.

Maybe they believe even those ways would still have capital/labor/opportunity
costs that are prohibitive or for a startup? Another option could be small
meetings with a few well respected hardware architects, who will have the best
chance of understanding the potential value. Once convinced, they will
probably be glad to write about it or just provide a reference, which will
make funding, partnerships, hiring, etc all easier.

I dislike being critical of people swinging for the fences, because it’s what
many of us here are trying to do, and it’s important that people keep doing
it. However in this case it’s not just about long odds. Because of the reasons
above and a few other details, things just don’t add up. I don’t believe the
FPGA will ever demonstrate anything compelling, and don’t think any investor
in their own backyard on sand hill road will bite.

It’s all conjecture of course, I’d be happy to be proven wrong.

~~~
ema
You brought up the question of why people are interested in the mill and then
discussed the question of whether the mill is viable. They're not completely
unrelated but still distinct.

Being a software guy I can't say much about the viability. Watching Ivan's
lectures and thinking it over however tickles the same part of my brain that
enjoys learning a new programming language. It is just fun to see how some
problem could be solved differently.

~~~
WhitneyLand
that’s a good point - i would agree that’s a very natural and healthy
viewpoint yet has nothing to do with the startup aspect.

------
convolvatron
watching the talk. does he not compare this to a classic segment/call gate
architecture because he doesn't expect it to be a familiar reference? i'm
certain he's seen it before :)

edit: i thought they managed to do all of this without segments, but at the
end we hear about a special local segment with offset addressing apparently
introduced just to handle children of fork()..i lost how cow can be expressed
losslessly as local/global

redit: question around 1:00:00 explicitly ask this question, and he said,
erroneously i think that while semantically similar this is the first time a
direct hardware implementation of a call gate has been proposed

~~~
Taniwha
I can think of a couple systems with hardware call gates

He also doesn't mention the word 'capability' anywhere either - this is all
1980s stuff

~~~
menage
Ivan definitely mentioned capabilities at some point in the evening - "I'd
love to build a capability architecture but I wouldn't be able to sell it"
(which he's also mentioned in previous talks) but it may have been after the
camera stopped rolling.

~~~
deepnotderp
He does say that, but if you watch previous talks (either that or the material
available) he says the primary difference (IIRC) is that you aren't supposed
to segment individual data objects, but rather instead coarse grained address
space segmentation, so there's really no technical difference, but it's kind
of an usage difference.

~~~
convolvatron
i think the other (turf-like) segmentation strategy is pretty common in
earlier work. i wonder about the relative efficacy from a VM implementation
standpoint. particularily wrt grant/revoke on a byte range rather than a
segment.

the other thing that struck me as really strange was a bit in the question
period where he says that 'smart devices' (DMA?) are different from simple
devices (pio?) in that they are expected to be first-class multiprocessing
citizens. doesn't that imply that high-performance peripherals need to be
specially designed for the Mill? (likely with a Mill core attached)

~~~
neerajsi
Wrt smart peripherals, probably all it means is that you need an IOMMU if you
don't want to have to trust your drivers.

~~~
igodard
We have focused on the core and less on the uncore, which is why there have
been no talks on I/O. The goal is for a smart peripheral to be
indistinguishable from just another regular core; the Mill design is big on
regularity. That implies that it has its own PLB and TLB, responds to HEYU,
and supports the same IPC mechanisms, both those in the talk and those NYF.

Of course, modern peripherals don't look like that, so there will be adaptors.
IBM 360 channels and CDC6600 PPs also haven't been architecturally revisited
in a while.

------
infogulch
Of all of the Mill subjects, the pointer kludge to support fork (itself a
kludge, yes) seems to me to be the biggest offender of the "sufficiently smart
compiler" red flag.

I just have a sinking feeling about hoping a compiler can correctly identify
and track all pointers to know how to flag them. The "pointer is a native-
word-sized int" assumption may be so ingrained -- from compilers to stdlibs to
the wide variety and age of programs -- that it will be nigh impossible to rid
existing codebases of it completely.

But I'm _not_ a compiler guy (or hardware, or assembly, or C for that matter)
so I could be quite mistaken. Perhaps it's enough to fix the compiler and make
it capable of emitting warnings/errors when it detects a violation.

As far as the talk itself goes, I'm a little sad that there was so little new
information though I understand that we're quite deep in the technical details
and there's a lot of prerequisite background that you can't reasonably expect
from a random tech audience. If there are more than a few more talks you might
need to reevaluate this method altogether and use a different format.

I'm very glad that you've decided to change the wording to refer to it as an
"SSA machine" as opposed to "belt". I think many more people are familiar with
SSA or can be convinced that it works ("your current compiler uses it _right
now_ " probably helps) by describing it as "SSA where you can only reference
the last N results" as opposed to building a whole model based on a
"conceptual giant shift register" from before. I've been following the Mill
talks since the first few videos and recently I wonder if even the asm
programming model should be writing raw SSA instead of belt numbers,
especially since genasm assumes an infinite belt anyways.

------
neerajsi
Unrelated to protection: Ivan mentioned that this is an SSA-like architecture.

How does the compiler implement PHIs connecting expressions with different
latencies? Let's say I have:

`if (cond) { x = a + b; } else { x = a * b; }`

The MUL is may take a bit longer than the ADD, but the user needs to accept
the argument at a given belt position. How do you avoid having the pay the
latency cost for the MUL if `cond` is usually true?

~~~
igodard
The tool chain does hoisting and if-conversion with wild abandon. That code
becomes {x = cond ? a+b : a*b}, and both expressions are evaluated in
parallel. The conversion is a heuristic; if you have tracing data for the
branch then it might not convert. However, a miss-predict is a lot more
expensive than a multiply so the tracing has to be pretty skewed to be worth
the branch.

The conversion does increase the latency of getting the value of x. If there's
nothing else to do then the tool chain will insert explicit nops to wait for
the expression. The same stalls will exist on other architectures for the same
code, just not visibly in the code. It happens that making the nops explicit
is faster than a stall; you can idle through a nop with no added overhead, but
you can't restart a stall instantaneously.

------
taliesinb
Just wanted to say, it's always a real treat to watch Mill talks, and I thank
Ivan for putting in the hard work of making them so good! (The negativity I
see here on HN really disappoints me).

Also, is the thread talk close at hand? I feel I learned less from this talk
than usual; most of the material was already discussed in the security talk.

------
burner
The emperor has no clothes. The guy who claims to have written 12 compilers
hasn't turned out one in a decade. How are microarchitectural decisions being
driven without a compiler?

------
axaxs
I like the idea behind Mill, and the openness of the talks, etc. That said,
it's been a -very- long time without so much as a real demo. What gives?

------
neerajsi
I wonder how the PLB can be fast. You have a dictionary from byte range to
permission. This is harder than TLBs, which map a relatively large granule
where you can form a search key by just extracting the top bits from the
virtual address.

Intel MPX has a similar protection model, and that introduces a lot of
overhead (of course it is bolted onto an existing arch and it wasn't a high
priority feature).

~~~
willvarfar
The protection entries have ranges. The bounds are in bytes, but the range can
be _massive_.

Imagine you load a 7MP image which takes, say, 21MB of RAM. That would be 5184
4K pages in a classic TLB. In the Mill's PLB, that whole part of the address
space can be in a single protection entry.

Then, there's a big difference from how things can be organised in software vs
hardware. The hardware PLB has some number of entries, and it will check _all_
those entries _in parallel_.

~~~
neerajsi
Yes, you will have a CAM, but the CAM for a PLB would likely be more expensive
than a TLB, right? And today's TLBs are only a few hundred entries for the
first level. The PLB would seem to necessitate many cycles for a load, longer
cycle times, or a high power CAM. Even if you could make address queries
quickly by replicating ranges based on some graularity, invalidating the PLB
becomes expensive on a grant revocation.

I understand that Itanium was forced into a low frequency, high power L1 cache
by being at a similar in order statically scheduled design point, where you
need loads to be fast in order for a compiler to be able to come up with a
reasonable static schedule. Unless there's some really nice idea out there
that's radically different from TLBs as they are today, I bet PLBs will be the
major limiter of performance for general purpose code.

~~~
willvarfar
Each protection entry is has a lower and upper bound. The entry can cover
something as small as a single byte, or as big as the whole address space or
anything inbetween.

It is just a normal bounds compare to see if each entry covers the access, and
the PLB has as long as the top-level cache access takes to do the checks.

So the PLB misses far far less often than a conventional TLB.

~~~
hilmipilmi
TLBs index a tag with the higher bits and compare one retrieved value for
equality only. A PLB with arbitrary resolution would need to do 2 subtractions
for the 2 compares and do that for all active entries in parallel. That is,
for every simple load/store in a 16 entry PLB you'd need to do 32
subtractions! Unless you come up with a novel scheme to handle this your chip
will get hot. Maybe you can reconstruct the indexing TLB scheme on the fly in
hardware or something like that, or have an special purpose embedded Cortex-M3
that recreates an efficient lookup structure for the MILL :-). And no, a 64
bit subtraction is not one clock cycle when you run at 1ghz.

~~~
willvarfar
The entry contains an upper and lower bound, and requires two comparisons but
no subtractions.

------
mar77i
I work with a lot of OO code (ORM) that regularly contains references on
objects to other objects. How would that "security model" behave wrt the map
of reachable objects in relation to the object passed... let's assume by
reference. I figure this scenario would be somewhat similar to the problem of
"/.." paths in URLs on web servers.

~~~
igodard
The grant model requires you to grant each object individually that you want
to pass. That is annoying if you have many objects. In both the caps and grant
models you can cut the overhead by thinking of the whole graph as "the
object". A typical approach is to allocate graph nodes in an arena and pass
the whole arena.

Fine granularity is expensive, which is why the monoliths have one process-
granularity. If you have 100,000 graph nodes and want to pass all of them
_except_ this one then you will have to pay for the privilege, in any
protection model. The Mill lets you pay less.

------
znpy
Is this mill architecture going to hit the market anytime soon?

