
Introduction to the Mill CPU Programming Model - luu
http://ootbcomp.com/topic/introduction-to-the-mill-cpu-programming-model-2/
======
TrainedMonkey
Interesting, the architecture looks greatly simplified compared to even
standard RISC (As opposed to lets say x86). Due to that simplification it will
be power efficient while being inherently highly parallel.

Would be interesting to find out:

1\. How high that degree of parallelism can be pushed, are we talking about
tens or hundreds of pipelines?

2\. What frequency this will operate at?

3\. What is up with RAM? I saw nothing about memory, with lots of pipelines it
is bound to be memory bound.

~~~
willvarfar
Hi, I'm the author of that intro. The talks which Ivan has been giving - there
are links in that intro - go into everything in much more detail. But here's a
quick overview of your specific questions:

1: we manage to issue 33 operations / sec. This is easily a world record :)
The way we do this is covered in the Instruction Encoding talk. We could
conceivably push it further, but its diminishing returns. We can have lots of
cores too.

2: its process agnostic; the dial goes all the way up to 11

3: the on-chip cache is much quicker than conventional architectures as the
TLB is not on the critical path and we typically have ~25% fewer reads on
general purpose code due to backless memory and implicit zero. The main memory
is conventional memory, though; if your algorithm is zig zagging unpredictably
through main memory we can't magic that away

~~~
sp332
33 ops/sec? :)

~~~
willvarfar
Oooh, too late for me to correct that particular typo :)

33 ops / cycle, sustained. Last night we also published an example list of the
FU mix on those pipelines here: [http://ootbcomp.com/topic/introduction-to-
the-mill-cpu-progr...](http://ootbcomp.com/topic/introduction-to-the-mill-cpu-
programming-model-2/#post-610)

------
JoeAltmaier
Some kinds of code will benefit from this - long calculations and deep nested
procedures. But lots of hangups on consumer applications are in
synchronization, kernel calls, copying and event handling.

I'd like to see an architecture address those somehow. E.g. virtualize
hardware devices instead of writing kernel-mode drivers. Create instructions
to synchronize hyperthreads instead of kernel calls (e.g. a large (128bit?)
event register, a stall-on-event opcode). If interrupts were events then a
thread could wait on an interrupt without entering the kernel.

~~~
willvarfar
Actually, the Mill is designed to address this; it has TLS segment for cheap
green threading, SAS for cheap syscall and microkernel arch, cheap calls and
several details for IPC which are not public yet.

~~~
JoeAltmaier
What about synchronization? Folks are terrified of threads because
synchronizing is so hard. But a thread model can be the simplest especially in
message models.

------
Mjolnir
Very very interesting, thanks for sharing! What would the path be to using
existing code/where would Mill appear logically first?

Also, could something like Mill work well within the HSA/Fusion/hybrid GPGPU
paradigm? E.g. from my very amateur reading of your documents, it looks like a
much needed and very substantial improvement to single threaded code; how
would a mixed case where we have heavy matrix multiplication in some parts of
our code as part of a pipeline with sequential dependencies work? Would an
ideal case be a cluster (or some fast interconnect fabric in a multi socket
system) of multi core Mill chips be the future?

Realistically, is this something that LLVM could relatively easily target? A
simple add in card that could give something like Julia an order of magnitude
improvement would be a very interesting proposition, especially in the HPC
market. I come at this mainly from an interest how this will benefit compute
intense machine learning/AI applications.

Sorry for all the questions.

~~~
rcxdude
The latest talk on their website mentions the LLVM status in passing at the
end. Essentially they're moving their internal compiler over to use LLVM, but
it requires fixing/removing some assumptions in LLVM because the architecture
is so different, and the porting effort was interrupted by their emergence
from stealth mode to file patents.

~~~
Mjolnir
Thanks, I'll have a look at the talks.

------
fleitz
Great idea, since it's all theoretical currently I'm wondering with the
compiler offloading how well it will actually perform. Itanium was capable of
doing some amazing things, but the compiler tech never quite worked out.

~~~
willvarfar
Ah, but the Mill was primarily designed by a compiler writer ;)

Here's Ivan's bio that is tagged on his talks:

"Ivan Godard has designed, implemented or led the teams for 11 compilers for a
variety of languages and targets, an operating system, an object-oriented
database, and four instruction set architectures. He participated in the
revision of Algol68 and is mentioned in its Report, was on the Green team that
won the Ada language competition, designed the Mary family of system
implementation languages, and was founding editor of the Machine Oriented
Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4
(Implementation languages) and was a member of the committee that produced the
IEEE and ISO floating-point standard 754-2011."

So actually its been designed almost compiler-first :)

~~~
fleitz
Still interested in how it works in practice. I'm pretty sure the Itanium team
combined with Intel's compiler team have similar credentials.

I'm not saying it can't work, not saying it won't work, but we know that most
code pointer chases. While CPU and compiler design is above my paygrade I know
that often a lot of fancy CPU/design and compiler tricks that make things
twice as fast on some benchmark leads to 2 to 3% performance gains on pointer
chasing code.

Not sure how the Mill is going to make my ruby webapp go 8 times as fast by
issuing 33 instructions instead of 4.

~~~
haberman
> Not sure how the Mill is going to make my ruby webapp go 8 times as fast by
> issuing 33 instructions instead of 4.

8x speed is not being claimed, 10x power/performance is. That could mean that
the app runs at the same speed but the CPU uses 10% of the power. A lot of the
power saving probably comes from eliminating many part of modern CPUs like
out-of-order circuitry.

~~~
fleitz
Ok, so now that it's 10x power/performance I buy 10 of these things and it
still only delivers 5% more webpages.

This kind of mealymouthed microbenchmark crap is exactly what the industry
doesn't need, if I have a bunch of code that is pure in order mul/div/add/sub
then I put it on a GPU that I already have and it goes gangbusters. The
problem is most code chases pointers.

Like I said, great idea, would love to see something that can actually serve
webpages 10x as fast or 1/10th the power (and cost similar to today's systems)

~~~
sp332
I never thought of serving webpages as being CPU-bound. Anyway, to get a 10x
speedup, you would have to buy enough of these to use as much power as
whatever you're replacing. So if one Mill CPU uses 2% as much power as a
Haswell, then you'd have to buy 50 of them to see a 10x performance
improvement over the Haswell.

------
cpr
Does anyone know how this compares with VLIW designs like the original
Yale/Multiflow machines? Seems very familiar.

(I ask as a survivor of Multiflow in the late 80's. ;-)

~~~
outside1234
Or more recently, how does this compare to the Itanic from Intel?

~~~
jmz92
Some of the memory ideas are similar--Itanium had some good ideas about
"hoisting" loads [1] which I think are more flexible than the Mill's solution.
In general, this is a larger departure from existing architectures than
Itanium was. Comparing it with Itanium, I doubt it will be successful in the
marketplace for these reasons:

-Nobody could write a competitive compiler for Itanium, in large part because it was just different (VLIW-style scheduling is hard). The Mill is stranger still. -Itanium failed to get a foothold despite a huge marketing effort from the biggest player in the field. -Right now, everybody's needs are being met by the combination of x86 and ARM (with some POWER, MIPS, and SPARC on the fringes). These are doing well enough right now that very few people are going to want to go through the work to port to a wildly new architecture.

[1]
[http://en.wikipedia.org/wiki/Advanced_load_address_table](http://en.wikipedia.org/wiki/Advanced_load_address_table)

~~~
jitl

       > -Right now, everybody's needs are being met by the
       > combination of x86 and ARM (with some POWER, MIPS, and
       > SPARC on the fringes). These are doing well enough
       > right now that very few people are going to want to go
       > through the work to port to a wildly new architecture.
    

That's not true at all. The biggest high-performance compute is being done on
special parallel architectures from Nvidia [1] (Tesla). Intel trying to bring
X86 back into the race with its Xeon Phi co-processer boards [2].

[1]
[http://www.top500.org/lists/2013/11/](http://www.top500.org/lists/2013/11/)

[2]
[http://www.intel.com/content/www/us/en/processors/xeon/xeon-...](http://www.intel.com/content/www/us/en/processors/xeon/xeon-
phi-detail.html)

~~~
Scaevolus
The Mill aims to be good at general purpose computation. HPC is _not_ general
purpose computation, and is a tiny fraction of the market.

------
solarexplorer
> The Mill has a 10x single-thread power/performance gain over conventional
> out-of-order (OoO) superscalar architectures

It would be nice to know how they got that number. Because it seems to be too
good to be true.

~~~
Guvante
I am pretty sure they are talking about per-cycle performance. Since they can
do 33 operations per cycle. IIRC the peak performance of an Intel chip at the
moment is 6 FLOP per 2 cycles (or there abouts).

Of course this is beyond ridiculous since a 780 TI can pull off 5 TFLOP/sec on
a little under a GHz clock, 5,000 FLOP per cycle is a little more than 33.

It seems like an interesting design, but comparing performance against what an
x64 chip can do is a bit silly, you can't just pick numbers at random and call
that the overall improvement.

~~~
pbsd
A Haswell core can do 2 vector multiply-adds per cycle, which results in a
peak of 32 single-precision FLOP per cycle per core or 16 double-precision
FLOP per cycle per core.

~~~
willvarfar
The instruction encoding talk starts with comparison between Mill, DSP and
Haswell and tries to explain the basic math. The Mill is a DSP that can run
normal, "general purpose" code better - 10x better - than an OoO superscalar.
The Mill used in the comparison - one for your laptop - is able to issue 8
SIMD integer ops and 2 SIMD FP ops each cycle, plus other logic.

~~~
pbsd
I was strictly replying to the Intel FLOPs claim of the parent comment. I have
only a faint idea how the Mill CPU works, so I can't really compare against
it.

From the little I have read, the Mill CPU looks like a cool idea, but I'm
skeptical about the claims. I'd rather see claims of efficiency on particular
kernels (this can be cherry-picked too, but at least it will be useful to
_somebody_ ) than pure instruction decoding/issuing numbers. Those are like
peak FLOPs: depending on the rest of the architecture they can become
effectively impossible to achieve in reality. In any case, I'm looking forward
to hearing more about this.

~~~
willvarfar
Apologies, I was replying to the thread in general and not your post in
particular.

Art has now published the 33 pipeline breakdown on the "Gold" Mill here:
[http://ootbcomp.com/topic/introduction-to-the-mill-cpu-
progr...](http://ootbcomp.com/topic/introduction-to-the-mill-cpu-programming-
model-2/#post-610)

A key thing generally is that vectorisation on the Mill is applicable to
almost all while loops, so is about speeding up normal code (which is 80%
loops with conditions and flow of control) as well as classic math.

------
sehugg
For those that are mainly software-oriented, the Lighterra overview posted
earlier is helpful background for understanding where VLIW fits into the zoo
of CPU architectures:

[http://www.lighterra.com/papers/modernmicroprocessors/](http://www.lighterra.com/papers/modernmicroprocessors/)

------
Symmetry
This whole thing is just horribly exciting for a computer architecture geek
like me. I am somewhat worried about the software side given the number of OS
changes that would have to be made to support this. But then again, there are
lots of places in the world where people are running simple RTOSes on high end
chips and the Mill probably has a good chance there. The initial plan to use
an older process and automated design means that the Mill can probably be
profitable in relatively modest volumes.

------
adamnemecek
This might be one of the most interesting things posted on HN.

------
petermonsson
There is something that I can't get to add up here. The phasing claims that
there are only 3 pipeline stages compared to 5 in the textbook RISC
architecture or 14-16 in a conventional Intel processor, but this can't
possibly add up with the 4 cycle division or the 5 cycle mis-predict penalty.

What am I getting wrong?

~~~
willvarfar
The phase says when the op issues. It takes some number of cycles before it
retires. So an divide issues in the "op phase" in the second cycle, and if on
the particular Mill model it takes 4 cycles then it retires on the fifth.

If there is a mispredict, there is a stall while the correct instruction is
fetched from the instruction L1 cache. If you are unlucky, it's not there and
you need to wait longer.

~~~
petermonsson
OK, so the phases aren't an apples to apples comparison to the traditional
pipeline stage, but more in line with the TI C6x fetch, decode, execute
pipeline which for TI covers something like 4 fetch stages, 2 decode stages
and between 1and 5 execute stages. Thank you for the clarification

~~~
willvarfar
We'll post the video of the Execution talk covering __phasing __to Mill forums
today or tomorrow or so. ootbcomp.com /forum/the-mill/architecture/

------
DiabloD3
I can't wait until designs like this become common.

------
the_mitsuhiko
I still want to know how to implement fork for it.

~~~
kristianp
There is some discussion of fork here:

[https://groups.google.com/forum/#!topic/comp.arch/sICkAag4ga...](https://groups.google.com/forum/#!topic/comp.arch/sICkAag4gao)

------
chmike
I'm skeptic of the belt efficiency. Memory storage will be wasted. What do we
gain with it ?

~~~
Rusky
From what I understand it would have very similar characteristics to current
register renaming. You just get direct access to the whole register file
rather than just a few ISA registers.

I think it would require some instruction scheduling to make optimal use of
it, but that means the silicon doesn't need that logic so cores can be smaller
and more efficient.

------
snorkel
Very interesting reading. Are such procs already being sold or is this still
on the workbench?

~~~
szatkus
No, it will available in few years :(

------
karavelov
In my regards it seems that one of their sources of inspiration were Transmeta
processors - VLIW core, software translator from some intermediate bytecode
(x86 in case of transmeta). I hope they will get it better this time.

~~~
jlouis
They don't translate. Rather, they compile code to their instruction set.

~~~
Symmetry
Well, the plan is to distribute an intermediate representation and then
specialize it to the particular mill pipeline the first time you load the
binary. Probably a lot easier than translating something that wasn't designed
for it.

~~~
mjn
I believe IBM mainframes have traditionally used something like that: binary
code is shipped for a general mainframe architecture, and on first execution
is specialized to the hardware / performance characteristics of the particular
model within that architecture that you're running. Also allows for
transparent upgrades, since if you migrate to a new model, the binary will re-
specialize itself on the next execution, (ideally) taking advantage of
whatever fancy new hardware you bought.

------
cordite
How well could LLVM be converted to the mill intermediate language?

~~~
willvarfar
We are starting work on an LLVM back end now. The tool chain will be described
in an upcoming talk, so subscribe to the mailing list if you want to be in the
audience or watch any available live streams.

I am also going to make a doc or presentation called "A Sufficiently Smart
Compiler" to explain how easily the Mill can vectorise your normal code and so
on :)

