
“The Mill” – It just might Work - dochtman
http://jakob.engbloms.se/archives/2004
======
ChuckMcM
I am really rooting for these folks, after going to a talk on it last year
about this time and trying to everything thing I could to pick it apart (they
had good answers for all my questions) I felt confident that its going to be a
pretty awesome architecture. Assuming that reducing to practice all of their
great ideas doesn't reveal some amazing 'gotcha', I'm hoping to get an eval
system.

The thing I'm most closely watching are the compiler stuff since this was such
a huge issue on Itanium (literally 6x difference in code execution speed just
from compiler changes) and putting some structured data (pointer chasing) type
applications through their paces which is always a good way to flush out
memory/cpu bottlenecks.

~~~
pjmlp
From the posts on Reddit they are saying their compiler is based in a JIT
architecture similar to how mainframes work.

Basically doing AOT compilation and optimizations on installation, but they
have postponed more information to an upcoming talk.

~~~
igodard
Not quite: only scheduling and binary creation is done at install time.
Instruction selection and optimization is done during compilation in the usual
way.

~~~
pjmlp
Thanks for the hint.

------
Symmetry
It's really exciting, but here are a few worries I have about their ability to
meet their performance claims:

1) I don't see that they'll be able to save much power on their cache
hierarchy relative to conventional machines. Sure, backless memory will save
them some traffic but on the other hand they won't e able to take advantage of
the sorts of immense optimization resources that Intel has, so that fraction
of chip power to performance isn't going away.

2) The bypass network on a conventional out of order chip takes up an amount
of power similar to the execution units, and I expect that the Mill's belt
will be roughly equivalent.

3) I'm worried on the software front. The differences between how they and
LLVM handle pointer is causing them trouble, and porting an OS to the Mill
looks to be a pretty complicated business compared to most other
architectures. It's certainly not impossible, but it's still a big problem if
they're worried about adoption.

All of which is to say, I think the 10x they're talking about is unrealistic.
The Mill is full of horribly clever ideas which I'm really excited about and I
do think their approach seems workable and advantageous, but I'd expect 3x at
most when they've had time to optimize. The structures in a modern CPU that
provide out of order execution and the top-level TLB are big and power hungry,
but they're not 90% of power use.

If they're going to hit it big they'll probably start out in high-end
embedded. Anything where you have a RTOS running on a fast processor, and your
set of software is small enough that porting it all isn't a big problem.

Also, the metadata isn't like a None in Python, it's like a Maybe in Haskell!
You can string them together and only throw an exception on side effects in a
way that makes speculation (and vector operation) in this machine much nicer.

EDIT: Whatever the result of the Mill itself, it contains a large number of
astoundingly clever ideas some of which would be useful even without all the
other ideas. Like you could drop in Mill load semantics to most in-order
processors and you'd have to do something different with how they interact
with function calls but it would still be pretty useful.

EDIT2: I may sound pessimistic above, but I would still totally give them
money if I were a registered investor. The outside view just says that new
processors that try to change lots of things at once is pretty bad even if, on
the inside view, they have a good story for how their going to overcome the
challenges they'll face.

~~~
rst
One tricky issue in porting an existing OS to the Mill is their AS/400-style
portability strategy (as outlined by Godard in the comments): binaries are
distributed as "Mill IR", and compiled to local binary, which gets cached for
the next execution. The problem is where to put this cache, and how the OS
deals with it.

Say, for example, you put the cache in the file system. OK, who has write
permission, and can the "cached local binary" be written on first execution by
a user who doesn't have write permission on the binary being "localized"? (Or
could the translator be run on "apt-get install ..." \--- and if so, who
trains apt to do that?) And so forth.

Putting the cache someplace that is hidden from the "normal" OS is possible,
but that has problems, too. At the very least, you'd need to figure out what
to do if the hidden whatever-it-is runs out of space. (And how doing I/O to it
would interfere with other OS-level performance optimizations, like scheduling
of disk seeks.)

IBM could finesse these problems on the AS/400 because they controlled
everything about it, hardware to OS to UI. And there are niches with high
performance requirements that could live with a nonstandard OS. But for
general-purpose computing, it could get awkward. (Perhaps awkward enough to
consider TransMeta's strategy of doing JIT translation to actual machine
instructions, which let them keep the "real instruction" cache entirely in
dedicated RAM --- though that has problems too.)

~~~
_wmd
Assuming their translation step is as cheap as they claim (single pass
rewrite/substitute missing hardware for software macros), it's conceivable
that there isn't much value in persistently caching the result.

In that case, everything could be contained to a single ld.so patch, or
(doubtfully) a modification to the kernel ELF loader

Finally, and although it is less common now, in prior days Linux _already had_
a post-install processing step for binaries on certain distributions -
prelink(8) (
[https://en.wikipedia.org/wiki/Prelink](https://en.wikipedia.org/wiki/Prelink)
)

~~~
rst
Also Mac OSX "prebinding" and (in later releases) dyld cacheing. Though just
skipping the cache entirely is certainly the quickest way to get something
running --- and quite a few plausible server workloads will pay the penalty
mainly at startup, when it might not matter so much.

(But yeah, keep it in user space, if only to make it easier to debug!)

------
cromwellian
I actually had high hopes for Sun's Rock architecture, which had a rather
elegant hardware-scout/speculative threading system to hide memory latencies,
and instead of a reorder-buffer they had a neat checkpoint table, that
simultaneously gave you out of order retirement, as well as hardware support
for software transactional memory.

Alas, it looked good on paper, but died in practice, either because the theory
was flawed (but academic simulations seemed to suggest it would be a win), or
because Sun didn't have the resources to invest in it properly and Oracle
killed it.

Claiming a breakthrough in VLIW static scheduling that yields 2.3x seems
interesting, but the reality made be different, not to mention what kinds of
workloads would get these speedups. If you compare the way NVidia and AMD's
GPUs work, in particular AMD's, they rely heavily on static analysis, but in
the end, extracting max performance is highly dependent on structuring your
workload to deal with the way the underlying architecture executes kernels.

If it turns out you have to actually restructure your code to get this 2.3x
performance, rather than gcc-recompile with a different architecture, then
it's not really an apples-to-apples speedup.

~~~
bcantrill
Having been at Sun and having been (too) intimately involved with the
microprocessor side of the house for way too damn long, I can tell you that
when it came to microprocessors, Sun was all vision and no execution. The
theme that was repeated over several microprocessors: a new, big idea that
made all of the DEs horny, but that proved annoyingly tricky to implement.
Sacrifices would then be made elsewhere in order to make a tape out date
and/or power or die budget. But these sacrifices would be made without a real
understanding of the consequences -- and the chip would arrive severely
compromised. (Or wouldn't arrive at all.) Examples abound but include Viking,
Cheetah, UltraJava/NanoJava/PicoJava, MAJC, Millennium (cancelled), Niagara
(shared FPU!) and ROC (originally "Regatta-on-a-chip", but became "Rock" only
when it was clear that it was going to be so late that it wasn't going to be
meaningfully competing with IBM's Regatta after all). The only microprocessor
that Sun really got unequivocally right (on time, on budget, leading
performance, basically worked) was Spitfire -- but even then, on the
subsequent shrinks (Blackbird and beyond) the grievous e-cache design flaws
basically killed it.

Point is: in microprocessors, execution isn't everything -- it's the only
thing.

~~~
gonzo
Hi Brian.

Spitfire was only on-time compared to the debacle of Viking and Voyager.

Thanks for dredging up the nightmare. :-)

~~~
bcantrill
Man, Voyager -- forgot that one!

And "debacle" is really the only word for Viking. A major rite of passage in
kernel development in the 1990s was finding your first Viking bug; I found
mine within a month of joining in 1996 (a logic bug whereby psr.pil was not
honored for the three "settling" nops following wrpsr, allowing a low priority
interrupt to tunnel in -- affecting all sun4m/sun4d CPUs). Bonwick's was still
the king of the hill, though: he was the one who discovered that the i-cache
wasn't grounded out properly, causing instructions with enough zeros in them
to flip a bit (!!). The story of tracking that one down (branches would go to
the wrong place) was our equivalent of the Norse sagas, an oral tradition
handed down from engineer to engineer over the generations. Good times!

------
rayiner
This is a detailed description of the architecture:
[http://millcomputing.com/topic/introduction-to-the-mill-
cpu-...](http://millcomputing.com/topic/introduction-to-the-mill-cpu-
programming-model-2).

It describes Mill's approach to specifying inter-instruction dependencies,
grouping instructions, and handling variable-latency memory instructions.

------
WhitneyLand
Who is this guy? Where can you teach post-doc computer science without ever
having taken a course in CS, let alone a degree?

Obviously a degree is not a necessary condition for success and it's always
bothered me that people like Michael Faraday had to battle academic and class
prejudice before changing the world.

However I don't think it's unreasonable to see a bio of past
projects/companies/research papers.

"Despite having taught Computer Science at the graduate and post-doctorate
levels, he has no degrees and has never taken a course in Computer Science"

~~~
igodard
My first compiler (still in use) was for the Burroughs B6500 mainframe in
1970. During my brief and inglorious college career I did not take a CS class.
In fact, there were no CS classes. The college didn't even own a computer.
Yes, there were such times, in living memory, hard as it may be to imagine.

These days you need a union card (i.e. a CS degree) to get a job. That's a
shame. I've been refused a university position for lack of a PhD - to teach a
subject that I largely invented. There's something wrong with that.

We have no such requirements on the Mill team.

~~~
DSingularity
Indeed there is something wrong here. I am sure it isn't easy to identify
those with scholarly authority. Sad to see that they are missing it.

That being said, you are still having scholarly impact! Your talks have taught
me to question all my fundamental assumptions when it comes to architecture,
compilers, and computing!

I love following your peoples work, and I can't wait to see its product!

------
dochtman
I wonder what the compilers would be like. If these guys contribute, say, an
LLVM backend, that would make it so much easier to support.

~~~
willvarfar
(Mill team)

We are in fact working on an LLVM backend right now.

This will generate Mill IR, which will be 'specialised' on-target so will run
on all Mill family members.

~~~
dochtman
Will you contribute it to upstream, or keep it closed source?

~~~
ape4
No good reason to make it closed source. Any users would need Mill hardware.

~~~
dochtman
icc isn't open source, is it?

~~~
bri3d
Intel are in a perpetual war with another vendor with the same instruction set
for which some of their optimizations are generally applicable, to the point
that ICC used to intentionally cripple AMD hardware:
[http://www.agner.org/optimize/blog/read.php?i=49#49](http://www.agner.org/optimize/blog/read.php?i=49#49)

Thus there's a disincentive for Intel to release their optimizer's tricks: not
only are at least some percentage of the optimizations applicable to their
competitor's microarchitecture implementing the same ISA, but they probably
reveal various Intel CPU internals that Intel consider trade secrets (similar
to the argument against open-sourcing 3D drivers and shader compilers).

Mill is not going to be locked into a bitter head-to-head battle with someone
else trying to implement the same ISA better (at least not for a long time),
so there's no incentive for them to hide their CPU's internal optimizations
and no competition for which compiler optimizations could be generally
applicable.

------
dkhenry
So where do I buy one and test it myself. I love the theory, and some of the
claims are awesome, but I am reminded of the Cell-BE and the chatter around it
at release time. It wasen't untill we got the Cell into the hands of
developers that we learned it's real limitations. I want a Mill I can write
programs for and run benchmarks against. My benchmarks on my bench.

~~~
bryanlarsen
If they raise the money they're looking for, you should be able to do that in
2 to 3 years.

[http://electronics360.globalspec.com/article/3843/startup-
se...](http://electronics360.globalspec.com/article/3843/startup-seeks-funds-
to-realize-belt-processor)

~~~
pdq
3 years is a bare minimum. They would need to staff up hardware design,
verification, performance modelers, compiler writers, OS and software teams,
and license all the necessary EDA software (simulators, waveform viewers,
etc).

My guess would be a minimum $20 million to get it to a solid FPGA prototype in
3 years. Then if that were successful they could spent another $25 million and
get it into silicon at a good process (20nm or below).

------
vietor
It's easy* to build a dramatically better performing and more efficient CPU
than currently available if you don't have to restrict yourself to the code
and compilers currently available.

The exciting thing to me is that between wider availability of open source
compilers and code, and a larger amount of user level code being written in
interpreted languages (so only the language runtime needs to be rebuilt),
there might actually be a future in alternative architectures.

* As these things go...

------
Brashman
What differentiates Mill with Itanium?

Also, what are the 2.3x power/performance improvements based on? Is there
silicon for this?

~~~
gnoway
I'm actually wondering where the 2.3x number he cites is coming from. I don't
believe the Mill team is claiming 2.3x performance advantage over Haswell
while using 2.3x less power, which is how I read that comment.

I watched the replay of the Execution talk here:

[http://millcomputing.com/docs/execution/](http://millcomputing.com/docs/execution/)

I'd recommend watching all of the talks if you have the time.

In this talk, maybe 2/3-3/4 of the way through, Godard made a claim about
performance relative to OOO, 'like a Haswell' or Haswell specifically - can't
remember which, and I can't go through the video again right now. He said
something to the effect that they would approach performance for
{OOO|~Haswell|Haswell} using less power. It was a very general statement,
which I took to mean that a Mill family member intended for GP PC desktop use
could approach - not match or exceed - performance of a typical GP PC desktop
processor while using less power. Which is certainly not something we've never
heard before. And I think the statement is coming from theoretical
calculation.

As far as difference with Itanium: I don't know anything about processor
design, but I am pretty certain the belt concept central to the Mill is not
applied in the Itanium/EPIC. I think it's likely that the Mill is intended to
support more operations per instruction than Itanium. The other thing is that
there is not 'The Mill Processor' \- it's more of a design scheme and ISA.

~~~
igodard
It would be nice if there were a single number that could be justified by
measurement, but there's no hardware yet to measure and there would not be a
single number even if the hardware existed. That's because there's not just
one "Mill", it's a family.

What we can say is that for equivalent computation capacity (i.e. number of
functional units) the Mill will give somewhat better performance at much
better power. Internally, the Mill's power budget is essentially the same as
that of a DSP with the same function capacity, because they work in much the
same way. DSPs have been around for a long time, and the power/performance
comparisons with OOO have been long published. For equal process and equal
Mips capacity the power difference for the core is 8-12x better than OOO, and
we expect to do at least as well.

That's for equal compute capacity. Every architecture has a cap on scaling
compute capacity. The cap seems to be around 8 pipelines in OOO machines; try
to add more and you just slow down everything more than you gain from the
extra pipes.

The Mill has caps too. We don't know yet where the diminishing returns point
will be in detail, but our sims and engineering expertise suggests that it
will be somewhere in the 30-40 pipes region. Such a high-end Mill would swap a
good deal - but not all - of its power advantage for more horsepower.

You have the inverse story at the low end of the family: the lowest Mill has
only five pipes, and no floating point at all. Not barn-burning performance,
but much lower power even than existing non-OOO offerings.

So there's no one number, and no hard measurements anyway. If you doubt our
projections then you are entitled to your opinion; in fact there's a fair
amount of disagreement even within the Mill team as to what we will see in the
actual chip. But the team includes quite a few who have been doing this for
years, and in several cases were involved in the creation of the chips that
you would compare the Mill against, so their considered opinion should not be
rejected out of hand.

~~~
Brashman
From what I've seen and heard in the (academic) computer architecture
community, performance and power gains often diminish when moving from theory
to simulation to RTL and into silicon (It seems the Mill team is aware of this
too). Thus, I tend to be skeptical about large performance/power gains. On the
other hand, it's not entirely unreasonable that VLIW could see these gains.
I'll be curious to see what happens with Mill. It seems to me the biggest
challenges with VLIW architectures are on the compiler side and the need to
recompile legacy code.

------
comatose_kid
I worked on a VLIW processor long ago and it had a theoretical peak of 700
MIPS (iirc) back in 2000. It was a neat architecture but required fairly low
level knowledge to get the most out of it.

~~~
shavenwarthog2
Sounds like the Intel i860 from the late 1990s. Frighteningly fast VLIW in
theory, but in practice not so much. I think untweaked code ran at 3% of max
speed.

~~~
DiabloD3
The problem there is, Intel intentionally kills any non-x86 Intel arch. For
example, the Itanium was brilliant, and then they repeatedly threw it under
the bus until people hated it.

~~~
gnaffle
That's not really how the story goes, though. Intel spent a lot of effort on
getting Itanium to succeed, and they really dragged their feet into x86-64. It
was the market that decided that the price/performance of Itanium wasn't worth
it.

~~~
DiabloD3
Thats how a lot of people tell the story, but I just don't agree with it.
There is zero evidence that the market killed Itanium when Intel was already
trying to kill it because it was eating into Xeon sales on high end platforms.

------
DiabloD3
I want to replace every computer in the world with this.

------
nly
My biggest concern is Intel will just buy and bury it.

~~~
DiabloD3
OOB is unlikely to ever sell.

------
Xdes
Can't wait to buy a Mill and mess around with it. Hopefully it isn't more
expensive than current desktop or server processors.

------
JoeAltmaier
How is this different from a convention register model, where the compiler
stores each result into a new register round-robin? That would be a belt too.

~~~
axman6
That works fine until you have things like branches and function calls. Most
call conventions specify which registers arguments are in, and with what
you're proposing each function call needs to know where it is up to in the
rotation at the beginning of the call. If the ahrdware looks after it (either
through a index register that stores the position of the next write), then it
becomes easier; but no one's doing that, and afaik, the mill is the first
machine to do anything close. Stack machines are sort of similar, but word
differently because they need to remember all values on the stack untill
they're popped. The mill just forgets old results and the compiler must make
sure they're still available when they're needed later.

------
panduwana
If the belt operation can be changed from the current "take any two items on
belt, process them, put the result to the front of belt" to "take two front-
most items on belt, process them, put the result anywhere on belt", we can
save some bits and make shorter instructions (good for mobile):

currently: OP load-address-1 load-address-2 // output is always put at belt's
front

to: OP store-address // inputs are always 2 frontmost items on belt

~~~
axman6
How do you get ILP from this? Seems like it would make scheduling much more
difficult because you now need to make sure that that the results you need for
each instruction are in the order in the belt you need, and you have to be
able to execute any order of instruction types. The mill can run fast partly
because and instruction can use any result on the belt, and similar
instructions are grouped together making decode simpler (from memory, they
have two separate decoders and each instruction has all say arith instructions
grouped and decoded by one decode, and all the others decoded by the other).

Basically, I don't see how you can use what you suggest to do, in one
instruction:

    
    
        [ add positions 7 and 5 | multiply positions 7 and 2 | call f on 4  and 5 | branch to foo if position 3 was LT else bar ]
    

ending up with the belt

    
    
        [ [7]+[5], [7]*[2], f([4],[5]) ... ]
    

or whatever you like. All you need to do to schedule the mill is to perform as
many operations in parallel as the hardware can do, and then find out where
their results would be placed to create the next instruction.

~~~
panduwana
In the linked article it is said that "According to prior research, only some
13% of values are used more than once". So based on the research on which Mill
is based on, your example and that "instruction can use any result on the
belt" is actually minority case.

As for scheduling my proposed store-addressed belt: you perform as many
operations in parallel, then for each operation find the other operation that
depends on the former's operation result, calculate the distance between them
and assign it as the former's store address. The compiler has more work to do
yes, but not "much more difficult".

~~~
baking
Offhand, I would say maybe one difference is that your model is trying to
predict where the belt will be in the future while the Mill is looking
backwards to find where the belt was in the past.

Another issue is that you would have to process the entire instruction in
order to know where each operation gets its input. (How many operations in the
instruction are taking things off the belt before I get my data?) In the Mill
the operations are parsed in parallel and they have all the information they
need to start processing as soon as the the instruction (block) is loaded in
the buffer.

The size of the belt is a very finely tuned constraint (using simulations)
that basically depends on how many cycles you have to save a value to the
scratchpad memory (if needed) before it "drops off" the belt. There is a
lecture that describes why it takes the number of cycles it does and if you
watch it you will probably understand better why the Mill is not about what is
easy or hard for the compiler but all about getting the silicon to jump
through hoops fast and efficiently.

------
al2o3cr
Verilog or GTFO.

------
klausnrooster
Soon with Memristor RAM/SSD!

