Hacker News new | past | comments | ask | show | jobs | submit login
“The Mill” – It just might Work (engbloms.se)
274 points by dochtman on March 27, 2014 | hide | past | favorite | 94 comments



I am really rooting for these folks, after going to a talk on it last year about this time and trying to everything thing I could to pick it apart (they had good answers for all my questions) I felt confident that its going to be a pretty awesome architecture. Assuming that reducing to practice all of their great ideas doesn't reveal some amazing 'gotcha', I'm hoping to get an eval system.

The thing I'm most closely watching are the compiler stuff since this was such a huge issue on Itanium (literally 6x difference in code execution speed just from compiler changes) and putting some structured data (pointer chasing) type applications through their paces which is always a good way to flush out memory/cpu bottlenecks.


From the posts on Reddit they are saying their compiler is based in a JIT architecture similar to how mainframes work.

Basically doing AOT compilation and optimizations on installation, but they have postponed more information to an upcoming talk.


Not quite: only scheduling and binary creation is done at install time. Instruction selection and optimization is done during compilation in the usual way.


Thanks for the hint.


It's really exciting, but here are a few worries I have about their ability to meet their performance claims:

1) I don't see that they'll be able to save much power on their cache hierarchy relative to conventional machines. Sure, backless memory will save them some traffic but on the other hand they won't e able to take advantage of the sorts of immense optimization resources that Intel has, so that fraction of chip power to performance isn't going away.

2) The bypass network on a conventional out of order chip takes up an amount of power similar to the execution units, and I expect that the Mill's belt will be roughly equivalent.

3) I'm worried on the software front. The differences between how they and LLVM handle pointer is causing them trouble, and porting an OS to the Mill looks to be a pretty complicated business compared to most other architectures. It's certainly not impossible, but it's still a big problem if they're worried about adoption.

All of which is to say, I think the 10x they're talking about is unrealistic. The Mill is full of horribly clever ideas which I'm really excited about and I do think their approach seems workable and advantageous, but I'd expect 3x at most when they've had time to optimize. The structures in a modern CPU that provide out of order execution and the top-level TLB are big and power hungry, but they're not 90% of power use.

If they're going to hit it big they'll probably start out in high-end embedded. Anything where you have a RTOS running on a fast processor, and your set of software is small enough that porting it all isn't a big problem.

Also, the metadata isn't like a None in Python, it's like a Maybe in Haskell! You can string them together and only throw an exception on side effects in a way that makes speculation (and vector operation) in this machine much nicer.

EDIT: Whatever the result of the Mill itself, it contains a large number of astoundingly clever ideas some of which would be useful even without all the other ideas. Like you could drop in Mill load semantics to most in-order processors and you'd have to do something different with how they interact with function calls but it would still be pretty useful.

EDIT2: I may sound pessimistic above, but I would still totally give them money if I were a registered investor. The outside view just says that new processors that try to change lots of things at once is pretty bad even if, on the inside view, they have a good story for how their going to overcome the challenges they'll face.


#1) Mill cache is relatively conventional (except with 9-bit bytes). Everybody uses the EDA tools to create caches, so everybody will get similar power numbers. However, there's a lot more to the hierarchy power budget than the caches: read buffers, write buffers, pin drivers, etc. The Mill has neither read nor write buffering, and backless lines don't drive pins. We won't have good numbers until we have gate-level sims of the hierarchy, so for now all there is to go on is skill and expertise. We're not worried; YMMV.

#2) The bypass is similar for the FU-to-FU paths, but an OOO has to also feed from the renames that the Mill doesn't have.

#3) The OS port is a largely solved problem: we expect to use the L4 microkernel as a base and the existing L4-based Linux etc. implementations on that. https://en.wikipedia.org/wiki/L4_microkernel_family. Porting L4 to the Mill is pretty easy; we designed it that way :-)


As I understand it, Mill has no privilege mode so I don't see why you need even L4! Providing a Unix compatible API (all except fork()) is basically just building a modular emulator. Mostly grunge work once you work out the access rights matrix :-)


How far along are you on the OS work? Have you booted anything on the simulators?


One tricky issue in porting an existing OS to the Mill is their AS/400-style portability strategy (as outlined by Godard in the comments): binaries are distributed as "Mill IR", and compiled to local binary, which gets cached for the next execution. The problem is where to put this cache, and how the OS deals with it.

Say, for example, you put the cache in the file system. OK, who has write permission, and can the "cached local binary" be written on first execution by a user who doesn't have write permission on the binary being "localized"? (Or could the translator be run on "apt-get install ..." --- and if so, who trains apt to do that?) And so forth.

Putting the cache someplace that is hidden from the "normal" OS is possible, but that has problems, too. At the very least, you'd need to figure out what to do if the hidden whatever-it-is runs out of space. (And how doing I/O to it would interfere with other OS-level performance optimizations, like scheduling of disk seeks.)

IBM could finesse these problems on the AS/400 because they controlled everything about it, hardware to OS to UI. And there are niches with high performance requirements that could live with a nonstandard OS. But for general-purpose computing, it could get awkward. (Perhaps awkward enough to consider TransMeta's strategy of doing JIT translation to actual machine instructions, which let them keep the "real instruction" cache entirely in dedicated RAM --- though that has problems too.)


Assuming their translation step is as cheap as they claim (single pass rewrite/substitute missing hardware for software macros), it's conceivable that there isn't much value in persistently caching the result.

In that case, everything could be contained to a single ld.so patch, or (doubtfully) a modification to the kernel ELF loader

Finally, and although it is less common now, in prior days Linux already had a post-install processing step for binaries on certain distributions - prelink(8) ( https://en.wikipedia.org/wiki/Prelink )


Also Mac OSX "prebinding" and (in later releases) dyld cacheing. Though just skipping the cache entirely is certainly the quickest way to get something running --- and quite a few plausible server workloads will pay the penalty mainly at startup, when it might not matter so much.

(But yeah, keep it in user space, if only to make it easier to debug!)


Sounds like Java and .Net Bytecode. I don't deal with Java much, but in .Net world IL is jitted on demand into memory and thrown away when the process is recycled unless you do ngen, which is difficult to do in many situations (web applications).

I think many underestimate the overhead involved in compiling IL to native including MS themselves. There are user perceivable delays in application startup time in larger code bases that lead to a worse user experience. Why they don't cache jitted code to disk after 5-6 version of .Net is beyond me, even Mono has a AOT compiler.

I am starting to appreciate the AOT approach of "native" code (C,C++,GO), do as much as possible one time, at compile time. Don't make the user wait because you want to distribute a single portable binary.


It's often recommended to run ngen as a part of the user install process for larger applications.


Sometimes ngen produces worse code than the JIT does since it has no knowledge of execution state. Compile your assemblies which can be easily optimized, but you may actually be better off losing a couple of cycles waiting for the JIT to compile a method on first call in certain cases.


I sort of assumed that they would just use fat binaries, and store the cached, translated version in the same file as the intermediate code.


Exactly. Most will specialize at install-time. Code without an install step will specialize at first execution; the IR is designed for specialize speed.

The specializer can also be run free-standing as well, to create ROMS and where installations want to ensure distribution uniformity.


Specialization at install time is, in a lot of ways, the most reasonable answer. (It completely sidesteps the permissioning issues involved in write permission on a global cache at first execution, for example.)

However, hacking an existing package manager to do this job may not be entirely trivial --- and, in typical server environments these days, there will often be more than one package manager to hack. Language-based package managers like Rubygems and npm, for example, build binary extensions for Ruby and Node/Javascript as part of their job. Updating any of them to deal with an additional "specialization" step may not be much of a big deal --- but dealing with all of them might be. (Particularly if you have sysadmins that like to, say, run something like debsums to verify that the installed files are exactly the same as the ones in the package --- which might fail on a "fattened" binary.)

If you've already had coffee with the maintainers of, say, apt (or rpm), npm (or Rubygems, or PIP), and a few others, hashed out the issues, and have worked out that it's no big deal, that's great! But if not --- management of servers these days is complicated in ways that you'd never guess from just looking at the hardware, and a few of those coffee chats might be enlightening.


I don't think it's that difficult for package managers to deal with. All of them have a way to hook post-install actions to run an arbitrary script. You might wind up with another step added to the scripts used to produce the package which automatically generates this hook, and the checksums might be slightly difficult (but then storing specialised binaries separately isn't hard).

I guess the biggest infrastructure change might be running the specialisation on a dedicated machine and hosting your own packages. This way you could also checksum and sign the specialised binaries as well.


I'm not sure what you're suggesting here. Adding the "specialization" hooks to the post-install script in every single package is genuinely hard --- Debian, for example, has literally thousands of packages, maintained by hundreds of packagers, some of which squirrel .so files and the like in all sorts of places you might not expect them without detailed knowledge of the package in question. And that's before you even consider software installed on Debian boxes by other systems entirely, like npm or Rubygems.

Running the specialization on build machines is easier, if you have compilers there that produce binaries directly, and not IR. But that's exactly what the Mill crew is trying to avoid by producing the IR! (Though it may actually be a better fit to an infrastructure which is already set up to produce distinct binaries for different processor architectures (x86, x86-64, ARM, and several others), and which is not set up to expect the "specialization" hook as a necessary post-install step which each package must separately provide for...)


In one of their talks they also mention saving the branch prediction table _after_ execution. Clearly that can't happen at install time.


The compiler makes the first branch prediction table. (It's actually smaller than a branch prediction table, because it only needs one exit point for each entry point.) The table can be updated (in memory) during execution if needed. The on-disk version can be modified for optimization purposes for later runs, but the mechanism for that would obviously be software dependent and not reside on chip.


It seems like the specializer contains a lot of a compiler back end. (I am not a compiler guy, so my understanding is probably wrong.)

Different family members have different functional unit timings and different numbers and orders of functional units. ("orders" -- the order they drop results onto the belt. What's the right term?) Therefore the specializer has to schedule instructions.

Different family members have different belt lengths. Therefore the specializer has to insert belt spills, which seems to be analogous to register allocation/spills.

Different family members have different encodings, so the specializer has to determine the size of each basic block and link-edit them together. (Probably fast, just a lot of bookkeeping.)

It looks like there's a lot of work between the Mill IR and the actual machine code.


It transforms one language into another, so by definition it's a compiler. But what is so scary about that? I would be willing to bet that most people already run JIT'd code every day, in the form of Java, C#, Javascript, etc.


My question is, how slow is the specializer? It is portrayed as very fast, but it looks like it's doing a significant fraction of a real compiler's work.

Then again, if it's mostly used at program installation time, who cares?


I would like it very much if alternative to '70s era operating systems were to flourish on newer hardware platforms. If Mill hardware backs up the performance claims, it would open an interesting door that's been shut to non Multics-likes for too long.


>>All of which is to say, I think the 10x they're talking about is unrealistic.

I think that is putting it mildly and is a little strange to somehow claim in the first place. That can't possibly be true, their isn't room for a 10X increase in optimization on an Intel core chip and would be impossible to reach based on memory bandwidth and the amount of execution resources available on a chip alone. The ideas they have concretely put forth simply don't work or don't really provide a performance increase.

Take their virtual memoryless implementation. Getting rid of virtual memory doesn't buy you a whole lot especially when you need to add in a protection mechanism that looks a lot like a TLB in the first place(and must have the same general properties to provide protection, you just gain very marginal lookup costs).

If you do the math, this can't add more than 1-2% in performance in common application software at the cost of making every modern operating system unusable and increasing memory consumption(embedded systems anyone?). If getting rid of virtual memory was so great, why didn't someone do it in every other preceding clean room architecture? My answer: It isn't.

Or consider how they want to do branch prediction: add a separate ISA to do static branch prediction that is added by the compiler and loaded asynchronously by another cpu component and then supplied to the main cpu.

First of all, this doesn't work. The CPU can't have performance critical data pushed to it by another component. There is a reason the Branch prediction table and branch target buffers are small and focused and able to be accessed quickly. Secondly, static branch prediction is awful. You simply must be able to modify branch prediction data as the CPU executes to provide optimum performance. So it seems they want to be more power hungry, more complex, and have less performance than a mainstream CPU when it comes to branch prediction.

It is possible I have misinterpreted some elements of this scheme but basic design decisions like putting branch prediction into a separate component of the CPU simple don't make sense at all from a chip layout perspective.

Finally, I'm not really sure where the performance is supposed to come from with the 'belt' in the first place. Data dependency is incredibly complex in a modern pipelined cpu and while it is possible to reduce the cost by precompiling software for an optimized CPU the benefits are all very low level and really don't extend beyond reduced power consumption(assuming compilation cost can be amortized). At some point, to get more instruction level parallelism you simply have to bite the bullet and do dynamic out of order scheduling in the CPU to extract more performance. This has a well defined cost and an upper level limitation on how much total parallelism a CPU can extract from an instruction stream. Think of it another way: a static compiler has less information than a running CPU so one can't expect it to be able to extract more parallelism than the CPU itself.


The Instruction Encoding talk http://millcomputing.com/topic/instruction-encoding/ was the first talk, so explained these numbers in the first few slides.

DSPs are massively faster than your Out-of-order Superscalar Monster, just ... not on general purpose code.

The Mill is a DSP-like architecture with secret sauce so it can overcome the gotchas and go DSP-fast on general purpose code.


Right, and I'm saying that the data dependency limitations alone in general purpose code pose an upper limit on how much instruction level parallelism can be extracted from the instruction stream. You simply can't make use of execution resources if they need to wait on the results of another one.

EDIT: I'm not even considering pipelining, latency or transferring data among execution units, just assume every instruction completes in one cycle and makes its result available instantly.


True, if you go hopping all over main memory, we'll go at main memory speed just like everybody else.

Luckily there is a noticeable performance improvement between an i3 and an i7 precisely because normal app code doesn't go hopping long chains across main memory. This is the code we speed up.

An order of magnitude improvement is saying that - hand waving - you can have a hot monster at 10x OoO SS performance or a cool little chip equiv to the OoO SS but low power.

However, the faster you crunch the bits between main memory stalls, the more dominating those stalls become. Its diminishing returns. And hot is not good.

So we talk more about sweet spots like 3x performance and 3x less power, and temper them appropriately.

The numbers are based on sim and experienced estimates. Mill is faster because we can time x86 code and we can sim Mill code and we can compare them.

Am typing on a phone, apologises if brief.


You then assume your own conclusion when you ignore pipelining. If instructions are executed in sequence indian-file then necessarily none will be faster than any other.

The traditional rule-of-thumb is that programs have an ILP of two. The Execution talk (millcomputing.com/docs/execution) explains how the Mill turns that into an ILP of six. Then for the 80% or so of code that is in loops, pipelining has unbounded ILP - there will be a talk on pipelines upcoming.


I agree, you are always as slow as your critical path of executions. General purpose code, I imagine, often has very long critical path that no amount of parallelism will improve.

What I want to know is just how silent the pipelines of this design would be under multi-threaded general purpose code.


I think that the advantage the mill gives here is more room to compute things speculatively on the critical path. You can happily load from an invalid address and then realise that was wrong later without causing an exception in the CPU; same with FP arith etc. This allows you to parallise more sequential code than say x86. I think the other related advantage here is that they've done as much as they can to remove any idea of core global state (comparison flags and so on) so that more operartions can be run in parallel; do several comparisona at once, and then process all the results together.


>>All of which is to say, I think the 10x they're talking about is unrealistic.

Bear in mind they are claiming 10x improvement in MIPS/Watt, not MIPS. So I guess what they are aiming at is a 13W chip with i7 performance.

Even if they managed a 65W i7 they would be on a winner.


> The differences between how they and LLVM handle pointer is causing them trouble

Could you please elaborate? What is the difference in the way the Mill and LLVM handle pointers?


Ivan has said the problem is that the LLVM treats all pointers as integers.


Here is a longer list:

LLVM assumes a register-based target

LLVM assumes that pointers are integers

and it can only vectorize counting loops whereas the Mill does while-loops too.


I actually had high hopes for Sun's Rock architecture, which had a rather elegant hardware-scout/speculative threading system to hide memory latencies, and instead of a reorder-buffer they had a neat checkpoint table, that simultaneously gave you out of order retirement, as well as hardware support for software transactional memory.

Alas, it looked good on paper, but died in practice, either because the theory was flawed (but academic simulations seemed to suggest it would be a win), or because Sun didn't have the resources to invest in it properly and Oracle killed it.

Claiming a breakthrough in VLIW static scheduling that yields 2.3x seems interesting, but the reality made be different, not to mention what kinds of workloads would get these speedups. If you compare the way NVidia and AMD's GPUs work, in particular AMD's, they rely heavily on static analysis, but in the end, extracting max performance is highly dependent on structuring your workload to deal with the way the underlying architecture executes kernels.

If it turns out you have to actually restructure your code to get this 2.3x performance, rather than gcc-recompile with a different architecture, then it's not really an apples-to-apples speedup.


Having been at Sun and having been (too) intimately involved with the microprocessor side of the house for way too damn long, I can tell you that when it came to microprocessors, Sun was all vision and no execution. The theme that was repeated over several microprocessors: a new, big idea that made all of the DEs horny, but that proved annoyingly tricky to implement. Sacrifices would then be made elsewhere in order to make a tape out date and/or power or die budget. But these sacrifices would be made without a real understanding of the consequences -- and the chip would arrive severely compromised. (Or wouldn't arrive at all.) Examples abound but include Viking, Cheetah, UltraJava/NanoJava/PicoJava, MAJC, Millennium (cancelled), Niagara (shared FPU!) and ROC (originally "Regatta-on-a-chip", but became "Rock" only when it was clear that it was going to be so late that it wasn't going to be meaningfully competing with IBM's Regatta after all). The only microprocessor that Sun really got unequivocally right (on time, on budget, leading performance, basically worked) was Spitfire -- but even then, on the subsequent shrinks (Blackbird and beyond) the grievous e-cache design flaws basically killed it.

Point is: in microprocessors, execution isn't everything -- it's the only thing.


ROC (originally "Regatta-on-a-chip")

Really? Ha, that is funny! I guess sun got the codenames and the fact that it was MCM full of GP's, but apparently didn't notice why it was MCM, or the fact that there were 4 MCM's in the full regatta config.

I mean, like, did sun expect to make a wafer level chip?

Its good to know the envy went both directions, I remember a lot of talk about sun's E10k...


Hi Brian.

Spitfire was only on-time compared to the debacle of Viking and Voyager.

Thanks for dredging up the nightmare. :-)


Man, Voyager -- forgot that one!

And "debacle" is really the only word for Viking. A major rite of passage in kernel development in the 1990s was finding your first Viking bug; I found mine within a month of joining in 1996 (a logic bug whereby psr.pil was not honored for the three "settling" nops following wrpsr, allowing a low priority interrupt to tunnel in -- affecting all sun4m/sun4d CPUs). Bonwick's was still the king of the hill, though: he was the one who discovered that the i-cache wasn't grounded out properly, causing instructions with enough zeros in them to flip a bit (!!). The story of tracking that one down (branches would go to the wrong place) was our equivalent of the Norse sagas, an oral tradition handed down from engineer to engineer over the generations. Good times!


>>Alas, it looked good on paper, but died in practice, either because the theory was flawed (but academic simulations seemed to suggest it would be a win), or because Sun didn't have the resources to invest in it properly and Oracle killed it.

I heard this never actually worked at all and they added the ability to turn off the hardware scout entirely before canceling it. I'm not really sure how the scout was supposed to be able to help performance. If the algorithm is indirect heavy then speculatively running it won't help you. On the other hand, if it isn't you might as well rely on conventional prefetch. Do you have a link to those studies?

>> If it turns out you have to actually restructure your code to get this 2.3x performance, rather than gcc-recompile with a different architecture, then it's not really an apples-to-apples speedup.

Right, I would only add that the algorithm itself has to be amenable to that architecture in the first place. Most general purpose code isn't and won't be able to take advantage of a large number of parallel execution resources.


This is a detailed description of the architecture: http://millcomputing.com/topic/introduction-to-the-mill-cpu-....

It describes Mill's approach to specifying inter-instruction dependencies, grouping instructions, and handling variable-latency memory instructions.


Who is this guy? Where can you teach post-doc computer science without ever having taken a course in CS, let alone a degree?

Obviously a degree is not a necessary condition for success and it's always bothered me that people like Michael Faraday had to battle academic and class prejudice before changing the world.

However I don't think it's unreasonable to see a bio of past projects/companies/research papers.

"Despite having taught Computer Science at the graduate and post-doctorate levels, he has no degrees and has never taken a course in Computer Science"


My first compiler (still in use) was for the Burroughs B6500 mainframe in 1970. During my brief and inglorious college career I did not take a CS class. In fact, there were no CS classes. The college didn't even own a computer. Yes, there were such times, in living memory, hard as it may be to imagine.

These days you need a union card (i.e. a CS degree) to get a job. That's a shame. I've been refused a university position for lack of a PhD - to teach a subject that I largely invented. There's something wrong with that.

We have no such requirements on the Mill team.


Indeed there is something wrong here. I am sure it isn't easy to identify those with scholarly authority. Sad to see that they are missing it.

That being said, you are still having scholarly impact! Your talks have taught me to question all my fundamental assumptions when it comes to architecture, compilers, and computing!

I love following your peoples work, and I can't wait to see its product!


>These days you need a union card (i.e. a CS degree) to get a job. That's a shame. I've been refused a university position for lack of a PhD - to teach a subject that I largely invented. There's something wrong with that.

My only degree is in physics and my career has yet to be harmed by this


Speaking of the Mill team, now that Mill Computing is exiting stealth mode, will there be any positions available for recent grads in the next few months/years? Say, someone with several years serious internship experience in (of all things) working on software teams alongside CPU development teams? (not really sure of a shorter way to state that... someone who spends a lot of time single-stepping assembly? ;))


> Ivan Godard has designed, implemented or led the teams for 11 compilers for a variety of languages and targets, an operating system, an object-oriented database, and four instruction set architectures. He participated in the revision of Algol68 and is mentioned in its Report, was on the Green team that won the Ada language competition, designed the Mary family of system implementation languages, and was founding editor of the Machine Oriented Languages Bulletin. He is a Member Emeritus of IFIPS Working Group 2.4 (Implementation languages) and was a member of the committee that produced the IEEE and ISO floating-point standard 754-2011.

http://millcomputing.com/docs/encoding/


Why the downvote? It just seems to be an unusual bio and would be interesting to see the history leading up the mill: http://www.ftpress.com/authors/bio.aspx?a=DE5F140D-E5BF-4E83...


He was on THE "green team" you young whipper-snapper and before that he worked on Mary and Mary2, languages from the long lost mists of time. After a certain point you don't need to go to school and people start asking you to teach them.


I wonder what the compilers would be like. If these guys contribute, say, an LLVM backend, that would make it so much easier to support.


(Mill team)

We are in fact working on an LLVM backend right now.

This will generate Mill IR, which will be 'specialised' on-target so will run on all Mill family members.


Will you contribute it to upstream, or keep it closed source?


No good reason to make it closed source. Any users would need Mill hardware.


icc isn't open source, is it?


Intel are in a perpetual war with another vendor with the same instruction set for which some of their optimizations are generally applicable, to the point that ICC used to intentionally cripple AMD hardware: http://www.agner.org/optimize/blog/read.php?i=49#49

Thus there's a disincentive for Intel to release their optimizer's tricks: not only are at least some percentage of the optimizations applicable to their competitor's microarchitecture implementing the same ISA, but they probably reveal various Intel CPU internals that Intel consider trade secrets (similar to the argument against open-sourcing 3D drivers and shader compilers).

Mill is not going to be locked into a bitter head-to-head battle with someone else trying to implement the same ISA better (at least not for a long time), so there's no incentive for them to hide their CPU's internal optimizations and no competition for which compiler optimizations could be generally applicable.


I always assumed that the parts of icc that Intel want to keep secret aren't in the backend, they're in the optimisation phases, which could be ported to compilers for other architectures.


Contribute it.


Wait, does this mean your approach to static scheduling is in fact just-in-time compilation? Nifty, I can see how that would work. :)


How about a JVM?


Seems much more sensible as a quicker commercialization approach.

Java code does not depend on hardware memory models but a defined memory model. Much easier to validate your jvm is valid than all C programs that clients might want to run.

You could even have mills cpus on PCI cards in a standard X86 machine, where the java executable passes the program to the the mills on the PCI to run. A bit like how Azul and their vega machines worked (although those where network attached).


Yes please. Its a stack machine anyway, so it should map nicely to the hardware right?


The Mill is closer to a queue machine [1] than a stack machine.

[1] http://en.wikipedia.org/wiki/Queue_automaton


Can't wait for Mill on Rails as well.


So where do I buy one and test it myself. I love the theory, and some of the claims are awesome, but I am reminded of the Cell-BE and the chatter around it at release time. It wasen't untill we got the Cell into the hands of developers that we learned it's real limitations. I want a Mill I can write programs for and run benchmarks against. My benchmarks on my bench.


If they raise the money they're looking for, you should be able to do that in 2 to 3 years.

http://electronics360.globalspec.com/article/3843/startup-se...


3 years is a bare minimum. They would need to staff up hardware design, verification, performance modelers, compiler writers, OS and software teams, and license all the necessary EDA software (simulators, waveform viewers, etc).

My guess would be a minimum $20 million to get it to a solid FPGA prototype in 3 years. Then if that were successful they could spent another $25 million and get it into silicon at a good process (20nm or below).


It's easy* to build a dramatically better performing and more efficient CPU than currently available if you don't have to restrict yourself to the code and compilers currently available.

The exciting thing to me is that between wider availability of open source compilers and code, and a larger amount of user level code being written in interpreted languages (so only the language runtime needs to be rebuilt), there might actually be a future in alternative architectures.

* As these things go...


What differentiates Mill with Itanium?

Also, what are the 2.3x power/performance improvements based on? Is there silicon for this?


I'm actually wondering where the 2.3x number he cites is coming from. I don't believe the Mill team is claiming 2.3x performance advantage over Haswell while using 2.3x less power, which is how I read that comment.

I watched the replay of the Execution talk here:

http://millcomputing.com/docs/execution/

I'd recommend watching all of the talks if you have the time.

In this talk, maybe 2/3-3/4 of the way through, Godard made a claim about performance relative to OOO, 'like a Haswell' or Haswell specifically - can't remember which, and I can't go through the video again right now. He said something to the effect that they would approach performance for {OOO|~Haswell|Haswell} using less power. It was a very general statement, which I took to mean that a Mill family member intended for GP PC desktop use could approach - not match or exceed - performance of a typical GP PC desktop processor while using less power. Which is certainly not something we've never heard before. And I think the statement is coming from theoretical calculation.

As far as difference with Itanium: I don't know anything about processor design, but I am pretty certain the belt concept central to the Mill is not applied in the Itanium/EPIC. I think it's likely that the Mill is intended to support more operations per instruction than Itanium. The other thing is that there is not 'The Mill Processor' - it's more of a design scheme and ISA.


It would be nice if there were a single number that could be justified by measurement, but there's no hardware yet to measure and there would not be a single number even if the hardware existed. That's because there's not just one "Mill", it's a family.

What we can say is that for equivalent computation capacity (i.e. number of functional units) the Mill will give somewhat better performance at much better power. Internally, the Mill's power budget is essentially the same as that of a DSP with the same function capacity, because they work in much the same way. DSPs have been around for a long time, and the power/performance comparisons with OOO have been long published. For equal process and equal Mips capacity the power difference for the core is 8-12x better than OOO, and we expect to do at least as well.

That's for equal compute capacity. Every architecture has a cap on scaling compute capacity. The cap seems to be around 8 pipelines in OOO machines; try to add more and you just slow down everything more than you gain from the extra pipes.

The Mill has caps too. We don't know yet where the diminishing returns point will be in detail, but our sims and engineering expertise suggests that it will be somewhere in the 30-40 pipes region. Such a high-end Mill would swap a good deal - but not all - of its power advantage for more horsepower.

You have the inverse story at the low end of the family: the lowest Mill has only five pipes, and no floating point at all. Not barn-burning performance, but much lower power even than existing non-OOO offerings.

So there's no one number, and no hard measurements anyway. If you doubt our projections then you are entitled to your opinion; in fact there's a fair amount of disagreement even within the Mill team as to what we will see in the actual chip. But the team includes quite a few who have been doing this for years, and in several cases were involved in the creation of the chips that you would compare the Mill against, so their considered opinion should not be rejected out of hand.


From what I've seen and heard in the (academic) computer architecture community, performance and power gains often diminish when moving from theory to simulation to RTL and into silicon (It seems the Mill team is aware of this too). Thus, I tend to be skeptical about large performance/power gains. On the other hand, it's not entirely unreasonable that VLIW could see these gains. I'll be curious to see what happens with Mill. It seems to me the biggest challenges with VLIW architectures are on the compiler side and the need to recompile legacy code.


"You have the inverse story at the low end of the family: the lowest Mill has only five pipes, and no floating point at all. Not barn-burning performance, but much lower power even than existing non-OOO offerings."

Does the Mill even need an FP unit? Or rather, couldn't a VLIW architecture be able to emulate floating point in such a way that it's nearly as fast and/or more flexible as far as precision and/or might be more optimizable for certain values?

Minimally, if you break down the FP opp into it's constituent integer operations, you put all of those in flight at the same time or schedule them to hide latencies of other operations, I would think.


Thanks for your reply.

I have no reason to doubt your projections. I mainly took issue with the 2.3x number in the parent blog post because I remembered you saying something different in your talk. That's all.


I worked on a VLIW processor long ago and it had a theoretical peak of 700 MIPS (iirc) back in 2000. It was a neat architecture but required fairly low level knowledge to get the most out of it.


Sounds like the Intel i860 from the late 1990s. Frighteningly fast VLIW in theory, but in practice not so much. I think untweaked code ran at 3% of max speed.


The problem there is, Intel intentionally kills any non-x86 Intel arch. For example, the Itanium was brilliant, and then they repeatedly threw it under the bus until people hated it.


That's not really how the story goes, though. Intel spent a lot of effort on getting Itanium to succeed, and they really dragged their feet into x86-64. It was the market that decided that the price/performance of Itanium wasn't worth it.


Thats how a lot of people tell the story, but I just don't agree with it. There is zero evidence that the market killed Itanium when Intel was already trying to kill it because it was eating into Xeon sales on high end platforms.


Any pointers to write ups about this?


I want to replace every computer in the world with this.


My biggest concern is Intel will just buy and bury it.


OOB is unlikely to ever sell.


Can't wait to buy a Mill and mess around with it. Hopefully it isn't more expensive than current desktop or server processors.


How is this different from a convention register model, where the compiler stores each result into a new register round-robin? That would be a belt too.


That works fine until you have things like branches and function calls. Most call conventions specify which registers arguments are in, and with what you're proposing each function call needs to know where it is up to in the rotation at the beginning of the call. If the ahrdware looks after it (either through a index register that stores the position of the next write), then it becomes easier; but no one's doing that, and afaik, the mill is the first machine to do anything close. Stack machines are sort of similar, but word differently because they need to remember all values on the stack untill they're popped. The mill just forgets old results and the compiler must make sure they're still available when they're needed later.


If the belt operation can be changed from the current "take any two items on belt, process them, put the result to the front of belt" to "take two front-most items on belt, process them, put the result anywhere on belt", we can save some bits and make shorter instructions (good for mobile):

currently: OP load-address-1 load-address-2 // output is always put at belt's front

to: OP store-address // inputs are always 2 frontmost items on belt


How do you get ILP from this? Seems like it would make scheduling much more difficult because you now need to make sure that that the results you need for each instruction are in the order in the belt you need, and you have to be able to execute any order of instruction types. The mill can run fast partly because and instruction can use any result on the belt, and similar instructions are grouped together making decode simpler (from memory, they have two separate decoders and each instruction has all say arith instructions grouped and decoded by one decode, and all the others decoded by the other).

Basically, I don't see how you can use what you suggest to do, in one instruction:

    [ add positions 7 and 5 | multiply positions 7 and 2 | call f on 4  and 5 | branch to foo if position 3 was LT else bar ]
ending up with the belt

    [ [7]+[5], [7]*[2], f([4],[5]) ... ]
or whatever you like. All you need to do to schedule the mill is to perform as many operations in parallel as the hardware can do, and then find out where their results would be placed to create the next instruction.


In the linked article it is said that "According to prior research, only some 13% of values are used more than once". So based on the research on which Mill is based on, your example and that "instruction can use any result on the belt" is actually minority case.

As for scheduling my proposed store-addressed belt: you perform as many operations in parallel, then for each operation find the other operation that depends on the former's operation result, calculate the distance between them and assign it as the former's store address. The compiler has more work to do yes, but not "much more difficult".


Offhand, I would say maybe one difference is that your model is trying to predict where the belt will be in the future while the Mill is looking backwards to find where the belt was in the past.

Another issue is that you would have to process the entire instruction in order to know where each operation gets its input. (How many operations in the instruction are taking things off the belt before I get my data?) In the Mill the operations are parsed in parallel and they have all the information they need to start processing as soon as the the instruction (block) is loaded in the buffer.

The size of the belt is a very finely tuned constraint (using simulations) that basically depends on how many cycles you have to save a value to the scratchpad memory (if needed) before it "drops off" the belt. There is a lecture that describes why it takes the number of cycles it does and if you watch it you will probably understand better why the Mill is not about what is easy or hard for the compiler but all about getting the silicon to jump through hoops fast and efficiently.


(See The Belt talk for an explanation of belt vs stack)


Verilog or GTFO.


Soon with Memristor RAM/SSD!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: