This market strategy dictates the ability to produce many specialized products with small sales for each. That can't be done in the million-monkeys development approach used by the majors, which is why the majors neglect these markets. So we adopted a specification-based design strategy.
In turn, the design strategy dictates the development strategy: first the specification tools; then the assembler and simulator so we could try out various designs in small test cases and improve the architecture; then the tool chain so we could measure real quantities of code and confirm the ISA and macro-architecture. Then, and only then, write what manual RTL is left that the generators working from the specifications can't handle. The combined RTL must be verified, and it is much easier and cheaper to do that in an FPGA than with fab turns. As the message says, the FPGA is next.
Lastly, we will pick a particular market and specify the ideal product for it, run the spec through the process we have been so long building, and the fab gives us something in a flat pack.
Which won't work, of course. The first time, anyway :-)
Still, I truly wish them well. It beats the crap out of 99% of the YC startups I see. The 37th recommendation engine, a webstore for sneakerheads, ...
"After a design is created, taped-out and manufactured, actual hardware, 'first silicon', is received which is taken into the lab where it goes through bringup. Bringup is the process of powering, testing and characterizing the design in the lab."
A newcomer can't sell yet another me-too, even with evolutionary improvement. Instead the newcomer has to rethink and create from first principles if it is to have any chance in the market. Rethinks can't be scheduled; they take as long as they take. Ours has taken longer than I'd hoped, but adding resources like more people to the project would have just slowed us down, or forced us to market with a broken product.
The Mill rethink stage is over, and we now can have reasonable schedules and put in more resources; that's why we are going out for a significant funding round this year, our first over $10M.
There's plenty of it on software side, too, with the proprietary DOS's and UNIX's. Foreign cloners stayed doing it with mainframes and embedded CPU's. So, incremental stuff (esp patented) can certainly grab market share and generate revenue. It's been going on for some time even in CPU market even with dominant players.
There's also personal strategy involved. If we had done a better X86 we would have needed huge dollops of money to crack the front door of the market - c.f. TransMeta - and lost ownership of the company. By going for the disruptive approach we still own all of it - and funding rounds now are at a valuation that will keep us making our own mistakes, not someone else's. That matters to me, enough to go without paycheck for a decade. YMMV.
Too right! 2017 is the year we get funding to go full-time and implement what we've been plotting at our Tuesday evening meetings for 10 years :)
Not that that's a huge market at this point, but apparently there's a lot of Fortune 500 interest so maybe that'll change by the time you're in production.
I can't say for certain that they'll fail. Maybe they'd be twice as good, but anything short of that and I just don't see viability.
The Mill is definitively not simpler than many other CPUs (in order CPUs for example), what it is supposed to have is better performance or more exactly better performance/power ratio.
Maybe that was more wishful interpretation, but if I remember correctly the idea is to put more logic into the compiler (good old "sufficiently smart compiler") to do the optimization than do it at runtime in HW. IMHO this implies less logic in the HW which might make the implementation simpler (I am SW guy so my computer architecture knowledge might we skewed).
Further more I believe that we might be able to infer more and more bounds about runtime behavior that could possibly give rise to more aggressive optimizations. In particular the Mill provides more predictable performance than traditional architectures which is a feature in itself as it e.g. simplifies (again IMHO) real-time programming.
So that's why I believe the implementation hasn't to be at least harder than traditional architectures.
I am really looking forward to see some results and admire their effort.
Whereas for a simple architecture registers are simple: they index into the register file. This only breaks down when we bring in out-of-order & register renaming in order to have more registers than the instruction set specifies
The Mill specializer part that schedules ops is linear, while the part that assigns belt numbers and inserts spills is NlogN in the number of ops in an EBB because it does some sorts.
I found this [1] blog post interesting - that mainstream architectures have defaulted to low level C-machines and that radically new CPU designs might return to the halcyon days of lisp.
However I think that Lisp is too powerful to be executed directly on the machine (as I understand Lisp machines provide HW capabilities to deal with cons and list atoms).
I often wonder if we should try to make a non Turing complete language, but that we make "pseudo-turing-complete", by using some kind of bounded-automaton with insane worst-case upper bounds (e.g. about complexity) that we can feed to the OS/machine which then can aggressively schedule as it knows upper timing bounds etc.
Btw. is their good literature about the implementation of intermediate languages as compilation targets for static functional languages? I think the book to go was SPJ's "The Implementation of Functional Programming Languages", but I am not sure how relevant it is today.
No of course not. But it will validate benchmarks per-clock cycle, as well as providing a gate count (or LUT count) required to achieve that benchmark result. If the architecture is anywhere near as awesome as claimed, there should be a strong indication of that in the numbers produced by the FPGA implementation.
They typically start with ("Mill team"), and then the rest sounds like it came from the future. It's all about getting the flux capacitors charged.
Can anyone post some low-level code (in assembly, or the closest equivalent if that term is no longer relevant with Mill) implementing some well-known simple algorithm on the Mill? Just to quickly get a feel of what it's all about, how it looks.
Something like computing n! for word-sized integers, or strcmp(), or whatever.
Does anyone have a quick link to something like that?
# entering factorial: argument n is b0
# drop initial accumulator to belt; bail early if n==0
con(1), cmpeq(b1, 0), brtrue(end);
# now, b0 == 1, b1 == n
loop:
# multiply accumulator by n, then decrement n, then check if n stil not 0
mul(b0, b1), sub(b1, 1), cmpeq(b1, 0), brfalse(loop);
end:
# no matter what, the most recent accumulator is b0
ret(b0);
F("fact") %0;
sub(b0 %0, 1) %1;
gtrsb(b0 %1, 1) %2,
retnfl(b2 %0),
inner("fact$1_1", b1 %1, b2 %0);
mul(b1 %3, b0 %4) %5;
sub(b0 %4, 1) %6;
gtrsb(b0 %6, 1) %7,
retnfl(b1 %5),
br("fact$1_1", b2 %6, b1 %5);
This tread for example, talks about instruction density of Mill programs. Mill does somewhat poorly, with a pre-alpha quality compiler doing the codegen; https://groups.google.com/forum/#!topic/comp.arch/RY3Bk7O61u...
Comparing several ISAs on the simple program in the first post, a Mill Gold CPU binary weighs in at 337 bytes (and 33 instructions), compared to:
* powerpc: 212 bytes, 53 instructions
* aarch64: 204 bytes, 51 instructions
* arm: 176 bytes, 44 instructions
* x86_64: 135 bytes, 49 instructions
* i386: 130 bytes, 55 instructions
* thumb2: 120 bytes, 51 instructions
* thumb: 112 bytes, 56 instructions
label $0:
%1 = sub(%0, 1) ^0;
%2 = gtrsb(%1, 1) ^1;
br(%2, $1, $2);
%3 = phi(w, %5, $1, %0, $0);
%4 = phi(w, %6, $1, %1, $0);
%5 = mul(%3, %4) ^2;
%6 = sub(%4, 1) ^3;
%7 = gtrsb(%6, 1) ^4;
br(%7, $1, $2);
%8 = phi(w, %0, $0, %5, $1);
retn(%8) ^5;
Higher-end Mill models have more and bigger things to encode, and so a given program will occupy more bytes than it will on a lower-end member. Thus a belt reference on a Gold is 6 bits, but only 3 on a Tin.
The layer that humans and HLL compiler developers end up with at the lowest seems to resemble a dataflow graph, and the system-specific code converts that into the parallel binary code stream.
The notion of the "belt" helps arrange how long data sticks around in live outputs & buffers, but isn't a physical element in the chip.
load *src, bv
eql <load>, 0
smearx <eql>
pick <smearx0>,None,<load>
store *dest, <pick>
brfl smearx1, loop
Pretty cool, and I feel like I learned something. Thanks!
If you also look a bit further into the talk they show the 'remaining' instruction, which allows you to do this same thing with fixed-count loops.
Regardless, I’m looking forward to seeing what MillComputing can deliver in the real world.
Implement algorithms? Improve FOSS tools?
@Mill team,
can you set up a list of FOSS projects that we can contribute in order to help? I specialize in implementing algorithms, especially optimization and other stuff.
NDA and sweat-equity agreement required; you get full-vested long-term options monthly; expected to be actual stock monthly after the next round.
To help in a FOSS context: we are not yet ready to put out an SDK to the FOSS community, or to anyone really. That will wait until after our cloud environment is up - and a few more patents have been filed (the ISE exposes things we want to protect).
In the meantime, the most help would be to support existing microkernel operating system efforts such as L4 (https://en.wikipedia.org/wiki/L4_microkernel_family). The Mill will have a big impact on conventional OSs like Linux, but those suffer from built-in assumptions that open security holes and makes the OS a dog that the Mill can only train a little. The big win is in microkernels, which can take advantage of the Mill's tight security and ultra-fast context switching.
Or support languages that have central micro-thread concepts, such as Go https://en.wikipedia.org/wiki/Go_(programming_language), for the same reason.
But whatever you do, don't tune the microkernels or microthreads for a conventional core; instead, do it right, and we'll be along to help eventually :-)
Joking aside, computers help us handle loops more efficiently; that's it. The 20% "non-loop operations" are just the fabric that tie the loops together.
Does anyone think this is commercially viable? Or are we looking at a new Transputer?
Commercial viability of the architecture depends on how X86-64 and ARM fares in the coming years. If they stagnate and can't produce faster single-CPU cores and people continue writing software which lives in a single-core world, then there is a room for a CPU delivering a 10-fold increase in performance per watt per dollar.
If on the other hand, most of the heavy computation are moved into either SIMD-style GPUs, or MIMD style Adapteva (Parallela) like solutions, where you have a thousand cores, then the Mill is less likely to have success.
I'm optimistic that someone are trying to rethink how to handle CPU design from the bottom up. It is a gamble, but if we don't try, we won't really learn if there is anything better out there than the current generation of out-of-order executing chips.
I'm pessimistic because Mill smells of Itanium. It is a mix of many new, unproven, ideas inside a single solution. Some of the ideas are genuinely good and might stick if thrown at a wall. But if you are gambling on having many new ideas at the same time, it is likely that some of them will turn out to be bad ideas.
I'd place the majority of my money on simple ISA designs, such as ARM or RISC-V, preferably in a many-core design, but I'd hedge a good sum on Mill as well.
One of the most interesting things to me is the security model with the promise (unproven, maybe very over-optimistic) of the equaivalent of context switches costing no more than regular function calls (for which there is a dedicated instruction). That would be HUGE for microkernel architectures.
In theory, we should be able to do something like that now on ordinary CPUs if we write our user-mode software in a memory-safe language such as Rust.
Current CPUs go to great lengths to isolate process address spaces so that one bad behaving application can't crash the computer. However, if the compiler also guarantees that the application can't access memory it doesn't own (because the language provides no mechanism by which to do that), then the hardware memory isolation is redundant.
The OS could maintain metadata about each binary and what it was compiled with. If it wasn't compiled with a trusted compiler, then it runs inside an isolated process and pays the cost of context switches or it runs inside a CPU emulator at something like 1% performance.
This assumes there are no unsafe code blocks in "trusted" user-space code; in practice, this would mean that everything that can be made more efficient by doing it in an unsafe way moves into a standard library that's trusted. So, an unsafe code block would be like an suid-root binary or a kernel module.
Having user space process and the kernel run in the same address space means that system calls are just function calls (which could be inlined by the compiler) and we could support things like Infiniband network interfaces (that map their hardware registers directly into user-space memory to avoid system call overhead) through an OS API, just like a regular network adapter instead of them being a weird special case.
No, if you want to keep security you need some kind of external memory protection system. Memory isolation gives that and takes away performance. Mill gives both.
If Mill can drop the cost of context switches down to the point where they're indistinguishable from function calls, then that's pretty awesome and it would presumably work well with the software we have now, which is a strong point in its favor. (I haven't seen any Mill talks about security/memory protection, so I don't know what the implementation details are.)
I'm just trying to make the point that reducing the cost of context switches isn't the only solution, and that we could eliminate context switches altogether by making different tradeoffs. I think both approaches are promising and will appeal to different audiences.
Doesn't Java do exactly that?
It also meant that compiler development essentially required your own multi-million dollar mainframe - there weren't a lot of languages available.
You couldn't safely import code from another machine since anyone could write any bitstream to a tape.
The 6700 was a great old machine .... but this bit sucked.
Itanium was a VLIW and the Yale Bulldog compiler solved the VLIW code generation problem with trace scheduling back in the 80s. At this point, that's standard Dragon Book stuff.
Itanium failed to be sure but it wasn't the compiler. However I'll grant that the workload didn't match the EPIC architecture and the compiler. EPIC succeeded at scientific workloads.
As to whether Mill respects the current level of compiler tech a lot more I'd like to know more about that. They need to get their LLVM backend up and running.
The Mill's history may be different because compilers actually advanced a lot since Itanium's launch, because it brings some arguably improvement over Itanium, or because the Mill's team is trying to write those compiler improvements, instead of making a partnership with Microsoft.
I don't know, some of the novelties in the Mill arch would seem to simplify compilers. For instance, no need for complex register allocation.
Other optimizations applicable to traditional CPUs either may not apply to the Mill CPU, or may need significant tweaking to work properly.
The one thing that I think makes the Mill simpler is the no-value values mixed with lazy error conditions. I have no idea if those are powerful enough to compensate the rest.
For spilling: if a desired argument is less than one belt-length away then we directly reference it; between one and two away we reorder; and more than two we spill.
We place a load as soon as the address arguments are statically available. The compiler doesn't have to deal with aliasing, which is a large and bug-ridden part of compiling for other targets.
Lazy-error (aka NaR) means that there are no error dependencies in arithmetic expressions. Current compilers for other targets simply ignore such dependencies, relying on the "undefined behavior" rule. Mill is designed to not produce nasal demons :-)
Incidentally, I don't think the Itanium compiler tech never materialized, it just materialized late. Itanium had a lot of other problems, but the later Itanium models really flew for certain workloads (SQL Server, in particular). After the first one (which was quite a dud and killed Itanium's reputation), Itanium was definitely not slow. The x86 emulator was slow, and AMD64 absolutely stole Itanium's lunch.
There are several of these already (e.g. Tilera) which haven't really taken off outside of those marketed as "graphics cards". The main problem is that memory bandwidth is now the real bottleneck for real workloads.
Basically simpler and more Ghz/Watt than Parallela.
Superscalars leave a lot of the optimization process to complexity in the hardware. This is seen in stuff like the out-of-order scheduling and large cache hierarchies. In one of the Mill talks, it's guesstimated that almost 90% of the circuitry is dedicated to simply moving data around, as opposed to actually doing any work on that data.
VLIWs take the opposite extreme, leaving the complexity of optimization to the compiler. History has shown so far though, that many computing problems are (again oversimplified) too conditionals-heavy to really benefit from VLIW, and end up running much more slowly.
So the Mill is a bet that they can get more benefits from each approach without the major draw-backs of each. This isn't really ground Intel could simply "move into" without half a decade plus of work, and even then, they'd be cannibalizing their x86 ecosystem, which is not a risk most entrenched corporations are fond of.
EDIT: To explain a bit more, VLIW has historically worked great in cases like DSP workloads where the memory access patterns are very predictable and you aren't unexpectedly loading things from lower level caches very often. There's anther thread on HR right now about doing deep learning with a Hexagon, which is a sort of VLIW, and it works very well. But as soon as you miss L1 in a VLIW the whole thing comes to a stop, whereas in an OoO (Out of Order) processor you can keep executing subsequent instructions that don't depend on that load and so you don't have to stall. Basically every instruction that isn't a load has deterministic or at least hard bounded latency that the compiler can easily plan for.
The other big disadvantage of VLIW is that sometimes you have stretches of code where you can only have one useful instruction at a time. VLIWs often use very RISCy encoding formats that still take up lots of space per bundle in these stretches, leading to potentially very low code density. The Mill gets around with a very CISCy encoding that only takes a small amount of I-cache size for single instructions.
And I'm still looking at recompiling, debugging and deploying a ton of software to see advantages from the Mill.
So whatever advantages it brings, they need to be very substantial (i.e. if the next-gen of Intel x86 chips still out-perform it, I'll buy them) and quite quick - because I can go 5 years not buying the Mill while Intel promises to support me for their awesome new architecture. I mean, probably we buy some Mill machines, but how likely is it that it's game changing on my codebase?
That's where I see the problem. There's this whole huge assumption that the Mill will yield a bunch of benefits. If they're clear cut (a huge if) then they still have to beat their competitors being able to brute-force performance improvements until they come up with a new architecture themselves, which can take advantage of the very compensation they're asking from their customers ("switch architectures, it'll be great we promise").
The security features could be very useful for the big internet companies, Amazon, Google, Facebook, etc and they can afford to spend a few tens of millions of dollars on something speculative like this. And they do, with things like Arm or OpenPower servers.
There's also the high end embedded land where you don't need to run a traditional operating system. Network switches, cell towers, that sort of thing.
For instance are there big problems out there that are compute intensive, but branch-heavy enough that GPUs aren't a great fit, in applications that need low power but that don't make sense as "cloud services"? I don't know.
The optimist in me sees potential opportunity in opening up new domains. The practicalist agrees with you; it's going to be an uphill battle and a lot of stars need to align just right. But if they really can deliver 10x on general purpose computing workloads, it's hard for me to see that not being game changing.
Whereas VLIW is where you optimise ahead of time with no fore knowledge, if the compiler is good enough, you get the same performance with simpler, cheaper, more efficient hardware (or potentially more performance) but often this doesn't pan out fully.
Not sure what's the memory latency hiding story for the Mill.
Edit: Symmetry said it better.
If you issue a load that completes in 5 cycles and it needs to go to RAM for that data then it's going to stall waiting for that data but because you can issue loads sooner than on other architectures you can sort of alleviate some of the problems with stalling on some load.
EDIT: I think the main problem was that on Itanium issuing a speculative load could potentially trigger a page fault making it potentially dangerous. On the Mill the page fault won't trigger until the load leads to a side effect outside the belt, sort of like it's been wrapped in a Haskell Maybe monad. So if you' have something like
for (int i = 0; i < foo; i++) {
a[i] = a[i]+1;
}
1) Intel is very heavily invested in x86, and they're likely to be very reluctant to change.
2) The Mill folks have patents, which might be hard to work around.
3) If the Mill becomes successful, they're likely to either be bought out by one of Intel's behomoth competitors, like Samsung or Qualcomm, or perhaps even by Intel itself.
4) If Intel really likes their technology, they could license it from Mill Computing. It would be surprising for Mill to say no to a reasonable offer, considering that they don't have their own fabs and it's their business model is to license their designs to those that do.
Licensing is a backup plan.
5) Intel really likes their technology, but they do not want to license it so they decide to implement tweaks around the Mill patents.
If the Mill actually achieves the stated 10x improvement in performance per watt, you'd be quite stupid not to use it in highly competitive markets like smartphones, because others will. It would provide a significant competitive advantage.
For anyone considering getting involved or investing could you fill in a couple of bio details?
>Ivan Godard has designed, implemented or led teams for...an operating system, an object-oriented database
Cool, which OS and database?
>taught computer science at graduate and post-graduate level at Carnegie-Mellon University
Which courses?
Ivan responded to the post but didn't address the questions about his bio at all, but perhaps he doesn't feel the need to.
For the curious, his name apparently was Mark Rain for some time in the 70s/80s [1] – there are a number of publications under that name.
All of us have to maintain detailed bios when seeking funding, trying to build a team, or building corporate partnerships, all of which appear to be on the table.
Instead they invest based on our technology, most of which (and eventually all) we make publicly available. You may not be able to judge it, but any potential partner has people who can. One of the things that encouraged us in our long road is that the more senior and more skilled the reviewer, the more they loved the Mill. Quotes like "This thing is the screwiest thing I've ever seen, but it would work - and I could build it".
As we grow there will be more place for beginners, but not yet. Mind, that's beginners as in concept understanding, not beginners as in age or degrees. I'm a dropout, and this year we added an intern in Tunisia who's still finishing his exams. We let people self-select, we don't try to persuade them. And we don't pay them, so only the convinced join.
In a way, at the top of engineering things work much more like they do in the arts: you are judged by your portfolio, not by your background or education.
An example of patents slowing down innovation rather than promoting it.
So it won't be useful for another 20ish years. Yay?
> It will, naturally enough, break. However, by hosting it ourselves we can capture what failed, and use that as raw feed for our test/debug team.
Wow, this is gettting more and more depressing.
If they had no chance of any reward for their efforts, but expected their decades of work to make some already wealthy chip vendors a bit richer, you think they would bother?
This kind of thing is the best possible use of the patent system: it’s supporting extremely novel and creative technical invention and engineering, not preventing anyone else’s work (nobody was intending to make anything remotely similar), and putting the details of the new inventions into the public sphere for anyone’s study now and free use in 20 years.
To make use of these inventions requires billions of dollars of capital investment in chip fabrication infrastructure. Anyone with serious intentions can approach the Mill people and ask to license their technology.
Hardware is harder than software.
> The designers claim that new principles are becoming rare in instruction-set design, as the most successful designs of the last forty years have become increasingly similar. Most of these that failed, failed because their sponsoring companies failed commercially, not because the instruction-sets were technically poor. So, a well-designed open instruction set designed using well-established principles should attract long-term support by many vendors.
* * *
> Not asking this rhetorically
It’s hard to take your comment seriously, since you don’t seem to understand what the Mill people are trying to do (even in basic concept, i.e. rethinking the entire design of the CPU).
I recommend you watch a few of Ivan Godard’s lectures. That will give you a much better idea than reading discussion here.
It seems like they're doing that now mainly to get better feedback. I can't imagine that rationale would make as much sense once their stuff is more stable.