Hacker News new | comments | show | ask | jobs | submit login
Mill Computing in 2017 (millcomputing.com)
230 points by reitzensteinm 255 days ago | hide | past | web | 141 comments | favorite

For the uninitiated, the bringup of a new processor is very complex - from compiler to libraries to kernel, from emulation to real hardware. The process described here is very similar to the one I was able to observe at SiCortex, though that chip was based on an existing one and so there was no FPGA stage. I'll bet they even use some of the same tools, such as the emulator. Kudos to them for not rushing ahead before their foundation is ready, and I look forward to seeing how this all works out.

Exactly. As a new entrant, it is impossible for us to immediately enter the mass markets dominated by the majors. Consequently we adopted a strategy of targeting an increasing set of niche markets that have been poorly server by the majors and their products, and then going after larger markets as we grow in resources.

This market strategy dictates the ability to produce many specialized products with small sales for each. That can't be done in the million-monkeys development approach used by the majors, which is why the majors neglect these markets. So we adopted a specification-based design strategy.

In turn, the design strategy dictates the development strategy: first the specification tools; then the assembler and simulator so we could try out various designs in small test cases and improve the architecture; then the tool chain so we could measure real quantities of code and confirm the ISA and macro-architecture. Then, and only then, write what manual RTL is left that the generators working from the specifications can't handle. The combined RTL must be verified, and it is much easier and cheaper to do that in an FPGA than with fab turns. As the message says, the FPGA is next.

Lastly, we will pick a particular market and specify the ideal product for it, run the spec through the process we have been so long building, and the fab gives us something in a flat pack.

Which won't work, of course. The first time, anyway :-)

Ivan, Thank you very much for this comment. I have been struggling to communicate similar problems related to scale in the biological sciences. The methodology needed to discover something new is far too "specialized" for the techniques used by the "majors" to be economically viable. I have been pushing for a specification based approach and your account here is the perfect articulation in support of it. I was lucky enough to spot the Mill talks when they first appeared and have been following along from the sidelines. I will keep cheering and look forward to more progress. Best of luck to you and the whole Mill team!

Bringup is when you throw everything together and that means that you have everything to throw together. It's like that Johnny Cash song, One Piece At A Time.

  Now, up to now my plan went all right
  'Til we tried to put it all together one night
  And that's when we noticed that something was definitely wrong.
That's bringup. Compiler development ain't part of bringup. They could/should have been doing an LLVM backend using a software simulator (and eventually an FPGA simulator) which is the traditional approach handed down from our knuckledragger forefathers and going back decades.

Still, I truly wish them well. It beats the crap out of 99% of the YC startups I see. The 37th recommendation engine, a webstore for sneakerheads, ...

They uh, are already doing an LLVM backend for a software simulator? The FPGA work is its own project.

The bringup process includes all the stuff leading up to that final moment, just like most other processes - e.g. release process - include the stuff leading up to a conclusion. The quibble really wasn't very constructive.


"After a design is created, taped-out and manufactured, actual hardware, 'first silicon', is received which is taken into the lab where it goes through bringup. Bringup is the process of powering, testing and characterizing the design in the lab."

Well, sorry, but I worked with a whole lot of people who were designing, verifying, and building that processor, and they used "bringup" for the entire process. I was part of that process myself. Actual experience counts for more than argumentum ad wikipedia. Also, even if the word was wrong, that matters far less than the context in which it was used. Dictionary flames are the last refuge of someone with nought worthwhile to say.

The "Software" section of the article says they did exactly the things you suggest they should do.

Is it reasonably to still not be running on an FPGA after like 12+ years of working on it?

It seems that they're working on it somewhat part-time, and have prioritised getting the patents first. Which makes some sense as it's their only way to avoid being instantly destroyed by Intel.

Evolutionary development you can start at once, because you are building on what you had before; think an x86 generation. Evolution works if you already dominate a market and only need to run a little faster than your competitor. Evolution can be scheduled; tick-tock.

A newcomer can't sell yet another me-too, even with evolutionary improvement. Instead the newcomer has to rethink and create from first principles if it is to have any chance in the market. Rethinks can't be scheduled; they take as long as they take. Ours has taken longer than I'd hoped, but adding resources like more people to the project would have just slowed us down, or forced us to market with a broken product.

The Mill rethink stage is over, and we now can have reasonable schedules and put in more resources; that's why we are going out for a significant funding round this year, our first over $10M.

That doesn't sound right even though I support your niche strategy. Many of the successful companies had me too priducts with incremental, ecosystem, or marketing improvements. Intel did with x86 starting as an incremental improvement. AMD turned into huge company doing it to x86. Transmeta and Centaur did well adding power efficiency among other things. Quite a few vendors implemented POWER variants with many acquired or still doing good business.

There's plenty of it on software side, too, with the proprietary DOS's and UNIX's. Foreign cloners stayed doing it with mainframes and embedded CPU's. So, incremental stuff (esp patented) can certainly grab market share and generate revenue. It's been going on for some time even in CPU market even with dominant players.

Yes - by those with established businesses. I can't think of a startup that succeeded with an initial incremental, at lease since Amdahl. It's also hard to do an increment: the Intel teams are not dumb, and many could and have built far better processors than Intel has - but not while keeping compatibility and Intel's ROI and marketing and pricing structure. TransMeta tried - RIP.

There's also personal strategy involved. If we had done a better X86 we would have needed huge dollops of money to crack the front door of the market - c.f. TransMeta - and lost ownership of the company. By going for the disruptive approach we still own all of it - and funding rounds now are at a valuation that will keep us making our own mistakes, not someone else's. That matters to me, enough to go without paycheck for a decade. YMMV.

Aren't almost all the businesses mentioned here extinct, or at least have tiny chunks of the market?

More disruptive startups are. These were either running profitably or absolutely huge at one point. Blame AMD's and VIA's management for the rest on their end. ;)

Starting out in x86, AMD had a license from Intel.

That justifies my use of it even more. Even with the license requirement, they still succeeded quite a bit. Even led on the 64-bit part since Intel screwed that up.

(Mill team)

Too right! 2017 is the year we get funding to go full-time and implement what we've been plotting at our Tuesday evening meetings for 10 years :)

Hmm, if you had a model with 128-bit words that skipped floating point, and left out the sneaky stuff that makes some people distrust Intel and AMD processors, you'd have an almost ideal chip for Ethereum nodes. (Make it 256-bit words and it'd really be ideal.)

Not that that's a huge market at this point, but apparently there's a lot of Fortune 500 interest so maybe that'll change by the time you're in production.

Something I don't see mentioned on your site is the rest of the hardware. Specifically is Mill intended to more of a co-processor with computing work handed over to it, or is it standalone in a system. If standalone what is the situation over bootup (eg UEFI or equivalent) and controllers like PCIe, USB, storage (NVME, SATA etc), NICs etc.

AFAIK, It's intended as a general purpose processor.

And when you show a reference design for a smart phone with provably strong security capabilities its going to get very real indeed. Kudos for continuing the push toward real parts.

But if you wait long enough your earlier ideas can no longer be patented due to prior art.

If they run it on an FPGA the game will be over. The first questions to ask are how it compares in terms of performance and gate count to other architectures running some benchmarks. In particular they need to compare it to RISC V which is deliberately not patent encumbered and seems to be killing it in performance per area and per watt.

I can't say for certain that they'll fail. Maybe they'd be twice as good, but anything short of that and I just don't see viability.

Perhaps you are thinking that the FPGA is a product? It's not; it's an RTL validator. Moving chip RTL from FPGA to product silicon is a well understood step that is almost routine in the industry. Time was you would do initial RTL development work in silicon, but modern FPGAs are big enough to hold a whole CPU core. Today you wouldn't develop directly on the silicon without an FPGA step, even if you own a fab; it's just too expensive to debug.

An FPGA will require an actual gate count, and it can give you a meaningful DMIPS/MHz number. Both of which are relevant metrics.

Their claim was ~10x improvement in performance / power compared to existing general-purpose CPUs.

> If they run it on an FPGA the game will be over.


That's great then. An FPGA implementation should be able to validate that. Even a fraction of that 10x improvement could change the world. So why does it take so long to benchmark an actual implementation? That's a rhetorical question, people implement CPUs in FPGAs all the time and this one is supposed to be simpler. There really is no excuse for it to take so long. They need to stop talking about their project and actually show us something.

> That's a rhetorical question, people implement CPUs in FPGAs all the time and this one is supposed to be simpler.

Huh? You're not serious.. The Mill is definitively not simpler than many other CPUs (in order CPUs for example), what it is supposed to have is better performance or more exactly better performance/power ratio.

I am not sure. I heard about the Mill some years ago and watched some presentations about the belt and my feeling was that it would be simpler to map functional languages onto it.

Maybe that was more wishful interpretation, but if I remember correctly the idea is to put more logic into the compiler (good old "sufficiently smart compiler") to do the optimization than do it at runtime in HW. IMHO this implies less logic in the HW which might make the implementation simpler (I am SW guy so my computer architecture knowledge might we skewed).

Further more I believe that we might be able to infer more and more bounds about runtime behavior that could possibly give rise to more aggressive optimizations. In particular the Mill provides more predictable performance than traditional architectures which is a feature in itself as it e.g. simplifies (again IMHO) real-time programming.

So that's why I believe the implementation hasn't to be at least harder than traditional architectures.

I am really looking forward to see some results and admire their effort.

The belt's designed to be easily compiled to (register coloring is NP hard) but it requires the CPU analyse data flow somewhat to optimize in a way where there isn't a bunch of data copying. Godard frequently explains that the belt is conceptual, that the hardware should be doing optimizations underneath

Whereas for a simple architecture registers are simple: they index into the register file. This only breaks down when we bring in out-of-order & register renaming in order to have more registers than the instruction set specifies

Optimal register coloring is NP hard, but no compiler does that; heuristics are no worse than quadratic and give near-optimal.

The Mill specializer part that schedules ops is linear, while the part that assigns belt numbers and inserts spills is NlogN in the number of ops in an EBB because it does some sorts.

> my feeling was that it would be simpler to map functional languages onto it

I found this [1] blog post interesting - that mainstream architectures have defaulted to low level C-machines and that radically new CPU designs might return to the halcyon days of lisp.


Thx, that was an interesting read.

However I think that Lisp is too powerful to be executed directly on the machine (as I understand Lisp machines provide HW capabilities to deal with cons and list atoms).

I often wonder if we should try to make a non Turing complete language, but that we make "pseudo-turing-complete", by using some kind of bounded-automaton with insane worst-case upper bounds (e.g. about complexity) that we can feed to the OS/machine which then can aggressively schedule as it knows upper timing bounds etc.

Btw. is their good literature about the implementation of intermediate languages as compilation targets for static functional languages? I think the book to go was SPJ's "The Implementation of Functional Programming Languages", but I am not sure how relevant it is today.

To "execute Lisp directly" means to interpret the AST. No hardware machine does that that I know of. It can almost certainly be done, but arguably shouldn't. Interpreting the raw syntax is a bootstrapping technique to which it would seem foolish to commit a full-blown hardware implementation.

The Mill is simpler that many CPU because it isn't OoO and it has a single address space but it also has some magic in it (the way it zeros memory, the stacks..) which aren't free to implement.

An FPGA implementation on general-purpose fabric will not validate the performance/watt when compared to dedicated silicon. The two aren't comparable. This is part of the reason modern FPGAs have onboard ARM processors, as well as graphics and I/O peripherals.

>> An FPGA implementation on general-purpose fabric will not validate the performance/watt when compared to dedicated silicon.

No of course not. But it will validate benchmarks per-clock cycle, as well as providing a gate count (or LUT count) required to achieve that benchmark result. If the architecture is anywhere near as awesome as claimed, there should be a strong indication of that in the numbers produced by the FPGA implementation.

My only exposure to Mill (besides this page link) has been random comments in various stories here over the years.

They typically start with ("Mill team"), and then the rest sounds like it came from the future. It's all about getting the flux capacitors charged.

Can anyone post some low-level code (in assembly, or the closest equivalent if that term is no longer relevant with Mill) implementing some well-known simple algorithm on the Mill? Just to quickly get a feel of what it's all about, how it looks.

Something like computing n! for word-sized integers, or strcmp(), or whatever.

Does anyone have a quick link to something like that?

I think this would implement factorial. I can't remember some of the specific mnemonics, but this is the gist:

        # entering factorial: argument n is b0
        # drop initial accumulator to belt; bail early if n==0
        con(1), cmpeq(b1, 0), brtrue(end);
        # now, b0 == 1, b1 == n
        # multiply accumulator by n, then decrement n, then check if n stil not 0
        mul(b0, b1), sub(b1, 1), cmpeq(b1, 0), brfalse(loop);
        # no matter what, the most recent accumulator is b0

ConAsm (model-dependent assembler) for "Silver" model:

F("fact") %0; sub(b0 %0, 1) %1;

        gtrsb(b0 %1, 1) %2,
          retnfl(b2 %0),
          inner("fact$1_1", b1 %1, b2 %0);
L("fact$1_1") %3 %4;

        mul(b1 %3, b0 %4) %5;

        sub(b0 %4, 1) %6;

        gtrsb(b0 %6, 1) %7,
          retnfl(b1 %5),
          br("fact$1_1", b2 %6, b1 %5);
Note that integer multiply is specified as 3 cycles in this model, so the recurrence in the loop takes three cycles. Other ops are one cycle. Each semicolon is an instruction (and cycle) boundary, ops separated by commas issue and execute together.

Ah, of course. I've watched your videos so much, how could I have forgotten that multiply is 3 cycles? Thanks for chiming in!

Well, the Mill uses a skewed, exposed pipeline VLIW ISA so reading the assembly isn't going to be a picnic. I'd recommend either listening to the talks, which are interesting whether the whole thing works out or not, or looking at the wiki.



Well, there's a couple of pieces of sample code on comp.arch in google groups / usenet. Sadly, I can't find any examples of genAsm anywhere, but a couple of fragments of conAsm. Maybe that's better, as conAsm is what's going to actually run.

This tread for example, talks about instruction density of Mill programs. Mill does somewhat poorly, with a pre-alpha quality compiler doing the codegen; https://groups.google.com/forum/#!topic/comp.arch/RY3Bk7O61u...

Comparing several ISAs on the simple program in the first post, a Mill Gold CPU binary weighs in at 337 bytes (and 33 instructions), compared to:

* powerpc: 212 bytes, 53 instructions * aarch64: 204 bytes, 51 instructions * arm: 176 bytes, 44 instructions * x86_64: 135 bytes, 49 instructions * i386: 130 bytes, 55 instructions * thumb2: 120 bytes, 51 instructions * thumb: 112 bytes, 56 instructions

GenAsm for the factorial function (conAsm in a different reply, above): define external function @fact w (in w) locals($3, %9, &0, ^6) {

label $0:

    %1 = sub(%0, 1) ^0;

    %2 = gtrsb(%1, 1) ^1;

    br(%2, $1, $2);
label $1 dominators($0) predecessors($0, $1): // loop=$1 header

    %3 = phi(w, %5, $1, %0, $0);

    %4 = phi(w, %6, $1, %1, $0);

    %5 = mul(%3, %4) ^2;

    %6 = sub(%4, 1) ^3;

    %7 = gtrsb(%6, 1) ^4;

    br(%7, $1, $2);
label $2 dominators($0) predecessors($0, $1):

    %8 = phi(w, %0, $0, %5, $1);

    retn(%8) ^5;

Higher-end Mill models have more and bigger things to encode, and so a given program will occupy more bytes than it will on a lower-end member. Thus a belt reference on a Gold is 6 bits, but only 3 on a Tin.

Mind you, the Mill's split-stream instruction decoding means each decoder only has to deal with ~168 bytes and ~16 instructions, and there's a cache for both (I think?).

Consider the CPU as a pile of ALUs, FPUs, load & store units, internal buffers, and the like. Each "instruction" is a variable-size set of parallel instructions, directing which output goes into which input for various stages of that clock cycle.

The layer that humans and HLL compiler developers end up with at the lowest seems to resemble a dataflow graph, and the system-specific code converts that into the parallel binary code stream.

The notion of the "belt" helps arrange how long data sticks around in live outputs & buffers, but isn't a physical element in the chip.


In one of their talks they show what strcpy would look like on the Mill: https://youtu.be/DZ8HN9Cnjhc?t=3416

Ah, always with the lectures. :) So, the code for strcpy() on the Mill seems to be:

    load    *src, bv
    eql     <load>, 0
    smearx  <eql>
    pick    <smearx0>,None,<load>
    store   *dest, <pick>
    brfl    smearx1, loop
This "vectorizes" a plain while-loop by loading as "byte vector". The size of the vector depends on the exact Mill family member implementation's choices, but is at least 8 according to the lecture. The lack of actual register names is of course a loud signal that the Mill is different.

Pretty cool, and I feel like I learned something. Thanks!

I refrained from posting the direct assembly, as without the talk the smearx and pick would be meaningless.

If you also look a bit further into the talk they show the 'remaining' instruction, which allows you to do this same thing with fixed-count loops.

From the talks, Mill assembly is ill-defined, it varies a lot from model to model. They have an intermediary representation, that I think is more like JVM bytecode than assembly, but I don't think they ever showed what it looks like.

Silly question: does this company retain full-time people, or is their work primarily done by folks employed elsewhere? Maybe I just haven't reached whatever point in life where a salary is no longer necessary, but I was surprised by their About page.

My understanding is that it's a group of enthusiasts and volunteers who are donating sweat equity out of enjoyment and the gentlemen's agreement of full-time employment once the company really gets going.


Or a technical collective. This isn't necessarily sinister.

Both. Both full- and part-timers are on sweat-equity, no cash, although we will be switching to cash-optional with the next funding round.

I’m slightly surprised that they didn’t have an FPGA implementation running as proof of concept already.

Regardless, I’m looking forward to seeing what MillComputing can deliver in the real world.

Instead of bashing Mill, the true question is what we open source developers can do to help them?

Implement algorithms? Improve FOSS tools?

@Mill team,

can you set up a list of FOSS projects that we can contribute in order to help? I specialize in implementing algorithms, especially optimization and other stuff.

See https://millcomputing.com/#JoinUs

NDA and sweat-equity agreement required; you get full-vested long-term options monthly; expected to be actual stock monthly after the next round.

To help in a FOSS context: we are not yet ready to put out an SDK to the FOSS community, or to anyone really. That will wait until after our cloud environment is up - and a few more patents have been filed (the ISE exposes things we want to protect).

In the meantime, the most help would be to support existing microkernel operating system efforts such as L4 (https://en.wikipedia.org/wiki/L4_microkernel_family). The Mill will have a big impact on conventional OSs like Linux, but those suffer from built-in assumptions that open security holes and makes the OS a dog that the Mill can only train a little. The big win is in microkernels, which can take advantage of the Mill's tight security and ultra-fast context switching.

Or support languages that have central micro-thread concepts, such as Go https://en.wikipedia.org/wiki/Go_(programming_language), for the same reason.

But whatever you do, don't tune the microkernels or microthreads for a conventional core; instead, do it right, and we'll be along to help eventually :-)

According to the CTO, 80% of software operations are loops (https://youtu.be/QGw-cy0ylCc?t=454)? I need convincing of that before I watch the rest of the talk. I suspect it would be a fundamental premise for the architecture of the chip. If it's not true, the stats are probably skewed for an unusual benchmark.

100% of software operations are loops, it's just that some of them are only run once ;)

Joking aside, computers help us handle loops more efficiently; that's it. The 20% "non-loop operations" are just the fabric that tie the loops together.

Their docs say that in most programs, 80% of executed operations are in loops, which makes sense.

I haven't followed mill after attending a few lectures years ago.

Does anyone think this is commercially viable? Or are we looking at a new Transputer?

The Mill architecture has lots of ideas which are really good on paper. We've had the same situation before, around 2000 with Itanium. There were many parts of that CPU which looked good on paper. In hindsight, the Itanium was too complex.

Commercial viability of the architecture depends on how X86-64 and ARM fares in the coming years. If they stagnate and can't produce faster single-CPU cores and people continue writing software which lives in a single-core world, then there is a room for a CPU delivering a 10-fold increase in performance per watt per dollar.

If on the other hand, most of the heavy computation are moved into either SIMD-style GPUs, or MIMD style Adapteva (Parallela) like solutions, where you have a thousand cores, then the Mill is less likely to have success.

I'm optimistic that someone are trying to rethink how to handle CPU design from the bottom up. It is a gamble, but if we don't try, we won't really learn if there is anything better out there than the current generation of out-of-order executing chips.

I'm pessimistic because Mill smells of Itanium. It is a mix of many new, unproven, ideas inside a single solution. Some of the ideas are genuinely good and might stick if thrown at a wall. But if you are gambling on having many new ideas at the same time, it is likely that some of them will turn out to be bad ideas.

I'd place the majority of my money on simple ISA designs, such as ARM or RISC-V, preferably in a many-core design, but I'd hedge a good sum on Mill as well.

In one of the talks Godard said the Itanium concept wasn't brought to conclusion mostly because the chief architect died. Don't remember the name. Of course that should be taken with a grain of salt.

One of the most interesting things to me is the security model with the promise (unproven, maybe very over-optimistic) of the equaivalent of context switches costing no more than regular function calls (for which there is a dedicated instruction). That would be HUGE for microkernel architectures.

> One of the most interesting things to me is the security model with the promise (unproven, maybe very over-optimistic) of the equaivalent of context switches costing no more than regular function calls...

In theory, we should be able to do something like that now on ordinary CPUs if we write our user-mode software in a memory-safe language such as Rust.

Current CPUs go to great lengths to isolate process address spaces so that one bad behaving application can't crash the computer. However, if the compiler also guarantees that the application can't access memory it doesn't own (because the language provides no mechanism by which to do that), then the hardware memory isolation is redundant.

The OS could maintain metadata about each binary and what it was compiled with. If it wasn't compiled with a trusted compiler, then it runs inside an isolated process and pays the cost of context switches or it runs inside a CPU emulator at something like 1% performance.

This assumes there are no unsafe code blocks in "trusted" user-space code; in practice, this would mean that everything that can be made more efficient by doing it in an unsafe way moves into a standard library that's trusted. So, an unsafe code block would be like an suid-root binary or a kernel module.

Having user space process and the kernel run in the same address space means that system calls are just function calls (which could be inlined by the compiler) and we could support things like Infiniband network interfaces (that map their hardware registers directly into user-space memory to avoid system call overhead) through an OS API, just like a regular network adapter instead of them being a weird special case.

I don't see compiler writers being thrilled by having to harden the entire compiler toolchain against actively malicious input. And there would be no way to verify the binary by itself, you'd have to have every binary built on and signed by a trusted build server. Oh, and now any binary built with a compiler that has bugs can now compromise the entire OS.

No, if you want to keep security you need some kind of external memory protection system. Memory isolation gives that and takes away performance. Mill gives both.

You're right, compiler writers wouldn't want to be responsible for system security in the same way that OS kernel authors are now, and commercial closed-source software doesn't fit well with the scheme I describe.

If Mill can drop the cost of context switches down to the point where they're indistinguishable from function calls, then that's pretty awesome and it would presumably work well with the software we have now, which is a strong point in its favor. (I haven't seen any Mill talks about security/memory protection, so I don't know what the implementation details are.)

I'm just trying to make the point that reducing the cost of context switches isn't the only solution, and that we could eliminate context switches altogether by making different tradeoffs. I think both approaches are promising and will appeal to different audiences.

> And there would be no way to verify the binary by itself.

Doesn't Java do exactly that?

No, the JVM provides security in this scenario by being the trusted compiler and inserting memory safety checks itself.

this is not a new idea - Burroughs old CPUs B6700/etc did this - system integrity depended on the tool chain always making safe code - I remember linker (aka binder) bugs bringing down weekly poayroll on our multitasking system.

It also meant that compiler development essentially required your own multi-million dollar mainframe - there weren't a lot of languages available.

You couldn't safely import code from another machine since anyone could write any bitstream to a tape.

The 6700 was a great old machine .... but this bit sucked.

Isn't this basically the idea of Google's NaCl (native client)?


From the wikipedia description, I think NaCl is doing something a little different. It looks like they rely on the MMU to prevent a misbehaving application from reading/writing outside its address space. That makes sense, since it's being used to run C and C++ programs that presumably would be difficult or impossible to statically verify that they're using pointers in a way that's definitely safe.

Itanium's major sticking point was relying on compiler technology that never materialized. The Mill architecture seems to respect the current level of compiler tech a lot more.

Not sure what you mean here.

Itanium was a VLIW and the Yale Bulldog compiler solved the VLIW code generation problem with trace scheduling back in the 80s. At this point, that's standard Dragon Book stuff.

Itanium failed to be sure but it wasn't the compiler. However I'll grant that the workload didn't match the EPIC architecture and the compiler. EPIC succeeded at scientific workloads.

As to whether Mill respects the current level of compiler tech a lot more I'd like to know more about that. They need to get their LLVM backend up and running.

Agreed: 'compiler technology' was a marketing promise/excuse over the fact that for any real-world problem, even the best possible code could not come near to the theoretical hardware performance. Intel had previously made the same mistake with the i860 but didn't learn from it.

So the hardware didn't match the workload and they blamed the compiler? That sounds plausible. I think they oversold the use cases for the Itanium though, they made it sound more like a general-purpose server CPU. The Mill might not be general-purpose either but they seem to be more honest about its niche (it's basically a DSP that happens to support general computation). Maybe that will be enough for it not to be a disappointment.

I bet if AMD didn't had come with their 64bit extensions, this laptop would be running on an Itanium processor.

I don't see how the Mill has any less reliance on modified compilers than Itanium. Instead, it seems to require more.

The Mill's history may be different because compilers actually advanced a lot since Itanium's launch, because it brings some arguably improvement over Itanium, or because the Mill's team is trying to write those compiler improvements, instead of making a partnership with Microsoft.

> I don't see how the Mill has any less reliance on modified compilers than Itanium. Instead, it seems to require more.

I don't know, some of the novelties in the Mill arch would seem to simplify compilers. For instance, no need for complex register allocation.

Other optimizations applicable to traditional CPUs either may not apply to the Mill CPU, or may need significant tweaking to work properly.

I got the opinion it would be much more complex, like deciding if saving a value is faster than reordering, and deciding the best place to request a load. It is also a large instruction word architecture, what means it's mostly similar from the compiler point of view.

The one thing that I think makes the Mill simpler is the no-value values mixed with lazy error conditions. I have no idea if those are powerful enough to compensate the rest.

Those issues are pretty easy.

For spilling: if a desired argument is less than one belt-length away then we directly reference it; between one and two away we reorder; and more than two we spill.

We place a load as soon as the address arguments are statically available. The compiler doesn't have to deal with aliasing, which is a large and bug-ridden part of compiling for other targets.

Lazy-error (aka NaR) means that there are no error dependencies in arithmetic expressions. Current compilers for other targets simply ignore such dependencies, relying on the "undefined behavior" rule. Mill is designed to not produce nasal demons :-)

Hm. I'd say Mill requires an even more ambitious compiler than Itanium. Itanium's large number of registers, for example, was friendly for existing compilers, while I don't know how compilers will handle the whole belt thing.

Incidentally, I don't think the Itanium compiler tech never materialized, it just materialized late. Itanium had a lot of other problems, but the later Itanium models really flew for certain workloads (SQL Server, in particular). After the first one (which was quite a dud and killed Itanium's reputation), Itanium was definitely not slow. The x86 emulator was slow, and AMD64 absolutely stole Itanium's lunch.

> simple ISA designs, such as ARM or RISC-V, preferably in a many-core design

There are several of these already (e.g. Tilera) which haven't really taken off outside of those marketed as "graphics cards". The main problem is that memory bandwidth is now the real bottleneck for real workloads.

That reminds me, any news from Rex Neo[1] ? There is no activity on its mailing list.

[1] http://rexcomputing.com/#neoarch Basically simpler and more Ghz/Watt than Parallela.

We're still alive, received our first silicon back in October, and will have news coming out soon.

The problem with the Mill is its not clear to me what advantage it can hope to seize over the behemoth which is Intel. If the market opens up enough to justify new architectures, then it's also opening up enough to let Intel pivot into that space and unlike everyone else they own the fabs to do it.

This is drastically over-simplified, but in short, the Mill is a sort of middle-ground between superscalar implementations (x86, ARM) and VLIW (Itanium, DSPs).

Superscalars leave a lot of the optimization process to complexity in the hardware. This is seen in stuff like the out-of-order scheduling and large cache hierarchies. In one of the Mill talks, it's guesstimated that almost 90% of the circuitry is dedicated to simply moving data around, as opposed to actually doing any work on that data.

VLIWs take the opposite extreme, leaving the complexity of optimization to the compiler. History has shown so far though, that many computing problems are (again oversimplified) too conditionals-heavy to really benefit from VLIW, and end up running much more slowly.

So the Mill is a bet that they can get more benefits from each approach without the major draw-backs of each. This isn't really ground Intel could simply "move into" without half a decade plus of work, and even then, they'd be cannibalizing their x86 ecosystem, which is not a risk most entrenched corporations are fond of.

More accurate to say, I think, that the Mill is a VLIW with certain hardware facilities to overcome the traditional weaknesses of those - code density and variable memory latency. The solution to conditional branches is exactly the same for VLIW as for Superscalar machines: use a branch predictor. And VLIWs can tolerate a slightly worse branch predictor since they tend to have shorter pipelines.

EDIT: To explain a bit more, VLIW has historically worked great in cases like DSP workloads where the memory access patterns are very predictable and you aren't unexpectedly loading things from lower level caches very often. There's anther thread on HR right now about doing deep learning with a Hexagon, which is a sort of VLIW, and it works very well. But as soon as you miss L1 in a VLIW the whole thing comes to a stop, whereas in an OoO (Out of Order) processor you can keep executing subsequent instructions that don't depend on that load and so you don't have to stall. Basically every instruction that isn't a load has deterministic or at least hard bounded latency that the compiler can easily plan for.

The other big disadvantage of VLIW is that sometimes you have stretches of code where you can only have one useful instruction at a time. VLIWs often use very RISCy encoding formats that still take up lots of space per bundle in these stretches, leading to potentially very low code density. The Mill gets around with a very CISCy encoding that only takes a small amount of I-cache size for single instructions.

But that's the point: if I want to buy non-Intel at my company, I've got contracts, lawyers fees, upper-level management etc. to convince to do it. Intel is probably sending me support personnel and sample hardware.

And I'm still looking at recompiling, debugging and deploying a ton of software to see advantages from the Mill.

So whatever advantages it brings, they need to be very substantial (i.e. if the next-gen of Intel x86 chips still out-perform it, I'll buy them) and quite quick - because I can go 5 years not buying the Mill while Intel promises to support me for their awesome new architecture. I mean, probably we buy some Mill machines, but how likely is it that it's game changing on my codebase?

That's where I see the problem. There's this whole huge assumption that the Mill will yield a bunch of benefits. If they're clear cut (a huge if) then they still have to beat their competitors being able to brute-force performance improvements until they come up with a new architecture themselves, which can take advantage of the very compensation they're asking from their customers ("switch architectures, it'll be great we promise").

There are a few places I could see Mill making inroads.

The security features could be very useful for the big internet companies, Amazon, Google, Facebook, etc and they can afford to spend a few tens of millions of dollars on something speculative like this. And they do, with things like Arm or OpenPower servers.

There's also the high end embedded land where you don't need to run a traditional operating system. Network switches, cell towers, that sort of thing.

Very good points. I agree that many established companies are going to be very wary of making the jump. My hunch is that if they have a real shot, and this is assuming that most of their promises turn out, it's not going to be going after general purpose computing head-on.

For instance are there big problems out there that are compute intensive, but branch-heavy enough that GPUs aren't a great fit, in applications that need low power but that don't make sense as "cloud services"? I don't know.

The optimist in me sees potential opportunity in opening up new domains. The practicalist agrees with you; it's going to be an uphill battle and a lot of stars need to align just right. But if they really can deliver 10x on general purpose computing workloads, it's hard for me to see that not being game changing.

My thought process of this has been that Superscalar is like a JIT, in that it attempts to optimise these instruction scheduling as it happens, always executing when viable. Bit obviously although this is good for optimising execution does come at a fairly high implementation complexity cost.

Whereas VLIW is where you optimise ahead of time with no fore knowledge, if the compiler is good enough, you get the same performance with simpler, cheaper, more efficient hardware (or potentially more performance) but often this doesn't pan out fully.

The biggest issue with WLIW is the impredictability of the latency of memory loads which make any sort of static scheduling very hard for general purpose workloads. Itanium it seems had similar issues even with specialised hardware to help the compiler.

Not sure what's the memory latency hiding story for the Mill.

Edit: Symmetry said it better.

On the Mill, you tell it the address that you want to read and when you want to read it. So you could issue a load for XYZ as soon as you know what XYZ is and if there's a subsequent store to XYZ that will update the result so when the load returns you have the data that was there when the load returns and not from when the load was issued.

If you issue a load that completes in 5 cycles and it needs to go to RAM for that data then it's going to stall waiting for that data but because you can issue loads sooner than on other architectures you can sort of alleviate some of the problems with stalling on some load.

But as far as I understand and hear, Itanium had sort-of similar capabilities (ALAT and speculative loads) and it didn't work great outside of FP workloads as it is still hard to programmatically schedule load early even if you can ignore RAW hazards. What's unique in the Mill?

Basically they've used their exposed pipeline design and register metadata to fix the things that didn't work well about ALAT. Or at least that's the theory.

EDIT: I think the main problem was that on Itanium issuing a speculative load could potentially trigger a page fault making it potentially dangerous. On the Mill the page fault won't trigger until the load leads to a side effect outside the belt, sort of like it's been wrapped in a Haskell Maybe monad. So if you' have something like

  for (int i = 0; i < foo; i++) {
    a[i] = a[i]+1;
you can speculatively load the a[i+n] while you're working on a[i] even if allocated memory stops at the border of the array allocation.

A couple things:

1) Intel is very heavily invested in x86, and they're likely to be very reluctant to change.

2) The Mill folks have patents, which might be hard to work around.

3) If the Mill becomes successful, they're likely to either be bought out by one of Intel's behomoth competitors, like Samsung or Qualcomm, or perhaps even by Intel itself.

4) If Intel really likes their technology, they could license it from Mill Computing. It would be surprising for Mill to say no to a reasonable offer, considering that they don't have their own fabs and it's their business model is to license their designs to those that do.

All but #4. Our business model is to sell chips, not licenses. Why? Intel's quarterly dividend is bigger than ARM's annual revenue.

Licensing is a backup plan.

Ah, thanks for the clarification.

You forgot:

5) Intel really likes their technology, but they do not want to license it so they decide to implement tweaks around the Mill patents.

A more likely strategy for a major (not Intel specifically) is to just use whatever they want without a license, and beat us to death with lawyers.

> The problem with the Mill is its not clear to me what advantage it can hope to seize over the behemoth which is Intel.

If the Mill actually achieves the stated 10x improvement in performance per watt, you'd be quite stupid not to use it in highly competitive markets like smartphones, because others will. It would provide a significant competitive advantage.

That's why the Mill team has been stockpiling patents for a while.

You would have thought the same about the SoC space, but they let it slip away. Intel does have the ability the squash competition, but it's not strategically perfect.

Fred Chow, who's great at compilers, expressed the opinion that the Itanium added instructions for many things that were already easy for compilers to deal with. That's not exactly a ringing endorsement. Mill appears to do a lot of stuff that's actually helpful to compilers.

Sounds like another Transmeta waiting to happen.

There really hasn't been anything to follow apart from the video lectures (https://millcomputing.com/technology/docs/).

It's exciting to think about a 10x improvement in CPUs. This would literally change our lives.

For anyone considering getting involved or investing could you fill in a couple of bio details?

>Ivan Godard has designed, implemented or led teams for...an operating system, an object-oriented database

Cool, which OS and database?

>taught computer science at graduate and post-graduate level at Carnegie-Mellon University

Which courses?

Someone asked something similar a while ago: http://blog.kevmod.com/2014/07/the-mill-cpu/

Ivan responded to the post but didn't address the questions about his bio at all, but perhaps he doesn't feel the need to.

For the curious, his name apparently was Mark Rain for some time in the 70s/80s [1] – there are a number of publications under that name.

[1] http://newsgroups.derkeiler.com/Archive/Comp/comp.arch/2012-...

Why would he not feel a need to?

All of us have to maintain detailed bios when seeking funding, trying to build a team, or building corporate partnerships, all of which appear to be on the table.

Our industry is a surprisingly small pond, at the top anyway. No one would invest in the Mill based on my formal bio "College dropout; never took a CS course in his life" :-)

Instead they invest based on our technology, most of which (and eventually all) we make publicly available. You may not be able to judge it, but any potential partner has people who can. One of the things that encouraged us in our long road is that the more senior and more skilled the reviewer, the more they loved the Mill. Quotes like "This thing is the screwiest thing I've ever seen, but it would work - and I could build it".

I don't think many people care about a degree. It's just weird that you ask interested people to join you but seem reluctant to give details on your bio. A lot of people factor it in when deciding on joining or dealing with a startup.

Judge us on the tech, not on me. To be honest, if you cannot already understand what we have put out to the public well enough to know that you want to work on it, then you are probably still too junior to be really useful as we are now. We can't afford the cost of ramping people up if they are not already there.

As we grow there will be more place for beginners, but not yet. Mind, that's beginners as in concept understanding, not beginners as in age or degrees. I'm a dropout, and this year we added an intern in Tunisia who's still finishing his exams. We let people self-select, we don't try to persuade them. And we don't pay them, so only the convinced join.

In a way, at the top of engineering things work much more like they do in the arts: you are judged by your portfolio, not by your background or education.

"Judge us on the tech, not on me." Hans Reiser and ReiserFS taught everyone that the person's bio _does_ matter for the project.

I don't think anything in his bio or not in his bio conveyed that he would murder his wife and go to prison for life thus abandoning the project.

Ah, Mark Rain brings back memories of his Mary language...

You show your age :-)

I am glad this is still going - if only for the reason that different computer architectures fascinate me. IIRC they said the belt machine thing maps well to SSA - you specify your operands by position, but the destination register is implicitly the back of the queue - in kind of the same way that each "a op b" in SSA is saved to a new variable.


Ivan is a compiler guy, and I've always seen a huge impact of "compiler guy" thinking on Mill -- in a much smarter way than how "compiler guy" thinking sank Itanium.

> "We considered going out for our next funding round round earlier, but decided to wait until our technology was confirmed by issued patents"

An example of patents slowing down innovation rather than promoting it.

> So far our patent experience has been excellent. We have successfully refuted all the prior art cited by the Examiners. While we have rephrased our applications to suit the Examiner in several cases, the substance of our claims have been allowed without exception. The Mill is truly novel – and we now have the USPTO imprimatur to confirm that.

So it won't be useful for another 20ish years. Yay?

> It will, naturally enough, break. However, by hosting it ourselves we can capture what failed, and use that as raw feed for our test/debug team.

Wow, this is gettting more and more depressing.

Here’s a team of world-class experts who have spent 10+ years of their lives completely redesigning all the parts of the CPU from scratch, along with a completely new low-level software stack on top of it. They’ve taken no salaries during that time.

If they had no chance of any reward for their efforts, but expected their decades of work to make some already wealthy chip vendors a bit richer, you think they would bother?

This kind of thing is the best possible use of the patent system: it’s supporting extremely novel and creative technical invention and engineering, not preventing anyone else’s work (nobody was intending to make anything remotely similar), and putting the details of the new inventions into the public sphere for anyone’s study now and free use in 20 years.

To make use of these inventions requires billions of dollars of capital investment in chip fabrication infrastructure. Anyone with serious intentions can approach the Mill people and ask to license their technology.

>team of world-class experts who have spent 10+ years of their lives completely redesigning all the parts of the CPU from scratch, along with a completely new low-level software stack on top of it. They’ve taken no salaries during that time.

reminds me of gimp, or emacs

Except for, you know, the main part of your quote: "completely redesigning all the parts of the CPU from scratch, along with a completely new low-level software stack on top of it."

Hardware is harder than software.

Not asking this rhetorically; how is it that RISC-V, also a major hardware development effort, was able to get a lot more done (including commercial chip production) in a lot less time (~6 years) without being patent-encumbered?

These projects have completely different scope. Wikipedia on RISC-V:

> The designers claim that new principles are becoming rare in instruction-set design, as the most successful designs of the last forty years have become increasingly similar. Most of these that failed, failed because their sponsoring companies failed commercially, not because the instruction-sets were technically poor. So, a well-designed open instruction set designed using well-established principles should attract long-term support by many vendors.

* * *

> Not asking this rhetorically

It’s hard to take your comment seriously, since you don’t seem to understand what the Mill people are trying to do (even in basic concept, i.e. rethinking the entire design of the CPU).

I recommend you watch a few of Ivan Godard’s lectures. That will give you a much better idea than reading discussion here.

I've been reading the mill papers for the last few years, and I'm still not sure why they haven't done even a basic PoC in the 10+ years they've been working on this.

This is an example of patents being used the way they were intended. Not all patents are bad you know. The world is shades of grey, not black and white as children see it.

Sorry, but children do not see the world in black and white, not literally nor in terms of good/bad. Some people seems to divide the world around them like that, but it has nothing to do with age.

If you want to get pedantic I'll revise that too young children typically have a simplistic world view where things are either good or bad. I'm all too aware that many adults are like that as well, but it's a childish attitude that speaks very poorly of them. I also expect that kind of attitude is negatively correlated with IQ, but have no evidence to back that up.

I don't think it would ever ship if it wasn't patented. It makes it considerably easier to obtain funding.

Frankly, I'd rather have it one product not ship at all than have it prevent other people from working in the area.

You can work in the same area, you just can't sell the exact same thing. From your comments, it seems like you have no problem working for free, so what's the issue?

Not sure about the patent stuff, but "cloud"-only compiler hosting is an absolute show-stopper for me.

That's only while it's in development. Obviously, once ready, the toolchain will be available publicly. They can't release it now, though, because not all of their patents have been accepted.

> We plan to host a complete Mill development environment in the cloud for public use. It will, naturally enough, break. However, by hosting it ourselves we can capture what failed, and use that as raw feed for our test/debug team. This will give us much better feedback than releasing a SDK for user’s own machines, because users rarely take the trouble to file bug reports.

It seems like they're doing that now mainly to get better feedback. I can't imagine that rationale would make as much sense once their stuff is more stable.

At a minimum, you'd need to spend a stack of cash verifying that the EULA of the cloud compiler didn't include a "ALL UR INVENTIONS ON THIS ARE BELONG TO US" clause - would be only prudent given how intent the company is on planting its flag in IP.

Such paranoia is amply justified in our industry. However I expect that we will continue our iconoclastic approach to legalisms so long as the founders keep control. Our model for our cloud software follows the example pioneered by Greg Comeau; you can see his at http://www.comeaucomputing.com/tryitout/.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact