If they run it on an FPGA the game will be over. The first questions to ask are ...

igodard · on Jan 12, 2017

Perhaps you are thinking that the FPGA is a product? It's not; it's an RTL validator. Moving chip RTL from FPGA to product silicon is a well understood step that is almost routine in the industry. Time was you would do initial RTL development work in silicon, but modern FPGAs are big enough to hold a whole CPU core. Today you wouldn't develop directly on the silicon without an FPGA step, even if you own a fab; it's just too expensive to debug.

phkahler · on Jan 13, 2017

An FPGA will require an actual gate count, and it can give you a meaningful DMIPS/MHz number. Both of which are relevant metrics.

jacobolus · on Jan 12, 2017

Their claim was ~10x improvement in performance / power compared to existing general-purpose CPUs.

> If they run it on an FPGA the game will be over.

Huh?

phkahler · on Jan 12, 2017

That's great then. An FPGA implementation should be able to validate that. Even a fraction of that 10x improvement could change the world. So why does it take so long to benchmark an actual implementation? That's a rhetorical question, people implement CPUs in FPGAs all the time and this one is supposed to be simpler. There really is no excuse for it to take so long. They need to stop talking about their project and actually show us something.

renox · on Jan 12, 2017

> That's a rhetorical question, people implement CPUs in FPGAs all the time and this one is supposed to be simpler.

Huh? You're not serious.. The Mill is definitively not simpler than many other CPUs (in order CPUs for example), what it is supposed to have is better performance or more exactly better performance/power ratio.

protomikron · on Jan 12, 2017

I am not sure. I heard about the Mill some years ago and watched some presentations about the belt and my feeling was that it would be simpler to map functional languages onto it.

Maybe that was more wishful interpretation, but if I remember correctly the idea is to put more logic into the compiler (good old "sufficiently smart compiler") to do the optimization than do it at runtime in HW. IMHO this implies less logic in the HW which might make the implementation simpler (I am SW guy so my computer architecture knowledge might we skewed).

Further more I believe that we might be able to infer more and more bounds about runtime behavior that could possibly give rise to more aggressive optimizations. In particular the Mill provides more predictable performance than traditional architectures which is a feature in itself as it e.g. simplifies (again IMHO) real-time programming.

So that's why I believe the implementation hasn't to be at least harder than traditional architectures.

I am really looking forward to see some results and admire their effort.

__s · on Jan 12, 2017

The belt's designed to be easily compiled to (register coloring is NP hard) but it requires the CPU analyse data flow somewhat to optimize in a way where there isn't a bunch of data copying. Godard frequently explains that the belt is conceptual, that the hardware should be doing optimizations underneath

Whereas for a simple architecture registers are simple: they index into the register file. This only breaks down when we bring in out-of-order & register renaming in order to have more registers than the instruction set specifies

igodard · on Jan 12, 2017

Optimal register coloring is NP hard, but no compiler does that; heuristics are no worse than quadratic and give near-optimal.

The Mill specializer part that schedules ops is linear, while the part that assigns belt numbers and inserts spills is NlogN in the number of ops in an EBB because it does some sorts.

petecox · on Jan 13, 2017

> my feeling was that it would be simpler to map functional languages onto it

I found this [1] blog post interesting - that mainstream architectures have defaulted to low level C-machines and that radically new CPU designs might return to the halcyon days of lisp.

[1]http://liam-on-linux.livejournal.com/43040.html

protomikron · on Jan 14, 2017

Thx, that was an interesting read.

However I think that Lisp is too powerful to be executed directly on the machine (as I understand Lisp machines provide HW capabilities to deal with cons and list atoms).

I often wonder if we should try to make a non Turing complete language, but that we make "pseudo-turing-complete", by using some kind of bounded-automaton with insane worst-case upper bounds (e.g. about complexity) that we can feed to the OS/machine which then can aggressively schedule as it knows upper timing bounds etc.

Btw. is their good literature about the implementation of intermediate languages as compilation targets for static functional languages? I think the book to go was SPJ's "The Implementation of Functional Programming Languages", but I am not sure how relevant it is today.

kazinator · on Jan 15, 2017

To "execute Lisp directly" means to interpret the AST. No hardware machine does that that I know of. It can almost certainly be done, but arguably shouldn't. Interpreting the raw syntax is a bootstrapping technique to which it would seem foolish to commit a full-blown hardware implementation.

lispm · on Jan 15, 2017

The Scheme79 / 81 chips seems to go in that direction:

https://dspace.mit.edu/handle/1721.1/6334

Scheme86 https://groups.csail.mit.edu/mac/users/mhwu/scheme86/sbt.ps

renox · on Jan 12, 2017

The Mill is simpler that many CPU because it isn't OoO and it has a single address space but it also has some magic in it (the way it zeros memory, the stacks..) which aren't free to implement.

LeifCarrotson · on Jan 12, 2017

An FPGA implementation on general-purpose fabric will not validate the performance/watt when compared to dedicated silicon. The two aren't comparable. This is part of the reason modern FPGAs have onboard ARM processors, as well as graphics and I/O peripherals.

phkahler · on Jan 12, 2017

>> An FPGA implementation on general-purpose fabric will not validate the performance/watt when compared to dedicated silicon.

No of course not. But it will validate benchmarks per-clock cycle, as well as providing a gate count (or LUT count) required to achieve that benchmark result. If the architecture is anywhere near as awesome as claimed, there should be a strong indication of that in the numbers produced by the FPGA implementation.