That said, I got the sense that this was what Intel was going for when they did Larrabee  and just missed because of the focus on Graphics. Unlike Larrabee is suspect OOTBC will need to build it themselves like Chip did for the Propeller .
That said, the challenge of these bespoke architectures are the requirement for software, or first a GCC port :-). I believe Ivan said they had a port that talked to their simulator, but I don't know if that was an optimizing thing like SGI's compiler for Itanium or a proof of concept thing.
The weird thing is of course "Why?", and one might say "But Chuck, faster and cheaper, why not?" and I look at the explosion of ARM SoC's (cheaper, not necessarily faster than x86) and look back at ARM and think 99% of this was getting the ecosystem built, not the computer architecture. So who can afford to invest the billions to build the eco-system? Who would risk that investment? (Google might but that is a different spin on things).
So playing around with the Zynq-7020 (same chip that is on the Parallea but not the Epiphany co-processor) I can see a nice dual core ARM9 where you have full speed access to a bespoke DSP if you want for the 'tricky' bits. Will that be "good enough" for the kinds of things this would also excel at? I don't know, so I don't know how to handicap OOTBC's chances for success. But I really enjoy novel computer architectures, like this one and Chuck Moore's '1000 forth chickens' chip  (it was a reference to the Seymour Cray's quote, "Would you have 1,000 chickens pull your plow or an Ox?"
A really interesting time will be had when 'paid off' fab capacity is sitting idle and the cost for a wafer start becomes a function of how badly the fab wants to keep the line busy.
Of course, there's always the outside chance that the experiment worked so well that Intel is keeping an architecture inspired by the Larrabee research project it in reserves for after it's gone as far as it can shrinking transistors and needs something new to sell.
I read-ish the whole set of slides and it sounds pretty good (the devil is always in the details), but I got a little worried about VLIW-ish issues when [in the slides] he said on slide #57:
The compiler controls when ops issue
Is this architecture going to face the same issue?
Edit: I watching the video and heard that "nearly all of what the super scalar is doing is [not calculating]". One of the other VLIW issues was that chip area was dominated by cache-area, so all the stuff about [not calculating] shrank and shrank relatively as cache area grew (see: http://techreport.com/review/15818/intel-core-i7-processors). This claim concerns me.
Edit V2: but damn... Exciting stuff.
All Mill family members have a common architecture, that the middle end knows about. ME output (specializer input) is a complete CFG and DFG for the abstract Mill. The specializer first replaces operations that are not present on the target with calls to functions; on a Mill function call is semantically equivalent to an op like an add.
There is no instruction-selection phase. The Mill in general has only one way to implement any given source-level action, and the correct (abstract) code has already been selected by the ME. The BE does instruction scheduling, introducing spill where necessary, using standard schedule algorithms long used for in-order machines and VLIWs. The result may be listed as assembler source or emitted as binary. The schedule algorithms have quadratic worst case but in practice are linear. The generated binaries are cached in the load module, so subsequent runs skip the specializer step.
Current status is that work on the new (LLVM-based) compiler is on hold while we complete the patent filings, and the old one is out of date already. When the filings are in we hope to make the tool chain and sim available on-line for those who want to play. We could do that now with the assembler and sim, except that the asm instruction set exposes some things that the patents aren't in on yet. Grrr - patents!
But perhaps that "final optimization pass" would be nearly as hard as the whole compilation problem in the first place; dunno. I wasn't on the compiler team, so this is perhaps a naive viewpoint.
Pragmatically, exactly how fast is that run-time optimization? Could you realistically JIT it, or should the more-optimal, chip-specific asm be cached between loads? Or is this so slow you'd only ever want to do it once?
If there's a bad gate in the cache then the fab process simply uses one of the spare cells inside; the user never realizes that a spare has been used. If there's a bad gate in the core proper then that chip is gone, lowering the fab yield.
The same can be done at the level of other regular structures. Your two-core chip is really a four-core chip with a couple of bad cores. That's one of the reasons why the vendors are so eager to convince you that multicore is the Pearly Gates - it increases their effective yield.
Re cache area:
To a first approximation, power cost of a cache is constant independent of size, whereas core power is superlinear. Consequently the limiting factor for increasing cache size is latency, not power. See ootbcomp.com/docs/encoding for how the Mill doubles instruction cache size without increasing latency.
But the lack of latency hiding means you have to generate code to schedule starting operations at the right time. Mul taking 2 cycles means you want to start any dependent, but lower latency instructions after it. But if at some point in the future mul is scaled down to one cycle for whatever reason your timings will all be off and your belt occupancy will be less than ideal.
I kind of wonder if the reality is simply that there is nothing better equipped to schedule instructions than the cpu itself. Maybe we could have better ways to explicitly hint meaningful information to it, but I'm not sure shoving all the decisions to the compiler is a long term solution.
...which (probably) explains why Google seems to be so interested in this new architecture. Besides Dalvik this new CPU architecture could also be used with the new Portable Native Client (PNaCl) which uses LLVM to compile to an intermediary bitcode. From this perspective, the Belt architecture makes a lot of sense.
Why don't you just recompile?
Some criticism however: no sign yet of how variable latencies in memory will be tolerated. Requiring fixed latency for FUs is problematic with cache misses and FU sharing between pipelines. Also his comparison between a 64-item belt and the 300+ registers of OoOE is unfair, since the 300+ registers will likely tolerate more latency than the smaller belt.
I wrote a review of what I get from his first two talks here: http://staff.science.uva.nl/~poss/posts/2013/08/01/mill-cpu/
I think Torben Mogensen posted the same basic idea of replacing registers with 'temporal addressing' to Usenet back in the 90s -- comp.arch? comp.compilers? Boy, it's been a long time.
It's quite a common and recurrent idea really. However promises / dataflow tokens / I-structures / etc all are subject to a common flaw / problem: when you receive multiple completions simultaneously, which of them are you going to schedule first? This choice is highly non-obvious and has tremendous impacts on data locality.
David E. Culler, Klaus E. Schauser, and Thorsten von Eicken. Two fundamental limits on dataflow multiprocessing. In PACT ’93: Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, pages 153–164. North-Holland Publishing Co., Amsterdam, Netherlands, 1993. ISBN 0-444-88464-5.
A preprint is available as tech report CSD-92-716 from Berkeley: http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6259.html
As far as the meat goes, I'm gratified that I'm not the only one to think this is mindbogglingly elegant. I see some misconceptions in this thread, but I'm finding it difficult to explain the divergences without beginning to inexpertly regurgitate large portions of this and the previous talk (on their instruction encoding).
The different VLIW execution units can only process a subset of the instructions set. That means you need the right mix of instructions in your algorithm to take advantage of the full throughput. If you have any sort of serial dependency you won't be able to take advantage of all the execution slots. It basically excels at - signal processing (and even a subset of that). That said, when you hit the sweet spot it's pretty good.
When someone like TI compares their DSP to ARM they usually tend to ignore SIMD (very conveniently). SIMD (NEON on ARM or SSE on x86) can buy you almost another order of magnitude performance on those super-scalar CPUs if you're dealing with vectors of smaller quantities (16 bit or 8 bit). So while on paper the VLIW DSP should blow the general purpose superscalar out of the water for signal processing algorithms at comparable clock rates it's not quite like that when you can use SIMD. It also takes a lot of manual optimization to get the most out of both architectures.
So when you're in the VLIW's sweet spot your performance/power/price is pretty good. But the general purpose processors can get you performance in and out of that sweet spot (you're probably sucking more power but that's life).
You really can't look at "peak" instructions per second on the VLIW as any sort of reliable metric and you need any comparison to include the SIMD units...
EDIT: Another note is that for many applications the external memory bandwidth is really important. The DSPs benefit from their various DMAs and being able to parallelize processing and data transfer but generally x86 blows away all those DSPs and ARM. I guess in a modern system you may also want to throw the GPU into the comparison mix.
Decades-old compiler tech can produce near-perfect schedules for in-order machines (including VLIWs) for all operations with a statically-fixed latency, which in practice means everything but loads. A load miss will stall an in-order machine.
Load misses are of two kinds: where the load result is a data-flow ancestor of all the rest of the program, and when the load is incidental to the program dataflow. In the former case neither in-order nor out-of-order machines can do anything but wait for memory. An example is making random references into a hash table that is bigger than cache - every machine of every architecture is going to run at DRAM speed.
However, there are also incidental misses, where the program has other things to do that do not require the result from the load miss. For these, the OOO machine can keep going, while an in-order machine stalls.
The Mill does not face this issue. It is an in-order machine, yet does not suffer from miss stalls, or at least not from unnecessary ones. In general, the Mill avoids all the misses that an OOO can avoid. Or roughly so - there are a few cases where the Mill stalls but an OOO doesn't, and a few where the OOO stalls but the Mill doesn't, but it's roughly a wash.
Most of how this works will be described in the next presentation (#3) in the series, although some won't be until a later presentation.
Too bad they can't release their emulators/simulators, but I sympathize with their desire to be first-to-file to protect what sounds like years of work.
It seems like quite a reach to pick a 'latest and greatest' IA processor (why is this $885? Because it can be) as a point of contrast. We can hit half the performance envelope and/or about an eighth the price while keeping many of the features that make a desktop processor what it is (like, for example, a whopping big cache hierarchy). Picking on x86 seems like a good way of overstating the case; if we're getting a new architecture, we may as well pick on ARM or Tilera or what have you.
Am also, as an early believer in the Itanium, somewhat nervous about static scheduling for any purpose. Dynamic branch prediction is very accurate on today's architectures; this does not mean that static scheduling can emulate this accuracy and enjoy the benefits.
Also the implementation of the belt itself could be quite nasty. Just look at the face of the guy who asked about it after he hears the answer. ;-)
Finally, they will not get anywhere near peak performance with general purpose code. The parallelism is just not there at the instruction level. They would do well on high performance computing or digital signal processing with enough floating point units and memory bandwidth.
Instead they seem to target big data. An interesting move. It will be interesting to see actual performance numbers (even from a simulator). I wish them luck.
There is an architectural minimal size to Mills; there must be at least one ALU and one memory unit, and all Mills must use a 64-bit address; the z80 market is safe from us, and you won't see Mills in your toaster or thermostat. There's no architectural maximal Mill, but there are diminishing returns; in current process technology we feel that eight ALUs are getting close to that edge, but those decisions are made independently for each market and process.
Therefore, it's better to use a statick-sized queue structure. Hence "belt" - it's like a conveyor belt.
I'm still watching it though. The topic is interesting but man, his voice is droning.
Converting to something akin to SSA (but not quite the same thing) is actually what he's talking about superscalar architectures doing in register renaming.
That said, just because you use an SSA IR doesn't mean your code generation doesn't require register renaming. But it would be a very good basis to generate code for this architecture.
I work daily with a VLIW architecture and the only place I ever see a plain "move" instruction is in loops. Everywhere else, the compiler just churns through the registers as if they were in a belt anyway – just the names are absolute instead of relative.
I can imagine this might simplify the processor logic some – results are primarily read from one of the first few belt locations, and are always written to the first location. "Register moves" aside, is this the primary benefit?
- To achieve instruction-level parallelism, traditional architectures (his example: Haswell) often employ very messy techniques like register renaming which create a huge amount of complexity increasing power-consumption.
- He has focused on one such technique called Very Long instruction word (VLIW), and has taken it to an extreme: the technique he proposes is to throw away general purpose registers, and replace it with a "belt": a write-once linear tape of memory (implemented using stacks).
- He then points out various advantages of this model, including in-order execution (ILP traditionally requires reordering), short pipeline, and overall simplification of architectures.
All this looks fine on paper, but I don't see a proposed instruction set, or any indication of what realization of this model will require. In short, it's a cute theoretical exercise.
So, let's look at what the unworkable problems with Haswell are, and what's being done to fix them. Yes, nobody can seem to be able to figure out how to reduce power consumption on performant x86 microarchitectures beyond a point. It's a very old architecture, and I'm hopeful about the rise of ARM. The solution is not to throw away general purpose registers, but rather to cut register renaming and make ILP easier to implement by using a weak memory model (which is exactly what ARM does). ARM64 is emerging and the successes of x86 are slowly percolating to it . Moreover, the arch/arm64 tree in linux.git is 1.5 years old and is under active development; we even got virt/kvm/arm merged in recently (3 months ago), although I'm not sure how it works without processor extensions (it's not pure pvops). ARM32 already rules embedded and mobile devices, and manufacturers are interested in taking it to heavy computing. In short, the roadmap is quite clear: ARM64 is the future.
The core of the Linux memory barriers model described in Documentation/memory-barriers.txt  (heavily inspired by the relaxed DEC Alpha) should tell you what you need to know about why a weak memory model is desirable.
To exploit ILP, ARM has to do the same thing as every other register architecture: move to a superscalar out-of-order architecture. There's no real alternative. The weak memory model makes cache coherency protocols simpler/faster, but won't save you from false register dependencies.
Three-address code might make it possible to defer the costs of complex register renaming, but ARM has discovered the importance of code density, and Thumb-2 (2-address code) is preferred for most functions.
What are false register dependencies? And how does a three-address code help in register renaming?
Some reading references would help. I'm currently reading .
The real cost of weak memory models is in software. Programmers that write multithreaded code need to understand the memory models. And few actually do, because these things are not trivial. So you gain a few percent of performance with a weak memory model and at the same time you dramatically restrict the number of programmers who can write correct code. A very bad deal.
Hennessy, Patterson: "Computer Architecture, A Quantitative Approach." is the standard introduction to computer architecture.
Linux supports Alpha, which provides the weakest ordering guarantees anyway, so no: any architecture stronger than Alpha puts absolutely no burden on anybody. In practice, weaker ordering constraint means more freedom for Linux.
And this particular solution only works inside the Linux kernel and only using GCC. It doesn't matter what we think a sane compiler-writer or a sane computer-architect would do. Many times we just have to live with the choices other people made.
I think you got something wrong here. Architectures with strong memory ordering put more burden on the hardware and less on the programmer. That was the point I was trying to make: weak memory ordering is not worth it, because it is better to let the hardware do the hard work and not to burden the programmer with it.
Linux may work well with relaxed memory ordering, but that does not come for free, a lot of people had to understand first, what a relaxed memory model is, etc.
My question is very simple: why do we have architectures that reorder aggressively if the performance gain is not worth it? CPU manufacturers do understand that weaker guarantees leads to complexity that must be handled at the software level: so, are they mad? The fact of the matter is that we have been moving away from Lisp machines, CISCs and "human" instruction sets to RISCs, because the kernel/ compiler is in a much better position to make optimization decisions than the hardware (it can see the bigger picture).
These architectures exist, and the infrastructure to handle all this complexity has already been written (yes, all major compilers too ). Yes, a lot of people had to study a _lot_ to get us where we are today, and the success of our little iPhone apps stands on the shoulders of those giants.
I seriously don't get what software complexity you're whining about. Complexity is not the exception in Linux; it's the bloody rule! Have you seen the fs/ tree implementing various filesystems? Perhaps the kernel/ tree with the state-of-the-art scheduler? Or even the net/ tree implementing TCP/IP?
I'm not interested in discussing some hypothetical idealized textbook world where everything is simple and elegant: I'm interested in contributing whatever little I can to tomorrow's concurrency infrastructure.
The mammoth question in the room has still not been answered: how does ARM manage significantly lower power consumption? Aggressive reordering (=> weaker guarantees) seems to be part of the answer.
No they are not. But in the early 90ties when most of these decisions were mande, the situation was very different. Processors got twice as fast every two process generations. Shared memory multiprocessors were an exotic niche. But then we hit the power wall, and all of a sudden shared memory multiprocessors became the norm. Today, you can not just wait for a new processor to make your program magically faster, you have to rewrite it. If you asked the same people who took these decisions in the 90ies today, many would (and many actually did so publically) say that they regret this decision.
On, complexity. Yes, software is incredibly complex. That's why you don't want to add even more complexity to it!
> How does ARM manage significantly lower power consumption?
Easy, ARM is targetting the low power market. So they don't need to build the fastest chip. When you try to build the fastest chip, energy efficiency suffers a lot. Once ARM starts to compete with Intel on performance, their power usage will skyrocket.
Registers are entirely internal to a processor. They don't have to be replicated or consistent between different cores.
Memory is shared system state. To write correct parallel programs, CPUs need to define how they manage consistency between different processors. x86 does have a more conservative model with stronger consistency than ARM, and this does have (theoretical) scaling consequences-- but Intel has a huge amount of experience making it fast.
> Memory is shared system state. To write correct parallel programs, CPUs need to define how they manage consistency between different processors. x86 does have a more conservative model with stronger consistency than ARM, and this does have (theoretical) scaling consequences-- but Intel has a huge amount of experience making it fast.
Yeah, but what worries me is not raw throughput scaling; it's power consumption. The mainstream netbooks today: Macbook Air Mid-2012, Chromebook Pixel, and Lenovo X1 Carbon all use the 17W i5-3427U. The MBA Mid-2013 jumped to Haswell early and uses the 15W Core i5-4250U. But we still haven't been able to make a decent quad-core netbook: there's a 15W Core i7-4650U; I believe you can get an MBA 2013 with this as well, but a paltry 1.7 Ghz?
How many years has it been since quad-cores first came out? It looks to me like x86 has stagnated: it doesn't matter how competent Intel is; the architecture seems to have fundamental problems.
Now, I don't fully understand what the ARM A57 is, but the specs look really impressive . Moreover, AMD has confirmed that they will manufacture Opterons in Q1 2014 . I'm not completely sold on ARM or anything, but there is certainly something interesting going on: and we must investigate.
As far as three address code making register renaming easier, I'm not sure what the author had in mind -- isn't `add $5, %rax` essentially a condensed form of `add $5, %rax, %rax` (which is a 3AC)? In fact, 3AC is _more_ general and should make register renaming harder if anything at all.
It's not a general or optimal solution, though.
This design choice had two motivations: 1) membar is fabulously expensive; and 2) weaker consistency models are incomprehensible to mere mortals.
It really does seem like a Lisp would map much better, with the whole caller/callee and those private data belts that looked like hardware level closures!
Considering that C has a "register" storage-class specifier keyword, I don't see how you can make that claim. Granted, modern compilers ignore it.
Certainly, the C specification states that "register" storage is non-addressable, so it does say something.
"A declaration of an identifier for an object with storage-class specifier register suggests that access to the object be as fast as possible. The extent to which such suggestions are effective is implementation-defined."
And 'non-addressable' only applies to the C 'runtime'. The standard does not say anything about the hardware.
For example, a C compiler for the 6502 could attempt to store variables of storage class 'register' in the zero page. Such storage locations have an address, but you are not allowed to take it from standard C.
For an even better example, look at http://en.wikipedia.org/wiki/TMS9900. That CPU stored all its general purpose registers in RAM.
Keep it up! Maybe tomorrow we'll see an idea to replace threads and locks.
Did you even watch his talk? I didn't see anyone arguing against Bret's scaling concerns with threads in yesterday's HN comments.
If we don’t come up with a new system, I can’t imagine how hard it will be to efficiently program CPUs when they start coming out with 128, 256, and 512 cores.