Hacker News new | past | comments | ask | show | jobs | submit login
The Belt CPU Architecture (ootbcomp.com)
183 points by momerath on Aug 1, 2013 | hide | past | web | favorite | 79 comments

I got to talk with Ivan after he gave an earlier version of this talk at Asilomar and the good news was he's the real deal, pretty much every question I could think to throw at him he had a solid answer for, the bad news was I felt the same way about the Mill architecture as I did about Intel's iapx432 architecture [1], which was elegant to a fault.

That said, I got the sense that this was what Intel was going for when they did Larrabee [2] and just missed because of the focus on Graphics. Unlike Larrabee is suspect OOTBC will need to build it themselves like Chip did for the Propeller [3].

That said, the challenge of these bespoke architectures are the requirement for software, or first a GCC port :-). I believe Ivan said they had a port that talked to their simulator, but I don't know if that was an optimizing thing like SGI's compiler for Itanium or a proof of concept thing.

The weird thing is of course "Why?", and one might say "But Chuck, faster and cheaper, why not?" and I look at the explosion of ARM SoC's (cheaper, not necessarily faster than x86) and look back at ARM and think 99% of this was getting the ecosystem built, not the computer architecture. So who can afford to invest the billions to build the eco-system? Who would risk that investment? (Google might but that is a different spin on things).

So playing around with the Zynq-7020 (same chip that is on the Parallea but not the Epiphany co-processor) I can see a nice dual core ARM9 where you have full speed access to a bespoke DSP if you want for the 'tricky' bits. Will that be "good enough" for the kinds of things this would also excel at? I don't know, so I don't know how to handicap OOTBC's chances for success. But I really enjoy novel computer architectures, like this one and Chuck Moore's '1000 forth chickens' chip [4] (it was a reference to the Seymour Cray's quote, "Would you have 1,000 chickens pull your plow or an Ox?"

A really interesting time will be had when 'paid off' fab capacity is sitting idle and the cost for a wafer start becomes a function of how badly the fab wants to keep the line busy.

[1] http://en.wikipedia.org/wiki/Intel_iAPX_432

[2] http://en.wikipedia.org/wiki/Larrabee_(microarchitecture)#Di...

[3] http://www.parallax.com/propeller/

[4] http://www.colorforth.com/S40.htm

I'll try to address some of the questions and issues raised here; the Mill has been more fully discussed on comp.arch, and there is more material at ootbcomp.com/docs. Sign up at ootbcomp.com/mailing-list to get announcements of additional videos, white papers etc as they become available.

I just signed up. Can't wait to see more of this!

I remember reading the Larrabee white papers, press material, etc. for an article[1] I wrote some five years ago. I think the reason that Larrabee was pitched for graphics initially was just to limit the domain of "things" (tessellation, texture sampling, etc.) that had to be catered to in silicon. I've no doubt that had it come to market, and had it worked well, we would be looking at a radically different CPU architecture than Haswell on the market today.

Of course, there's always the outside chance that the experiment worked so well that Intel is keeping an architecture inspired by the Larrabee research project it in reserves for after it's gone as far as it can shrinking transistors and needs something new to sell.

[1] http://www.trustedreviews.com/opinions/intel-larrabee-an-int...

First off, I love the post! Super meaty goodness.

I read-ish the whole set of slides and it sounds pretty good (the devil is always in the details), but I got a little worried about VLIW-ish issues when [in the slides] he said on slide #57:

    The compiler controls when ops issue
One of the big issues with VLIW was that the compiler had to be intimately aware of processor architecture. So when you upgraded your '886 to a '986 you needed new binaries because the '986 had more registers or executions units. [I assume Itanium fixed some of this, but it also sunk my interest in VLIW.]

Is this architecture going to face the same issue?

Edit: I watching the video and heard that "nearly all of what the super scalar is doing is [not calculating]". One of the other VLIW issues was that chip area was dominated by cache-area, so all the stuff about [not calculating] shrank and shrank relatively as cache area grew (see: http://techreport.com/review/15818/intel-core-i7-processors). This claim concerns me.

Edit V2: but damn... Exciting stuff.

The Mill compiler is a conventional three-phase tool. The first version used the EDG front end and home-grown middle and back ends, but we are converting to a clang front end, LLVM middle, and home-grown back end. All optimization is done in the middle end; the back end (the "specializer" runs at program install (or load) time and does only scheduling and emit.

All Mill family members have a common architecture, that the middle end knows about. ME output (specializer input) is a complete CFG and DFG for the abstract Mill. The specializer first replaces operations that are not present on the target with calls to functions; on a Mill function call is semantically equivalent to an op like an add.

There is no instruction-selection phase. The Mill in general has only one way to implement any given source-level action, and the correct (abstract) code has already been selected by the ME. The BE does instruction scheduling, introducing spill where necessary, using standard schedule algorithms long used for in-order machines and VLIWs. The result may be listed as assembler source or emitted as binary. The schedule algorithms have quadratic worst case but in practice are linear. The generated binaries are cached in the load module, so subsequent runs skip the specializer step.

Current status is that work on the new (LLVM-based) compiler is on hold while we complete the patent filings, and the old one is out of date already. When the filings are in we hope to make the tool chain and sim available on-line for those who want to play. We could do that now with the assembler and sim, except that the asm instruction set exposes some things that the patents aren't in on yet. Grrr - patents!

As someone who worked on operating systems back at Multiflow (late 80's, the first VLIW start-up as a spin-off from Yale research), it struck me recently that something like an LLVM representation of binary code might solve the "compiler problem" (needing to know the exact specifics of the chip's latencies). I.e., you'd run and load LLVM binaries, and have a runtime final optimization pass that took into account the specific latencies of each opcode for a particular implementation. (The LLVM architecture is actually already set up to do optimizations at runtime on LLVM "binaries".)

But perhaps that "final optimization pass" would be nearly as hard as the whole compilation problem in the first place; dunno. I wasn't on the compiler team, so this is perhaps a naive viewpoint.

> The LLVM architecture is actually already set up to do optimizations at runtime on LLVM "binaries".

Pragmatically, exactly how fast is that run-time optimization? Could you realistically JIT it, or should the more-optimal, chip-specific asm be cached between loads? Or is this so slow you'd only ever want to do it once?

Right, forgot to say that: you'd cache the results (like Rosetta translation on the PowerPC or the DEC VAX-to-Alpha binary recompilation) so you'd only take the translation hit once.

Transmeta was a little like that, except it was x86 to VLIW.

Transmeta was doing dynamic binary translation from x86 to VLIW. I think the grandparent comment is saying that you could distribute programs in an IR and then compile them before running.

In terms of square millimeters the caches dominate chip area. In terms of fab yield (which is the cost driver) they do not, because caches are regular structures that are built with built-in sparing.

If there's a bad gate in the cache then the fab process simply uses one of the spare cells inside; the user never realizes that a spare has been used. If there's a bad gate in the core proper then that chip is gone, lowering the fab yield.

The same can be done at the level of other regular structures. Your two-core chip is really a four-core chip with a couple of bad cores. That's one of the reasons why the vendors are so eager to convince you that multicore is the Pearly Gates - it increases their effective yield. Re cache area:

To a first approximation, power cost of a cache is constant independent of size, whereas core power is superlinear. Consequently the limiting factor for increasing cache size is latency, not power. See ootbcomp.com/docs/encoding for how the Mill doubles instruction cache size without increasing latency.

I feel like the issue this faces isn't so much the belt length, since I think you'd probably just end up with a dominant choice (much like x86 managed to stick to a roughly unchanging number of registers for a very long time).

But the lack of latency hiding means you have to generate code to schedule starting operations at the right time. Mul taking 2 cycles means you want to start any dependent, but lower latency instructions after it. But if at some point in the future mul is scaled down to one cycle for whatever reason your timings will all be off and your belt occupancy will be less than ideal.

I kind of wonder if the reality is simply that there is nothing better equipped to schedule instructions than the cpu itself. Maybe we could have better ways to explicitly hint meaningful information to it, but I'm not sure shoving all the decisions to the compiler is a long term solution.

The obstacle would simply be recompiling from IR. That doesn't seem like a big challenge in today's environment for a large swath of devices -- nearly mobile devices, for instance, have nearly-complete top-to-bottom source code for all software running on the chip available. So that approach doesn't seem nearly as crazy as it would have 10-20 years ago.

Complete source code, except for the applications you want to run.

This assumes that the applications are provided fully compiled to object code. In situations where the applications are provided as bytecode (e.g. Android Dalvik) or are interpreted (e.g. Python) the applications of which you speak will run just fine.

...which (probably) explains why Google seems to be so interested in this new architecture. Besides Dalvik this new CPU architecture could also be used with the new Portable Native Client (PNaCl) which uses LLVM to compile to an intermediary bitcode. From this perspective, the Belt architecture makes a lot of sense.

Is this architecture going to face the same issue?

Why don't you just recompile?

I like this guy and his presentation style, it is good to see someone delivering seriously on an old idea: queue machines. (The "Belt" is the queue machine model revisited. I can't see a theoretical breakthrough here). But the Mill's realization that combines a queue model with VLIW and embedded spiller machinery is worth attention.

Some criticism however: no sign yet of how variable latencies in memory will be tolerated. Requiring fixed latency for FUs is problematic with cache misses and FU sharing between pipelines. Also his comparison between a 64-item belt and the 300+ registers of OoOE is unfair, since the 300+ registers will likely tolerate more latency than the smaller belt.

I wrote a review of what I get from his first two talks here: http://staff.science.uva.nl/~poss/posts/2013/08/01/mill-cpu/

Nice post; I had the same thought of using promises for memory access. I guess you could extend that to a massively-parallel design (promises for communication).

I think Torben Mogensen posted the same basic idea of replacing registers with 'temporal addressing' to Usenet back in the 90s -- comp.arch? comp.compilers? Boy, it's been a long time.

Every successive incarnation of dataflow scheduling uses a matching store and synchronization tokens on long-latency operations. You have them in Tomasulo's reservation stations at a very small scale (considering FUs as unpredictable), in the MTA throughout the memory system, in the D-RISC core of my research group, and quite a few others.

It's quite a common and recurrent idea really. However promises / dataflow tokens / I-structures / etc all are subject to a common flaw / problem: when you receive multiple completions simultaneously, which of them are you going to schedule first? This choice is highly non-obvious and has tremendous impacts on data locality.

Thanks. Idle curiosity: do you know any introductory refs about that question and its impact? (which to schedule first)

Sure, the best reference is the paper by Culler et al. from 1992/1993:

David E. Culler, Klaus E. Schauser, and Thorsten von Eicken. Two fundamental limits on dataflow multiprocessing. In PACT ’93: Proceedings of the IFIP WG10.3. Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, pages 153–164. North-Holland Publishing Co., Amsterdam, Netherlands, 1993. ISBN 0-444-88464-5.

A preprint is available as tech report CSD-92-716 from Berkeley: http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6259.html


The memory architecture hasn't been detailed, but from his claims of latency hiding I could see a memory load succeeding in a constant number of cycles and pushing a belt item tagged as unfilled memory, and then an actual attempt to read that belt item would cause a stall if it hasn't been filled yet.

As Ivan just pointed out to me privately, I meant "is the stack machine model revisited", not "queue". Queue and stack machines are quite different things. But the rest of the argument holds.

Part of the way the Mill handles memory latency will be in the next talk. Sign up at ootbcomp.com/mailing-list for an announcement of date and venue.

The spiller machinery seems to be quite similar to SPARCs register windows. SPARC requires software handlers for the overflow though.

The spiller concept is not new with us; there have been others besides SPARC to use the idea. Ours has no software handlers.


Wow, the one submission I've made anywhere that hit the front page, and I'm kicking myself for the title. 'Belt' is really their name for the category of machine architecture in the sense of register and stack machines. 'Mill' is the name of their specific architecture/family.

As far as the meat goes, I'm gratified that I'm not the only one to think this is mindbogglingly elegant. I see some misconceptions in this thread, but I'm finding it difficult to explain the divergences without beginning to inexpertly regurgitate large portions of this and the previous talk (on their instruction encoding).

As someone who has worked implementing the same algorithms on a VLIW DSP (TI C64x) and ARM and x86 a few comments after watching the beginning of the talk. It's been a while so this is from memory...

The different VLIW execution units can only process a subset of the instructions set. That means you need the right mix of instructions in your algorithm to take advantage of the full throughput. If you have any sort of serial dependency you won't be able to take advantage of all the execution slots. It basically excels at - signal processing (and even a subset of that). That said, when you hit the sweet spot it's pretty good.

When someone like TI compares their DSP to ARM they usually tend to ignore SIMD (very conveniently). SIMD (NEON on ARM or SSE on x86) can buy you almost another order of magnitude performance on those super-scalar CPUs if you're dealing with vectors of smaller quantities (16 bit or 8 bit). So while on paper the VLIW DSP should blow the general purpose superscalar out of the water for signal processing algorithms at comparable clock rates it's not quite like that when you can use SIMD. It also takes a lot of manual optimization to get the most out of both architectures.

So when you're in the VLIW's sweet spot your performance/power/price is pretty good. But the general purpose processors can get you performance in and out of that sweet spot (you're probably sucking more power but that's life).

You really can't look at "peak" instructions per second on the VLIW as any sort of reliable metric and you need any comparison to include the SIMD units...

EDIT: Another note is that for many applications the external memory bandwidth is really important. The DSPs benefit from their various DMAs and being able to parallelize processing and data transfer but generally x86 blows away all those DSPs and ARM. I guess in a modern system you may also want to throw the GPU into the comparison mix.

Re Scheduling for latency:

Decades-old compiler tech can produce near-perfect schedules for in-order machines (including VLIWs) for all operations with a statically-fixed latency, which in practice means everything but loads. A load miss will stall an in-order machine.

Load misses are of two kinds: where the load result is a data-flow ancestor of all the rest of the program, and when the load is incidental to the program dataflow. In the former case neither in-order nor out-of-order machines can do anything but wait for memory. An example is making random references into a hash table that is bigger than cache - every machine of every architecture is going to run at DRAM speed.

However, there are also incidental misses, where the program has other things to do that do not require the result from the load miss. For these, the OOO machine can keep going, while an in-order machine stalls.

The Mill does not face this issue. It is an in-order machine, yet does not suffer from miss stalls, or at least not from unnecessary ones. In general, the Mill avoids all the misses that an OOO can avoid. Or roughly so - there are a few cases where the Mill stalls but an OOO doesn't, and a few where the OOO stalls but the Mill doesn't, but it's roughly a wash.

Most of how this works will be described in the next presentation (#3) in the series, although some won't be until a later presentation.

Both SIMD and VLIW are techniques that only work on programs with a good amount of statically accessible instruction level parallelism, but VLIW is much more general since it can be used in cases where you know in advance that you want to be executing a number of instructions in parallel, rather than one instruction multiple times. The programs that can be sped up by SIMD are a subset of the programs that can be sped up by VLIW, are a subset of the programs that can be sped up by OoO engines. Of course, SIMD has a big advantage in instruction bandwidth vs VLIW, and if you're already going OoO the SIMD instructions only use up a minimum of the OoO engine's resources. Combining VLIW and OoO, on the other hand, is very difficult to do.


Very interesting that the architecture has no condition codes, or any global state for that matter. Also, that the belts are logical, in that there really isn't a chunk of memory where the belt lives, but the members of a belt live where they live and know "which belt" they belong to and "where on that belt" they are.

Too bad they can't release their emulators/simulators, but I sympathize with their desire to be first-to-file to protect what sounds like years of work.

Exactly :-) We will be publishing as fast as we get the filings done.

This looks fascinating - and "the real deal" (not crackpottery) but I'm hesitant about a number of items.

It seems like quite a reach to pick a 'latest and greatest' IA processor (why is this $885? Because it can be) as a point of contrast. We can hit half the performance envelope and/or about an eighth the price while keeping many of the features that make a desktop processor what it is (like, for example, a whopping big cache hierarchy). Picking on x86 seems like a good way of overstating the case; if we're getting a new architecture, we may as well pick on ARM or Tilera or what have you.

Am also, as an early believer in the Itanium, somewhat nervous about static scheduling for any purpose. Dynamic branch prediction is very accurate on today's architectures; this does not mean that static scheduling can emulate this accuracy and enjoy the benefits.

This was a very nice presentation indeed. Too bad they could not tell us how they handle the memory hierarchy. Because that part is crucial for any statically scheduled architecture.

Also the implementation of the belt itself could be quite nasty. Just look at the face of the guy who asked about it after he hears the answer. ;-)

Finally, they will not get anywhere near peak performance with general purpose code. The parallelism is just not there at the instruction level. They would do well on high performance computing or digital signal processing with enough floating point units and memory bandwidth.

Instead they seem to target big data. An interesting move. It will be interesting to see actual performance numbers (even from a simulator). I wish them luck.

We consider the Mill family to be general-purpose, with the same potential markets as any other GP architecture (x86, ARM, zSeries, PowerPC, etc.). Individual family members can be configured for particular markets; the Gold mentioned in the talk is indeed somewhat Big Data oriented, with eight ALUs and only two FPUs. A High Silver, with six FPUs and a SIMD height of 256 bits, is an FP banger. A Tin, with one ALU, no FPU, and 8-byte SIMD, is at the low end.

There is an architectural minimal size to Mills; there must be at least one ALU and one memory unit, and all Mills must use a 64-bit address; the z80 market is safe from us, and you won't see Mills in your toaster or thermostat. There's no architectural maximal Mill, but there are diminishing returns; in current process technology we feel that eight ALUs are getting close to that edge, but those decisions are made independently for each market and process.


At 17 mins in, when he talked about rename registers, it reminds me of SSAs. It's quite fascinating to see that something like that happens in the hardware level as well

My first thought was "Huh, that sounds like an architecture that would map perfectly to LLVM's SSA IR". Can't wait to watch the video now!

nah it's a different thing. He says general purpose processor architectures have so many rename registers because of single assignments. He then proceed to talk about percentage of data being Single Assignment.

Therefore, it's better to use a statick-sized queue structure. Hence "belt" - it's like a conveyor belt.

I'm still watching it though. The topic is interesting but man, his voice is droning.

What he's talking about there isn't SSA. In SSA you never assign to a variable more than once. Instead you shadow the existing variable with a new version.

Converting to something akin to SSA (but not quite the same thing) is actually what he's talking about superscalar architectures doing in register renaming.

That said, just because you use an SSA IR doesn't mean your code generation doesn't require register renaming. But it would be a very good basis to generate code for this architecture.

Interesting idea. A limitation of the temporal addressing is that variables live on entry to a basic block have to be put in the same position on the belt on each path which could enter the block. This is okay for loops, but it's going to require lots of shuffling about in conditional code. Still could be worth it, of course.

They will probably do lots of if conversions to avoid this and to make use of the 33 issue slots. They could also add special instructions to push several values on the belt to keep different paths aligned.



This feels like a functional architecture. It's really cool.

I don't get how this is different than an in-order register-based VLIW architecture, except that it makes branching more difficult to deal with because you need to synchronize belt-deltas between branches (which sounds like the dual of a register move to me)?

I work daily with a VLIW architecture and the only place I ever see a plain "move" instruction is in loops. Everywhere else, the compiler just churns through the registers as if they were in a belt anyway – just the names are absolute instead of relative.

I can imagine this might simplify the processor logic some – results are primarily read from one of the first few belt locations, and are always written to the first location. "Register moves" aside, is this the primary benefit?

My takeaway from this presentation:

- To achieve instruction-level parallelism, traditional architectures (his example: Haswell) often employ very messy techniques like register renaming which create a huge amount of complexity increasing power-consumption.

- He has focused on one such technique called Very Long instruction word (VLIW), and has taken it to an extreme: the technique he proposes is to throw away general purpose registers, and replace it with a "belt": a write-once linear tape of memory (implemented using stacks).

- He then points out various advantages of this model, including in-order execution (ILP traditionally requires reordering), short pipeline, and overall simplification of architectures.

All this looks fine on paper, but I don't see a proposed instruction set, or any indication of what realization of this model will require. In short, it's a cute theoretical exercise.

So, let's look at what the unworkable problems with Haswell are, and what's being done to fix them. Yes, nobody can seem to be able to figure out how to reduce power consumption on performant x86 microarchitectures beyond a point. It's a very old architecture, and I'm hopeful about the rise of ARM. The solution is not to throw away general purpose registers, but rather to cut register renaming and make ILP easier to implement by using a weak memory model (which is exactly what ARM does). ARM64 is emerging and the successes of x86 are slowly percolating to it [1]. Moreover, the arch/arm64 tree in linux.git is 1.5 years old and is under active development; we even got virt/kvm/arm merged in recently (3 months ago), although I'm not sure how it works without processor extensions (it's not pure pvops). ARM32 already rules embedded and mobile devices, and manufacturers are interested in taking it to heavy computing. In short, the roadmap is quite clear: ARM64 is the future.

The core of the Linux memory barriers model described in Documentation/memory-barriers.txt [2] (heavily inspired by the relaxed DEC Alpha) should tell you what you need to know about why a weak memory model is desirable.

[1]: http://www.arm.com/files/downloads/ARMv8_Architecture.pdf

[2]: https://github.com/torvalds/linux/blob/master/Documentation/...

A weak memory model has almost nothing to do with ILP.

To exploit ILP, ARM has to do the same thing as every other register architecture: move to a superscalar out-of-order architecture. There's no real alternative. The weak memory model makes cache coherency protocols simpler/faster, but won't save you from false register dependencies.

Three-address code might make it possible to defer the costs of complex register renaming, but ARM has discovered the importance of code density, and Thumb-2 (2-address code) is preferred for most functions.

I see. Can you explain to me how it fixes the power consumption problem? Yes, I know that it needs to reorder instructions to achieve ILP, but isn't that the entire point of a weak architecture? To reduce the memory barriers and weaken the ordering constraints, making it easier to reorder?

What are false register dependencies? And how does a three-address code help in register renaming?

Some reading references would help. I'm currently reading [1].

[1]: http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...

The ordering constraints of memory models are foremost between instructions that are executed on different processors. As long as instructions execute on a single processor, everything appears to execute sequentially (even if it doesn't).

The real cost of weak memory models is in software. Programmers that write multithreaded code need to understand the memory models. And few actually do, because these things are not trivial. So you gain a few percent of performance with a weak memory model and at the same time you dramatically restrict the number of programmers who can write correct code. A very bad deal.

Hennessy, Patterson: "Computer Architecture, A Quantitative Approach." is the standard introduction to computer architecture.

What?! Linux provides me memory-barrier primitives: smp_mb(), smp_rmb(), smp_wmb() etc. Even an everyday Linux hacker does not need to understand the peculiarities of each architecture, because the locking peculiarities are all abstracted out in commonly used spinlock_*() interfaces. Any sane compiler-writer would use these primitives too: so, unless you're writing raw architecture-dependent assembly, or are a very hardcore Linux hacker, you do NOT need to understand this.

Linux supports Alpha, which provides the weakest ordering guarantees anyway, so no: any architecture stronger than Alpha puts absolutely no burden on anybody. In practice, weaker ordering constraint means more freedom for Linux.

Sure, if you use a portable set of primitives you only need to understand how and when to use those primitives and only the relaxed memory model they are based on. But even that is a lot to ask for, and too much for many programmers.

And this particular solution only works inside the Linux kernel and only using GCC. It doesn't matter what we think a sane compiler-writer or a sane computer-architect would do. Many times we just have to live with the choices other people made.

I think you got something wrong here. Architectures with strong memory ordering put more burden on the hardware and less on the programmer. That was the point I was trying to make: weak memory ordering is not worth it, because it is better to let the hardware do the hard work and not to burden the programmer with it.

Linux may work well with relaxed memory ordering, but that does not come for free, a lot of people had to understand first, what a relaxed memory model is, etc.

> weak memory ordering is not worth it, because it is better to let the hardware do the hard work and not to burden the programmer with it

My question is very simple: why do we have architectures that reorder aggressively if the performance gain is not worth it? CPU manufacturers do understand that weaker guarantees leads to complexity that must be handled at the software level: so, are they mad? The fact of the matter is that we have been moving away from Lisp machines, CISCs and "human" instruction sets to RISCs, because the kernel/ compiler is in a much better position to make optimization decisions than the hardware (it can see the bigger picture).

These architectures exist, and the infrastructure to handle all this complexity has already been written (yes, all major compilers too [1]). Yes, a lot of people had to study a _lot_ to get us where we are today, and the success of our little iPhone apps stands on the shoulders of those giants.

I seriously don't get what software complexity you're whining about. Complexity is not the exception in Linux; it's the bloody rule! Have you seen the fs/ tree implementing various filesystems? Perhaps the kernel/ tree with the state-of-the-art scheduler? Or even the net/ tree implementing TCP/IP?

I'm not interested in discussing some hypothetical idealized textbook world where everything is simple and elegant: I'm interested in contributing whatever little I can to tomorrow's concurrency infrastructure.

The mammoth question in the room has still not been answered: how does ARM manage significantly lower power consumption? Aggressive reordering (=> weaker guarantees) seems to be part of the answer.

[1]: http://en.wikipedia.org/wiki/Memory_ordering#Compiler_suppor...

> Are CPU manufacturers mad?

No they are not. But in the early 90ties when most of these decisions were mande, the situation was very different. Processors got twice as fast every two process generations. Shared memory multiprocessors were an exotic niche. But then we hit the power wall, and all of a sudden shared memory multiprocessors became the norm. Today, you can not just wait for a new processor to make your program magically faster, you have to rewrite it. If you asked the same people who took these decisions in the 90ies today, many would (and many actually did so publically) say that they regret this decision.

On, complexity. Yes, software is incredibly complex. That's why you don't want to add even more complexity to it!

> How does ARM manage significantly lower power consumption?

Easy, ARM is targetting the low power market. So they don't need to build the fastest chip. When you try to build the fastest chip, energy efficiency suffers a lot. Once ARM starts to compete with Intel on performance, their power usage will skyrocket.

Memory model and instruction set architecture are orthogonal.

Registers are entirely internal to a processor. They don't have to be replicated or consistent between different cores.

Memory is shared system state. To write correct parallel programs, CPUs need to define how they manage consistency between different processors. x86 does have a more conservative model with stronger consistency than ARM, and this does have (theoretical) scaling consequences-- but Intel has a huge amount of experience making it fast.

The OP was clearly talking about memory barriers in architectures, not memory models in multi-threaded programming. Don't nitpick.

> Memory is shared system state. To write correct parallel programs, CPUs need to define how they manage consistency between different processors. x86 does have a more conservative model with stronger consistency than ARM, and this does have (theoretical) scaling consequences-- but Intel has a huge amount of experience making it fast.

Yeah, but what worries me is not raw throughput scaling; it's power consumption. The mainstream netbooks today: Macbook Air Mid-2012, Chromebook Pixel, and Lenovo X1 Carbon all use the 17W i5-3427U. The MBA Mid-2013 jumped to Haswell early and uses the 15W Core i5-4250U. But we still haven't been able to make a decent quad-core netbook: there's a 15W Core i7-4650U; I believe you can get an MBA 2013 with this as well, but a paltry 1.7 Ghz?

How many years has it been since quad-cores first came out? It looks to me like x86 has stagnated: it doesn't matter how competent Intel is; the architecture seems to have fundamental problems.

Now, I don't fully understand what the ARM A57 is, but the specs look really impressive [1]. Moreover, AMD has confirmed that they will manufacture Opterons in Q1 2014 [2]. I'm not completely sold on ARM or anything, but there is certainly something interesting going on: and we must investigate.

[1]: http://www.arm.com/products/processors/cortex-a50/cortex-a57...

[2]: http://www.amd.com/us/press-releases/Pages/amd-unveils-2013j...

A "false" register dependency is a Write-after-read (WAR) [1].

As far as three address code making register renaming easier, I'm not sure what the author had in mind -- isn't `add $5, %rax` essentially a condensed form of `add $5, %rax, %rax` (which is a 3AC)? In fact, 3AC is _more_ general and should make register renaming harder if anything at all.

[1] http://en.wikipedia.org/wiki/Register_renaming

A careful compiler could avoid WARs more easily in 3AC than 2AC (by spacing reuse of registers), so an architecture without renaming could still extract ILP.

It's not a general or optimal solution, though.

The Mill is strong sequentially consistent throughout. There are no membar operations, and the Linux macroes for that purpose are empty.

This design choice had two motivations: 1) membar is fabulously expensive; and 2) weaker consistency models are incomprehensible to mere mortals.

This guy looks and is awesome! Can't wait to have them produce many-core Mill CPUs, that I'd like to throw algorithms onto. Then I'm gonna Buy Buy Buy

This reminds me a lot of SSAs. This is really really cool stuff. One of those genuine progress type things in this space.

This strikes me as a further progression of the idea of register windows; in particular the SPARC architecture.

Wow, this addresses quite a number of my concerns with aging CPU architectures, specifically with throwaway registers for fundamentally low-latency operations like chaining ALU functions.

Is there a simulator for this available somewhere?

What would be the ideal language to compile to this?

C would probably be fine. One interesting point seems to be the ease with which many parameters can be both passed and returned. So, languages that return multiple values from functions, and that allow for good, automatic software pipe lining.

Fortran or numerical code in C. It seems best suited for statically unrollable loops with few data-dependent branches or random memory accesses.

C expects addressable registers. From what I can tell this doesn't have those!

It really does seem like a Lisp would map much better, with the whole caller/callee and those private data belts that looked like hardware level closures!

C does not expect addressable registers. This architecture seems quite appropriate for C!

To elaborate on what knz42 said, C expects addressable memory but doesn't say anything about registers at all. In fact, the existence of registers is sort of vaguely awkward from the perspective of C's model of computing.

> doesn't say anything about registers at all

Considering that C has a "register" storage-class specifier keyword, I don't see how you can make that claim. Granted, modern compilers ignore it.

Certainly, the C specification states that "register" storage is non-addressable, so it does say something.

That "register" nowadays is just a sequence of characters that, to those familiar with CPU architecture, rings a bell. Modern C standards do not convey any links between the keyword and hardware registers. http://www.open-std.org/JTC1/SC22/wg14/www/docs/n1124.pdf, section 6.7.1:

"A declaration of an identifier for an object with storage-class specifier register suggests that access to the object be as fast as possible. The extent to which such suggestions are effective is implementation-defined."

And 'non-addressable' only applies to the C 'runtime'. The standard does not say anything about the hardware.

For example, a C compiler for the 6502 could attempt to store variables of storage class 'register' in the zero page. Such storage locations have an address, but you are not allowed to take it from standard C.

For an even better example, look at http://en.wikipedia.org/wiki/TMS9900. That CPU stored all its general purpose registers in RAM.

My intuition says haskell or lisp. Don't take my word for it.

After Bret Victor's inspiring talk yesterday, it's fitting to see a new CPU architecture make it to the top of HN.

Keep it up! Maybe tomorrow we'll see an idea to replace threads and locks.

You can replace threads and locks, if you use non-blocking programming techniques.


With massively parallel hardware, concurrency is the biggest challenge facing programmers today. It's a terrible programming paradigm. And your Non-blocking algorithms only add complexity to an already difficult problem. For example, I've seen NBAs lead to a priority inversion bug.

Did you even watch his talk? I didn't see anyone arguing against Bret's scaling concerns with threads in yesterday's HN comments.

In my experience, the worse bugs I have encountered are deadlocks and race conditions. Saying just use a “non-blocking algorithm” is no silver bullet to the difficultly of working with threads.

If we don’t come up with a new system, I can’t imagine how hard it will be to efficiently program CPUs when they start coming out with 128, 256, and 512 cores.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact