
The Belt CPU Architecture - momerath
http://ootbcomp.com/docs/belt/index.html
======
ChuckMcM
I got to talk with Ivan after he gave an earlier version of this talk at
Asilomar and the good news was he's the real deal, pretty much every question
I could think to throw at him he had a solid answer for, the bad news was I
felt the same way about the Mill architecture as I did about Intel's iapx432
architecture [1], which was elegant to a fault.

That said, I got the sense that this was what Intel was going for when they
did Larrabee [2] and just missed because of the focus on Graphics. Unlike
Larrabee is suspect OOTBC will need to build it themselves like Chip did for
the Propeller [3].

That said, the challenge of these bespoke architectures are the requirement
for software, or first a GCC port :-). I believe Ivan said they had a port
that talked to their simulator, but I don't know if that was an optimizing
thing like SGI's compiler for Itanium or a proof of concept thing.

The weird thing is of course "Why?", and one might say "But Chuck, faster and
cheaper, why not?" and I look at the explosion of ARM SoC's (cheaper, not
necessarily faster than x86) and look back at ARM and think 99% of this was
getting the ecosystem built, not the computer architecture. So who can afford
to invest the billions to build the eco-system? Who would risk that
investment? (Google might but that is a different spin on things).

So playing around with the Zynq-7020 (same chip that is on the Parallea but
not the Epiphany co-processor) I can see a nice dual core ARM9 where you have
full speed access to a bespoke DSP if you want for the 'tricky' bits. Will
that be "good enough" for the kinds of things this would also excel at? I
don't know, so I don't know how to handicap OOTBC's chances for success. But I
really enjoy novel computer architectures, like this one and Chuck Moore's
'1000 forth chickens' chip [4] (it was a reference to the Seymour Cray's
quote, "Would you have 1,000 chickens pull your plow or an Ox?"

A really interesting time will be had when 'paid off' fab capacity is sitting
idle and the cost for a wafer start becomes a function of how badly the fab
wants to keep the line busy.

[1]
[http://en.wikipedia.org/wiki/Intel_iAPX_432](http://en.wikipedia.org/wiki/Intel_iAPX_432)

[2]
[http://en.wikipedia.org/wiki/Larrabee_(microarchitecture)#Di...](http://en.wikipedia.org/wiki/Larrabee_\(microarchitecture\)#Differences_with_CPUs)

[3] [http://www.parallax.com/propeller/](http://www.parallax.com/propeller/)

[4] [http://www.colorforth.com/S40.htm](http://www.colorforth.com/S40.htm)

~~~
igodard
I'll try to address some of the questions and issues raised here; the Mill has
been more fully discussed on comp.arch, and there is more material at
ootbcomp.com/docs. Sign up at ootbcomp.com/mailing-list to get announcements
of additional videos, white papers etc as they become available.

~~~
solarexplorer
I just signed up. Can't wait to see more of this!

------
CoffeeDregs
First off, I love the post! Super meaty goodness.

I read-ish the whole set of slides and it sounds pretty good (the devil is
always in the details), but I got a little worried about VLIW-ish issues when
[in the slides] he said on slide #57:

    
    
        The compiler controls when ops issue
    

One of the big issues with VLIW was that the compiler had to be intimately
aware of processor architecture. So when you upgraded your '886 to a '986 you
needed new binaries because the '986 had more registers or executions units.
[I assume Itanium fixed some of this, but it also sunk my interest in VLIW.]

Is this architecture going to face the same issue?

Edit: I watching the video and heard that "nearly all of what the super scalar
is doing is [not calculating]". One of the other VLIW issues was that chip
area was dominated by cache-area, so all the stuff about [not calculating]
shrank and shrank relatively as cache area grew (see:
[http://techreport.com/review/15818/intel-
core-i7-processors](http://techreport.com/review/15818/intel-
core-i7-processors)). This claim concerns me.

Edit V2: but damn... Exciting stuff.

~~~
cpr
As someone who worked on operating systems back at Multiflow (late 80's, the
first VLIW start-up as a spin-off from Yale research), it struck me recently
that something like an LLVM representation of binary code might solve the
"compiler problem" (needing to know the exact specifics of the chip's
latencies). I.e., you'd run and load LLVM binaries, and have a runtime final
optimization pass that took into account the specific latencies of each opcode
for a particular implementation. (The LLVM architecture is actually already
set up to do optimizations at runtime on LLVM "binaries".)

But perhaps that "final optimization pass" would be nearly as hard as the
whole compilation problem in the first place; dunno. I wasn't on the compiler
team, so this is perhaps a naive viewpoint.

~~~
arh68
> The LLVM architecture is actually already set up to do optimizations at
> runtime on LLVM "binaries".

Pragmatically, exactly how fast is that run-time optimization? Could you
realistically JIT it, or should the more-optimal, chip-specific asm be cached
between loads? Or is this so slow you'd only ever want to do it once?

~~~
cpr
Right, forgot to say that: you'd cache the results (like Rosetta translation
on the PowerPC or the DEC VAX-to-Alpha binary recompilation) so you'd only
take the translation hit once.

------
knz42
I like this guy and his presentation style, it is good to see someone
delivering seriously on an old idea: queue machines. (The "Belt" is the queue
machine model revisited. I can't see a theoretical breakthrough here). But the
Mill's realization that combines a queue model with VLIW and embedded spiller
machinery is worth attention.

Some criticism however: no sign yet of how variable latencies in memory will
be tolerated. Requiring fixed latency for FUs is problematic with cache misses
and FU sharing between pipelines. Also his comparison between a 64-item belt
and the 300+ registers of OoOE is unfair, since the 300+ registers will likely
tolerate more latency than the smaller belt.

I wrote a review of what I get from his first two talks here:
[http://staff.science.uva.nl/~poss/posts/2013/08/01/mill-
cpu/](http://staff.science.uva.nl/~poss/posts/2013/08/01/mill-cpu/)

~~~
abecedarius
Nice post; I had the same thought of using promises for memory access. I guess
you could extend that to a massively-parallel design (promises for
communication).

I _think_ Torben Mogensen posted the same basic idea of replacing registers
with 'temporal addressing' to Usenet back in the 90s -- comp.arch?
comp.compilers? Boy, it's been a long time.

~~~
knz42
Every successive incarnation of dataflow scheduling uses a matching store and
synchronization tokens on long-latency operations. You have them in Tomasulo's
reservation stations at a very small scale (considering FUs as unpredictable),
in the MTA throughout the memory system, in the D-RISC core of my research
group, and quite a few others.

It's quite a common and recurrent idea really. However promises / dataflow
tokens / I-structures / etc all are subject to a common flaw / problem: when
you receive multiple completions simultaneously, which of them are you going
to schedule first? This choice is highly non-obvious and has tremendous
impacts on data locality.

~~~
abecedarius
Thanks. Idle curiosity: do you know any introductory refs about that question
and its impact? (which to schedule first)

~~~
knz42
Sure, the best reference is the paper by Culler et al. from 1992/1993:

David E. Culler, Klaus E. Schauser, and Thorsten von Eicken. Two fundamental
limits on dataflow multiprocessing. In PACT ’93: Proceedings of the IFIP
WG10.3. Working Conference on Architectures and Compilation Techniques for
Fine and Medium Grain Parallelism, pages 153–164. North-Holland Publishing
Co., Amsterdam, Netherlands, 1993. ISBN 0-444-88464-5.

A preprint is available as tech report CSD-92-716 from Berkeley:
[http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6259.html](http://www.eecs.berkeley.edu/Pubs/TechRpts/1992/6259.html)

~~~
abecedarius
Thanks!

------
momerath
Wow, the one submission I've made anywhere that hit the front page, and I'm
kicking myself for the title. 'Belt' is really their name for the category of
machine architecture in the sense of register and stack machines. 'Mill' is
the name of their specific architecture/family.

As far as the meat goes, I'm gratified that I'm not the only one to think this
is mindbogglingly elegant. I see some misconceptions in this thread, but I'm
finding it difficult to explain the divergences without beginning to
inexpertly regurgitate large portions of this and the previous talk (on their
instruction encoding).

------
YZF
As someone who has worked implementing the same algorithms on a VLIW DSP (TI
C64x) and ARM and x86 a few comments after watching the beginning of the talk.
It's been a while so this is from memory...

The different VLIW execution units can only process a subset of the
instructions set. That means you need the right mix of instructions in your
algorithm to take advantage of the full throughput. If you have any sort of
serial dependency you won't be able to take advantage of all the execution
slots. It basically excels at - signal processing (and even a subset of that).
That said, when you hit the sweet spot it's pretty good.

When someone like TI compares their DSP to ARM they usually tend to ignore
SIMD (very conveniently). SIMD (NEON on ARM or SSE on x86) can buy you almost
another order of magnitude performance on those super-scalar CPUs if you're
dealing with vectors of smaller quantities (16 bit or 8 bit). So while on
paper the VLIW DSP should blow the general purpose superscalar out of the
water for signal processing algorithms at comparable clock rates it's not
quite like that when you can use SIMD. It also takes a lot of manual
optimization to get the most out of both architectures.

So when you're in the VLIW's sweet spot your performance/power/price is pretty
good. But the general purpose processors can get you performance in and out of
that sweet spot (you're probably sucking more power but that's life).

You really can't look at "peak" instructions per second on the VLIW as any
sort of reliable metric and you need any comparison to include the SIMD
units...

EDIT: Another note is that for many applications the external memory bandwidth
is really important. The DSPs benefit from their various DMAs and being able
to parallelize processing and data transfer but generally x86 blows away all
those DSPs and ARM. I guess in a modern system you may also want to throw the
GPU into the comparison mix.

~~~
Symmetry
Both SIMD and VLIW are techniques that only work on programs with a good
amount of statically accessible instruction level parallelism, but VLIW is
much more general since it can be used in cases where you know in advance that
you want to be executing a number of instructions in parallel, rather than one
instruction multiple times. The programs that can be sped up by SIMD are a
subset of the programs that can be sped up by VLIW, are a subset of the
programs that can be sped up by OoO engines. Of course, SIMD has a big
advantage in instruction bandwidth vs VLIW, and if you're already going OoO
the SIMD instructions only use up a minimum of the OoO engine's resources.
Combining VLIW and OoO, on the other hand, is very difficult to do.

~~~
YZF
Agree.

------
johnpmayer
Very interesting that the architecture has no condition codes, or any global
state for that matter. Also, that the belts are logical, in that there really
isn't a chunk of memory where the belt lives, but the members of a belt live
where they live and know "which belt" they belong to and "where on that belt"
they are.

Too bad they can't release their emulators/simulators, but I sympathize with
their desire to be first-to-file to protect what sounds like years of work.

~~~
igodard
Exactly :-) We will be publishing as fast as we get the filings done.

------
onan_barbarian
This looks fascinating - and "the real deal" (not crackpottery) but I'm
hesitant about a number of items.

It seems like quite a reach to pick a 'latest and greatest' IA processor (why
is this $885? Because it _can_ be) as a point of contrast. We can hit half the
performance envelope and/or about an eighth the price while keeping many of
the features that make a desktop processor what it is (like, for example, a
whopping big cache hierarchy). Picking on x86 seems like a good way of
overstating the case; if we're getting a new architecture, we may as well pick
on ARM or Tilera or what have you.

Am also, as an early believer in the Itanium, somewhat nervous about static
scheduling for any purpose. Dynamic branch prediction is very accurate on
today's architectures; this does not mean that static scheduling can emulate
this accuracy and enjoy the benefits.

------
solarexplorer
This was a very nice presentation indeed. Too bad they could not tell us how
they handle the memory hierarchy. Because that part is crucial for any
statically scheduled architecture.

Also the implementation of the belt itself could be quite nasty. Just look at
the face of the guy who asked about it after he hears the answer. ;-)

Finally, they will not get anywhere near peak performance with general purpose
code. The parallelism is just not there at the instruction level. They would
do well on high performance computing or digital signal processing with enough
floating point units and memory bandwidth.

Instead they seem to target big data. An interesting move. It will be
interesting to see actual performance numbers (even from a simulator). I wish
them luck.

~~~
igodard
We consider the Mill family to be general-purpose, with the same potential
markets as any other GP architecture (x86, ARM, zSeries, PowerPC, etc.).
Individual family members can be configured for particular markets; the Gold
mentioned in the talk is indeed somewhat Big Data oriented, with eight ALUs
and only two FPUs. A High Silver, with six FPUs and a SIMD height of 256 bits,
is an FP banger. A Tin, with one ALU, no FPU, and 8-byte SIMD, is at the low
end.

There is an architectural minimal size to Mills; there must be at least one
ALU and one memory unit, and all Mills must use a 64-bit address; the z80
market is safe from us, and you won't see Mills in your toaster or thermostat.
There's no architectural maximal Mill, but there are diminishing returns; in
current process technology we feel that eight ALUs are getting close to that
edge, but those decisions are made independently for each market and process.

Ivan

------
chewxy
At 17 mins in, when he talked about rename registers, it reminds me of SSAs.
It's quite fascinating to see that something like that happens in the hardware
level as well

~~~
axman6
My first thought was "Huh, that sounds like an architecture that would map
perfectly to LLVM's SSA IR". Can't wait to watch the video now!

~~~
chewxy
nah it's a different thing. He says general purpose processor architectures
have so many rename registers because of single assignments. He then proceed
to talk about percentage of data being Single Assignment.

Therefore, it's better to use a statick-sized queue structure. Hence "belt" \-
it's like a conveyor belt.

I'm still watching it though. The topic is interesting but man, his voice is
droning.

~~~
stormbrew
What he's talking about there isn't SSA. In SSA you never assign to a variable
more than once. Instead you shadow the existing variable with a new version.

Converting to something akin to SSA (but not quite the same thing) is actually
what he's talking about superscalar architectures doing in register renaming.

That said, just because you use an SSA IR doesn't mean your code generation
doesn't require register renaming. But it would be a very good basis to
generate code for this architecture.

------
ajb
Interesting idea. A limitation of the temporal addressing is that variables
live on entry to a basic block have to be put in the same position on the belt
on _each path_ which could enter the block. This is okay for loops, but it's
going to require lots of shuffling about in conditional code. Still could be
worth it, of course.

~~~
solarexplorer
They will probably do lots of if conversions to avoid this and to make use of
the 33 issue slots. They could also add special instructions to push several
values on the belt to keep different paths aligned.

~~~
igodard
Bingo!

Ivan

------
gmrple
This feels like a functional architecture. It's really cool.

------
colanderman
I don't get how this is different than an in-order register-based VLIW
architecture, except that it makes branching more difficult to deal with
because you need to synchronize belt-deltas between branches (which sounds
like the dual of a register move to me)?

I work daily with a VLIW architecture and the only place I ever see a plain
"move" instruction is in loops. Everywhere else, the compiler just churns
through the registers as if they were in a belt anyway – just the names are
absolute instead of relative.

I can imagine this might simplify the processor logic some – results are
primarily read from one of the first few belt locations, and are always
written to the first location. "Register moves" aside, is this the primary
benefit?

------
artagnon
My takeaway from this presentation:

\- To achieve instruction-level parallelism, traditional architectures (his
example: Haswell) often employ very messy techniques like register renaming
which create a huge amount of complexity increasing power-consumption.

\- He has focused on one such technique called Very Long instruction word
(VLIW), and has taken it to an extreme: the technique he proposes is to throw
away general purpose registers, and replace it with a "belt": a write-once
linear tape of memory (implemented using stacks).

\- He then points out various advantages of this model, including in-order
execution (ILP traditionally requires reordering), short pipeline, and overall
simplification of architectures.

All this looks fine on paper, but I don't see a proposed instruction set, or
any indication of what realization of this model will require. In short, it's
a cute theoretical exercise.

So, let's look at what the unworkable _problems_ with Haswell are, and what's
being done to fix them. Yes, nobody can seem to be able to figure out how to
reduce power consumption on performant x86 microarchitectures beyond a point.
It's a very old architecture, and I'm hopeful about the rise of ARM. The
solution is not to throw away general purpose registers, but rather to cut
register renaming and make ILP easier to implement by using a weak memory
model (which is exactly what ARM does). ARM64 is emerging and the successes of
x86 are slowly percolating to it [1]. Moreover, the arch/arm64 tree in
linux.git is 1.5 years old and is under active development; we even got
virt/kvm/arm merged in recently (3 months ago), although I'm not sure how it
works without processor extensions (it's not pure pvops). ARM32 already rules
embedded and mobile devices, and manufacturers are interested in taking it to
heavy computing. In short, the roadmap is quite clear: ARM64 is the future.

The core of the Linux memory barriers model described in Documentation/memory-
barriers.txt [2] (heavily inspired by the relaxed DEC Alpha) should tell you
what you need to know about why a weak memory model is desirable.

[1]:
[http://www.arm.com/files/downloads/ARMv8_Architecture.pdf](http://www.arm.com/files/downloads/ARMv8_Architecture.pdf)

[2]:
[https://github.com/torvalds/linux/blob/master/Documentation/...](https://github.com/torvalds/linux/blob/master/Documentation/memory-
barriers.txt)

~~~
Scaevolus
A weak memory model has almost nothing to do with ILP.

To exploit ILP, ARM has to do the same thing as every other register
architecture: move to a superscalar out-of-order architecture. There's no real
alternative. The weak memory model makes cache coherency protocols
simpler/faster, but won't save you from false register dependencies.

Three-address code _might_ make it possible to defer the costs of complex
register renaming, but ARM has discovered the importance of code density, and
Thumb-2 (2-address code) is preferred for most functions.

~~~
artagnon
I see. Can you explain to me how it fixes the power consumption problem? Yes,
I know that it needs to reorder instructions to achieve ILP, but isn't that
the entire point of a weak architecture? To reduce the memory barriers and
weaken the ordering constraints, making it easier to reorder?

What are false register dependencies? And how does a three-address code help
in register renaming?

Some reading references would help. I'm currently reading [1].

[1]:
[http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2...](http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.07.23a.pdf)

~~~
solarexplorer
The ordering constraints of memory models are foremost between instructions
that are executed on different processors. As long as instructions execute on
a single processor, everything appears to execute sequentially (even if it
doesn't).

The real cost of weak memory models is in software. Programmers that write
multithreaded code need to understand the memory models. And few actually do,
because these things are not trivial. So you gain a few percent of performance
with a weak memory model and at the same time you dramatically restrict the
number of programmers who can write correct code. A very bad deal.

Hennessy, Patterson: "Computer Architecture, A Quantitative Approach." is the
standard introduction to computer architecture.

~~~
artagnon
What?! Linux provides me memory-barrier primitives: smp_mb(), smp_rmb(),
smp_wmb() etc. Even an everyday Linux hacker does not need to understand the
peculiarities of each architecture, because the locking peculiarities are all
abstracted out in commonly used spinlock_*() interfaces. Any sane compiler-
writer would use these primitives too: so, unless you're writing raw
architecture-dependent assembly, or are a very hardcore Linux hacker, you do
NOT need to understand this.

Linux supports Alpha, which provides the weakest ordering guarantees anyway,
so no: any architecture stronger than Alpha puts absolutely no burden on
anybody. In practice, weaker ordering constraint means more freedom for Linux.

~~~
solarexplorer
Sure, if you use a portable set of primitives you only need to understand how
and when to use those primitives and only the relaxed memory model they are
based on. But even that is a lot to ask for, and too much for many
programmers.

And this particular solution only works inside the Linux kernel and only using
GCC. It doesn't matter what we think a sane compiler-writer or a sane
computer-architect would do. Many times we just have to live with the choices
other people made.

I think you got something wrong here. Architectures with strong memory
ordering put more burden on the hardware and less on the programmer. That was
the point I was trying to make: weak memory ordering is not worth it, because
it is better to let the hardware do the hard work and not to burden the
programmer with it.

Linux may work well with relaxed memory ordering, but that does not come for
free, a lot of people had to understand first, what a relaxed memory model is,
etc.

~~~
artagnon
> weak memory ordering is not worth it, because it is better to let the
> hardware do the hard work and not to burden the programmer with it

My question is very simple: why do we have architectures that reorder
aggressively if the performance gain is not worth it? CPU manufacturers do
understand that weaker guarantees leads to complexity that must be handled at
the software level: so, are they mad? The fact of the matter is that we have
been moving away from Lisp machines, CISCs and "human" instruction sets to
RISCs, because the kernel/ compiler is in a much better position to make
optimization decisions than the hardware (it can see the bigger picture).

These architectures exist, and the infrastructure to handle all this
complexity has already been written (yes, all major compilers too [1]). Yes, a
lot of people had to study a _lot_ to get us where we are today, and the
success of our little iPhone apps stands on the shoulders of those giants.

I seriously don't get what software complexity you're whining about.
Complexity is not the exception in Linux; it's the bloody rule! Have you seen
the fs/ tree implementing various filesystems? Perhaps the kernel/ tree with
the state-of-the-art scheduler? Or even the net/ tree implementing TCP/IP?

I'm not interested in discussing some hypothetical idealized textbook world
where everything is simple and elegant: I'm interested in contributing
whatever little I can to tomorrow's concurrency infrastructure.

The mammoth question in the room has still not been answered: how does ARM
manage significantly lower power consumption? Aggressive reordering (=> weaker
guarantees) seems to be part of the answer.

[1]:
[http://en.wikipedia.org/wiki/Memory_ordering#Compiler_suppor...](http://en.wikipedia.org/wiki/Memory_ordering#Compiler_support_for_hardware_memory_barriers)

~~~
solarexplorer
> Are CPU manufacturers mad?

No they are not. But in the early 90ties when most of these decisions were
mande, the situation was very different. Processors got twice as fast every
two process generations. Shared memory multiprocessors were an exotic niche.
But then we hit the power wall, and all of a sudden shared memory
multiprocessors became the norm. Today, you can not just wait for a new
processor to make your program magically faster, you have to rewrite it. If
you asked the same people who took these decisions in the 90ies today, many
would (and many actually did so publically) say that they regret this
decision.

On, complexity. Yes, software is incredibly complex. That's why you don't want
to add even more complexity to it!

> How does ARM manage significantly lower power consumption?

Easy, ARM is targetting the low power market. So they don't need to build the
fastest chip. When you try to build the fastest chip, energy efficiency
suffers a lot. Once ARM starts to compete with Intel on performance, their
power usage will skyrocket.

------
X4
This guy looks and is awesome! Can't wait to have them produce many-core Mill
CPUs, that I'd like to throw algorithms onto. Then I'm gonna Buy Buy Buy

------
pavlov199
This reminds me a lot of SSAs. This is really really cool stuff. One of those
genuine progress type things in this space.

------
tankenmate
This strikes me as a further progression of the idea of register windows; in
particular the SPARC architecture.

------
microcolonel
Wow, this addresses quite a number of my concerns with aging CPU
architectures, specifically with throwaway registers for fundamentally low-
latency operations like chaining ALU functions.

------
auggierose
Is there a simulator for this available somewhere?

------
gruseom
What would be the ideal language to compile to this?

~~~
tlb
Fortran or numerical code in C. It seems best suited for statically unrollable
loops with few data-dependent branches or random memory accesses.

~~~
williamcotton
C expects addressable registers. From what I can tell this doesn't have those!

It really does seem like a Lisp would map much better, with the whole
caller/callee and those private data belts that looked like hardware level
closures!

~~~
Symmetry
To elaborate on what knz42 said, C expects addressable memory but doesn't say
anything about registers at all. In fact, the existence of registers is sort
of vaguely awkward from the perspective of C's model of computing.

~~~
Mindless2112
> doesn't say anything about registers at all

Considering that C has a "register" storage-class specifier keyword, I don't
see how you can make that claim. Granted, modern compilers ignore it.

Certainly, the C specification states that "register" storage is non-
addressable, so it does say something.

~~~
Someone
That "register" nowadays is just a sequence of characters that, to those
familiar with CPU architecture, rings a bell. Modern C standards do not convey
any links between the keyword and hardware registers. [http://www.open-
std.org/JTC1/SC22/wg14/www/docs/n1124.pdf](http://www.open-
std.org/JTC1/SC22/wg14/www/docs/n1124.pdf), section 6.7.1:

 _" A declaration of an identifier for an object with storage-class specifier
register suggests that access to the object be as fast as possible. The extent
to which such suggestions are effective is implementation-defined."_

And 'non-addressable' only applies to the C 'runtime'. The standard does not
say anything about the hardware.

For example, a C compiler for the 6502 could attempt to store variables of
storage class 'register' in the zero page. Such storage locations have an
address, but you are not allowed to take it from standard C.

For an even better example, look at
[http://en.wikipedia.org/wiki/TMS9900](http://en.wikipedia.org/wiki/TMS9900).
That CPU stored all its general purpose registers in RAM.

------
slacka
After Bret Victor's inspiring talk yesterday, it's fitting to see a new CPU
architecture make it to the top of HN.

Keep it up! Maybe tomorrow we'll see an idea to replace threads and locks.

~~~
qznc
You can replace threads and locks, if you use non-blocking programming
techniques.

[http://en.wikipedia.org/wiki/Non-
blocking_algorithm](http://en.wikipedia.org/wiki/Non-blocking_algorithm)

~~~
slacka
With massively parallel hardware, concurrency is the biggest challenge facing
programmers today. It's a terrible programming paradigm. And your Non-blocking
algorithms only add complexity to an already difficult problem. For example,
I've seen NBAs lead to a priority inversion bug.

Did you even watch his talk? I didn't see anyone arguing against Bret's
scaling concerns with threads in yesterday's HN comments.

