
Asynchronous CPU, MIPS  - jacquesm
http://ca.olin.edu/2008/async/home.html
======
kelnos
This stuff is really cool. One of my college professors, Rajit Manohar, was
one of the guys who did a lot of the major research into async VLSI in the
90s. He and a few of his colleagues founded a company in San Jose (Achronix)
that designs and builds async FPGAs.

It's an amazingly different way of thinking about computing. Generally async
designs take up more physical die space (because of things like dual-rail
logic), but the power savings can be pretty significant since you're not
running a clock... basically the logic runs as fast or as slow as it needs to,
and doesn't do much of anything when idle.

I took a class where we designed and did a little layout for some CPU-subset
blocks. It was fun, but often very challenging. At least at the time, there
weren't many (any?) good test/validation tools. A friend of mine has been
working on one for the past 8 years or so
(<http://www.csl.cornell.edu/~fang/hackt/>), first as a part of his PhD
thesis, later as a part of his job.

It's been years since I've touched this stuff... if I'd continued down the EE
path (like many others, I switched to software post-college) I'd (ideally) be
doing async stuff now.

~~~
jacquesm
You may find that talent much in demand in the years to come.

------
sketerpot
This feels like a lot of my class projects: fascinating idea, some good work
on it, and then "we ran out of time, here's a pretty web page and a report".
Ah, memories. :-)

I've really enjoyed asynchronous design. Did you know that it's possible to
make an async n-bit adder that works in O(lg lg n) average time? Or that you
can make a low-power CPU that does all its operations one bit at a time, and
just stops sending bits when it comes to the leading zeros? It's fun stuff.

(My only contribution to the world of async design was a smaller dual-rail
completion detector. It was pretty sweet, and I'm eternally proud of it.)

~~~
jacquesm
I ran in to all this because of some talks with other HN'ers about computing
fabrics, one of the big questions is 'clocked' or 'free running'.

That brought me to post my ideas on using 'life' as a very basic computing
fabric, it's already proven to be Turing complete, so now to make it run
without a global clock.

That would prove that a clock-free computing fabric of arbitrary size is a
real life possibility.

I think it can be done, I'm not 100% sure yet.

That adder sounds like a bit of magic, do you have a link for that?

I can picture the cpu that 'sleeps on 0', basically the transitions are what
clocks it forward so as long as you keep sending it 0's it simply never
changes state right? (no state changes -> minimal power consumption).

~~~
sketerpot
Someone's already figured out a way to run Conway's Game of Life
asynchronously. Check it out:

[http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=10298...](http://ieeexplore.ieee.org/search/wrapper.jsp?arnumber=1029886)

I didn't read the paper carefully enough to be sure how it works, but it looks
like each cell (or region of cells) maintains a count of how many generations
have passed, and it blocks until adjacent cells have caught up to it. So time
propagates across the board like waves on a pond.

> That adder sounds like a bit of magic, do you have a link for that?

Sure! Here's the paper:

[http://www1.cs.columbia.edu/async/publications/cheng-
unger-t...](http://www1.cs.columbia.edu/async/publications/cheng-unger-
theobald-tc2000.pdf)

I'll summarize it briefly. You know about ripple carry adders, where the carry
bits propagate across the adder and it takes O(n) time? Well, the adder for
each bit can sometimes precompute its carry output. If both inputs are 1, it
knows that its carry output will be 1. There are some other tricks like that.
Anyway, the upshot of all this is that an asynchronous ripple carry adder can
operate in O(lg n) average time, and an asynchronous carry-lookahead adder can
work in O(lg lg n) average time. Or about twice as fast as a synchronous carry
lookahead adder, according to transistor-level simulations.

> I can picture the cpu that 'sleeps on 0', basically the transitions are what
> clocks it forward so as long as you keep sending it 0's it simply never
> changes state right? (no state changes -> minimal power consumption).

Asynchronous design is like ideal clock gating: when part of a processor
doesn't need to do anything, it just stays idle and doesn't dissipate dynamic
power. In the example of a bit-serial length-adaptive datapath that I
mentioned, the leading zero bits are simply ignored, and no processor work
needs to be done to deal with them. Here's a paper on it:

<http://vlsi.cornell.edu/~rajit/ps/bitsnap.pdf>

~~~
jacquesm
> it looks like each cell (or region of cells) maintains a count of how many
> generations have passed,

Hah!

That's _exactly_ what I came up with :)

I finished it at 5 am this morning and it works.

The code is here: <http://ww.com/asynclife.php.txt>

Thanks for the other papers, I'll be reading.

edit:

I've written a blog post about all this:

<http://jacquesmattheij.com/asynchronous+life>

I can't read the IEEE paper without paying for it, so I think I'll forego the
honour of doing so, pity because I would like to see what they came up with
exactly.

Especially whether or not they clued in to the fact that if you have a
'generation' counter you only need the lower two bits of it to keep going
forever.

~~~
sketerpot
Oh man, that's awesome! Have you considered modeling a more "real-world"
computing fabric? Your cells would have randomly-distributed computation and
communication delays, and you could make some sort of visualization of the
time waves propagating, and it would just be really cool.

I initially didn't think this would be possible to implement in simple
physical hardware, but your insight that you only need a two-bit time counter
changes things.

Of course, ultimately, I think what you'd want to look at is amorphous
computing. It's a fascinating new branch of computer science that looks at
organizing a large number of unreliable small processors, sensors, and
actuators, with only local communication, into a working system. Here's the
big intro paper (Abelson and Sussman are among the authors, BTW):

[http://groups.csail.mit.edu/mac/projects/amorphous/papers/ai...](http://groups.csail.mit.edu/mac/projects/amorphous/papers/aim1665.pdf)

I think you'll really enjoy amorphous computing. It's already getting some use
in wireless sensor networks, especially the stuff having to do with
establishing reliable communication channels, or figuring out the spatial
distribution of the sensors based on ping times, or keeping clocks more-or-
less synchronized.

~~~
jacquesm
> Oh man, that's awesome!

Thanks :) It was a lot of fun.

> Have you considered modeling a more "real-world" computing fabric?

That's what this is leading in to, the simplest unit is just a way for me to
know that it is doable. If I can't do 'life' then there is no point in going
any further with it.

One of the HN members here is very far in coming up with a higher level
concept but I feel that it needs to be done from the 'gate level' up.

> Your cells would have randomly-distributed computation and communication
> delays, and you could make some sort of visualization of the time waves
> propagating, and it would just be really cool.

I did that for a bit, it is easy enough to show the generations as colours.

> I initially didn't think this would be possible to implement in simple
> physical hardware, but your insight that you only need a two-bit time
> counter changes things.

Yes, that's what made me really happy :) I realized that with a 'full'
generation count you could make it work but that would always limit the
maximum generation count to the number of bits in the counters. So why not
turn it around and see how few bits you can use and handle the 'wrapping'
properly. The number is 2, and that's low enough to use a simple 4-to-1
demultiplexer to select the right state line (current or previous generation).

> Of course, ultimately, I think what you'd want to look at is amorphous
> computing.

I had not heard the term before, more reading to do!

> I think you'll really enjoy amorphous computing.

Yes, that sounds like it is exactly the way to go.

I will write a fair sized article at the end of this month about where I feel
computing is headed, this as a follow up on the new years day posting about
what we think the future will bring.

I thought the question was an important one and I'm hitting two birds with one
stone here, first the interest in cell computing and second the information
I'm gathering to answer that question about the future.

Fascinating stuff.

------
boots
I worked at Achronix Semiconductor (San Jose startup) for over a year and a
half designing Asynchronous Circuits. I can answer any questions about the
specifics of designing these circuits.

Anyone interested in this topic should definitely look into the Cornell
University research ( <http://vlsi.cornell.edu/> ). Unlike most researchers in
asynchronous research, they still tape out chips.

~~~
sketerpot
Cool, there are a few things I've been wondering about for a while now.

Is it possible to make practical async logic on ordinary FPGAs? I've seen some
very conservatively-designed async stuff that ran on an Altera FPGA, but I
remember hearing that it was kind of flaky. (It was a Java-to-hardware
synthesis program -- there's a remarkably direct mapping between the usual
program structures like loops, if statements, etc., and hardware if you're
willing to use a naive design.)

What are the prospects for putting asynchronous IP cores into a system-on-chip
with mostly clocked parts?

What kind of async logic style do you usually use in practice? Dual-rail
quasi-delay-insensitive, or bundled data with matched delays, or something so
exotic that I've yet to hear of it?

I've heard some people speculating about how async circuits can be used to
make fault-tolerant logic. For example, if you're using dual-rail coding, then
a single bit flip is going to be detectable as a fault, and you can trigger
some self-checking-and-repair logic without breaking your timing assumptions.
How important/plausible would you say this is?

~~~
blasdel
Years ago I tried to implement a simple CPU architecture in Xilinx Verilog on
a Spartan3 where the aim was for it to be pedagogically simple inside and out,
with extremely concise and expository HDL source.

I failed pretty miserably: It took me several rewrites before I got one to
even synthesize, and dozens before I got it to place+route onto the physical
chip (fucking traveling salesman problem!). I never got my CPU's JMP
instruction to work correctly within my self-imposed ideological constraints,
making it rather useless.

It took me a while to realize that the fundamental problem with my design was
hidden asynchrony, despite having a clock.

------
petewarden
Great stuff. I've been fascinated by the potential of async CPUs ever since I
studied under Steve Furber, who was working on the clockless version of the
ARM processor, Amulet: <http://en.wikipedia.org/wiki/AMULET_microprocessor>

~~~
jacquesm
yes, I posted that one earlier today.

Quite the inspiration all these designs doing the 'impossible' (according to
some).

It's always nice to see people pursue presumed dead ends then turn up finding
gold.

------
kd0amg
The professor I TA'ed for last semester said he's had a student turn in a
clockless CPU for the final project in the introductory architecture class.

------
david927
very cool

~~~
jacquesm
You're the inspiration for the current crop of 'clock-free' articles, I really
got interested in it that particular aspect of our conversation after M's
visit here.

I think I can see at least one way to crack that 'nut' of doing without a
global clock, which would scale to arbitrary large systems.

The downside is that it is a purely theoretical solution for the moment, I'll
try to do some simulations to see if it is actually viable.

The tricky bit seems to be to not fall in to the 'pit' of thinking that you
have it working without synchronization when actually the test system has some
underwater property that makes the likelihood of a failure vanishingly small,
but which when duplicated 'in real life' would lead to an error almost
immediately.

The classical example is two circuits that are 'independently' clocked from
some master clock that gets divided by different factors.

You'd need to test this with local clocks that are varying wildly comparing
the output of the system with 'known good' output over many cycles to gain
some confidence.

It's amazing how the devil really is in the details with this stuff.

~~~
rikthevik
Agreed. Especially when timings errors happen between different clocks in a
deployed chip at different temperatures.

