
Power Complexity On The Rise - ChuckMcM
https://semiengineering.com/power-complexity-skyrockets/?1
======
ChuckMcM
This came across twitter and I found it a really interesting read. Simulation
tools make a bunch of assumptions, when the chip designer breaks those
assumptions, the tools break in unexpected ways. The first time I saw this in
action was the design of the SPARC 10 CPU which was, in places, an
asynchronous design, meaning that the signals for controlling functions in
various units happened when the previous unit finished, there wasn't a common
clock domain. That broke all of the simulation code and it took months to work
around it. I got to hear this second hand as I was just doing system software,
not chip design, but it stories were pretty amazing.

~~~
bsder
There is a reason why every design that attempts to use asynchronous design
has been a disaster. That's because:

Asynchronous design is a bloody stupid idea.

Non-deterministic behavior is the _LAST_ thing anybody wants. And software
people go to great lengths to mitigate it. And when they fail, we leak crypto
keys, allow people to read privileged memory, etc.

People like John Carmack even advocate adding _time_ as an input to many of
your functions so that you are fully functional and deterministic and easier
to debug and test. ie. doing the same thing as VLSI designers and providing a
global clock.

And, for reference, designers didn't just get stuck in a local minimum with
synchronous design. Designers did use asynchronous designs back in the day--
mostly S-R latches--and they discovered that if they added clocks and used
flip-flops instead that things were more reliable _and_ easier to design with.

(This goes back to even the Apollo guidance computers which used lots of NOR-
gate based latches but had a few flip flops. Later designs prioritized flip
flops over latches.

See:
[https://electronics.stackexchange.com/questions/318341/histo...](https://electronics.stackexchange.com/questions/318341/history-
of-edge-triggered-d-flip-flop-design-using-three-s-r-latches)

One interesting bit is that those three S-R latch flip flops actually have
intermediate state stabilized by the analog feedback if you look at the
designs at the transistor level. It makes for some interesting failure modes
if you disturb the node with noise during feedback amplification. However,
outside of those small windows they feedback makes them _much_ more resistant
to noise upset.

Note: I did the analysis moons ago on the old SN7474 flip flops which are
NAND-based, not NOR-based, but I presume the same thing holds. Of course, I
could of course be _completely_ wrong and maybe that's why NASA standardized
on NOR-gates instead.)

~~~
rstuart4133
> Non-deterministic behavior is the LAST thing anybody wants. And software
> people go to great lengths to mitigate it. And when they fail, we leak
> crypto keys, allow people to read privileged memory, etc.

Lets be a little more precise here. It's the last thing any designer wants to
deal with. I don't think we humans are mentally equipped to deal with the
explosion in the number of possible states it causes.

When we do decide to deal with it, it's not because we love it, its because we
have no choice. No matter how stupid it is, no matter how difficult it is to
get right, it can deliver one thing in spades: speed. When you exhausted all
other options, that's where you have to go.

You are looking at it at the gate level, but it's everywhere. Don't want the
CPU waiting around for slow I/O? Simples - just add an interrupt line so the
I/O peripheral can asynchronously tell the CPU when it's done. What the
fastest CPU? Chop up the instruction stream into little pieces, the throw the
bits at a set of ALU's execute them as fast as they can, doing it
asynchronously, tracking the interdependencies on the fly. That didn't get you
enough? The put 10 CPU's on a chip and schedule tasks across them
asynchronously. Oh, that causes cache conflicts - then just have all the CPU's
snoop each other caches, asynchronously, and try to keep everything sane with
the occasional bus lock. Want the fastest web server? Have it fire up threads
on demand, have a thread pool, distribute the work load asynchronously. That
generated too much disk I/O? No problem - use a SAN that shards everything
into random locations, hand out the jobs asynchronously and in parallel.

It just goes on and on. We do it all the time. If it really is stupid we must
be the dumbest fucks in the universe.

~~~
bsder
> Don't want the CPU waiting around for slow I/O? Simples - just add an
> interrupt line so the I/O peripheral can asynchronously tell the CPU when
> it's done.

Which we run through a synchronizer to bring back into the primary clock
domain.

> Chop up the instruction stream into little pieces, the throw the bits at a
> set of ALU's execute them as fast as they can, doing it asynchronously,
> tracking the interdependencies on the fly.

Which are all so strongly wired to the primary clock that you can count the
cycles.

> Oh, that causes cache conflicts - then just have all the CPU's snoop each
> other caches, asynchronously, and try to keep everything sane with the
> occasional bus lock.

Memory systems generally operate in a clock-forwarded manner, which has a
fixed phase and frequency relative to the primary clock of the chip
controlling the access. We are not willing to pay the synchronization penalty
to cross from an unrelated clock domain under most circumstances.

The only times we are willing to be "asynchronous" are precisely those times
when we don't care about performance.

~~~
rstuart4133
> Which we run through a synchronizer to bring back into the primary clock
> domain.

Of course we do that. To get speed out of one big synchronous domain we split
it into a whole pile of little domains (each internally synchronous) that talk
asynchronously. That allows us to keep the amount of asynchronousity to a
minimum, and as you say doing otherwise would be stupid. That does not alter
the fact that those domains all operate asynchronously with respect to each
other.

And you are not wrong is saying we sometimes go to great lengths to avoid
asynchronous systems. The best example is possibly the GPU, which are
massively parallel but all threads are in strict lock step. But notice GPU's
have to execute at a fairly slow clock rate to pull that stunt off. An
execution unit (ALU) in a modern CPU that executes mostly independently of the
other execution units are about an order or magnitude faster.

By the by there is a computer that operates on at absurdly slow 1ms "clock",
yet is faster at most things it does than the biggest and best computer we can
build today. "Clock" is in quotes because there is no clock, 1ms is about
delay through one of elements. It gets it's speed through massive yet
completely asynchronous parallelism. It is the human brain. Such is the power
of being stupidly asynchronous.

------
nullc
There have been a number of failed Bitcoin miner asics basically as a result
of designers using the tools without really understanding the tool's
assumptions (or without themselves understanding the work the circuit would be
doing).

E.g. assuming that the toggle rate for gates would be much much less than 50%,
and as a result the actual part used so much more power than expected that it
wasn't viable (couldn't cool it or couldn't get enough juice into the chip).

