
Computing with Random Pulses Promises to Simplify Circuitry and Save Power - sohkamyung
https://spectrum.ieee.org/computing/hardware/computing-with-random-pulses-promises-to-simplify-circuitry-and-save-power
======
bcatanzaro
It's likely that computing with random pulses would _increase_ power. That's
because the vast majority of power in processors today is spent _transferring_
information, not computing arithmetic operations. The actual arithmetic
operations are very cheap energy-wise compared to the cost of reading and
writing data from memory, as well as sending it around the chip.

The cost of moving data is proportional to the number of toggles you have to
push through the wire. And with this representation, the number of toggles is
immense. Bill Dally noted at SysML the other week that this kind of data
format is almost the worst possible number format you could construct, if you
care about power, because on average there must be lots of toggles.

It's too bad the article doesn't include any comparisons to see why the
authors believe this is a good idea. The neuromorphic story is getting a
little threadbare at this point, without data to back it up.

~~~
darkmighty
A few points: data movement is expensive, but the reasons are not so simple:

1) x68s RAM needs to move huge amounts of data cheaply and extremely
reliability. This requires a lot of power to fight noise. Once you're allowed
to have a few bit flips, the bus voltages can be made much power directly
lowering losses.

2) _A significant portion of the large energy costs is actually in computing
associated with memory movement!_ (according to Sohmers from REX Computing,
about 40%). Encoding, queuing on processor, Decoding, fetching, queuing on the
memory controller, and caching again on the CPU takes a lot of logic.

I think it's quite possible this approach is superior for low precision
floating point applications that currently use CPUs/GPUs. Note this allows
lowering voltages for _the whole system_ , including interfaces and memory.

Also, data movement doesn't dominate the energy cost of _all_ applications;
there are plenty of operations still compute-bound (e.g. some evolutionary
methods), it's just that the applications more in vogue are like that. There
are plenty of applications that are currently done in DSPs and ASICs because
of heavy computational costs (error correction decoding as mentioned in the
article) that might benefit from this.

\---

As far as I know, it is still an open question whether it is possible to
generalize this approach for data-efficiency. I have a few ideas I've been
playing around with but none that seem to work yet.

------
dfee
“An AND gate is a digital circuit with two inputs and one output that gives a
1 only if both inputs are 1. It consists of just a few transistors and
requires very little energy to operate. Being able to do multiplications with
it—rather than, say, programming a microprocessor that contains thousands if
not millions of transistors—results in enormous energy savings.”

To be fair, you don’t need thousands or millions of transistors for a
functional ALU, such as the ones we built in college.

I’d imagine the reason it’s more practical to use something with 1M
transistors rather than something with 10 or 100 is the ease of integration,
(amazingly) simplified manufacturing, and supply chains already existing. I.e.
why not throw a 1M transistors in when it’s actually cheaper?

So then, I’m sure there is value in what’s proposed, but for a layperson, what
is the real, practical advantage of this?

~~~
userbinator
_To be fair, you don’t need thousands or millions of transistors for a
functional ALU, such as the ones we built in college._

Indeed, and some of the variants of early mainframes were bit-serial in order
to reduce cost, with a corresponding (significant) decrease in performance
too. An ALU can be less than a dozen transistors or so if it's completely
serial --- which seems to be what this article is going towards, but there's a
reason bit-serial processors have basically become extinct: they're horribly
slow. Any power savings is negated by the multiple-times-longer every
operation takes, so the total energy does not decrease. In fact, since now
_frequency_ has to increase, and power consumption increases superlinearly
with frequency, it's a net loss.

One memorable microcontroller with a bit-serial ALU is the CDP1802 --- and it
takes 16 clock cycles for a single 8-bit add instruction.

~~~
tomxor
> An ALU can be less than a dozen transistors or so if it's completely serial
> --- which seems to be what this article is going towards.

Each bit being completely independent seems to make it trivial to parallelise.
I know they describe it as "serial" but I don't see anything inherently serial
about it.

High order numbers being impractical anyway means in practical use people
would probably settle on predefined word lengths rather than actual "bit
streams". You can multiply your 256bit word (8bit probabilistic integer) in
the time it takes to propagate through a single AND gate - the shortest
possible, giving you the fastest clock - or if sharing the clock with other
far more time consuming blocks in traditional CPUs it allows you to do a great
many probabilistic multiplications in a single clock with the same silicon in
the same block.

Admittedly what i'm suggesting uses 256 gates not a dozen, but the performance
will be significantly better in terms of how much clock it takes up. I think
the real problem with this method is the cost of serialising into the
probabilistic bit stream. That needs to be avoided as much as possible to make
this useful.

------
nbingham
Here is the counter argument to Stochastic Computing.

[http://csl.yale.edu/~rajit/ps/stochastic.pdf](http://csl.yale.edu/~rajit/ps/stochastic.pdf)

Basically, stochastic computing takes exponentially more time and energy to
perform the same computation at the same precision, and has a higher error
rate. If you want to save energy on precision, it would be better to use bit
or digit serial operators.

Note of disclosure, this paper was written by my adviser and I am currently
writing two papers on digit-serial arithmetic operators.

------
avdempsey
Fascinating! I don't think the article went into what physically encoded
numbers into the probability streams to begin with, only mentioning that this
is simple. Does anyone know about that? Is this happening on the analog to
digital conversion?

~~~
neltnerb
I imagine the internal registers simply clock out their stored data to the
bitstream in sequence rather than carrying it in a parallel bus to adders and
the like.

I suppose that wouldn't be a random pattern though, each iteration through it
would be the same.

~~~
seiferteric
Could you have a true random number generator that with each clock cycle had a
50% chance of producing a 1 or 0, then XOR that stream with the data stream?

~~~
neltnerb
How expensive are RNGs to implement? I'm not sure how they work. I think your
idea would work though.

You might make do with a very long prime value like 4073 that just clocks
around endlessly, which will rarely line up meaningfully with anything else.

It does seem memory intensive though, maybe these minor issues aren't a big
deal in simple enough systems that this could ever work anyway.

Maybe you just use 127 bits or something to randomize it, probably more than
enough since the goal is likely mostly to avoid aliasing. Even with multiple
incoming synchronized sources if you use different mixing patterns it might
work out fine.

~~~
pjc50
In physical hardware you can get "real" noise out of a single diode with
appropriate processing. [https://www.maximintegrated.com/en/app-
notes/index.mvp/id/34...](https://www.maximintegrated.com/en/app-
notes/index.mvp/id/3469)

The effect is a true quantum process whereby one electron tunneling across the
junction triggers a large number of other electrons into moving.

~~~
neltnerb
Nice! I see the noise spectra, I am guessing the long tail out at high
frequencies means that the recovery time is short enough that subsequent bits
aren't correlated?

------
myself248
I'm not familiar with this specific technique, but it sounds like it's
conceptually related to dither:
[https://en.wikipedia.org/wiki/Dither](https://en.wikipedia.org/wiki/Dither)

Just gonna toss that out there and see if anyone with more expertise can
comment on the similarities or differences.

~~~
aidenn0
The representation they suggest sounds a lot like a dithered PDM[1] bitstream.
There is a slight difference because PDM is usually representing a
continuously varying value.

Sigma-delta ADC converters internally generate something like this from the
analog signal (though they often use more than one bit) and then sum the
values over a window to get the actual value that they output.

[edit]

This actually makes me wonder if you could get decent power savings with 2-bit
add/multipliers and streams of 2-bit values; that would significantly decrease
the stream-length required for more precise calculations.

1: [https://en.wikipedia.org/wiki/Pulse-
density_modulation](https://en.wikipedia.org/wiki/Pulse-density_modulation)

------
modeless
I like the connection to biology. But it seems incredibly inefficient to store
or transmit this representation. A normal 8 bit integer is blown up into 32
bytes. How would you implement anything that needs a cache?

~~~
Miner49er
Somebody correct me if I'm wrong, but the stream isn't an integer, it's a
probability, or a number between 0 and 1. The point of it is that you
shouldn't need to cache it, that would defeat the point of it being a stream.

~~~
mozumder
Still, it seems like if you wanted reduced power, and could accept reduced
precision, wouldn't you just do the same thing in binary by using only a few
bits? Not even 8 bits, but like 3 or 4 bits?

------
visarga
They said the stochastic neural net had a factor of 10x reduction in power
usage. What kinds of power reductions are theoretically possible with this
tech?

~~~
IIAOPSW
A factor of 1000x.

How do I get this number? Bitcoin mining. Compare the hashes per watt of a CPU
to the hashes per watt on an ASIC. The difference can be entirely attributed
to the efficiency of the CPU circuit design. It is nearly impossible to design
better ASIC's because the specification for SHA256 practically _is_ the
circuit design.

------
tzahola
So basically they’re simulating analog circuits with pulse density modulation.

------
amluto
> Half the time, a bit sampled from the first input is transferred to the
> output; otherwise a bit from the second input is used, effectively averaging
> the two inputs. The circuit that accomplishes this sampling is again a very
> rudimentary one, called a multiplexer. With it, addition becomes very easy.

A multiplexer with a magic random input? If you merely alternate, then you
have a non-i.i.d. sequence, and your next gate can mess up.

~~~
abstrakraft
The article takes as given that it's easy to generate such a magic random
input: it's simply a stream representing 0.5.

~~~
amluto
I would love to see the tiny, power-efficient circuit for that.

~~~
aidenn0
My first thought is an LFSR, but there are probably other ways.

------
deepnotderp
I'm more interested in how stochastic computing can do more interesting things
such as "quantum computing lite", similar to how some applications of chaos
analog computing can do rough approximations of what quantum annealers are
capable of.

------
dahart
What's the difference between what the article described and PWM? I didn't see
any specific justification for randomness. The math parts all work as
described with deterministic and regular PWM signals...

~~~
ballenf
I believe you need the randomness to avoid artifacts in the calculations.

But more importantly, the source of data in the applications are analog
themselves and they are skipping the step of manually calculating an exact
value before doing calculations.

~~~
dahart
Makes sense. Yes, aliasing and bias can happen with PWM math. But, you can
directly encode analog signals in PWM, and doing so (using an analog source)
mostly avoids the aliasing and bias problems.

------
uesjk
Could you program an FPGA as a very efficient neural network with this
technique?

~~~
3chelon
That was my first thought too. Neural networks are stochastic by their very
nature. I've been reading about binary neural networks for embedded devices
and FPGAs recently because I'm convinced machine learning is about to get a
whole lot smaller and more distributed, so this article really piqued my
interest.

------
mycall
Is the glass half full or half empty?

Nope, its 50% ones and zeros.

