
Computing 10,000x More Efficiently (slides) - silentbicycle
http://web.media.mit.edu/~bates/Summary_files/BatesTalk.pdf
======
patrickyeon
I'm about to head out the door, but here's what I see. He's got two
fundamental changes to the way a modern CPU is designed:

* Internal representation of floats is logarithmic

* Small processing elements are connected in a mesh network _that is made explicit to the programmer_

The outcome of the second is that there is more work for a programmer, as
details of the hardware are no longer hidden. The tradeoff is that none of the
overhead to hide those details is required in hardware anymore. This is not
new in other application areas, I'm sure DSP folk (and probably GPU folk) can
tell us all about it.

The first gives us very fast arithmetic. What I think he's describing is,
instead of working with x and y, you would work with log(x) and log(y). If `x
* y = z`, then `log(x) + log(y) = log(z)`. (Additionally, the logarithm of
exponentiation is multiplication, so square roots are just bit shifts!)
Addition can be done in an approximate manner as well, using a bit more work.
This is all from slide 5.

Releasing the hardware from the requirement that it be extremely accurate (he
sets up his logarithmic representation to allow for ~1% error) allows him to
make all this much smaller. He's shown that it can work "well enough" in some
cases, and is working on finding more applications for the tech.

This _isn't_ a free lunch (and isn't proposed as such), and not applicable
across all of general purpose computing. But I would think there are lots of
areas it could improve; whether it's huge datasets and customers that can
afford expensive, custom hardware, or low power application areas that would
benefit from approximate methods only a full-blown desktop can handle now
(pattern recognition or noise reduction on phones and cameras, for example).

~~~
rhino42
The other issue will be the I/O bottleneck. A practical application would need
to have on-chip memory to buffer data (not mentioned here, I think). Even so,
the applications will be severely limited by having data with enough
processing needs.

I spend most of my time working on FPGAs, and I would imagine that in the end,
a practical implementation of this chip would involve a good deal of similar
work. Data would have to be piped from one core to adjacent cores, to keep the
inter-chip bandwidth tractable. This could result in portions on the edges
left unused because the data can't get there.

In summary, the applications for such a chip are limited, and will require a
different skillset from what we usually ascribe to a programmer, but in those
constraints these chips could be incredibly high-performance.

EDIT: I haven't played with log-scale arithmetic yet, but I think that adds /
subs will be much more computationally complex--probably more so than
multiplies in the current linear-scale numbers. Just a thought.

~~~
omaranto
I don't think addition and subtraction are too bad in this model. The slides
mention that there is a simple circuit computing F(t):=log(1+2^t). So if you
have two numbers x and y, represented by their logarithms log(x), log(y), the
log of their sum can be computed using F as follows: log(x+y) = log(x) + log(1
+ y/x) = log(x) + F(log(y) - log(x)).

Whether this is better or worse than multiplies in the current linear-scale
representation depends on how hard it is to compute F, I guess.

------
patrickyeon
It's not especially obvious, the way his site is laid out, but there's a more
conversational article at <http://web.media.mit.edu/~bates/Summary.html>

------
russell
An interesting piece of hardware that I have seen was used in ray tracing
where high speed reduced precision FP was used to find polygons that were not
visible, those where the low precision hid a change and those that were
otherwise visible Conventional high precision processing was needed only for
those with changes. The overall performance improvement was 100x.

We are probably close to hitting the wall for general purpose solutions on
general purpose hardware, but the future is wide open for application specific
hardware.

~~~
Palomides
are we actually going to see application-specific hardware, or are we going to
see something like an FPGA as a coprocessor in consumer hardware?

~~~
Tuna-Fish
FPGAs really wouldn't help much. The theoretical gains sound good, but
frankly, the practical speed of a cpu is measured by how fast it can run
crappy code. FPGA in a cpu would just be a pointless waste of silicon that
almost nobody would program for.

Application-specific hardware is already being added to the cpus -- the newest
Intel ones have a hardware RNG and hardware video encoders and decoders.

~~~
rhino42
There's a team at UC-Berkeley which is working on compiling C++ to a
combination of CPU/GPU/FPGA code right now, which might change things. I think
you need a special architecture to take full advantage of it though.

------
algoshift
Scaled integers give you an instant and very significant boost in performance.
And, what's best, in most cases the error is far smaller than 1%. I've done
tons of real-time image processing work on FPGA's where you almost never use
floating point hardware due to how resource intensive (and slow) it is.

Another construct that is extremely fast, cheap and can even represent
nonlinear and discontinous relationships is the good-old lookup table. In
hardware (FPGA) you get one result per clock cycle.

Now, of course, if you can deal with the error and name the tune with just a
handful of transistors it could be a good tool to have.

It might make more sense to have a hybrid. Build a CPU that includes a
traditional FPU as well as a "physics-based" FPU. Then the programmer can
decide which way to go by trading off between accuracy and deterministic
behavior against speed and errors.

~~~
ippisl
How does fpga+scaled integers compare with a gpu ?

~~~
robomartin
An FPGA can't compete with a dedicated chip. The additional hardware switches
and routing layers needed to make an FPGA programmable means that there's a
limit to how fast it can go and how much logic you can pack per square
millimeter. What you can do is use an FPGA to validate the design and then
convert it to an ASIC if the market can justify the costs involved.

------
cousin_it
> _Now each instruction (increment, load register, occasionally multiply)
> invokes >10M transistor operations_

This was the most interesting statement in the slides to me. Can someone
explain why so many?

~~~
narag
He seems to be writing about floating point operations. Considering
instruction pointer logic, buses, cache, branch prediction and all that jazz,
not so surprising, but it still seems a lot to me.

~~~
wmf
According to Wikipedia the Alpha 21264 was 6M transistors not counting caches
and the Pentium Pro was 5.5M total. 10M doesn't sound far off for today's
processors. Since these are some of the earliest out-of-order superscalar
processors, they may give an idea of the minimum cost of those features.

~~~
KaeseEs
A quad core i7 has 1.16 billion transistors, for reference. The question posed
was whether each instruction might touch 10 million, or .86%, of these.

------
mturmon
Invocations of massive parallelism are always exciting and often convincing
because of the huge speedups you can get for niche applications. But
historically, SIMD has not been successful
(<http://en.wikipedia.org/wiki/SIMD>). Or, as a NASA program manager in the
area of parallel computing once remarked more colorfully in the mid-90s, "SIMD
is dead".

You could argue that GPUs have been a successful back door pathway for SIMD.
This seems analogous to the offboard processors that are used in medical
imaging or military signal processing (radar) -- there is a SIMD unit, but it
is controlled by a general-purpose computer, and the whole thing is in a
peculiar niche.

~~~
DarkShikari
_But historically, SIMD has not been successful_

Then why does Intel keep adding larger, more expanded SIMD instruction sets
with every new CPU?

Why did ARM invent NEON and place it on all the high-end ARM chips, taking up
more space than the entire main ALU?

Why are so many DSPs, for so many purposes, basically dumb SIMD machines?
Broadcom's VideoCore, for example, is just a specialized 512-bit SIMD chip
with insanely flexible instructions and a gigantic, transposable register
file.

OpenCL/CUDA _are_ SIMD languages -- controlling them with a "general purpose
computer" isn't any different from your typical Intel chip, which is also a
big SIMD unit controlled by a general-purpose computer.

It's rather difficult to call something that is an integral part of billions
of CPUs "not successful". Without SIMD, much of modern computing would be
crippled by an order of magnitude -- more for GPUs, whose shader units are
basically SIMD engines and little more.

~~~
yvdriess
\- Because SIMD is 'easy' from the processor's point of view. Basically the
compiler it tasked to bundle parallel computations together. SIMD died out
back in the 90s because no one could get that magical parallelizing compiler
to work. Similar story with VLIW ISA.

\- Modern GPUs are not your 80's SIMD. Programs are written with very fine
granularity. Lots of processor real estate is dedicated to scheduling/task
switching, basically bundling fine-grained computations where possible. Modern
GPUs are more like the 80's experimental dataflow machines than the SIMDs.

~~~
DarkShikari
_SIMD died out back in the 90s because no one could get that magical
parallelizing compiler to work._

That "magical parallelizing compiler" does work, and is known as a "human
being". This curious type of compiler has been used for over 20 years to
produce countless lines of SIMD code, ranging from Altivec to SSE.

------
hinathan
This concept - trading accuracy for efficiency - has always made intuitive
sense to me, it's great to see someone building it out.

------
mrb
This blew my mind. It seems such a simple idea to "move the instruction set
closer to hardware". I am amazed no one thought about it before, because it
seems so obvious.

The author provided convincing examples that 1% floating point accuracy is
sufficient in many domains.

------
kristianp
Here's the patent:
[http://www.wipo.int/patentscope/search/docservicepdf_pct/id0...](http://www.wipo.int/patentscope/search/docservicepdf_pct/id00000012558385?download)

------
kghose
People should look into neuromorphic engineering, which has been around for a
while, which is basically computations using analog VLSI circuits (aVLSI).
<http://en.wikipedia.org/wiki/Neuromorphic>

As a first approximation these are the analog computers our fathers and
grandfathers used.

------
crntaylor
Here's a fun fact. There are approximately 365 * 24 ~ 10,000 hours in a year.
If he can pull this off, then problems which currently take a year of CPU time
could be solved in _one hour_ with the proviso that your results can be off by
up to 1% (assuming that the error can be tracked and doesn't compound over the
lifetime of the calculation).

I can think of a lot of data mining/machine learning applications for which
that would be a huge benefit.

------
keturn
Any news since 2010?

~~~
MaysonL
No, but the author is CEO of Singular Computing, LLC:
<http://singularcomputing.com/>

(not that there's anything there)

------
aidenn0
So it's basically a digital slide-rule. Am I the only one picturing a million
guys in neckties moving a slide rule at a rate of 1 billion times per second?

------
robomartin
Raise your hand if you want the product liability exposure caused by your code
potentially having 1% (or whatever) errors in calculations.

~~~
learc83
This isn't for general purpose computing. This would only be used for specific
domains where 1% errors don't matter. No one's going to be using this for
financial transactions.

~~~
robomartin
Is this the reason for the downvote?

First, I'd ask you to name, say, 1,000 applications where this would be
acceptable. Games might be an obvious one.

OK, what else? Because, you see, without a significant ecosystems of
applications that can tolerate these kinds of errors the business case to
fire-up a foundry and make chips just can't be made.

Military? I don't know. You can justify even the most ridiculous things under
the cover of military spending.

I would venture to say that the vast majority of the applications that can
deal with a 1% error will work just fine using integer math which, if done
correctly, can be far more accurate as someone else pointed out. And, I would
further venture to say that they already use integer math.

In other word, in the context of the vast majority of applications this is not
a problem needing a solution.

I've done lots of analog circuit designs (yes, analog...scary) where you
creatively make use of the physics of the junctions to produce various non-
linear transfer functions. It works well and can produce amazingly compact
circuits that do what would require lots of digital circuitry. The errors,
potential thermal drift and process variations can kill you though.

To give it a different spin, an FPU required an external chip 20 or 30 years
ago. Today you have processors smaller than a fingernail running a multi-GHz
with pipelined FPU's. My vote is for further development in the direction of
faster and more accurate FP calculations rather than faster and less accurate
(and full of other liabilities) FPU's.

~~~
DarkShikari
_To give it a different spin, an FPU required an external chip 20 or 30 years
ago. Today you have processors smaller than a fingernail running a multi-GHz
with pipelined FPU's._

GPUs have so many FPUs (literally thousands) that their silicon footprint is
quite large. Large SIMD FPUs, like those in the Sandy Bridge, are also large
and power-hungry, so much that Intel actually has an (undocumented) feature
where the Sandy Bridge powers down half of of the FPU if it hasn't been used
recently.

With regard to the importance of accuracy, even Intel realizes the importance
of being able to improve power and performance with lower-accuracy FP
calculations: see for example their demo hardware here:
[http://www.theregister.co.uk/2012/02/19/intel_isscc_ntv_digi...](http://www.theregister.co.uk/2012/02/19/intel_isscc_ntv_digital_radio/)
.

 _"The FP unit can do 6-bit, 12-bit, and 24-bit precision and, by being smart
about doing the math, it can cut energy consumption by as much as 50 per cent
and - perhaps more importantly - you can do the math a lot faster because you
only operate on the bits you need and don't stuff registers and do
calculations on a lot of useless zeros."_

~~~
robomartin
Sure. Makes sense. The question really is about whether or not some of these
applications can tolerate errors in the order of 1%. I contend that those
application are few and far between, particularly in this day and age. People
don't want to go backwards.

For example, I can do photorealistic rendering of mechanical models from
Solidworks. A 1% error in the output of these calculations would not be
acceptable. The quality of the images would not be the same. Play with this in
Photoshop and see what 1 1% error means to images. I would have zero interest
in a graphics card that ran faster, used less power but offered a 1% error in
calculations. That's an instant deal breaker.

I can see the competitors' marketing: "Buy the other brand if you want 1%
errors".

Does a radiologist care about the processing of an MRI happening faster? Sure.
Can he/she tolerate a 1% error in the resulting images? Probably not. He
probably wants more detail and resolution rather than less. I wouldn't want
him/her to be evaluating MRI's with a baked-in 1% error. I think it is
ridiculous to think that this sort of thing is actually useful outside of a
very narrow field within which errors would be OK.

I'm still waiting for someone to rattle-off 100 to 1,000 applications where
this would be actually acceptable. Without that it is nothing more than a fun
lab experiment with lots of funding to burn but no real-world utility.

Let's check in ten years and see where it's gone.

~~~
DarkShikari
_Sure. Makes sense. The question really is about whether or not some of these
applications can tolerate errors in the order of 1%._

I think the idea here is threefold:

1\. The vast majority of computations do not need extremely high precision. A
billion times more floating-point operations go into rendering video for games
every day than for MRI processing (made-up statistic). Even now, we have
double-precision floating point for this reason: most operations only need
single-precision.

2\. Applications that need a particular level of precision can be written with
that level of precision in mind. If you know the ability of a particular
operation to provide precision, and you are aware of the numerical properties
of your algorithms, you can write faster code that is nevertheless still as
accurate as necessary. Most of this isn't done by normal programmers, but
rather by numerical libraries and such.

3\. Many performance-intensive applications _already use_ 16-bit floating
point, even though it has very little native support in modern CPUs. AAC Main
Profile even mandates its use in the specification. The world is not new to
lower-precision floating point, and it has gained adoption _despite_ its lack
of hardware support on the most popular CPUs.

The entire idea that this is "let's make ordinary computations less accurate"
is a complete straw man and red herring that nobody actually suggested.

