
“It's done in hardware so it's cheap” (2012) - bibyte
http://www.yosefk.com/blog/its-done-in-hardware-so-its-cheap.html
======
Symmetry
Another benefit of specialized hardware, besides dispatch costs, is the cost
of moving data around. As your chip gets wider you want more physical (rather
than architectural) registers with more ports meaning super linear growth. And
your bypass network also grows quadratically in transistor terms. And as your
core gets bigger physically you lose more power moving data longer distances.

But in dedicated hardware you can just gang your operations into dataflows
where the output of one stage feeds into the physically adjacent next stage
with no need to make a trip through the register file or bypass network.

A lot of the benefit of hardware vector operations over scalar operations is
in the dispatch cost, but most of the benefit from hardware matrix operations
over hardware vector operations is from reduced data movement.

EDIT: Of course, the post is from 2012 back when nobody was doing hardware
matrix multiplication so it's understandable.

~~~
delinka
I recall a demo app from my teen years (late 1908s to early 90s) that
demonstrated, after installing your coprocessor, improved 3D performance by
rotating at "high" speed a 3D sprite on the screen. This was an Intel 80386
with an 80387 coprocessor - I thought it was doing matrix math. Nope, the 8037
was an _external_ floating-point unit that arrived two years after the
processor with which it was paired. Cringey.

~~~
Retric
Math coprocessors worked fairly well due to much slower clock speed minimizing
the latency issues. Off chip level 2/3 cache was still a thing into 486 days.
And x86 ASM still treats floating point as if it’s on a separate processor.

PS: Got a real chuckle out of that 1908s typo. I would probably keep it.

~~~
benj111
L2 cache only got packaged with the chip in the Pentium 2, and even then,
wasn't on chip, hence the weird packaging.

~~~
seanmcdirmid
They were packaged closely together in the Pentium Pro (two dies in the same
processor package). The Pentium II was a cheaper downgrade that preserved the
separate bus but just put the cache next to the CPU in the weird packaging.

------
pkaye
Reminds of when I was doing firmware development and the ASIC team would ask
if they could toss in a extra Cortex-M3 core to solve specific control
problems. Those cores would be used as programmable state machines. For the
ASIC team tossing in a extra core was free compared to custom logic design.
However for the firmware team it would be another job to write and test that
firmware. We had designs with upwards of 10 Cortex-M3 cores. I've heard from a
friend at another employer had something like 32 such cores and it was a pain
to debug.

------
CalChris
_Why have multicore chips if it saves no energy?_

In addition to TFA's reason, bluntly, because we have the transistors. Dennard
scaling has ended which has meant that we can't continue to increase clock
frequencies. However, transistor counts have continued to increase. This has
basically forced CPU manufacturers to focus on multicore _because we have the
transistors_.

Also, big/little, gating off unused silicon and other approaches can save
energy even as they use more transistors.

~~~
jayd16
There is also a heat aspect to this. You can run many cores at lower frequency
to save on heat. This makes perf more consistent as you can avoid thermal
throttle vs a single bursty core.

~~~
vvanders
Yeah, to do higher speeds(switching cost, F) you need higher voltages (V) so
your power/heat goes up by non-linear variant of F*V.

Depending on what your savings are for idle power it can either make sense to
run long and slow or try and burst to a deep sleep state.

------
deepnotderp
Imho the article misses a _key_ point which is data locality.

If you look at specialized processors for say, convolutions, the majority of
the benefits are coming from data locality being exploited.

(And no I've never heard the term "dispatch" be used for data movement)

~~~
p1mrx
The "extract bits 3 to 7 and multiply by 13" example is about data locality,
to some extent. It's cheaper to keep data in a local circuit, than to ship it
around between general-purpose registers.

------
kazinator
"Done in hardware" means "done _directly_ in hardware"; the _directly_ part is
understood, because everyone knows that everything is ultimately done in
hardware.

Something not done directly in hardware is done in software. That means it's
done using _more_ hardware resources compared to directly in hardware.

QED; directly in hardware is cheaper.

Cheaper to operate, anyway, not necessarily cheaper to produce. You have to
move a decent volume before it becomes economic to optimize a solution into
hardware. Also, a mistake discovered in the field in hardware is more costly
than a mistake in upgradable software.

------
dang
Discussed at the time:
[https://news.ycombinator.com/item?id=4339024](https://news.ycombinator.com/item?id=4339024)

------
depressed
I immediately thought of Nvidia RTX when I saw the title. "Faster" does not
always mean "fast enough".

------
alain94040
I don't agree with the tone of the article. Doing complex functions in
hardware is a lot cheaper, often by 100X, compared to doing them in software.

As an extreme case, to do a simple 32-bit add, you light up tens of millions
of transistors if the addition goes through a CPU pipeline. The adder itself
of course only requires a few transistors...

Saying that "specialization saves dispatching costs" is minimizing the savings
by orders of magnitude. Of course, the article is correct in pointing out that
hardware doesn't make things free.

[source: my day job]

~~~
_yosefk
TFA author. How does the word "dispatching" hint at the order of magnitude of
anything? What orders of magnitude do depend on is your competition. Outdoing
a CPU at operation X is easier than outdoing a GPU at X [assuming the GPU does
X reasonably well] which is easier than outdoing a DSP at X [assuming the DSP
does X reasonably well]. If your competition is reasonably optimized
programmable accelerators, your opportunities to beat it start shrinking.

Source: my day job

~~~
alain94040
For me, dispatching refers to a very specific piece of the CPU micro-
architecture, by far not the most complex or largest area. So by focusing on
"dispatching", you make it sound like the overhead is small. Maybe you meant
to use another word.

My point stands that from a silicon area perspective, 99% of the CPU is
overhead when all you need is a fixed function.

Based on your background, I know you know that. A lot of details in the
article show that you know what you are talking about and have specific use
cases in mind. I can guess those and for those the article is correct.

------
sbhn
Its done in hardware because its faster, and tech support can just switch it
on and off in case of panic, and flicking the switch is a moment of anxiety
relief, for those with a tendancy to flick a switch in times of absolute
pannick.

------
VLM
Its a good article but the author missed the third reason hardware can be
vastly cheaper, which is lack of abstraction.

You can use google search to add 0x9 + 0x2 and get hexadecimal 0xb... however
that involves dozens of layers of abstraction and endless formatting and
parsing that are fundamentally useless in the long run for something like a
GPU display.

The 4th reason hardware is vastly cheaper is it needs less testing.

In the example above you can either trust your FPGA/ASIC software to implement
a byte-wide full adder properly, because thats kinda a basic task for that
technology, or you can whack the byte-wide adder with all possible test cases
in a couple ns on real hardware, all possible binary inputs and outputs are
quite well known and trivial. When you ask google, or worse, alexa, to add two
hexadecimal digits there are uncountable number of theoretical buffer
overflows and MITM attacks possible spyware/virus infections and similar such
nonsense at multiple layers you probably are not even aware of.

The 5th reason hardware is vastly cheaper is environmentalism and energy
costs. I have trouble estimating the energy cost of a byte-wide adder in a
ASIC or a CPU, surely it can't be more than charging and discharging a couple
sub-pF capacitors... takes billions of transistors switching like crazy to
dump the 100 watts a server motherboard can dump and a full adder doesn't take
many transistors. On the other hand the infrastructure and environmental
damage required to ask Alexa to add two hex digits is very high. You can piggy
back on it by passing the buck; well, we need that environmental damage and
economic cost to enable netflix at which point asking alexa questions is a
drop in the bucket, but people have polluted for centuries on the same
argument (well, its just a little extra lost plutonium and compared to above
ground nuclear testing its a drop in the bucket, etc)

~~~
blattimwind
> The 4th reason hardware is vastly cheaper is it needs less testing.

This goes against all experience I had with hardware, and everything I have
ever heard from every single embedded/electronics engineer.

~~~
badpun
It may be because, outside of a few domains, we just accept software failing
all the time, and hence do little testing of it.

