
Domain-Specific Hardware Accelerators - MindGods
https://cacm.acm.org/magazines/2020/7/245701-domain-specific-hardware-accelerators/fulltext
======
reidacdc
It used to be the conventional wisdom that domain-specific ASICs were unlikely
to take off because of the economics -- the cost of fab time for the small
feature sizes (meaning an expensive fab) and small batch sizes mean that the
unit cost would be astronomical, and you'd end up not beating the
price/performance of general-purpose chips, which don't peform as well, but
have unbeatable unit costs because, being general purpose, they are produced
by the billions.

The exceptions were GPUs, where there is enough of a mass market, and maybe
super-high-value-added niches, like aerospace, where high unit costs are not a
deal-breaker.

The article seems to get there, in the TCO section, but it looks to me like
maybe genomics is another niche, where there's enough demand to make the
economics work.

I had some hope that maybe fab technology was saturating enough that fab
access was getting cheaper, maybe driven by SoCs for phones and all those ARM
devices and RPis and stuff, that a new era of awesome ASICs was at hand. This
appears not to be the case.

~~~
borramakot
Depending on your definition of cheap, fab access might be pretty cheap right
now for old technology nodes, which have totally reasonable performance if you
have an architectural advantage.

This article suggests mask tapeout costs are under $1 million in older nodes,
sometimes well under. If you have an architectural advantage in a problem
domain with tens of millions or more in costs, a simple ASIC can be very
worthwhile. That architectural advantage might be hard to find, especially
when problem domains aren't fixed for long periods of time (e.g. how many ML
accelerators only really work well for dense convolutions?), but I suspect too
few companies are making custom chips, rather than too many.

[https://www.electronicdesign.com/technologies/embedded-
revol...](https://www.electronicdesign.com/technologies/embedded-
revolution/article/21808278/the-economics-of-asics-at-what-point-does-a-
custom-soc-become-viable)

~~~
jart
Why pay for fab? Fabrice Bellard wrote a new kind of qemu last year that's
tiny enough to boot operating systems in webbrowsers:
[https://bellard.org/tinyemu/](https://bellard.org/tinyemu/)

Intel also taught us last year that 4kb of code is all it takes to decode
every x86 isa (i.e. 1977-2020)
[https://github.com/jart/cosmopolitan/blob/d51409c/third_part...](https://github.com/jart/cosmopolitan/blob/d51409c/third_party/xed/x86ild.greg.c#L1223)
Thanks Mark Charney.
[https://github.com/intelxed/xed](https://github.com/intelxed/xed)

~~~
detaro
What does that have to do with accelerators?

------
glangdale
Not sure of the specifics of their acceleration (described here:
[https://dl.acm.org/doi/pdf/10.1145/3296957.3173193](https://dl.acm.org/doi/pdf/10.1145/3296957.3173193))
but a 15,000x speedup seems like... well... a lot. Does anyone have experience
in this area to the point where they can describe whether the software being
compared to is state of the art?

I've lost count of the number of hardware accelerators in the regular
expression world that were way faster than some outrageous strawman (e.g.
"running libpcre one by one over a thousand patterns") as opposed to doing a
little research to find what the state of the art is. Not sure if that's
what's happening here, but 15,000x rings some alarm bells.

~~~
alephnil
The Smith-Waterman algorithm they made an accelerator for was probably chosen
because it is highly parallelizable. On a CPU it is more accurate but a lot
slower than the BLAST algorithm, which is more commonly used to search for
similar DNA or Protein sequences. One could think that this chip would make
the more accurate Smith-Waterman algorithm more attractive to use, but I
suspect the difference is not large enough to get biologists invest in
specialized hardware. I think this is more of a showcase of what you can do
with accelerators, and they picked a problem that would give impressive
numbers.

~~~
glangdale
Smith-Waterman seems like it's begging for better algorithms on the CPU side
as well. In my experience, these papers often involve disparate amounts of
ingenuity and/or "risk tolerance" (i.e. do something that works pragmatically
for the inputs you have) on the h/w vs the s/w side.

The analogous thing for regex would be, say, extract a literal factor that
_probably_ won't appear in the input and suppress the regex execution if that
nice even occurs - but only do it on the h/w implementation. Voila, 1000x
speedup - on the "nice" input at least.

It would be interesting to see what the expected limits are for fair
comparisons in extremely hardware-acceleration friendly cases. I don't think
GPUs, for example, run 15,000x faster than pure CPU based graphics rendering,
but am not sure.

------
afwaller
Bitcoin is a pretty reasonable story for how this occurs - first you start on
CPU, then GPU, then custom ASICs.

If the financial motivation is there, companies will build custom hardware.
Companies like Apple and Google (TPUs) are building custom processors for
various purposes.

~~~
TaylorAlexander
And also if we can low costs (through openness for example) more companies
will build custom hardware.

------
mmmBacon
Here’s a companion video where Prof Dally discusses some of aspects found in
the paper.

[https://vimeo.com/423287458](https://vimeo.com/423287458)

~~~
truth_seeker
"Typical CPU spends 99.98% of the energy on overhead and only about 0.02% in
doing actual work"

Oh man, this makes me go back to FPGA programming which I did a millions of
years ago when i was in college.

------
stimz
I think the development of and move away from closed source development tools
and license models is the key for hardware accelerators.

We have tried to accelerate different domain specific tasks with FPGAs but the
development cost and effort keeps being to much compared to well written and
smarter software.

Sure there is a case for some very computational intensive tasks but I think
the really interesting part is when you can safely high level program these
things.

~~~
TaylorAlexander
Good example of the way proprietary technology can impede innovation. I'm glad
this field is opening up.

------
vsskanth
Is it possible to achieve any speedup using domain specific hardware
accelerators for solving stiff (but linear) differential equations with large
number of states ?

