
Why Use an FPGA Instead of a CPU or GPU? - ScottWRobinson
https://blog.esciencecenter.nl/why-use-an-fpga-instead-of-a-cpu-or-gpu-b234cd4f309c
======
seliopou
One clarification on the comment about latency. FPGAs are typically clocked
much slower than a modern CPU. Typically, they run somewhere in the low 100's
of MHz, whereas an Intel CPU clocks in at around 3GHz last time I went to the
Apple Store. With a typical x86 multiply instruction having a latency of
about, say, 3 cycles, putting that workload on an FPGA would result in a ~10x
slow-down!

The real benefit of an FPGA is that you get to decide what happens in any
given cycle. So rather than being able to multiply two numbers in a single
cycle like on x86, you could make your FPGA design do, say, 20 multiplications
in a single cycle (space allowing). Which means that you can now multiply 20
numbers in 1/10th of the time it would take on x86. (In reality I think you
have something like four execution units capable of perfoming parallel ALU
operations per cycle, depending on the family.)

So the latency benefit of an FPGA comes from flexible, almost (almost)
unbounded potential for parallelism in a given cycle, not clock frequency.
Hardware has to be designed to exploit this potential, otherwise it's not
going to see any improvement in latency.

Anyways, just something that is maybe obvious once you're told it, but isn't
always mentioned in discussions like this. It certainly something that I
didn't fully appreciate before getting involved in hardware design.

~~~
fpgaminer
Another thing is that you can pipeline, for example, multiplies. So in a CPU
multiply you give the CPU inputs, wait a couple cycles, and then get the
result. In an FPGA you can build a pipelined multiply. It's built such that
you can feed it input every cycle and get an output every cycle. The only
caveat is that the outputs are delayed relative to the inputs. i.e. you may
give it (2, 3) to multiply on one cycle, but you won't see the result (6) of
that particular input on the output until a couple cycles later.

[Yup, "pipeline" here is the same term used to describe how modern CPUs get
their performance. They, too, are pipelining their instruction execution so
that many instructions can be in the process of executing at the same time.
Though what I describe is a more extreme and specific kind of pipelining.]

This is kind of like having a bunch of multiplies in parallel. But it's useful
for, for example, real-time calculations. You get to perform all the
calculations you need every single cycle, no matter how complex; your results
are just delayed by X cycles. Pipelines are usually also more efficient than
straight parallelism (i.e. an 8 deep pipelined multiplier uses less silicon
than 8 individual "serial" multiplier units).

Another interesting thing: In a previous life I built an FPGA based video
processing device. It sat in an HDMI chain, so it had to be real-time. If that
had been built with a GPU, most engineers would build the system to buffer up
a frame, perform the processing, and then feed the processed frame out. That
results in at least 1 frame of delay. In contrast, because we used an FPGA, it
was simple to just pipeline the entire design and thus only needed to buffer
up the few lines that we needed. This meant A) we needed no external memory
(cheaper) and our latency was on the order of microseconds. In my travels with
that job I ran into tons of other companies building video processing devices.
They _all_ used frame buffers, which made their devices unacceptable for,
e.g., gaming.

~~~
ignoramous
Thanks.

Just to clarify whilst I have your attention: isn't it a common practice to
pipeline frame buffers too? For instance, Android has three frame buffers.
IIRC, one is with the user space handing off display lists, one used for
composition, and one is used by driver for rasterization?

In that case, how good, you reckon, would the perf be compared to FGPA?

~~~
8note
by straight having a frame buffer vs having a couple lines buffered, I'd
imagine it's still the fpga, except now the GPU needs even more memory to do
the same task

------
ChuckMcM
The author isn't aware apparently of over a decade of work done in FPGA/CPU
integration. Both in more sequential languages like System C[1] or in extended
instruction set computing like the Stretch[2]. Not to mention the Zynq[3]
series where the CPU is right there next to the FPGA fabric.

For "classic" CPUs (aka x86 cpus with a proprietary frontside bus and
southbridge bus) it can be challenging to work an FPGA into the flow. But the
problem was pretty clear in 2007 in this Altera whitepaper[4] where they
postulate that not keeping up with Moore's law means you probably end up
specializing the hardware for different workloads.

The tricky bit on these systems is how much effort/time it takes to move data
and context between the "main" CPU and the FPGA and then back again (Amdahl's
law basically). When the cost of moving the data around is less than the
scaling benefit of designing a circuit to do the specific computation, then
adding an FPGA is a winning move.

[1] [https://www.xilinx.com/products/design-tools/vivado/prod-
adv...](https://www.xilinx.com/products/design-tools/vivado/prod-
advantage/rtl-synthesize.html)

[2] [http://www.stretchinc.com/index.php](http://www.stretchinc.com/index.php)

[3] [https://www.xilinx.com/products/silicon-devices/soc/zynq-
ult...](https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascale-
mpsoc.html)

[4] [https://www.intel.com/content/dam/altera-
www/global/en_US/pd...](https://www.intel.com/content/dam/altera-
www/global/en_US/pdfs/literature/wp/wp-01029.pdf)

~~~
blacksmythe

      >> The tricky bit on these systems is how much effort/time it takes 
      >> to move data and context between the "main" CPU and the FPGA and then back again 
    

The sweet spot for FPGAs is when the data is streaming. You can build hardware
where the data is passed from one stage to another, and doesn't go to RAM at
all. Particularly where the data is coming in fast, but with low resolution
(huge FPGA advantage for fixed point instead of floating point).

FGPAs are much more efficient for signal processing from data collected from
an A/D converter (e.g. software defined radio) or initial video processing
from an image sensor.

~~~
tomnipotent
> The sweet spot for FPGAs is when the data is streaming

Netezza does this with their hardware, I've been dreaming of the day I could
DIY and put a OSS RDBMS on top.

------
antoineMoPa
"Intel does offer an emulator, so testing for correctness does not require
this long step, but determining and optimizing performance does require these
overnight compile phases."

After some experience with FPGAs, the emulation step is not enough to test for
correctness. Most of the problems happen while synchronizing signals with
inputs/outputs and with other weird timing problems, glitches and unintuitive
behavior that FPGAs provide (and the emulator behaves differently). I was
using VHDL however. Anybody has experience with OpenCL on FPGAs to explain
what difficulties persist (are the timing problems easier to solve)?

~~~
shriver
There are a number of real issues with the emulation approach even now.
Firstly, emulation isn't accurate - if you do floating point math in your
application it will give you different results (within the tolerances of the
OpenCL spec) on FPGA vs. CPU. So you can't test for correctness in the
emulator.

A second more serious issue is that getting performance that justifies using
an FPGA requires tuning very carefully to the architecture. This may mean
adopting pipeline architectures that destroy emulator performance (there's
still issues with the emulator identifying that the design patterns for shift
registers in FPGA look like pointer fun on CPU rather than mammoth memcpys).
So for a huge part of the design stage the emulator is basically useless-
because it tells you nothing about what you care about, since the performance
of the emulator is often negatively correlated with performance on FPGA. This
is made worse if you're doing hardcore FPGA tricks like mixed precision
arithmetic.

As you say though - the fact that timing is lost in the emulator also means
you don't get a true idea of whether you have buffer overflows etc or lock-up.
Adding debug into the actual design impacts the implementation on FPGA in a
way that it doesn't for software - and is sometimes unintuitive.

~~~
pjc50
> if you do floating point math in your application it will give you different
> results

That seems like a gross weakness in the emulator; floating point isn't
actually nondeterministic!

~~~
shriver
Actually it's not as clear cut as you'd expect. Obviously you can't represent
every number in floating point, so you have to choose a way to round numbers -
and for simple operations like add you can correctly round the results. For
transcendental operations like x^y it's actually unknown how many resources
you'd need to correctly calculate x^y for every valid value of x and y[1]. So
since you can't calculate these numbers to correct rounding, you have to
choose a level of rounding for your approximation - like 3 units of least
precision rounding at the output. Of course we all need to know how accurate
these are - so the OpenCL standard specifies it[2] - exp requires being
correct to 3ULPs for single/double and 2ULPs for half.

Now if you have 3 ULP to play with, the maker of an Intel CPU is going to
design an exp instruction to best make use of the existing Intel functional
units. But an Intel FPGA dev is going to design an exp instruction to best
make use of Lookup Tables and 18x18 multiplies - because that's what they have
on the FPGA.

So whilst you'll get the same answer for x^y on Intel CPU and Intel FPGA
within 3 ULPs those rounding errors are going to be different between the two
architectures. So now, if you want to compute a normal distribution on Intel
FPGA vs CPU you'll get 3 ULPs in your exponent, but that'll carry forward into
the rest of the equation.

So now you have a choice - do you use the built-in function for exp on the
Intel CPU - which is OpenCL compliant just like the FPGA, and get unknown
rounding errors in what is probably a mathematically sensitive task, or do you
emulate the actual sub-operations the FPGA does? In which case your hardy RTL
designer who wrote that exponent function RTL is going to have to write an
implementation in C that emulates the hardware. Oh and they don't only have to
do that for exp - they have to do that for 100s of mathematical functions, and
it'll run dog slow on the CPU compared to using the native functions.

[1][https://en.wikipedia.org/wiki/Rounding#Table-
maker's_dilemma](https://en.wikipedia.org/wiki/Rounding#Table-maker's_dilemma)

[2][https://www.khronos.org/registry/OpenCL/specs/opencl-2.1-env...](https://www.khronos.org/registry/OpenCL/specs/opencl-2.1-environment.pdf)

~~~
kjeetgill
> do you emulate the actual sub-operations the FPGA does?

Yes. It's an emulator.

> ... is going to have to write an implementation in C that emulates the
> hardware.

Makes sense. It's an emulator.

> ... and it'll run dog slow on the CPU compared to using the native
> functions.

Isn't that to be expected? It's an emulator. This isn't like games where it
just has to look close. If it's a dev tool for testing correctness, exactness
matters.

~~~
shriver
Well last time I looked, the answer to question 1 for the Intel OpenCL SDK is
actually no.

And while yes, it's expected to be slow compared to the native functions,
that's not the problem. It's slow compared to simulation.

------
neuromantik8086
> The HPC community is already used to GPUs — getting people to switch from
> GPUs to FPGAs requires larger benefits.

It's worth pointing out that some scientists haven't even made the leap to
CPGPU computing yet and are still relying upon OpenMP / multithreading on
general purpose CPUs, even when a GPU would be clearly superior. Anecdotally,
I remember hearing that climate science simulations are particularly bad about
taking advantage of new architectures for speedups, an assertion which
possibly is supported by the fact that climateprediction.net simulations took
several days using all 8 cores on my laptop to finish.

Neuroimaging is also pretty bad, since other than Broccoli [0], most toolkits
for preparing analyses don't leverage GPUs, even though I'd wager 80% of the
steps involved image manipulations / linear algebra.

[0]
[https://github.com/wanderine/BROCCOLI](https://github.com/wanderine/BROCCOLI)

~~~
cjhanks
In my experience, it's true. And it's unlikely to change - the simulation code
was largely written in FORTRAN and C over a period of 50 years.

This is in fact what makes Intel Phi so appealing to some.

~~~
nl
For C/Fortran using LAPACK it _should_ be as simple as relinking.
[https://www.olcf.ornl.gov/tutorials/gpu-accelerated-
librarie...](https://www.olcf.ornl.gov/tutorials/gpu-accelerated-libraries/)

------
godber
One interesting use case for FPGAs is where hardware qualification is very
expensive, for instance for use in space. If a given FPGA is already space
qualified, you just need to load new code onto it and you can skip the
expensive qualification step for your new application. You can also
consolidate functionality of multiple chips into that one qualified FPGA.

~~~
irq-1
The expense for space is producing RAD Hard parts. (The designs are done on
the same FPGA then a few RAD Hard chips are made.)

[https://en.wikipedia.org/wiki/Radiation_hardening](https://en.wikipedia.org/wiki/Radiation_hardening)

------
Firerouge
I've always thought FPGAs would be perfect to have hardware backed video
decoding/encoding that could adapt to new codecs (like vp9) while also being
updatable for new performance improving discoveries.

It also seemed like it would go well with a generic radio subsystem, so you
could compile hardware support for new wireless standards that come out after
your hardware did (essentially an fpga sdr).

It seems like there could be lots of uses for being able to on demand enable
hardware acceleration for certain tasks a program might need lots of.

~~~
mirashii
FPGAs are indeed commonly used for SDR as well. Ettus has a range of USRPs
with FPGAs on-board, and you can find readily available IP cores for anything
from Bluetooth to DVB-S.

~~~
Firerouge
That's partially where I got my inspiration.

Seems like it integrated properly it could be a great component in a general
purpose PC to future proof it against changing needs and requirements

------
stevesimmons
Does anyone have experience of Reconfigure.io [1] and their toolchain that
transpiles Go for execution on FPGAs [2]? I am curious how it feels from the
developer perspective, how it works in practice, and what types of
applications it is a good fit for.

[1] [https://reconfigure.io](https://reconfigure.io)

[2]
[http://docs.reconfigure.io/overview.html](http://docs.reconfigure.io/overview.html)

~~~
zymhan
Unless you have optimized your code as much as possible to run on x86 or GPU
and still find that you need a prohibitively expensive amount of processing
power to accomplish the task, then and only then is an FPGA worth it.

It is literally a DIY CPU architecture. So, unless you know exactly what it is
about your current CPU's architecture that is holding you back, you won't be
able to benefit from an FPGA-based design.

------
petra
I doubt that programming efficiency is what's holding back FPGA's for general
compute.

Why ? because we've seen decades of research in this area - so at least we
have some tools(c for fpga isn't ideal, but still...), and a lot of the
general mapping between what algorithms should be in FPGA.

And Amazon FPGA's instance exist for over a year.

So in that time, if there worth while services to offer with FPGA's, people
would have offered them, or at least started. And sure programming complexity
is a barrier. but entrepreneurs and VC's love competitive barriers. So where's
all this VC funding going towards this area ? Where are all those startups ?

~~~
seliopou
I disagree. FPGA design remains a highly-specialized area. Tools such as
VivadoHLS, and other high-level synthesis tools, do provide some improvement
in productivity though with inherent tradeoffs in design quality. There's been
a lot of new hardware construction language popping up recently (e.g.,
HardCaml--which I happen to use, Chisel, Migen, PyMTL, etc.). They may bring
something to the table in the coming years, but that's yet to be determined.

HardCaml's pretty great though, IMHO. But then again I'm already biased
towards OCaml, and I get to work with its creator!

------
jaxtellerSoA
Aren't FPGA's used mostly to test/design a circuit that you would then go on
to actually fabricate/build? I could be wrong but I thought FPGA's were
stateless (meaning if they powered off/reboot you loose everything and have to
set it up from scratch again).

~~~
LeifCarrotson
That is incorrect, or at least inaccurate, on both counts.

FPGAs are often used to test/design not a "circuit" but an ASIC (application-
specific-integrated-circuit) that you will then go on to actually build. Or,
if your application doesn't have volume to suppprt the ASIC engineering costs
but can support the FPGA unit cost, you just leave it as an FPGA. There are
millions of devices (industrial machines, research, testing, etc) where this
makes sense.

And not all FPGAs loose state when powered down. For those that do, there's
typically an easy way to hook up a little Flash chip and bootloader to reload
the state automatically on reset. For those that don't have that option (EG,
currently working with an ABB robot, Mecco laser, and Rockwell PLC which all
contain FPGAs, only the Mecco can boot itself), you store state in battery-
backed SRAM and just leave the RAM on at all times.

~~~
joemi
> FPGAs are often used to test/design not a "circuit" but an ASIC
> (application-specific-integrated-circuit)

To be pedantic, an ASIC (and any IC really) is a circuit, as denoted by the
"C" standing for "circuit".

------
mladmon
"...programming FPGAs is still an order of magnitude more difficult than
programming instruction based systems." True statement. If you're offloading
computation from a CPU, profile your application first, and determine if the
FPGA board's architecture will allow you to improve performance. Don't
underestimate the cost of communicating data to/from the FPGA/CPU.

------
rini17
Is there high performance FPGA that does not depend on proprietary bloated
windows-only toolchain?

~~~
radarsat1
Proprietary and bloated yes, but I believe both Xilinx and Altera both have
software available for Linux, or am I wrong? At least I used Altera's suite on
Linux once for a pilot project.

~~~
jeffreyrogers
I have used Xilinx's toolchain on Linux, though it was hard to get working.
This was a few years ago.

~~~
slededit
They support Ubuntu now. Setup was just as easy as windows for me.

------
tzs
For those wanting to play around with an FPGA, this looks quite interesting:
[https://www.sparkfun.com/products/14829](https://www.sparkfun.com/products/14829)

It's about half the price of the other FPGAs I've seen commonly used for
hobbyist projects [1], doesn't need any separate programmer board to interface
to your computer for programming, and has an open source tool chain available.

I've not played with it, as it just showed up in a mailing from Sparkfun about
new products.

[1] such as
[https://www.sparkfun.com/products/11953](https://www.sparkfun.com/products/11953)

------
rwmj
I'm surprised he doesn't talk much about direct cost (rather than engineering
cost). A decent FPGA developer kit will cost several thousands of US$, more
expensive even than high end GPUs.

~~~
lightbyte
You can buy this dev board to play around with and learn FPGA for $75, and it
has extensive documentation and resources to help you

[https://www.sparkfun.com/products/11953](https://www.sparkfun.com/products/11953)

~~~
rwmj
But that's not a high performance FPGA. Believe me, I have a ton of FPGAs at
home (starting even cheaper than that), and I write about them all the time:
[https://rwmj.wordpress.com/?s=fpga](https://rwmj.wordpress.com/?s=fpga)

However if you're doing the sort of work which isn't just hobbyist stuff,
you're going to spend 1000s on a board. This is the sort of thing commercial
developers will be using: [https://www.xilinx.com/products/boards-and-
kits/ek-v7-vc707-...](https://www.xilinx.com/products/boards-and-
kits/ek-v7-vc707-g.html)

~~~
lightbyte
If you're building a commercial product, a 3k dev board sounds incredibly
cheap. My company spends orders of magnitudes more on licenses for an IDE
extension (resharper).

~~~
rwmj
Absolutely. However I guess you'll also want to deploy your FPGA application
at some point, which means buying FPGAs for all your servers too. So the cost
is more analogous to the cost of a CPU/GPU than to the cost of an
IDE/developer license.

------
dojopico
Another FPGA sweet spot is analyzing TB/PB-scale databases. Netezza programmed
FPGAs to uncompress, project, restrict (including NULLs), enforce isolation &
visibility and check CRCs at disk scan rates (>200MB/sec). When touching a
handful of columns from a 20-200 column table, CPUs spend most of their time
stalled on cache line misses. FPGA-projecting just those few columns into
memory enormously increases cache hit rates.

~~~
paulsutter
Plain C code can readily scan at a couple GB/sec on an ordinary Intel core,
and NVME drives can transfer 8 or 16GB/sec

------
saagarjha
Is it just me, or does this article seem like it’s trying really hard to push
Intel? For an article about FPGAs, failing to acknowledge the other (major)
player in the market-there’s no mention of Xilinx anywhere in the article,
while plenty of Intel/Altera-seems disingenuous. Really, the tone just seemed
to stick out to me as an subtle advertisement piece; maybe it’s just me?

------
make3
Is there a pipeline to compile tensorflow computation graphs to fpga? seems
like this is one possible benefit of Tensorflow's fixed graphs, that they may
be much easier to compile to fpga then other options

Edit: Seems like some experimental work is being done with XLA (and llvm)
[https://www.google.ca/amp/s/www.nextplatform.com/2018/07/24/...](https://www.google.ca/amp/s/www.nextplatform.com/2018/07/24/clearing-
the-tensorflow-to-fpga-path/amp/)

------
hyperpallium
How long does it take to "program" a FPGA? i.e. compared with typing a
command, and having it load into memory.

What about a JIT that compiles hot code sections to FPGA? (Though compilation
sounds too slow)

EDIT "one or two seconds or at least 100's of milliseconds."
[https://electronics.stackexchange.com/questions/212069/how-l...](https://electronics.stackexchange.com/questions/212069/how-
long-to-program-a-fpga-seconds-microseconds-less)

------
floatboth
> An exception to the rule that GPUs require a host is the NVidia Jetson, but
> this is not a high-end GPU

Well, the host is just on the same die, but it's still there

------
amelius
One reason could be to meet real-time constraints or to get predictable
timing. This is becoming more and more difficult to do with modern CPUs for
all sorts of reasons.

------
jwbensley
"A couple of FPGAs in mid-air (probably)"

------
faragon
I would love to see projects like MAME and MESS targeting FPGA hardware for
near-perfect hardware emulation of classic hardware.

~~~
IvanGoneKrazy
The Anologue Super Nt project uses an FPGA to run SNES cartridges unemulated.

[https://www.analogue.co/pages/super-nt/](https://www.analogue.co/pages/super-
nt/)

~~~
faragon
But it is not Open Source. There are more commercial products using FPGA's for
emulating gaming console hardware.

------
foobaw
What are some cool jobs for FPGA experts besides Intel and Military? It seems
like there's not much availability.

~~~
kurtisc
I recently spoke to someone at Broadcom who was moving networking processes
from software to hardware using them, try looking into similar companies.

------
exabrial
Why does the compilation take hours, is this an NP Hard problem?

~~~
mmirate
If my memory of a Processor Design course is correct, arranging the desired
functionality onto the finite resources of an actual FPGA chip, let alone in
such a way as to get good performance (i.e. have critical paths be as short as
possible), reduces to bin-packing.

~~~
esfandia
The answer might be obvious, but I'm not a hardware guy: couldn't they use a
FPGA to do the compilation itself, or at least the bin-packing part? Is there
a way to code a general SAT-solver in FPGA?

------
nastypasty
Whats that for?

