
On FPGAs as PC Coprocessors (1996) - luu
http://fpgacpu.org/usenet/fpgas_as_pc_coprocessors.html
======
JoachimS
Good writeup by Jan Gray.

If we look at it from today, some things have changed, but the end result
holds true - at least for general purpose computing.

The FPGA vendors (esp Xilinx and Altera) have released devices where the
intergration between a hard CPU core and the FPGA is much tighter than using a
standardised bus like PCI. This gives a much lower latency. In these devices
the CPUs are normally ARM based.

The prevalence of GPUs as PCIe connected coporocessors and OpenCL has also
made FPGA based computing connected via PCIe much easier.

One thing that has not improved, but rather worsened is the difference in
clock frequency between CPU and FPGA. In Grays posting he talket about PPro
running at 200 MHz. Today we have cores running at 2-3 GHz while FPGAs are
hard to push beyond 200 MHz.

For applications that are highly parallel, takes a lot of cycles and not
suitable for GPU processing, FPGAs might be useful as base for custom
coprocessors even for x86, Power, SPARC based servers. But otherwise probably
not.

For embedded systems FPGAs makes more sense, but cost of the device and power
consumption might make it less so. The integrated SoC devices from Xilinx and
Altera can do cool things like media processing, SDR competitively.

The Novena is an example of a modern system where the CPU has access to a FPGA
over a low latency interface which makes it useful as a coprocessor. But the
Novena is really not built to be cost efficient.

~~~
planteen
Zynq from Xilinx uses an AXI bus between the CPU and logic. The first gen ones
are not cache coherent between the logic and CPU but I believe UltraScale will
be. I don't know much about PCIe, but how is AXI much lower latency?

~~~
reisgabrieljoao
There is a cache coherent FPGA-CPU DMA interface in Zynq-7000: the Accelerator
Coherency Port (ACP).

------
jacquesm
To me the biggest viewpoint shift was moving the CPU onto the FPGA instead of
the other way around (which was the way we thought it would go for quite a
while).

Putting one or more ARM cores on an FPGA is mostly matter of including some
lines in a specification file. Putting an FPGA onto a CPU requires a lot of
capital and minimum orders that are simply scary.

See also: soft-core vs hard-core FPGA CPUs.

~~~
nickpsecurity
That was a surprise. I thought they'd put FPGA's onto the CPU's, too, given
they were always trying to max interconnect speed and minimize latency.
However, we might still get our dream: Intel & Altera merger w/ Intel needing
a boost in datacenter performance. They'll do it eventually if they haven't
already.

------
tkinom
I can think of two apps that can leverage this kind of FPGA+CPU setup.

Heavy duty image processing: Giga pixel images processing - probably mainly
for defence related app like this:
[https://youtu.be/MVFeMH3ahtw?t=34m45s](https://youtu.be/MVFeMH3ahtw?t=34m45s)

Video processing: I used to work on one - 10 years ago. We have 1U system that
process feed ip videos in and provide video/decoding/transcoding service. It
has 37 Virtex Pro FPGA each with 450Mhz PPC processor inside. It was a very
fun project, make some good $ for the startup too. Deploy a lot of them at
comcast and other major cable company til some stupid VC impose their own CEO
and piss off of the three MIT founders. I think transcoding system probably
still be useful for all the internet video datacenter app. FPGA transcoding
app can provide real time transcoding service for youtube to a lot resolutions
to various phone screen site.

FaceBox, Youtube, Amazon, Netflix can probably use those device/services.

Anyone else can think of any good/interesting applications for FPGA+CPU setup?

~~~
sparkie
Another obvious example that can benefit from the setup is a crypto
coprocessor, whereby private keys are side-loaded onto the coprocessor by
means of a separate hardware bus, such that the key is never exposed to the
CPU or main memory, and the coprocessor can handle all encryption/decryption
with said keys.

~~~
tkinom
FPGA+CPU type setup is not cheap to develop or deploy.

Crypto with modern CPU such as pentium works wells. Also almost all the ARM
Soc comes with crypto co-processor already with private memory that's not
share with main CPU.

Not too sure about the value add there.

~~~
sparkie
The value is not about performance, but security. Crypto done by modern CPUs
is susceptible to many side-channel attacks. Having a separate SoC is nice,
but the SoC is not upgradable with new algorithms or patches if there are any
problems found in its implementation.

Also, just having a separate memory space for the purpose of computation is
not sufficient. I'm arguing for an entirely separate hardware bus to load keys
onto the chip, such that they never exist in CPU, memory, or the storage for
the machine being used for general purpose computation, because keys there can
be obtained via side-channels if other software (such as the kernel) is
exploited.

------
antome
I think that with architecture design becoming increasingly important in
processor performance, and on the heels of Intel-Altera, CPUs or GPUs may
gradually add "FPGA-like" aspects, where the configuration of the silicon can
be dynamically shifted in order to achieve different tasks.

Personally, I would love to see a day when FPGA coprocessors are a thing, and
FPGA companies are big enough to make it happen.

~~~
wfunction
There are apparently already a thing (Xeon processors with FPGAs seem to have
come out a while ago).

~~~
listic
Xeon processors with FPGAs have been _announced_ over a year ago [1] (June
2014) and we haven't heard any details or exact dates since then.

[1] [http://www.extremetech.com/extreme/184828-intel-unveils-
new-...](http://www.extremetech.com/extreme/184828-intel-unveils-new-xeon-
chip-with-integrated-fpga-touts-20x-performance-boost)

------
fithisux
It is a good idea as long as a. An openly standardized direct-to-PC bus is
used b. An openly standardized interface to co-processor is used (not
utilizing blobs). c. Possibly a transparent to user mixing of CPU and special
instructions.

~~~
hamiltonkibbe
Both Xilinx and Altera are using ARM AXI protocol for the FPGA-CPU interface
in their current SoCs.

------
enos_feedler
A company called Xtremedata Inc used to make an FPGA coprocessor module that
plugged into an x86 socket using AMD hyper transport bus. I thought it was
cool enough to do my MASc thesis on some potential applications for it:

[http://www.hypertransport.org/docs/wp/FPGA_Acceleration_in_H...](http://www.hypertransport.org/docs/wp/FPGA_Acceleration_in_HPC_Nov06.pdf)

------
mrmagooey
There have been a few projects to generate VHDL/Verilog using an LLVM backend,
a quick google brings up some interesting presentations. I imagine constraints
similar to those that the links OP brings up are why these aren't more
mainstream.

~~~
sklogic
Generating HDL out of C (or OpenCL) is a very bad idea anyway.

~~~
kbeckmann
How come? I guess it would be quite fun if you could write your algorithm with
well-defined input/output in a high level language and generate HDL out of it.
Let's say you just want to crunch numbers. Today you can use CUDA for that
task, but what if you just use a subset of the CUDA features, maybe the
compiler might use the gates in the FPGA more efficiently compared with the
already defined GPU.

~~~
sklogic
In a high level language? Absolutely. In C? No, thanks. There is a huge
semantic mismatch with the highly parallel nature of FPGAs.

A high level language suitable for HDL generation must expose much better
abstractions for parallelism. CUDA is just the same thing as C, too bound to
the underlying architecture.

~~~
hamiltonkibbe
I think a high level HDL would be great. C really isn't a good starting point
because as you said it doesn't really fit the parallel hardware paradigm. The
existing tools that do this feel like they're trying to shoehorn C into this
application and it doesn't work too well, because of all the extra information
you need to synthesize hardware, e.g. should every iteration of this loop
happen simultaneously, or are you making a shift register?

Something that started with a Verilog/VHDL paradigm but provided higher level
abstrations would be nice, a good analogy would be something like how Python
is to C.

~~~
sklogic
This is actually something I am currently working on. An extensible HDL which
allows to gradually add abstraction layers (and operate solely on any chosen
level), while keeping an ability to express anything down to gates.

~~~
hamiltonkibbe
I'd definitely be interested in checking that out if you have a link

~~~
sklogic
I have not published the language yet, have to produce some nice looking
examples first. But you can take a look at my approach to extensibility in
general (the HDL language is using exactly the same thing) and some of the
earlier mixing Verilog with C experiments here:
[https://github.com/combinatorylogic](https://github.com/combinatorylogic)

------
cowardlydragon
If Intel is desperately looking for stuff to put on silicon besides yet
another core, and they've already "gone there" for Graphics, why not an FPGA.

How far are we from a gigabyte of on-die cache, anyway?

