Ok, this is insanely cool. This is the kind of thing I could get lost in just sitting around week after week figuring out new ways to utilize this sort of technology. About 8 - 10 years ago when I talked with the Stretch guys (they did dynamic compilation into FPGA "hard" logic and computational "soft" logic) I was wondering if anyone was thinking about a general purpose approach. An argument I've had off and on with various Intel and AMD account managers has been "What if my app doesn't use the FPU, can I turn it off entirely? Then what is my TDP? Could you give me a switch so that I could add another compute core if I didn't use the FPU? Or GPU?" Their response was Larrabe and the APU stuff that AMD/ATI has done, "lightweight" reconfigurable cores for either graphic type operations or integer operations.
Here is also a much clearer explanation taken from that link:
BORPH is an extended Linux kernel that treats FPGA resources as native computational resources on reconfigurable computers such as BEE2. As such, it is more than just a way to configure an FPGA. It also provides integral operating system supports for FPGA designs, such as the ability for an FPGA design to read/write to the standard Linux file system.
If it supports partial-reconfiguration (which it looks like it does), then it could be a very handy tool. Why? Hardware is significantly faster than software. While Linux is running, the ability to spawn hardware at will would be great for many applications.
At UCSB, some of the research I did related to this very problem. What I was trying to do was have a Linux web server running on an FPGA that could dynamically reconfigure itself for different experiments. I ended up choosing a board that has an FPGA that communicates with a ARM processor. Decent size FPGAs were (and still are) expensive. For all the Linux overhead and my budget, a hybrid FPGA-processor platform turned out to be the better solution. If anyone is interested, here is the problem we were solving: http://ece.ucsb.edu/academics/undergrad/capstone/presentatio...
Think of it this way. At the basic level, you have logic gates. CPUs are massive ensembles of these which then run your program majorly sequentially. GPUs/GPGPUs are smaller ensembles that can be configured better for specific tasks, resulting in better performance/power ratio. At the other end of the scale is using HDLs to program the gates directly for the specific task at hand, which would offer best performance/power ratio. The development process is however more involved. In-between the last two is reconfigurable computing.
A MicroBlaze is just a regular CPU instantiated on an FPGA, most likely controlling other logic around it designed using HDLs -- HDLs portions taking over compute-intensive tasks while the CPU takes over (relatively) low-speed control logic. A reconfigurable computing environment would try to make this more symmetric.
In my opinion, there is a continuum between more complex units (CPUs) that have a large number of logic gates and offer lots of features, and between less complex units (logic gates directly) offering very small functionality. At the end of the day, it is all about finding the right architecture for this "unit", which may vary from application to application.
how do these compare to gpus? a kind-of client is considering using this or something similar to process radio data (calculating correlatins i assume) and i'm curious what chance i have of selling him a gpu based solution instead.
This really depends on the application; but in general FPGAs are a good choice for small to medium volume software defined radio applications, including building correlators and beamformers for radio telescopes (casper.berkeley.edu). CASPER tools used to be a major user of Borph - it made the open hardware and software platforms easy to use to anyone that had experience with Linux; however its use got killed by complexity.
The biggest challenge for FPGAs is tools; the CASPER guys used Matlab Simulink; which works OK for small designs, but becomes challenging as the devices and designs grow. There are also a bunch of C to gates tools, Matlab to gates and Haskell inspired tools like Bluespec, and even some Python tools. In fact for the Berkeley CASPER group, as GPU through put improved (along with processing) and the size of the radio telescopes increased, it made sense to move to a hybrid FPGA+GPU backend processor.
We have used it to build a arb. waveform radar transceiver and a few other things.
Ofcource now one has the Xilinx Zynq 7K family which includes dual core ARM and tightly coupled Xilinx V7 fabric all on a single chip. Plus for a few hundred $$ you get a nice development board (www.zedboard.org) and a frontend wirless digital receiver FMC card - http://goo.gl/qCnJM
In fact, initially JP Morgan looked at GPUs for acceleration. They ported one of their models to the graphics architecture and were able to get a 14- to 15-fold performance boost. But they thought they could do even better with FPGAs. The problem was that it was going to take about 6 months for an initial port. That's when they went to Maxeler and initiated a proof-of-concept engagement with them.
GPUs [...] were able to get a 14- to 15-fold performance boost [...] Their Monte Carlo model, for example, was able to realize a 260- to 280-fold speedup using FPGA acceleration.
These numbers are consistent with other computations I've seen that were well-suited for FPGAs.
(Note - it might well be that the FPGA version of that particular problem would be even faster. I merely posted the GP to point out that there was significantly more effort to tailor the model towards the FPGA than towards the GPU, which seemed to be merely a "compile for GPU" without a restructuring to make the data model fit.)
In general when implementing an easily parallelizable algorithm, you can expect about an order of magnitude improvement in power or speed going from a traditional CPU to a DSP or GPU, another going from that to an FPGA, and another going from that to hard silicon. In real applications you often have un-parallelizable or less-parallelizable parts that limit your gains.