Hacker News new | comments | ask | show | jobs | submit login
On FPGAs as PC Coprocessors (1996) (fpgacpu.org)
55 points by zpiman 74 days ago | hide | past | web | favorite | 32 comments

Something I've often dreamed about is an fpga board in pcie card form with a sane toolset along side it so I can treat it as software instead of getting advanced degrees in desktop cable management and I/O pin mapping. Does something like that exist?

What you're describing is OpenCL, yes it exists, both Xilinx and Intel produce toolsets. No they aren't sane by software standards, but they're fantastic compared to hardware engineering. A card will cost you ~$10k for something you'd actually get acceleration from (https://www.xilinx.com/products/boards-and-kits/alveo/u250.h...) and you'll still need a degree in electronic engineering to produce something that convincingly accelerates your task.

Most FPGAs that are viable accellerators aren't for hobbyists but they also are not as expensive as you think. I can't find it anymore but I once saw an online shop with a huge variety of 500k+ LUT FPGA modules (just the chip itself on a small PCB) for around 1000€ + 500€ breakoutboard/mainboard. At those prices it makes more sense as an individual to invest into more CPU cores or a GPU (if your problem maps to it).

Edit: Maybe its this one here. https://shop.trenz-electronic.de/de/TE0808-04-06EG-1EE-Ultra...

How much time would it take to synthesize 500K-lut design on a high-end workstation?

wait, we have an FPGA-based hardware module to accelerate synthetization!

(or even more ironic: an ASIC module)

Around 5 hours.

If you run Linux and are happy with 4 64-bit ARM cores, 154K logic elements and 2GB LDDR4, $250 is the new low bar for low-cost acceleration[1]. [1] https://www.xilinx.com/products/boards-and-kits/1-vad4rl.htm...

Adding to my siblings, Lattice Semiconductor also has a line of FPGA boards with PCI-E connectors: http://www.latticesemi.com/pci-express

I caution you to not dismiss the entire field of computer hardware engineering as "cable management". If that's your view, best just stick to whatever you're doing now.

Depending on how you define "sane toolset", they do exist[1]...except they're in that class where if have to ask the price you probably can't afford it, and it doesn't relieve a developer of vendor place-and-route toolchain to build the application pipeline.

[1] https://www.nallatech.com/solutions/all-fpga-cards/

I'd think it would be cool to have an FPGA in my PC for various kinds of emulation. If I want to play some old games I can use it for accurate emulation (like the MiSTer project[1]) or if I'm in a DAW and want to produce audio from some old synthesizer I can do that on the FPGA to get a more authentic sound. Likely niche but I'd be all over it.

[1] https://github.com/MiSTer-devel/Main_MiSTer/wiki

Probably a stupid question: Instead of 6 core or 8 core CPUs, why Intel doesn't make 4 traditional cores + 2 FPGA cores on same die?

It's a good question: the answer is that they have done this. (or atleast they are doing this) https://www.nextplatform.com/2018/05/24/a-peek-inside-that-i...

What's proving to be a problem though is where does this fit? If you don't have a clear need for an FPGA then just buy a normal Xeon. If you do need an FPGA then why compromise your Xeon? Have an FPGA card, or hell a group of FPGA cards.

The only place this makes sense is if you can think of a use case where you have an FPGA task that needs low latency communication with your CPU. Even with this chip though you have an uphill struggle because the cache hierarchy of a Xeon makes access to memory non-deterministic which traditionally isn't what FPGAs are designed for. It's much more difficult to design your algorithm on FPGA to deal with arbitrary memory latency.

So the question back to you is: What would you use it for?

The TI AM335x CPU has something (sort of) like this...basically 2 microcontrollers that share memory with the ARM cpus.

People have done some pretty clever things with it. Audio processing, driving LED matrix boards, emulating old video boards, driving precision servos, software oscopes and logic analyzers, etc.

Though that's in a small dev board, like the Beaglebone Black, not a beefy Intel server.

It doesn't have a use case, at least not yet. But easy, cheap gains are running out in general-purpose computing as we near 1nm process. Heterogenous computing will then become more relevant, and a great way to do that is an FPGA.

My rule of thumb from a few years ago: given the same semiconductor process, you roughly have a 40x area difference between ASIC and FPGA for the same amount of random logic.

There are few things that can be done very well on a FPGA, but most things are not, and the market for it tiny.

If you really have an application that's perfect for a CPU/FPGA combo, just buy a PCIe card with a beefy FPGA.

It will cost you, but the development of the FPGA logic will cost way more.

Because 2 FPGA cores don't give you the same bang for your computational buck as 2/4 general purpose cores. You're better off hanging an FPGA off a fast internal bus with an expansion card, rather than try and cram an FPGA on a CPU die.

Think of them like graphics cards, but even more niche. Trying to stick them directly into the CPU isn't going to provide the power of a dedicated add on.

Although if they are on-die, you can benefit from shared L2/L3 cache, and lower power/increased performance of the CPU-FPGA coupling, shared memory path (lower cost than dedicated, although they can compete/starve each other if there isn't good synergy at the OS level.)

Yeah, the problem is any FPGA solution that integrates directly with the CPU cache is going to be a bit underpowered due to fitting on the silicon. Even the integrated CPU/FPGA SoCs I've seen have the ARM core separated by an interconnect

What for, and how exactly would it be integrated? Those are the unsolved questions.

Because supporting different instructions sets or extensions through FPGAs would lead to the beginning of the end for x86-64.

Kinda sorta related: Novena, since it has a on-board FPGA: https://kosagi.com/w/index.php?title=Novena_Main_Page

The article seems to say FPGA on a high latency bus can only accelerate workload that are streamed via DMA, and implies that a general purpose accelerator has to be closer to the CPU. Sounds like a coprocessor, like putting an FPGA into the slot where the 8087 used to be.

That made me think, why not get even closer? Why not have an FPGA as execution unit? Modern CPUs have multiple ALUs, multiple FPUs, multiple vector units. Wouldn't it be great if an FPGA was added to that, such that the instruction set becomes extensible?

The idea is too obvious to assume nobody ever thought of it. Why isn't it done?

It has been researched: Alessandro Forin's eMIPS research project was on integrating FPGA fabric as an execution unit.

Project page: https://www.microsoft.com/en-us/research/project/emips/

Research paper: https://www.microsoft.com/en-us/research/wp-content/uploads/...

Back then Moore's Law was still going full steam so there wasn't much interest but, who knows, maybe that will change in a few years.

>So as long as FPGAs are attached on relatively glacially slow I/O buses -- including 32-bit 33 MHz PCI

GPUs are on the PCI bus, aren't they? Has something changed in the last two decades to increase bandwidth?

PCI: 33 * 32 = ~1 Gbit/s

PCIe 3: 16 lanes * 8 Gtransfers/s * 128/130 (encoding) : ~126 Gbit/s

So, yes, it has changed quite a bit!

But so has everything else.

If you want performance, you still better do it through DMA transfers that bypass the CPU, because otherwise, the CPU will still be waiting for thousands of cycles to fetch data from the device on the other side of the bus.

And the transfers that are done by the CPU should be write-only to the bus as much as possible.

The AGP bus was invented to remove the bottleneck for video cards the year this article was written, which wasn't phased out until PCIe became common in the mid 00s.

Data transfer from the host CPU to the GPU card can kill the performance of offloading. You need a hefty data-parallel kernel, with a high-ish work-per-element, to get speedup that's worth the data transfer costs.

GPUs worked well because you could transfer all your large art assets upfront and then only communicate your mesh and shader logic as the game ran. They don't work so well if you need frequent access to system memory.

They also do not have to send the result back to the CPU because they have a video output directly attached to them.

Increase bandwidth? Absolutely. Increase relative bandwidth? Not so much.

But PCIe 3 is a whole different beast than PCI.


the Netezza shared nothing database appliance used FGPAs as helper cards on each of the x86 data blade servers. A little more about it worked here (and via The DuckDuckGo.)


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact