Hacker News new | past | comments | ask | show | jobs | submit login
Why Use an FPGA Instead of a CPU or GPU? (esciencecenter.nl)
284 points by ScottWRobinson on Aug 14, 2018 | hide | past | favorite | 125 comments

One clarification on the comment about latency. FPGAs are typically clocked much slower than a modern CPU. Typically, they run somewhere in the low 100's of MHz, whereas an Intel CPU clocks in at around 3GHz last time I went to the Apple Store. With a typical x86 multiply instruction having a latency of about, say, 3 cycles, putting that workload on an FPGA would result in a ~10x slow-down!

The real benefit of an FPGA is that you get to decide what happens in any given cycle. So rather than being able to multiply two numbers in a single cycle like on x86, you could make your FPGA design do, say, 20 multiplications in a single cycle (space allowing). Which means that you can now multiply 20 numbers in 1/10th of the time it would take on x86. (In reality I think you have something like four execution units capable of perfoming parallel ALU operations per cycle, depending on the family.)

So the latency benefit of an FPGA comes from flexible, almost (almost) unbounded potential for parallelism in a given cycle, not clock frequency. Hardware has to be designed to exploit this potential, otherwise it's not going to see any improvement in latency.

Anyways, just something that is maybe obvious once you're told it, but isn't always mentioned in discussions like this. It certainly something that I didn't fully appreciate before getting involved in hardware design.

Another thing is that you can pipeline, for example, multiplies. So in a CPU multiply you give the CPU inputs, wait a couple cycles, and then get the result. In an FPGA you can build a pipelined multiply. It's built such that you can feed it input every cycle and get an output every cycle. The only caveat is that the outputs are delayed relative to the inputs. i.e. you may give it (2, 3) to multiply on one cycle, but you won't see the result (6) of that particular input on the output until a couple cycles later.

[Yup, "pipeline" here is the same term used to describe how modern CPUs get their performance. They, too, are pipelining their instruction execution so that many instructions can be in the process of executing at the same time. Though what I describe is a more extreme and specific kind of pipelining.]

This is kind of like having a bunch of multiplies in parallel. But it's useful for, for example, real-time calculations. You get to perform all the calculations you need every single cycle, no matter how complex; your results are just delayed by X cycles. Pipelines are usually also more efficient than straight parallelism (i.e. an 8 deep pipelined multiplier uses less silicon than 8 individual "serial" multiplier units).

Another interesting thing: In a previous life I built an FPGA based video processing device. It sat in an HDMI chain, so it had to be real-time. If that had been built with a GPU, most engineers would build the system to buffer up a frame, perform the processing, and then feed the processed frame out. That results in at least 1 frame of delay. In contrast, because we used an FPGA, it was simple to just pipeline the entire design and thus only needed to buffer up the few lines that we needed. This meant A) we needed no external memory (cheaper) and our latency was on the order of microseconds. In my travels with that job I ran into tons of other companies building video processing devices. They _all_ used frame buffers, which made their devices unacceptable for, e.g., gaming.

> Another thing is that you can pipeline, for example, multiplies. So in a CPU multiply you give the CPU inputs, wait a couple cycles, and then get the result.

Just to be clear: CPUs will pipeline high-latency execution units. Most arithmetic operations, vector, integer, or floating point, will have a reciprocal throughput of 1 (i.e., you can issue a new operation every cycle) even if their latency will be 3 or 4 cycles. See Agner's optimization tables for actual counts. The major exception to that rule is division, which has a reciprocal throughput of ~5 cycles or so and a latency ~3-4 times that.


Just to clarify whilst I have your attention: isn't it a common practice to pipeline frame buffers too? For instance, Android has three frame buffers. IIRC, one is with the user space handing off display lists, one used for composition, and one is used by driver for rasterization?

In that case, how good, you reckon, would the perf be compared to FGPA?

by straight having a frame buffer vs having a couple lines buffered, I'd imagine it's still the fpga, except now the GPU needs even more memory to do the same task

> One clarification on the comment about latency. FPGAs are typically clocked much slower than a modern CPU. Typically, they run somewhere in the low 100's of MHz, whereas an Intel CPU clocks in at around 3GHz last time I went to the Apple Store. With a typical x86 multiply instruction having a latency of about, say, 3 cycles, putting that workload on an FPGA would result in a ~10x slow-down!

No, not that latency. Latency from external input to external output.

Like the latency of receiving an ethernet frame to the point you can decide what to do with it, forward, modify, etc. Or reacting to sensor input.

CPUs process interrupts slowly: X86 CPUs running a non-realtime OS can take sweet 50 microseconds+ just to get to run interrupt service routine let alone doing something useful with the data. Oh, and generally no FPU, SIMD, etc. is available in this ISR [0]. So ISR generally queues the event (on Linux a Bottom Half and on Windows a DPC, deferred procedure call) to actually process the event. The queued events might execute tens of milliseconds later in bad cases.

Microcontrollers can sometimes process external interrupts pretty quickly, it's not impossible to be done in 100 nanoseconds. But it really depends, and you usually have to be pretty careful to achieve this. And even then you often have significant jitter in the processing time.

FPGAs can start to handle an external I/O event on the same clock cycle, and might very well be driving I/O output pin a clock cycle later, if no pipelining is required.

So an FPGA might be done in mere 10 nanoseconds in some ideal cases. With little to no jitter.

[0]: Well, you can do FPU/SIMD in interrupt service routine, but need to handle all register state storage on your own, and if you mess up, it'll cause very interesting bugs in the userland application software.

FPGA latency is also much more predictable. Since you have cycle level control, you can know exactly how long any given operation will take. In a multi-threaded CPU, it's likely that the operation may be interrupted or delayed because of another process using the same resourt.

It's also worth noting that FPGAs are often much closer to the I/O than a CPU on a motherboard. Pin to pin latencies are much reduced in a situation where direct access to I/O is available. An Ethernet stack can sit on the same silicon without having to serialize data between the CPU and the north bridge.

>In a multi-threaded CPU, it's likely that the operation may be interrupted or delayed because of another process using the same resourt.

If you're considering an FPGA, isn't the alternative likely bare-metal CPU programming? In that case, you have just as much control over thread scheduling as you do over FPGA timing don't you?

Throughput and latency are opposite ends of the same tradeoff. FPGAs enable cycle perfect timing control, while for bare-metal CPU programming.. even if you just have an infinite loop running bare-metal on one CPU, looking at the assembly can't tell you anything about the timing. Modern CPUs have multiple layers of caches with penalties for a miss coming in at hundreds or thousands of cycles. They do parallel and speculative execution of instructions. Instructions cost variable amount of cycles and take variably long depending on instructions before and after.

And then, with the SoCs that are everywhere now.. you are poking some memory mapped register, say you toggle a GPIO output value.. how long does it take to change on the actual physical pin? Is the GPIO peripheral part of the processor or some IP they bought in and then interconnected over an AXI bus? Is it buffered? It's all entirely impossible to say.

I feel like people are taking the hybrid approach of having a dedicated microcontroller on the same die as their CPU, like the PRU on TI's ARM chips: http://processors.wiki.ti.com/index.php/PRU-ICSS

This by no means replaces an FPGA, but if you just want your pins to flip and code to run at a predictable interval, this gets you that. (As for reading pins, they have another peripheral on the die that will just timestamp when an event occurs. This is much more consistent than dealing with your OS's interrupt handler, but again, it's no FPGA.)

Yeah, the TI PRUs are what you get approaching this topic from a processor manufacturer standpoint, the XILINX Zynq is how a FPGA manufacturers would do it.

Just to add Microchip with "Configurable Logic Cell" and Cypress with "configurable blocks".

SMM code from your firmware can still take over and add random latencies at the worst times.

Not since Ring -1 came to existence.

For large calculations, the magic of an FPGA is in its throughput. Imagine that you have some mess of addition, multiplication, ... The time for the first calculation hardly matters. Even if the FPGA is slower getting through the first calculation that took 100 clock cycles on a CPU vs 10 on an FPGA, what happens on the next clock cycle? The FPGA cranked through an entire second iteration while the CPU is a few steps in. Next clock cycle? Now the FPGA has pushed another whole calculation through the pipeline.

My favourite personal example of this is using a GPU to perform an all-pairs nearest neighbour lookup in an image (for all pixels, find the nearest keypoint). That's something like 2 billion comparisons per image. A decent CPU (parallelised) took minutes to do that, at the time.

By far the simplest solution was to brute force it on a GPU. Probably took longer "single threaded", as there was no optimisation at all, but over O(10e6) pixels with a list of O(10e3) keypoints in shared memory it was basically instant.

It was a great lesson in premature optimisation. I could have spent days tweaking the CPU method with heuristics, sorting the inputs, etc. In the end it was less than 100 lines of OpenCL I think.

The author isn't aware apparently of over a decade of work done in FPGA/CPU integration. Both in more sequential languages like System C[1] or in extended instruction set computing like the Stretch[2]. Not to mention the Zynq[3] series where the CPU is right there next to the FPGA fabric.

For "classic" CPUs (aka x86 cpus with a proprietary frontside bus and southbridge bus) it can be challenging to work an FPGA into the flow. But the problem was pretty clear in 2007 in this Altera whitepaper[4] where they postulate that not keeping up with Moore's law means you probably end up specializing the hardware for different workloads.

The tricky bit on these systems is how much effort/time it takes to move data and context between the "main" CPU and the FPGA and then back again (Amdahl's law basically). When the cost of moving the data around is less than the scaling benefit of designing a circuit to do the specific computation, then adding an FPGA is a winning move.

[1] https://www.xilinx.com/products/design-tools/vivado/prod-adv...

[2] http://www.stretchinc.com/index.php

[3] https://www.xilinx.com/products/silicon-devices/soc/zynq-ult...

[4] https://www.intel.com/content/dam/altera-www/global/en_US/pd...

  >> The tricky bit on these systems is how much effort/time it takes 
  >> to move data and context between the "main" CPU and the FPGA and then back again 
The sweet spot for FPGAs is when the data is streaming. You can build hardware where the data is passed from one stage to another, and doesn't go to RAM at all. Particularly where the data is coming in fast, but with low resolution (huge FPGA advantage for fixed point instead of floating point).

FGPAs are much more efficient for signal processing from data collected from an A/D converter (e.g. software defined radio) or initial video processing from an image sensor.

> The sweet spot for FPGAs is when the data is streaming

Netezza does this with their hardware, I've been dreaming of the day I could DIY and put a OSS RDBMS on top.

Yeah, the High-Level Synthesis work goes back to the 1990's. Wikipedia has a nice list:


They didn't really pan out in terms of large uptake. They did solve some problems pretty well with shorter, developer time. There's also been open source ones designed. Here's a few:



Many of the other ones seem dead or don't have a software page. I'm not sure what's current in FOSS HLS with C-like languages. Anyone know of other developments at least as good as the two I linked?

Nitpick: LegUp is published under a "no commercial use" license. Synflow mentions they use a "open-source language" but it is not clear how much of the tool chain is open source.

Didn't know about LegUp restriction. Thanks!

"Intel does offer an emulator, so testing for correctness does not require this long step, but determining and optimizing performance does require these overnight compile phases."

After some experience with FPGAs, the emulation step is not enough to test for correctness. Most of the problems happen while synchronizing signals with inputs/outputs and with other weird timing problems, glitches and unintuitive behavior that FPGAs provide (and the emulator behaves differently). I was using VHDL however. Anybody has experience with OpenCL on FPGAs to explain what difficulties persist (are the timing problems easier to solve)?

There are a number of real issues with the emulation approach even now. Firstly, emulation isn't accurate - if you do floating point math in your application it will give you different results (within the tolerances of the OpenCL spec) on FPGA vs. CPU. So you can't test for correctness in the emulator.

A second more serious issue is that getting performance that justifies using an FPGA requires tuning very carefully to the architecture. This may mean adopting pipeline architectures that destroy emulator performance (there's still issues with the emulator identifying that the design patterns for shift registers in FPGA look like pointer fun on CPU rather than mammoth memcpys). So for a huge part of the design stage the emulator is basically useless- because it tells you nothing about what you care about, since the performance of the emulator is often negatively correlated with performance on FPGA. This is made worse if you're doing hardcore FPGA tricks like mixed precision arithmetic.

As you say though - the fact that timing is lost in the emulator also means you don't get a true idea of whether you have buffer overflows etc or lock-up. Adding debug into the actual design impacts the implementation on FPGA in a way that it doesn't for software - and is sometimes unintuitive.

> if you do floating point math in your application it will give you different results

That seems like a gross weakness in the emulator; floating point isn't actually nondeterministic!

Actually it's not as clear cut as you'd expect. Obviously you can't represent every number in floating point, so you have to choose a way to round numbers - and for simple operations like add you can correctly round the results. For transcendental operations like x^y it's actually unknown how many resources you'd need to correctly calculate x^y for every valid value of x and y[1]. So since you can't calculate these numbers to correct rounding, you have to choose a level of rounding for your approximation - like 3 units of least precision rounding at the output. Of course we all need to know how accurate these are - so the OpenCL standard specifies it[2] - exp requires being correct to 3ULPs for single/double and 2ULPs for half.

Now if you have 3 ULP to play with, the maker of an Intel CPU is going to design an exp instruction to best make use of the existing Intel functional units. But an Intel FPGA dev is going to design an exp instruction to best make use of Lookup Tables and 18x18 multiplies - because that's what they have on the FPGA.

So whilst you'll get the same answer for x^y on Intel CPU and Intel FPGA within 3 ULPs those rounding errors are going to be different between the two architectures. So now, if you want to compute a normal distribution on Intel FPGA vs CPU you'll get 3 ULPs in your exponent, but that'll carry forward into the rest of the equation.

So now you have a choice - do you use the built-in function for exp on the Intel CPU - which is OpenCL compliant just like the FPGA, and get unknown rounding errors in what is probably a mathematically sensitive task, or do you emulate the actual sub-operations the FPGA does? In which case your hardy RTL designer who wrote that exponent function RTL is going to have to write an implementation in C that emulates the hardware. Oh and they don't only have to do that for exp - they have to do that for 100s of mathematical functions, and it'll run dog slow on the CPU compared to using the native functions.



> do you emulate the actual sub-operations the FPGA does?

Yes. It's an emulator.

> ... is going to have to write an implementation in C that emulates the hardware.

Makes sense. It's an emulator.

> ... and it'll run dog slow on the CPU compared to using the native functions.

Isn't that to be expected? It's an emulator. This isn't like games where it just has to look close. If it's a dev tool for testing correctness, exactness matters.

Well last time I looked, the answer to question 1 for the Intel OpenCL SDK is actually no.

And while yes, it's expected to be slow compared to the native functions, that's not the problem. It's slow compared to simulation.

Funny how some people believe that brain simulations are relatively nearby, but we can't even simulate an FPGa well enough to trust the model.

All sorts of simulators are nearby at once if you consider quantum computers to be nearby, it's not linear development. I agree however, since I don't consider a quantum computer to be nearby.

I wouldn't think the normal standard issues with glitches and synchronizing signals would happen with OpenCL. The synthesizer should have enough information to handle all of that and spit out max timing on the other side.

> The HPC community is already used to GPUs — getting people to switch from GPUs to FPGAs requires larger benefits.

It's worth pointing out that some scientists haven't even made the leap to CPGPU computing yet and are still relying upon OpenMP / multithreading on general purpose CPUs, even when a GPU would be clearly superior. Anecdotally, I remember hearing that climate science simulations are particularly bad about taking advantage of new architectures for speedups, an assertion which possibly is supported by the fact that climateprediction.net simulations took several days using all 8 cores on my laptop to finish.

Neuroimaging is also pretty bad, since other than Broccoli [0], most toolkits for preparing analyses don't leverage GPUs, even though I'd wager 80% of the steps involved image manipulations / linear algebra.

[0] https://github.com/wanderine/BROCCOLI

Massively depends on what you're doing as to whether it's possible, straightforward or useful to be honest. For many people it's fine, but there are plenty of people where the time investment is not worth it for a variety of reasons. It's nowhere near as simple as just porting things quickly unless you have a very simple code.

For example, I work on coupled finite-element/boundary-element calculations in magnetics. Porting our finite element calculations to GPU is possible but pointless because the memory requirements are too high - we just got some V100s on our University supercomputer but they're only the 16gb versions, and the compressed boundary element matrix of the simulation I was running just yesterday was 22gb alone, let alone all of the other data! It's nothing like people who do simulations in, for example, fluid dynamics, because the interactions they look at are generally local, whereas in our field we have a dominating non-local calculation.

If you instead use finite differences to tackle the problems we look at, then it's much less of a problem, but if you try and get into multi-GPU things you quickly run into the same memory issues because one of the calculations requires repeated FFT accelerated 3-D convolution to run in a reasonable amount of time, and in-place FFTs on multi-GPU can only be done on arrays that are < (memory on each GPU)/(Number of GPUs) because there needs to be an all-to-all communication. If you have 4 x 16gb V100s (which is $36k in my country), then that means the array you need to transform must be at most 4gb, and in practice it's less than that because we need to store other data on the GPU. That hugely limits the problems you can look at.

I think these scientists are the real winners. Computers just work for them, they don't need to jump through hoops - eg, rewrite their software for immature proprietary software architectures. Chasing that last 10x of performance at great cost is for the desperate!

In my experience, it's true. And it's unlikely to change - the simulation code was largely written in FORTRAN and C over a period of 50 years.

This is in fact what makes Intel Phi so appealing to some.

For C/Fortran using LAPACK it should be as simple as relinking. https://www.olcf.ornl.gov/tutorials/gpu-accelerated-librarie...

12 years ago I worked on High Level Synthesis at Xilinx converting atmospheric moisture simulations from FORTRAN...

One interesting use case for FPGAs is where hardware qualification is very expensive, for instance for use in space. If a given FPGA is already space qualified, you just need to load new code onto it and you can skip the expensive qualification step for your new application. You can also consolidate functionality of multiple chips into that one qualified FPGA.

The expense for space is producing RAD Hard parts. (The designs are done on the same FPGA then a few RAD Hard chips are made.)


I've always thought FPGAs would be perfect to have hardware backed video decoding/encoding that could adapt to new codecs (like vp9) while also being updatable for new performance improving discoveries.

It also seemed like it would go well with a generic radio subsystem, so you could compile hardware support for new wireless standards that come out after your hardware did (essentially an fpga sdr).

It seems like there could be lots of uses for being able to on demand enable hardware acceleration for certain tasks a program might need lots of.

I spent this summer doing exactly that for an internship. I set up a xilinx Zynq chip to run a basic video capture and streaming program (based on gstreamer), with the encoding portion handled by the fpga. This chip has a baked-in encoder supporting AVC and HEVC, but I suppose you could implement or buy your own IP block for other codecs. Xilinx makes multiple boards based off the same chip with support for video codecs and SDR.

FPGAs are indeed commonly used for SDR as well. Ettus has a range of USRPs with FPGAs on-board, and you can find readily available IP cores for anything from Bluetooth to DVB-S.

That's partially where I got my inspiration.

Seems like it integrated properly it could be a great component in a general purpose PC to future proof it against changing needs and requirements

Experience seems to show that standardizing on a codec or instruction type is the most likely path. Apple iDevices have had hardware transcoding for video for a couple releases now.

You really can't just have an FPGA that you recompile to a different architecture within a split second. They typically require compiling, downloading, and testing the the architecture you're deploying.

Until someone comes up with a much faster to deploy FPGA, you'll likely not see enough efficiency gains in having a codec-specific processor given the time it takes to reconfigure the FPGA for each codec.

But isn't that a one time expense?

When a program installs it could compile the hardware acceleration for your platform and test it. Can you not, rather rapidly, directly activate that compiled designs every time you launch the program in the future?

In that same vein, one time compilation for a hardware video decoder when a new codecs is released, and simply loading it when you want to play a video, seems infinitely easier than upgrading your desktop GPU or CPU to one that supports hardware acceleration for a new codec.

You're correct that it's a one time expense, you can quickly load the bit file onto the FPGA (< 2 seconds) once it's been generated.

Does anyone have experience of Reconfigure.io [1] and their toolchain that transpiles Go for execution on FPGAs [2]? I am curious how it feels from the developer perspective, how it works in practice, and what types of applications it is a good fit for.

[1] https://reconfigure.io

[2] http://docs.reconfigure.io/overview.html

Unless you have optimized your code as much as possible to run on x86 or GPU and still find that you need a prohibitively expensive amount of processing power to accomplish the task, then and only then is an FPGA worth it.

It is literally a DIY CPU architecture. So, unless you know exactly what it is about your current CPU's architecture that is holding you back, you won't be able to benefit from an FPGA-based design.

I've waited 20 years for something like that, thanks for posting! FPGAs are great but I think the biggest problem with them is that they're an array of gates instead of logic units (like ALUs). So the learning curve to get to the point where you can place an array of cores and memories and have them all communicate is so high that most people never get that far.

So that's been a huge barrier to exploring the problem spaces that GPUs are currently used for like AI and physics simulations. Hopefully using languages like Go/Elixer/Erlang/MATLAB/Octave etc will alleviate that to some degree.

My gut feeling is that in 3-5 years we'll reach the limits of what SIMD can accomplish and we'll find it difficult to do the kinds of general-purpose MIMD computing needed to move beyond the basic building blocks of AI like neural nets and genetic algorithms. I stumbled onto these links a month and a half ago and I think something like this will make writing generalized/abstract highly parallelized code tractable again:



Original at: https://news.ycombinator.com/item?id=17419917

It's so far outside the model of FPGAs, that I wouldn't even consider putting in the effort. It's like how regular code doesn't run well on the GPU, but much worse.

I doubt that programming efficiency is what's holding back FPGA's for general compute.

Why ? because we've seen decades of research in this area - so at least we have some tools(c for fpga isn't ideal, but still...), and a lot of the general mapping between what algorithms should be in FPGA.

And Amazon FPGA's instance exist for over a year.

So in that time, if there worth while services to offer with FPGA's, people would have offered them, or at least started. And sure programming complexity is a barrier. but entrepreneurs and VC's love competitive barriers. So where's all this VC funding going towards this area ? Where are all those startups ?

I disagree. FPGA design remains a highly-specialized area. Tools such as VivadoHLS, and other high-level synthesis tools, do provide some improvement in productivity though with inherent tradeoffs in design quality. There's been a lot of new hardware construction language popping up recently (e.g., HardCaml--which I happen to use, Chisel, Migen, PyMTL, etc.). They may bring something to the table in the coming years, but that's yet to be determined.

HardCaml's pretty great though, IMHO. But then again I'm already biased towards OCaml, and I get to work with its creator!

Disagree. I worked in this space for 15 years and watched CUDA win a market for NVidia that was ripe for the picking by Xilinx or Altera. The culture around hardware design prototyping dominates the FPGA field and all attempts at making it approachable by software development are stunted by the proprietary ecosystem. I also think that high level synthesis is largely misguided since sequential imperative style isn't the correct way to think about FPGAs and also doesn't really address the lack of a good low-level language with an abstraction mechanism, and fundamental need for an operating system for running processes on a reconfigurable computing system.

Aren't FPGA's used mostly to test/design a circuit that you would then go on to actually fabricate/build? I could be wrong but I thought FPGA's were stateless (meaning if they powered off/reboot you loose everything and have to set it up from scratch again).

That is incorrect, or at least inaccurate, on both counts.

FPGAs are often used to test/design not a "circuit" but an ASIC (application-specific-integrated-circuit) that you will then go on to actually build. Or, if your application doesn't have volume to suppprt the ASIC engineering costs but can support the FPGA unit cost, you just leave it as an FPGA. There are millions of devices (industrial machines, research, testing, etc) where this makes sense.

And not all FPGAs loose state when powered down. For those that do, there's typically an easy way to hook up a little Flash chip and bootloader to reload the state automatically on reset. For those that don't have that option (EG, currently working with an ABB robot, Mecco laser, and Rockwell PLC which all contain FPGAs, only the Mecco can boot itself), you store state in battery-backed SRAM and just leave the RAM on at all times.

> FPGAs are often used to test/design not a "circuit" but an ASIC (application-specific-integrated-circuit)

To be pedantic, an ASIC (and any IC really) is a circuit, as denoted by the "C" standing for "circuit".

Absolutely. In the worlds I work in, FPGA has been standard practice for years and years. As of late, it's becoming more common to place the CPU there for cost reduction.

In addition, you get the opportunity to both update the part and/or to change it's behavior via (for instance) a user setting.

> Aren't FPGA's used mostly to test/design a circuit that you would then go on to actually fabricate/build?

We use FPGAs in cell sorting because we need to make decisions off of high dimensional data with low latency. The cells moving through our system have velocities higher than 1 m/s. They flow past a set of lasers and wind up in a droplet less than a millisecond later so we need to make a decision whether or not to sort a droplet within that time frame. We don't use ASICs because we don't move enough volume to justify the startup cost.

They're also really common for interconnect fabric on reasonably complex systems. Lattice does really well in this space with their low-cost FPGAs/CPLDs.

Are you doing this in a commercial context, or in a university lab or similar? I've seen a few papers come out of universities using this technique but I haven't gotten to look at any commercial gear to see how it works under the hood.

The company I work at manufactures these instruments. There are a bunch of videos on Youtube explaining how these systems work if you search for "flow sorting". I can also answer questions if you have them.

I have at least 4 FPGAs that will never be made into asics sitting in my office right now and they're not even for highly technical stuff. They're retro video game things (2 everdrives, a video upscaling device called the OSSC, and an HDMI adapter that plugs into the digital out port of the gamecube originally made for component output).

I bought all these things off the shelf. There are definitely some markets that are well served by FPGAs but are still too small for actual fabrication processes.

A fellow retro gamer. I love my OSSC

Some FPGAs have built in flash that they will automatically load a bitstream from upon boot. If they don't, they typically can load the bitstream from an external flash chip upon powering up, or even via a microcontroller that's interfacing with the flash chip.

This is exactly how the Mega Everdrive and similar retro flashcarts work. The "OS" loaded from SD card configures an FPGA inside the cartridge.

Thanks, didn't know that. Is the boot process slow?

Yes and yes but new uses are emerging, it's like 3D printing: there's a crossover where the extra cost you pay for flexibility becomes too much to compete with hard wired logic despite the high initial cost.

To be fair, it's not new. People have been using FPGAs in that capacity for decades.

Very true, but mostly EEs and embedded SW/FW engineers. I think the SW tools, although still awful, are reaching the point where more SW-focused developers can take advantage of FPGA flexibility.

Which isn't to say every programmer shouldn't be forced to struggle through a Verilog or VHDL project - that'll put some hair on your chest and the fear of god into you. I'll see your async and raise you an "always @(posedge clk)"

They are used in production all over the place, along with their smaller cousin, the CPLD. If you see a part on a PCB by Xilinx, Altera, Lattice, or some MicroSemi parts, odds are that it's an FPGA. Many FPGAs lose their configuration on power-off, so you configure the bitstream using a CPU for example. As others have said, there are also flash-based ones.

I personally know of many embedded systems using FPGAs (spacecraft and medical applications), signal processing such as radar, many PCIe cards, etc.

They make it into commercial products, for a recent example see nvidia using it in the gsync modules for 4k displays. Although a retail priced $2600 fpga doesn't make for a cheap end product, even when purchased in quantity.


The Bing team at Microsoft use FPGAs in production. I was chatting with one of their developers recently who is focussed on that space for them and he mentioned they find they get similar performance to GPUs, but with significantly less power consumption, while affording them the same levels of flexibility for their particular use case.

Many FPGAs can load their bitstream from Flash.

Cool, didn't know that. Is the boot time fairly slow compared to booting a normal PC?

No, we're talking about microseconds.

We're often talking about hundreds of milliseconds.

No, they retain state. We used an Altera DE1 in school and it retained our burnt in program until erased or reprogrammed

Probably burned into an external EEPROM or flash, the FPGA fabric itself still has to be configured by some external means. It's pretty fast but on-the-fly reprogramming of the FPGA logic is still an issue where you want to very quickly switch the logic on the FPGA.

Most FPGAs I've seen go out of their way to allow you to reconfigure while it's running, you just have to jump through a few extra hoops and get down and dirty with the floor plan.

Older Altera CPLDs and Actel/Microsemi antifuse based FPGA families have nonvolatile configuration within the fabric. They enjoy the benefit of near instant start up since there is no programming phase to transfer a bitstream from an external flash part.

some have the configuration memory on-die as well to keep part count down. Also good to help protect your bitstream, but this isn't infallible of course.

No, FPGAs are used in production. There are also plenty of flash-based FPGAs available.

"...programming FPGAs is still an order of magnitude more difficult than programming instruction based systems." True statement. If you're offloading computation from a CPU, profile your application first, and determine if the FPGA board's architecture will allow you to improve performance. Don't underestimate the cost of communicating data to/from the FPGA/CPU.

Is there high performance FPGA that does not depend on proprietary bloated windows-only toolchain?

Every FPGA I have used has offered their proprietary bloated incredibly expensive toolchain on Linux.

There's icestorm but the supported hardware is not exactly high performance, IIRC the biggest chips supported have around 8K LUTs.

The proprietary bloated toolchains are free to use for lower end to mid range parts, even on Linux.

Yes but as as soon as you hit the limits of i.e. Quartus Lite it's $4k per seat per year. Then another $2k for Modelsim. And another $2k if you want the DSP package, etc.

That's on the super low end of the price spectrum.

Proprietary and bloated yes, but I believe both Xilinx and Altera both have software available for Linux, or am I wrong? At least I used Altera's suite on Linux once for a pilot project.

I recently installed Altera Quartus on Debian for a school project. It kept crashing after 2 seconds.

First result on google says quartus uses its own libcurl which is broken and you should copy the OS libcurl to /opt/altera/quartus/lin64/lib/whatever. Lots of issues like this.

I have used Xilinx's toolchain on Linux, though it was hard to get working. This was a few years ago.

They support Ubuntu now. Setup was just as easy as windows for me.

Actually the EDA industry for the longest time only hard working versions of their tools for "esoteric" cpu architectures like the Sparc and Solaris :). So yes by now they do support linux very well, some of the cadence and synopsis tools might not even work on windows (you need them for simulation, when Modelsim does not cut it). In any case EDA has been a Unix / Linux domain because it is so old and used to require really expensive unix workstations.

Supposedly some of the Xilinx 7 series FPGAs have been reverse engineered so there is a possibility of supporting them in Yosys (which is reasonable free toolchain). I've not been able to get much information about this, but you can check on the website: http://www.clifford.at/yosys/

From the FAQ at http://www.clifford.at/yosys/faq.html:

> 2. What synthesis targets are supported by Yosys?

> Yosys is retargetable and adding support for additional targets is not very hard. At the moment, Yosys ships with support for ASIC synthesis (from liberty cell library files), iCE40 FPGAs, Xilinx 7-Series FPGAs, Silego GreenPAK4 devices, and Gowinsemi GW1N/GW2A FPGAs.

> Note that in all this cases Yosys only performs synthesis. For a complete open source ASIC flow using Yosys see Qflow, for a complete open source iCE40 flow see Project IceStorm. Yosys Xilinx 7-Series synthesis output can be placed and routed with Xilinx Vivado.

[emphasis mine]

All vendors have linux toolchains. But they are still proprietary and bloated (at least after a point)

I believe this was on hn recently (uses icestorm as mentioned by others):


Last time I checked, vivado was available and fully functional on linux. Unless you are aware of a change in the last few minutes?

There's vivado on Linux and the open source toolchain for some lattice fpgas (icestorm)

For those wanting to play around with an FPGA, this looks quite interesting: https://www.sparkfun.com/products/14829

It's about half the price of the other FPGAs I've seen commonly used for hobbyist projects [1], doesn't need any separate programmer board to interface to your computer for programming, and has an open source tool chain available.

I've not played with it, as it just showed up in a mailing from Sparkfun about new products.

[1] such as https://www.sparkfun.com/products/11953

I'm surprised he doesn't talk much about direct cost (rather than engineering cost). A decent FPGA developer kit will cost several thousands of US$, more expensive even than high end GPUs.

You can buy this dev board to play around with and learn FPGA for $75, and it has extensive documentation and resources to help you


But that's not a high performance FPGA. Believe me, I have a ton of FPGAs at home (starting even cheaper than that), and I write about them all the time: https://rwmj.wordpress.com/?s=fpga

However if you're doing the sort of work which isn't just hobbyist stuff, you're going to spend 1000s on a board. This is the sort of thing commercial developers will be using: https://www.xilinx.com/products/boards-and-kits/ek-v7-vc707-...

If you're building a commercial product, a 3k dev board sounds incredibly cheap. My company spends orders of magnitudes more on licenses for an IDE extension (resharper).

Absolutely. However I guess you'll also want to deploy your FPGA application at some point, which means buying FPGAs for all your servers too. So the cost is more analogous to the cost of a CPU/GPU than to the cost of an IDE/developer license.

You can get boards with a zynq 7020 on them for less than $200.

Another FPGA sweet spot is analyzing TB/PB-scale databases. Netezza programmed FPGAs to uncompress, project, restrict (including NULLs), enforce isolation & visibility and check CRCs at disk scan rates (>200MB/sec). When touching a handful of columns from a 20-200 column table, CPUs spend most of their time stalled on cache line misses. FPGA-projecting just those few columns into memory enormously increases cache hit rates.

Plain C code can readily scan at a couple GB/sec on an ordinary Intel core, and NVME drives can transfer 8 or 16GB/sec

Is it just me, or does this article seem like it’s trying really hard to push Intel? For an article about FPGAs, failing to acknowledge the other (major) player in the market-there’s no mention of Xilinx anywhere in the article, while plenty of Intel/Altera-seems disingenuous. Really, the tone just seemed to stick out to me as an subtle advertisement piece; maybe it’s just me?

Is there a pipeline to compile tensorflow computation graphs to fpga? seems like this is one possible benefit of Tensorflow's fixed graphs, that they may be much easier to compile to fpga then other options

Edit: Seems like some experimental work is being done with XLA (and llvm) https://www.google.ca/amp/s/www.nextplatform.com/2018/07/24/...

How long does it take to "program" a FPGA? i.e. compared with typing a command, and having it load into memory.

What about a JIT that compiles hot code sections to FPGA? (Though compilation sounds too slow)

EDIT "one or two seconds or at least 100's of milliseconds." https://electronics.stackexchange.com/questions/212069/how-l...

> An exception to the rule that GPUs require a host is the NVidia Jetson, but this is not a high-end GPU

Well, the host is just on the same die, but it's still there

One reason could be to meet real-time constraints or to get predictable timing. This is becoming more and more difficult to do with modern CPUs for all sorts of reasons.

"A couple of FPGAs in mid-air (probably)"

I would love to see projects like MAME and MESS targeting FPGA hardware for near-perfect hardware emulation of classic hardware.

The Anologue Super Nt project uses an FPGA to run SNES cartridges unemulated.


But it is not Open Source. There are more commercial products using FPGA's for emulating gaming console hardware.

What are some cool jobs for FPGA experts besides Intel and Military? It seems like there's not much availability.

I recently spoke to someone at Broadcom who was moving networking processes from software to hardware using them, try looking into similar companies.

In the past I've seen job adverts for networking companies (routers) and hedge funds. Not sure if either of them count as cool.

Lots of companies who design ASICs will prototype them on FPGA for catching bugs and start early firmware development.

Why does the compilation take hours, is this an NP Hard problem?

If my memory of a Processor Design course is correct, arranging the desired functionality onto the finite resources of an actual FPGA chip, let alone in such a way as to get good performance (i.e. have critical paths be as short as possible), reduces to bin-packing.

The answer might be obvious, but I'm not a hardware guy: couldn't they use a FPGA to do the compilation itself, or at least the bin-packing part? Is there a way to code a general SAT-solver in FPGA?

Whats that for?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact