The real benefit of an FPGA is that you get to decide what happens in any given cycle. So rather than being able to multiply two numbers in a single cycle like on x86, you could make your FPGA design do, say, 20 multiplications in a single cycle (space allowing). Which means that you can now multiply 20 numbers in 1/10th of the time it would take on x86. (In reality I think you have something like four execution units capable of perfoming parallel ALU operations per cycle, depending on the family.)
So the latency benefit of an FPGA comes from flexible, almost (almost) unbounded potential for parallelism in a given cycle, not clock frequency. Hardware has to be designed to exploit this potential, otherwise it's not going to see any improvement in latency.
Anyways, just something that is maybe obvious once you're told it, but isn't always mentioned in discussions like this. It certainly something that I didn't fully appreciate before getting involved in hardware design.
[Yup, "pipeline" here is the same term used to describe how modern CPUs get their performance. They, too, are pipelining their instruction execution so that many instructions can be in the process of executing at the same time. Though what I describe is a more extreme and specific kind of pipelining.]
This is kind of like having a bunch of multiplies in parallel. But it's useful for, for example, real-time calculations. You get to perform all the calculations you need every single cycle, no matter how complex; your results are just delayed by X cycles. Pipelines are usually also more efficient than straight parallelism (i.e. an 8 deep pipelined multiplier uses less silicon than 8 individual "serial" multiplier units).
Another interesting thing: In a previous life I built an FPGA based video processing device. It sat in an HDMI chain, so it had to be real-time. If that had been built with a GPU, most engineers would build the system to buffer up a frame, perform the processing, and then feed the processed frame out. That results in at least 1 frame of delay. In contrast, because we used an FPGA, it was simple to just pipeline the entire design and thus only needed to buffer up the few lines that we needed. This meant A) we needed no external memory (cheaper) and our latency was on the order of microseconds. In my travels with that job I ran into tons of other companies building video processing devices. They _all_ used frame buffers, which made their devices unacceptable for, e.g., gaming.
Just to be clear: CPUs will pipeline high-latency execution units. Most arithmetic operations, vector, integer, or floating point, will have a reciprocal throughput of 1 (i.e., you can issue a new operation every cycle) even if their latency will be 3 or 4 cycles. See Agner's optimization tables for actual counts. The major exception to that rule is division, which has a reciprocal throughput of ~5 cycles or so and a latency ~3-4 times that.
Just to clarify whilst I have your attention: isn't it a common practice to pipeline frame buffers too? For instance, Android has three frame buffers. IIRC, one is with the user space handing off display lists, one used for composition, and one is used by driver for rasterization?
In that case, how good, you reckon, would the perf be compared to FGPA?
No, not that latency. Latency from external input to external output.
Like the latency of receiving an ethernet frame to the point you can decide what to do with it, forward, modify, etc. Or reacting to sensor input.
CPUs process interrupts slowly: X86 CPUs running a non-realtime OS can take sweet 50 microseconds+ just to get to run interrupt service routine let alone doing something useful with the data. Oh, and generally no FPU, SIMD, etc. is available in this ISR . So ISR generally queues the event (on Linux a Bottom Half and on Windows a DPC, deferred procedure call) to actually process the event. The queued events might execute tens of milliseconds later in bad cases.
Microcontrollers can sometimes process external interrupts pretty quickly, it's not impossible to be done in 100 nanoseconds. But it really depends, and you usually have to be pretty careful to achieve this. And even then you often have significant jitter in the processing time.
FPGAs can start to handle an external I/O event on the same clock cycle, and might very well be driving I/O output pin a clock cycle later, if no pipelining is required.
So an FPGA might be done in mere 10 nanoseconds in some ideal cases. With little to no jitter.
: Well, you can do FPU/SIMD in interrupt service routine, but need to handle all register state storage on your own, and if you mess up, it'll cause very interesting bugs in the userland application software.
It's also worth noting that FPGAs are often much closer to the I/O than a CPU on a motherboard. Pin to pin latencies are much reduced in a situation where direct access to I/O is available. An Ethernet stack can sit on the same silicon without having to serialize data between the CPU and the north bridge.
If you're considering an FPGA, isn't the alternative likely bare-metal CPU programming? In that case, you have just as much control over thread scheduling as you do over FPGA timing don't you?
And then, with the SoCs that are everywhere now.. you are poking some memory mapped register, say you toggle a GPIO output value.. how long does it take to change on the actual physical pin? Is the GPIO peripheral part of the processor or some IP they bought in and then interconnected over an AXI bus? Is it buffered? It's all entirely impossible to say.
This by no means replaces an FPGA, but if you just want your pins to flip and code to run at a predictable interval, this gets you that. (As for reading pins, they have another peripheral on the die that will just timestamp when an event occurs. This is much more consistent than dealing with your OS's interrupt handler, but again, it's no FPGA.)
By far the simplest solution was to brute force it on a GPU. Probably took longer "single threaded", as there was no optimisation at all, but over O(10e6) pixels with a list of O(10e3) keypoints in shared memory it was basically instant.
It was a great lesson in premature optimisation. I could have spent days tweaking the CPU method with heuristics, sorting the inputs, etc. In the end it was less than 100 lines of OpenCL I think.
For "classic" CPUs (aka x86 cpus with a proprietary frontside bus and southbridge bus) it can be challenging to work an FPGA into the flow. But the problem was pretty clear in 2007 in this Altera whitepaper where they postulate that not keeping up with Moore's law means you probably end up specializing the hardware for different workloads.
The tricky bit on these systems is how much effort/time it takes to move data and context between the "main" CPU and the FPGA and then back again (Amdahl's law basically). When the cost of moving the data around is less than the scaling benefit of designing a circuit to do the specific computation, then adding an FPGA is a winning move.
>> The tricky bit on these systems is how much effort/time it takes
>> to move data and context between the "main" CPU and the FPGA and then back again
FGPAs are much more efficient for signal processing from data collected from an A/D converter (e.g. software defined radio) or initial video processing from an image sensor.
Netezza does this with their hardware, I've been dreaming of the day I could DIY and put a OSS RDBMS on top.
They didn't really pan out in terms of large uptake. They did solve some problems pretty well with shorter, developer time. There's also been open source ones designed. Here's a few:
Many of the other ones seem dead or don't have a software page. I'm not sure what's current in FOSS HLS with C-like languages. Anyone know of other developments at least as good as the two I linked?
After some experience with FPGAs, the emulation step is not enough to test for correctness. Most of the problems happen while synchronizing signals with inputs/outputs and with other weird timing problems, glitches and unintuitive behavior that FPGAs provide (and the emulator behaves differently). I was using VHDL however. Anybody has experience with OpenCL on FPGAs to explain what difficulties persist (are the timing problems easier to solve)?
A second more serious issue is that getting performance that justifies using an FPGA requires tuning very carefully to the architecture. This may mean adopting pipeline architectures that destroy emulator performance (there's still issues with the emulator identifying that the design patterns for shift registers in FPGA look like pointer fun on CPU rather than mammoth memcpys). So for a huge part of the design stage the emulator is basically useless- because it tells you nothing about what you care about, since the performance of the emulator is often negatively correlated with performance on FPGA. This is made worse if you're doing hardcore FPGA tricks like mixed precision arithmetic.
As you say though - the fact that timing is lost in the emulator also means you don't get a true idea of whether you have buffer overflows etc or lock-up. Adding debug into the actual design impacts the implementation on FPGA in a way that it doesn't for software - and is sometimes unintuitive.
That seems like a gross weakness in the emulator; floating point isn't actually nondeterministic!
Now if you have 3 ULP to play with, the maker of an Intel CPU is going to design an exp instruction to best make use of the existing Intel functional units. But an Intel FPGA dev is going to design an exp instruction to best make use of Lookup Tables and 18x18 multiplies - because that's what they have on the FPGA.
So whilst you'll get the same answer for x^y on Intel CPU and Intel FPGA within 3 ULPs those rounding errors are going to be different between the two architectures. So now, if you want to compute a normal distribution on Intel FPGA vs CPU you'll get 3 ULPs in your exponent, but that'll carry forward into the rest of the equation.
So now you have a choice - do you use the built-in function for exp on the Intel CPU - which is OpenCL compliant just like the FPGA, and get unknown rounding errors in what is probably a mathematically sensitive task, or do you emulate the actual sub-operations the FPGA does? In which case your hardy RTL designer who wrote that exponent function RTL is going to have to write an implementation in C that emulates the hardware. Oh and they don't only have to do that for exp - they have to do that for 100s of mathematical functions, and it'll run dog slow on the CPU compared to using the native functions.
Yes. It's an emulator.
> ... is going to have to write an implementation in C that emulates the hardware.
Makes sense. It's an emulator.
> ... and it'll run dog slow on the CPU compared to using the native functions.
Isn't that to be expected? It's an emulator. This isn't like games where it just has to look close. If it's a dev tool for testing correctness, exactness matters.
And while yes, it's expected to be slow compared to the native functions, that's not the problem. It's slow compared to simulation.
It's worth pointing out that some scientists haven't even made the leap to CPGPU computing yet and are still relying upon OpenMP / multithreading on general purpose CPUs, even when a GPU would be clearly superior. Anecdotally, I remember hearing that climate science simulations are particularly bad about taking advantage of new architectures for speedups, an assertion which possibly is supported by the fact that climateprediction.net simulations took several days using all 8 cores on my laptop to finish.
Neuroimaging is also pretty bad, since other than Broccoli , most toolkits for preparing analyses don't leverage GPUs, even though I'd wager 80% of the steps involved image manipulations / linear algebra.
For example, I work on coupled finite-element/boundary-element calculations in magnetics. Porting our finite element calculations to GPU is possible but pointless because the memory requirements are too high - we just got some V100s on our University supercomputer but they're only the 16gb versions, and the compressed boundary element matrix of the simulation I was running just yesterday was 22gb alone, let alone all of the other data! It's nothing like people who do simulations in, for example, fluid dynamics, because the interactions they look at are generally local, whereas in our field we have a dominating non-local calculation.
If you instead use finite differences to tackle the problems we look at, then it's much less of a problem, but if you try and get into multi-GPU things you quickly run into the same memory issues because one of the calculations requires repeated FFT accelerated 3-D convolution to run in a reasonable amount of time, and in-place FFTs on multi-GPU can only be done on arrays that are < (memory on each GPU)/(Number of GPUs) because there needs to be an all-to-all communication. If you have 4 x 16gb V100s (which is $36k in my country), then that means the array you need to transform must be at most 4gb, and in practice it's less than that because we need to store other data on the GPU. That hugely limits the problems you can look at.
This is in fact what makes Intel Phi so appealing to some.
It also seemed like it would go well with a generic radio subsystem, so you could compile hardware support for new wireless standards that come out after your hardware did (essentially an fpga sdr).
It seems like there could be lots of uses for being able to on demand enable hardware acceleration for certain tasks a program might need lots of.
Seems like it integrated properly it could be a great component in a general purpose PC to future proof it against changing needs and requirements
You really can't just have an FPGA that you recompile to a different architecture within a split second. They typically require compiling, downloading, and testing the the architecture you're deploying.
Until someone comes up with a much faster to deploy FPGA, you'll likely not see enough efficiency gains in having a codec-specific processor given the time it takes to reconfigure the FPGA for each codec.
When a program installs it could compile the hardware acceleration for your platform and test it. Can you not, rather rapidly, directly activate that compiled designs every time you launch the program in the future?
In that same vein, one time compilation for a hardware video decoder when a new codecs is released, and simply loading it when you want to play a video, seems infinitely easier than upgrading your desktop GPU or CPU to one that supports hardware acceleration for a new codec.
It is literally a DIY CPU architecture. So, unless you know exactly what it is about your current CPU's architecture that is holding you back, you won't be able to benefit from an FPGA-based design.
So that's been a huge barrier to exploring the problem spaces that GPUs are currently used for like AI and physics simulations. Hopefully using languages like Go/Elixer/Erlang/MATLAB/Octave etc will alleviate that to some degree.
My gut feeling is that in 3-5 years we'll reach the limits of what SIMD can accomplish and we'll find it difficult to do the kinds of general-purpose MIMD computing needed to move beyond the basic building blocks of AI like neural nets and genetic algorithms. I stumbled onto these links a month and a half ago and I think something like this will make writing generalized/abstract highly parallelized code tractable again:
Original at: https://news.ycombinator.com/item?id=17419917
Why ? because we've seen decades of research in this area - so at least we have some tools(c for fpga isn't ideal, but still...), and a lot of the general mapping between what algorithms should be in FPGA.
And Amazon FPGA's instance exist for over a year.
So in that time, if there worth while services to offer with FPGA's, people would have offered them, or at least started. And sure programming complexity is a barrier. but entrepreneurs and VC's love competitive barriers. So where's all this VC funding going towards this area ? Where are all those startups ?
HardCaml's pretty great though, IMHO. But then again I'm already biased towards OCaml, and I get to work with its creator!
FPGAs are often used to test/design not a "circuit" but an ASIC (application-specific-integrated-circuit) that you will then go on to actually build. Or, if your application doesn't have volume to suppprt the ASIC engineering costs but can support the FPGA unit cost, you just leave it as an FPGA. There are millions of devices (industrial machines, research, testing, etc) where this makes sense.
And not all FPGAs loose state when powered down. For those that do, there's typically an easy way to hook up a little Flash chip and bootloader to reload the state automatically on reset. For those that don't have that option (EG, currently working with an ABB robot, Mecco laser, and Rockwell PLC which all contain FPGAs, only the Mecco can boot itself), you store state in battery-backed SRAM and just leave the RAM on at all times.
To be pedantic, an ASIC (and any IC really) is a circuit, as denoted by the "C" standing for "circuit".
In addition, you get the opportunity to both update the part and/or to change it's behavior via (for instance) a user setting.
We use FPGAs in cell sorting because we need to make decisions off of high dimensional data with low latency. The cells moving through our system have velocities higher than 1 m/s. They flow past a set of lasers and wind up in a droplet less than a millisecond later so we need to make a decision whether or not to sort a droplet within that time frame. We don't use ASICs because we don't move enough volume to justify the startup cost.
I bought all these things off the shelf. There are definitely some markets that are well served by FPGAs but are still too small for actual fabrication processes.
Which isn't to say every programmer shouldn't be forced to struggle through a Verilog or VHDL project - that'll put some hair on your chest and the fear of god into you. I'll see your async and raise you an "always @(posedge clk)"
I personally know of many embedded systems using FPGAs (spacecraft and medical applications), signal processing such as radar, many PCIe cards, etc.
There's icestorm but the supported hardware is not exactly high performance, IIRC the biggest chips supported have around 8K LUTs.
First result on google says quartus uses its own libcurl which is broken and you should copy the OS libcurl to /opt/altera/quartus/lin64/lib/whatever. Lots of issues like this.
> 2. What synthesis targets are supported by Yosys?
> Yosys is retargetable and adding support for additional targets is not very hard. At the moment, Yosys ships with support for ASIC synthesis (from liberty cell library files), iCE40 FPGAs, Xilinx 7-Series FPGAs, Silego GreenPAK4 devices, and Gowinsemi GW1N/GW2A FPGAs.
> Note that in all this cases Yosys only performs synthesis. For a complete open source ASIC flow using Yosys see Qflow, for a complete open source iCE40 flow see Project IceStorm. Yosys Xilinx 7-Series synthesis output can be placed and routed with Xilinx Vivado.
It's about half the price of the other FPGAs I've seen commonly used for hobbyist projects , doesn't need any separate programmer board to interface to your computer for programming, and has an open source tool chain available.
I've not played with it, as it just showed up in a mailing from Sparkfun about new products.
 such as https://www.sparkfun.com/products/11953
However if you're doing the sort of work which isn't just hobbyist stuff, you're going to spend 1000s on a board. This is the sort of thing commercial developers will be using: https://www.xilinx.com/products/boards-and-kits/ek-v7-vc707-...
Edit: Seems like some experimental work is being done with XLA (and llvm) https://www.google.ca/amp/s/www.nextplatform.com/2018/07/24/...
What about a JIT that compiles hot code sections to FPGA? (Though compilation sounds too slow)
EDIT "one or two seconds or at least 100's of milliseconds." https://electronics.stackexchange.com/questions/212069/how-l...
Well, the host is just on the same die, but it's still there