
FPGAs Have the Wrong Abstraction for Computing - matt_d
https://www.cs.cornell.edu/~asampson/blog/fpgaabstraction.html
======
nominatronic
"Vendors keep bitstream formats secret, so Verilog is as low in the
abstraction hierarchy as you can go. The problem with Verilog as an ISA is
that it is too far removed from the hardware."

I wonder if the author is aware that the bistream formats for both Lattice
iCE40 series and Xilinx Virtex 7 series FPGAs have now been reverse
engineered, and there is a complete open source toolchain that can be used for
these. So Verilog is no longer as low as you can go.

Efforts of this type are also underway for other parts and there is a growing
movement in this direction - see talks from Clifford Wolf at recent CCC
events.

~~~
typon
Except these are not mainstream efforts supported by vendors.. Which means
that they're effectively off limits to serious developers are large
engineering firms who are the typical customers of FPGA hardware

~~~
Aissen
Yes, exactly like Linux, Apache, Docker or your favorite language were off-
limits to serious developers in large engineering firms at some point in time.
Things are evolving, so we'll see how it goes.

------
ianhowson
Two thoughts:

* Most of the algorithms that we want to work with in this domain are doing arithmetic operations on ints and floats. This isn't super difficult to do in an RTL, but it's like implementing C++ objects in assembler. You _can_ do it, but you need to think harder than you should.

* FPGAs make you worry about timing. This is a massive shift in thinking for software people. It's also not a value-add; I don't _want_ to care about timing. And it enforces chip-wide dependencies (you can have separate clock domains, but not many of them).

If you simplify the model to "pipelines of arithmetic ops" and then provide an
abstraction that eliminates timing (e.g. all ops run in a fixed number of
clock cycles and the compiler automatically pipelines them where necessary)
then I think you'd have something usable. But this is basically a GPU with a
lot of SRAM. Such a constrained problem would run extremely well on any modern
GPU or SIMD machine, without the power and cost and obscurity constraints of
FPGAs.

~~~
justaaron
I DO want access to timing as most of what I want to use an FPGA to implement
directly relates to timing. Think: real-time audio.

~~~
reitzensteinm
Is this really the case? The timing parent is talking about is fractions of a
cycle, maybe five orders of magnitude above audible frequencies.

How could you take advantage of exposed timing?

~~~
justaaron
point being: Control. Precise control over timing is required for
deterministic temporal activities.

Removing precise control over timing from the language stack one uses to
program FPGA's with is removing a desirable feature for many of their uses
cases.

If one is interesting in glossing over all this abstraction, why is one
wishing to use an FPGA at all?

I will reverse the question and say: "in which scenarios is someone hoping to
avoid addressing precise timing constructs in FPGA programming?"

Obviously I'm not referring to clock propagation delay or quantum entanglement
etc LOL I mean the intentional macro stuff wrt "timing"

~~~
reitzensteinm
Parent _was_ referring to clock propagation delay (resulting in multiple clock
domains on the same chip), hence my confusion.

If you're talking about higher level timing, nobody is arguing for taking that
away from you.

~~~
justaaron
I agree. I've seen System-Verilog used for simulating such propagation delays,
but it's not exactly a language-abstractable concept yet (precisely as it
involves the actual hardware gate implementation) in that such black-box
simulations estimate average propagation delay as a fixed function dependent
upon results obtained from testing specific functionality blocks in specific
devices specific gate-counts away from I/O pins, etc. It would be highly
desirable to have HDL level modeling of this stuff in the abstract sense,
although I confess to not knowing how that could possibly work, given the
above.

------
eebynight
Is it just me or did it sound like the author was complaining the whole time?

"FPGAs aren't evolved like modern processors with standardized programming
models, so we must throw out the current model but I have no idea what is
better."

Having worked with FPGAs, I can understand his complaints about the toolchain,
they absolutely suck and are mostly closed source. This technology is just
like microprocessors in its infantry, its evolving.

FPGAs have made crazy progress in the last decade and are getting to the point
where it's now affordable for hobbyists and consumers to work with them
instead of merely just aerospace & defense contractors with massive hardware
budgets.

 _The problem with Verilog as an ISA is that it is too far removed from the
hardware. The abstraction gap between RTL and FPGA hardware is enormous: it
traditionally contains at least synthesis, technology mapping, and place &
route—each of which is a complex, slow process. As a result, the
compile/edit/run cycle for RTL programming on FPGAs takes hours or days and,
worse still, it’s unpredictable: the deep stack of toolchain stages can
obscure the way that changes in RTL will affect the design’s performance and
energy characteristics._

Yes, RTL varies significantly from the actual hardware of the device (LUTs,
Memory, Peripherals, etc..) but from a design standpoint, I don't see anything
else that would make more sense to work with. FPGAs are significant BECAUSE of
the fact that you get build and design at that level. Let's not forget that
ISAs are a higher level abstraction of RTL...

~~~
nabla9
If I understand your argument correctly, you are saying that complaining is
not justified when FPGAs have made crazy progress. Can you explain why making
crazy progress means that we could continue using the same abstractions or
that the abstractions ar right?

GPU's made crazy progress but they changed because there was better way to
make even crazier progress.

------
nonlinearzone
I have designed ASICs and FPGAs for nearly 30 years, and seen the evolution of
this technology first hand. To say that FPGAs have the wrong abstraction is to
not understand what an FPGA is and what is intended to accomplish.

Transistors are abstracted into logic gates. Logic gates are abstracted into
higher-order digital functions like flip-flops, muxes, etc. It is the mapping
of algorithms/functions onto gates that is the essence of digital design. This
is difficult work that would be impossible at today's scales (5-billion+
transistors) without synthesis tools and HDLs. And, given that an ASIC mask
set costs 1MM+ for a modern geometry, it needs to be done right the first time
(or at least the 2nd). Furthermore, the mapping to gates needs to be
efficient, throwing more gates at a problem increases area, heat, and power,
all of which need to be minimized in most contexts.

My first job out of college was designing 386 motherboards. Back then we were
still using discrete 74xx ICs for most digital functions. The boards were
huge. PLDs allowed better intergration and were cost effective since a single
device could implement many different functions, and reduced board area and
power consumption. CPLDs moved this further along.

FPGAs grew out of PLD/CPLDs and allowed a significantly higher level of
integration and board area reduction. They offered a way to reduce the cost of
a system without requiring the investment and expertise required for an ASIC.
But, FPGAs themselves are an ASIC, implemented with the same technology as any
other ASIC. So, FPGAs are a compromise; the LUTs, routing, etc are all a
mechanism to make a programmable ASIC. Compared to an ASIC, however, FPGAs
require more power and can implement less capability for a given die size.
But, they allow a faster and lower cost development cycle. To bring this back
around, the LUTs and routing mechanisms are functions that have been mapped to
gates. To use an FPGA, algorithms still need to be mapped onto the LUTs and
this is largely the same process as mapping to gates.

This article was pointless, even the author acknowledges: "I don’t know what
abstraction should replace RTL for computational FPGAs." And, "Practically,
replacing Verilog may be impossible as long as the FPGA vendors keep their
lower-level abstractions secret and their sub-RTL toolchains proprietary." As
I have argued above, knowing the FPGA vendors lower-level abstractions won't
make the problem any better. The hard work is mapping onto gates/LUTs. And
that analogy is wrong: "GPU : GPGPU :: FPGA : " An FPGA is the most general
purpose hardware available.

The best FPGA/ASIC abstraction we have today is a CPU/GPU.

~~~
atoav
As a autodidact with a focus on analogue circuits and experience in
microcontroller stuff the thing that always threw me off aboit FPGAs is that
there was never a real _oh-don’t-mind-I-am-just-looking_ kind of FPGA
environment.

When I started with MCUs I started with an arduino. The thing it did for me
was to give me a feeling when to use a microcontroller and when to use
something else entirely.

Of course the level of control I had with an arduino was far from optimal, but
it worked out of the box and guided me into the subject (a bit like a
childrens bicycle: neither fast nor special, but helps in avoiding pain and
frustration for the learner).

I wished I had this kind of thing in an affordable fpga way. Simple enough to
get me hooked, with examples and good sane defaults etc.

 _This_ is what mainstream means: idiots like me who didn’t get a formal
education on the subject but want to try things out.

~~~
ThrowawayR2
Here you go:

Cheap FPGA boards for educational purposes:
[https://store.digilentinc.com/fpga-for-
beginners/](https://store.digilentinc.com/fpga-for-beginners/)

The software is free: [https://www.xilinx.com/products/design-tools/ise-
design-suit...](https://www.xilinx.com/products/design-tools/ise-design-
suite/ise-webpack.html)

The hard part is several semesters worth of textbooks to go through that cover
digital logic (try Mano's " _Digital Design: With an Introduction to the
Verilog HDL_ " to start with) through computer architecture in order to know
what to do with the board.

~~~
morphle
Much, much larger and better $113 FPGA development board with free 1 year
license: [https://www.microsemi.com/existing-
parts/parts/139680](https://www.microsemi.com/existing-parts/parts/139680) You
never pay the $159 list price. Besides, the FPGA alone would cost you >$350
retail. Arrives with Risc-V preprogrammed and 'hello world' type demos on wifi
and usb.

~~~
lnsru
Really exotic and complex board to start with. No community support (think
about Digilent or Terasic). I also guess, no examples. I visited a seminar
about RISC-V and Microsemi presentation was really weak on this topic. Do not
recommend this for getting started. Only for experienced users looking for
pain. Cheap and affordable board is for example Max1000 from arrow.

~~~
morphle
Examples RiscV softcore, ADC, wifi, 12,5 Gbps Serdes(!), tic tac to, console
echo. [https://github.com/Future-Electronics-Design-
Center/Avalanch...](https://github.com/Future-Electronics-Design-
Center/Avalanche-Eval-Board) [https://github.com/RISCV-on-Microsemi-
FPGA/PolarFire-Eval-Ki...](https://github.com/RISCV-on-Microsemi-
FPGA/PolarFire-Eval-Kit)

Community support is indeed just beginning but the risc-V community and the
HiFive1 - SiFive community support these polarfire FPGA's. Even 50 Risc-V
softcores fit on this FPGA.

------
tasty_freeze
I've done three ASICs using only schematics in the 1980s (back when 2.5 um was
the bomb). I've done ASICs, FPGAs, but mostly full custom designs since then.
And this article is wrong-headed on many levels.

Verilog is an event-driven modeling language. It easily describes large
collections of processes where a given process is triggered for reevaluation
any time one of its inputs changes. That is what it was designed for back in
the 80s. Using it for automatic synthesis of logic came later.

If your mental model is that Verilog is an ISA then things will be very
confusing. Programming language semantics and ISAs are two different things.

Yes, the Verilog language is pretty ugly. But it allows low level modeling at
the primitive level, even if that primitive is a single logic gate. And like
any programming language, hierarchy is used to build useful abstractions for
the particular problem domain.

~~~
ncmncm
The article is not about using Verilog to model ASICs. He even says himself
it's fine for that.

He wants to compute with FPGAs. So, if you're not interested in computing with
FPGAs, you won't agree with him. Full Stop.

~~~
dang
OK, we've added "for computing" to the title above.

------
tverbeure
The premise of the post seems to be that Verilog RTL is the lowest form of
abstraction in which you can program an FPGA, and that everything lower is
guarded by proprietary FPGA tools.

That is simply not true.

You can manually instantiate FPGA primitives (LUTs/BRAMs/DSPs) with all major
FPGAs, and if you’re truly desperate you can add placement attributes to place
these primitives exactly where you want them.

That’s as close to the metal as I can imagine (other than specifying the
actual routing network), and from there one could build up any abstraction
level one desires.

~~~
Traster
That's a bit of a grey area. Some FPGA primitives work with inference only and
finding those templates is hard. Some FPGA primitives have modules you can
instantiate, but again finding all legal parameterizations will take you
forever (it's never properly documented) and the simulation models rarely
match the hardware behaviour so it's difficult to verify.

------
bisrig
I think the heart of what the article is getting at is represented well by the
following quote:

"To let GPUs blossom into the data-parallel accelerators they are today,
people had to reframe the concept of what a GPU takes as input. We used to
think of a GPU taking in an exotic, intensely domain specific description of a
visual effect. We unlocked their true potential by realizing that GPUs execute
programs."

Up until the late 2000s, there was a lot of wandering-in-the-wilderness going
on with respect to multicore processing, especially for data-intensive
applications like signal processing. What really made the GPU solution
accelerate (no pun intended!) was the recognition and then real-world
application (CUDA & OpenCL) of a programming paradigm that would best utilize
the inherent capabilities of the architecture.

I have no idea if those languages have gotten any better in the last few
years, but anything past a matrix-multiply unroll was some real "here be
dragons" stuff. But: you could take these kernels and then add sufficient
abstraction on top of them until they were actually usable by mere humans (in
a BLAS flavor or even higher). And even better if you can add in the memory
management abstraction as well.

Point being: still not there for FPGA computation, though there was some hope
at one time that OpenCL would lead us down a path to decent heterogeneous
computing. Until there's some real breakthroughs in this area though, the best
computation patterns that are going to map out using these techniques are the
things we're already targeting to either CPUs or GPUs.

------
phire
I'm annoyed this post ended with an open question. I was hoping it might at
least have some ideas, as I do agree that verilog is a horrible abstraction
layer.

However, I don't think an ISA for FPGA can exist, not one that that allows for
quick synthesis.

Sure, you could drop down to a layer where everything is described as LUTs,
FFs and routing; But you still need to run "place and route" before you can
execute it, and that the expensive part of synthesis.

~~~
tachyonbeam
IMO, if what you want is to accelerate computation, what you need is basically
an FPGA with higher-level building blocks. Instead of an array of lookup
tables, you want an array of what are essentially ALUs and registers. The
abstraction should be something more akin to a dataflow graph, where you can
route data from one unit to another.

Yes, I know FPGAs already have adders, multipliers and registers inside them.
My point is, they should have more of that. For computation, the focus needs
to be on having as many of these useful high-level building blocks as
possible, and making the routing of data between these blocks as easy as
possible. For instance, routing 32-wide or 64-wide buses between these units,
instead of individual wires.

The "problem" with FPGAs is that they are trying to be able to emulate any
ASIC. If what you want is a general-purpose computational accelerator, you
need hardware that is a bit more high-level and more specialized. More
tailored for that specific purpose.

~~~
gmueckl
More ALUs reminds me of the GreenArrays chips: hundreds of really tiny cores
arranged on grid, each one running a small program. Pretty weird stuff. And
the canonical development tool is a very weird color forth environment.

~~~
tachyonbeam
I saw a live presentation about greenarrays by its creator and wasn't
impressed. His presentation was a long list of cool assembly-level tricks he
had come up with to program his own chip, it seemed like he was trying to show
off his skills to the audience. The chip has a very arcane design that is
impractical to use, hence why it saw very little adoption.

~~~
gmueckl
I only saw a friend of mine hacking that stuff once. Or at least he tried to.
I don't actually know if he ever got to a point where he had something
interesting running. I know that I would never put up with that color forth
environment, though.

Even though the GreenArrays design may be quite flawed, there mihht be use
cases where the array of small cores idea could be put to use. I'm also
remimded a bit of the XMOS multicore microcontrollers for some reason,
although they are quite different. But the core link can work even between
microcontroller packages, allowing the xreation of decently sized grids. But I
can't think of a good use case for a large grid of those. Thus is more about
nerworkimg microcontollers on an embedded context.

------
mikewarot
FPGAs have the wrong architecture. Routing fabrics are a premature
optimization. They need to be a 2d array of 4:4 look up tables. One bit in
from each cartesian neighbor, a latch, and one bit out to each of them. It's
turing complete. You alternate the latch clocks, thus preventing all timing
issues. The delays are predictable. You can route almost trivially. You can
route around defects. You can prove that programs will work as designed.

See my rants for the last decade about "Bitgrid" if you want to know more.

~~~
ThrowawayR2
Already been done, IIRC. Algotronix had a similar nearest-neighbor
architecture FPGA in the form of the CAL1024 back in 1989†.

[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.152...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.152.6623&rep=rep1&type=pdf)
(See p. 16)

[https://www.algotronix.com/company/Photo%20Album.html](https://www.algotronix.com/company/Photo%20Album.html)

That architecture never really was particularly successful. I vaguely seem to
recall that the issue was that not having long routing lines mean that huge
swaths of the grid had to be wasted to do routing of signals.

† As an aside, Xilinx acquired Algotronix and evolved their design into the
XC6200, which was notable for being one of the earliest run-time
reconfigurable FPGAs

~~~
mikewarot
If you have a symmetric grid, you can quickly rotate, flip, and otherwise
transform the program. Since all the outputs are independent, you can have
computation in a cell that has results flowing into the chip, and back out to
the edge at the same time. Delays are simply a matter of counting how many
cells a result passes, and thus a good signal to feed to an optimizing
algorithm.

~~~
ohazi
The general version of the thing you're describing is a 2d systolic array.
Designs like this already map reasonably well to standard FPGAs.

The problem with the 4:4 LUT you described above is that you almost always
want the compute cell part of your design to have inputs and outputs that are
wider than a single bit. For example, if you were designing an FIR filter as a
systolic array, you might want one filter coefficient to come from the top of
each cell, and your signal to pass through the cells from left to right. A
realistic design might have a 16 bit signal and 10 bit coefficients.

This would require 160 LUTs per cell _just_ to size the signals correctly.
You'd need a lot more in practice, because in addition to fitting your compute
cell logic, state machine logic, and arithmetic logic, you also need to waste
extra LUTs to route the arithmetic carry chains and state machine signals
across this LUT grid.

And even if you manage to do all that, you're still left with a bit-serial
design that needs to be clocked an order of magnitude faster than a design
with wider inter-cell transfer sizes to get similar performance.

When you start thinking about ways to get around some of these limitations
while keeping designs with high "spacial locality" efficient, you get
something that looks a lot like a modern FPGA.

~~~
mikewarot
If you had a systolic array that was generic and routless, you could very
easily compile (at run time) the filter coefficients directly into the logic,
updating as you go.

I'd feed inputs from the left, passing them through towards the right, then
reflect them back in a manner that would keep timing right. Sums would
accumulate and appear as outputs on the left.

If you clocked the whole array with a 2 phase clock, you'd get an answer out
every cycle, once all the pipelines were filled.

------
asaph
In case anyone else was wondering: FPGA stands for field-programmable gate
array[0]. Frustratingly, this is mentioned nowhere in the article, not even
after posing the question "What is an FPGA?" right at the beginning.

[0] [https://en.m.wikipedia.org/wiki/Field-
programmable_gate_arra...](https://en.m.wikipedia.org/wiki/Field-
programmable_gate_array)

------
ohyes
I worked on something like this at a startup, it was very interesting work,
but it is hard to convince people that it has value.

The biggest issue is that standard processors (i.e. x86_64) continue to get
faster and better, and in a straight line, they're already much faster than an
FPGA. Secondarily, software written for an FPGA ends up being specific to the
FPGA, mostly because the number of LUTs per chip and the routing.

So there's two things going on, one is that you need something _very_ parallel
in order to have a compelling argument for implementing it on an FPGA, and
secondarily, in a couple of years, the C implementation of the same thing ends
up being _faster_ because processors have improved and number of cores has
increased, caches have gotten better on your standard processor. So for your
code to improve, you have to rework it entirely, possibly to the extent of
rethinking the algorithm you've used. (FPGAs advance in the same way, but a
new compiler doesn't speed up your code, because you wrote it _very_ close to
the metal).

So in reality, the issue isn't that FPGAs have the wrong abstraction, it's
that they have practically _no_ abstraction, at least when it comes to having
a real compiler that will optimize your code in a meaningful way for the chip
that you are using, the way something like GCC would. Even if you are writing
verilog or vhdl, you still need to consider the number of LUTs you have, how
the placement will work out based on the size of your different modules (and
of course, clock timings & pipeline stalls). You get some help with that stuff
from the compiler, but when you then upgrade to a bigger chip, there are
diminishing returns in the help that it provides. It is really like you are
building your own arbitrary CPU. In that regard, it is difficult to find good
people.

None of the commercial attempts at high level languages have been successful,
largely because they suck. You're stuck writing C code when you'd rather write
something that actually considers the advantages that an FPGA has (the
extremely parallel processing of bytes). Implementors need to think more in
terms of parallel graphs and tree structures that coalesce than they do loops.
From that perspective, you need actual language primitives that actually match
with the advantages of the platform (which verilog and vhdl do, but in a low-
level, ham-fisted way). So it's an incredibly tough problem to tackle.

~~~
shaklee3
This is completely incorrect. There are many serial applications that run
significantly faster in an fpga even at one-fifth of the clock of a CPU. The
reason being that you can dedicate all of the resources to performing that one
task instead of sharing it and using instructions that are not optimized for
it. There is a reason signal processing is still predominantly done in asics,
fpgas, and DSPs.

~~~
ohyes
I think you misunderstood. I said very little about the relative performance.

I know quite well that the FPGA can be faster. The point is you have to work
10-50x as hard to get there, if you want to do anything interesting. Arbitrary
FPGA application code is not a thing.

Practically speaking, by the time in your development cycle that you’ve built
your custom hardware and software, got it through emissions and are ready to
sell it commercially, there’s a new intel chip that is competitive with what
you’re doing, using C code. (It’s an FPGA so you have to deliver custom
hardware with it). FPGAs have advanced as well in this time, but you’re not
using that, you’re using a 2-4 year old one.

We were picking apart tcp packets and processing the data inside of them. What
we built worked, and it was faster than a CPU, but we had a lot of clever non-
standard custom compilers to do it.

It was really fun, challenging and interesting work, but really hard to
justify in a practical sense.

Things like signal processing are a good application mostly because they’re
well understood and the code follows the same basic pattern. Novel
applications are much more difficult.

------
amirhirsch
Reading this article and thinking to myself that I wrote this 13 years ago:
[http://fpgacomputing.blogspot.com/2006/05/methods-for-
reconf...](http://fpgacomputing.blogspot.com/2006/05/methods-for-
reconfigurable-computing.html)

------
snvzz
Somewhat related: There's now an Open Source toolchain[0] for FPGAs.

[0] [https://symbiflow.github.io/](https://symbiflow.github.io/)

~~~
analognoise
Which doesn't handle DSP and BRAM primitives properly, it can't even do packed
Verilog arrays, there's no VHDL capability AT ALL, and it supports only
extremely primitive Lattice parts (and "eventually" will support the Xilinx 7
series).

Great for blinkenlights. Useless for work.

~~~
DoingIsLearning
> Useless for work

... yet?

It's an open source project and it is trying to achieve what multiple vendors
building moats could not.

Maybe it is not production grade but maybe all it needs is time and
contributions.

~~~
lnsru
It’s a hobbyist thing. Useless in corporate setting. I can summon vendor’s
field application engineers when I have a problem. Guess what I can do when I
have problems with open source toolchain and project’s deadline is coming.
Xilinx and Altera will not support 3rd party tools due this liability issue
for sure. Time and contributions will not help this venture, the problems are
not technical.

~~~
ncmncm
People used to say that about regular compilers.

Yes, when you are targeting an ASIC and a manufacturing pipeline, you need
more support. But not everybody is. In fact, 7+ billion aren't. There is no
shame in shipping a low-volume FPGA product, or even in targeting an existing
FPGA demo board. Or even a commercial SOC or PC, if it does the job.

~~~
analognoise
"People used to say that about compilers"

Yes. Very talented engineers are paid large sums of money to work on open
source compilers when the company's main concern is NOT compilers - Red Hat
wants the compilers stable on Linux, and helps out. Same for apple. Etc, etc.

"Open Source" is just somebody else footing the bill because it IS NOT the
main business concern - compilers aren't your product differentiator, but you
need them to work on your systems, you pay talented engineers to work on them.

This dynamic DOES NOT HOLD for FPGAs. There is no "second source" of Xilinx
FPGAs. Nobody will pitch in to help Xilinx because the only people who employ
an army of talented people with the direct niche skills needed are their
competitors.

F/OSS software is not great once you get past small, single use tools. KiCad
became barely useable in Version 5, and it's still a decade behind what we had
commercially a decade ago - and there are probably what, 500 PCB designers for
every ASIC designer? 50 for every FPGA designer?

Xilinx and Intel fund programs at schools to develop talent they can use to
further develop their systems and chips. That's remarkably NARROW.

In short: No, this tool will NEVER achieve critical mass, unless Lattice
decides to support it directly - and then it still won't work on Xilinx chips.
It is an absolute dead end, because the economics that helped the success
story of GCC will not apply.

------
tomxor
> The abstraction gap between RTL and FPGA hardware is enormous: it
> traditionally contains at least synthesis, technology mapping, and place &
> route—each of which is a complex, slow process. As a result, the
> compile/edit/run cycle for RTL programming on FPGAs takes hours or days and,
> worse still, it’s unpredictable: the deep stack of toolchain stages can
> obscure the way that changes in RTL will affect the design’s performance and
> energy characteristics.

This is basically what pushes me away from FPGAs, even though there is now
icestorm so i can avoid Windows. As an outsider I totally agree with the
author, I love the idea of using FPGAs as accelerators, but it's way easier to
do GPGPU right now.

------
xvilka
Problem with Verilog and VHDL is widely known. This is why I proposed[1] to
create a unified LLVM-like framework to prevent many projects from reinventing
the wheel, and be able to use each other's results.

[1]
[https://github.com/SymbiFlow/ideas/issues/19](https://github.com/SymbiFlow/ideas/issues/19)

~~~
Traster
Are you aware that Intel's OpenCL compiler for FPGA uses _actual_ LLVM?

~~~
xvilka
Yes, but LLVM is not very suitable in general for FPGA/ASIC design.
FIRRTL[1][2] looks much more complete and fit among various alternatives.

[1]
[https://bar.eecs.berkeley.edu/projects/firrtl.html](https://bar.eecs.berkeley.edu/projects/firrtl.html)

[2]
[https://github.com/freechipsproject/firrtl](https://github.com/freechipsproject/firrtl)

------
jgeada
This is a useless post: bunch of complaining about the author’s own ignorance
of FPGA and computational fabrics, willful misunderstanding of the differences
between FPGAs, GPUs, ASICs, and lastly, not even an attempted proposal for a
way forward past the author’s inabilities. Why did I just waste time reading
this crap?

~~~
ninjacatex
I’m not quite sure I understand your discontent.

You assert the author displays a “willful misunderstanding of the differences
between FPGAs, GPUs, and ASICs.” What differences did the author misrepresent?
And is it fair to call it “willful”?

From what I can tell, the ideas should be taken at a 10,00ft view: Verilog as
an interface to (computational) FPGAs is not good enough because it’s
inaccessible to domain scientists in other fields (where CUDA and similar
are). The way I read the post was the author is framing a research question:
“how do we design a new abstraction for FPGAs to do for them what we’ve done
for GPUs?”

------
dooglius
> Even RTL experts probably don’t believe that Verilog is a productive way to
> do mainstream FPGA development. It won’t propel programmable logic into the
> mainstream. RTL design may seem friendly and familiar to veteran hardware
> hackers, but the productivity gap with software languages is immeasurable.

The author seems to view "mainstream" as meaning being as easy as CPU or GPGPU
programming. I don't think it makes much sense trying to accomplish this on
FPGAs; you're better off using CPUs, GPUs, or making something domain-specific
like TPUs. The benefit of FPGAs is that they allow you to build your own
architecture, and define data movement in a way that is specific to your
application, at the cost of increased development effort. The complexity
encountered in doing RTL arises from the inherent complexity in using FPGAs
effectively.

There is a case to be made for something in between a CPU and an FPGA that
allows easier development, but gives you some ability to control your data
movement to get higher performance. Processor meshes, like what Xilinx is
including with their upcoming Versal chips, might be a good solution to this
(though in typical FPGA vendor fashion this too is locked behind proprietary
tooling).

~~~
wwwigham
> I don't think it makes much sense trying to accomplish this on FPGAs; you're
> better off using CPUs, GPUs, or making something domain-specific like TPUs.

I dunno, we have a lot of libraries and frameworks who make it their purpose
to efficiently run code on either a CPU or GPU (or TPU, if we're talking about
things like tensorflow); a FPGA is conceptually just an extension of that
except, ofc, that on-the-fly FPGA recompilation usually outweighs any
performance benefits you'd get from using the FPGA based hardware accelerator,
so the only reasonable uses are one and done layouts with occasional
patches/updates. If we could get an FPGA gate pipeline that was as efficient
as the modern shader pipeline, I think what we could use them for could expand
greatly - imagine a tracing JIT that could dynamically load hot sequential
code into an FPGA to speed it up, the same way we can load matrix ops into a
GPU today.

With where FPGA toolchains are today, it just doesn't work _because_ the
compilation process is so bad and slow, and nobody except the engineers within
Xilinx and Altera even have a reasonable shot at making it better since both
the compiler and the compilation target are closed. Part of it is simply that
these compilation passes aren't really configurable (beyond expressing layout
and pipelining preference) - if I could get near instant bitstream generation
in exchange for 30% extra space used, I'd use that setting for all but my
final builds, and strongly consider invoking it at runtime.

~~~
LargoLasskhyfv
Hm. Are you aware of [1] [https://www.microsoft.com/en-
us/research/project/emips/](https://www.microsoft.com/en-
us/research/project/emips/) & [2]
[https://blog.netbsd.org/tnf/entry/support_for_microsoft_emip...](https://blog.netbsd.org/tnf/entry/support_for_microsoft_emips_extensible)
?

Especially the two papers which come up when you search like this? [3]
[https://duckduckgo.com/?q=microsoft+extensible+mips](https://duckduckgo.com/?q=microsoft+extensible+mips)

MIPS-to-Verilog, Hardware Compilation for the eMIPS Processor Karl Meier,
Alessandro Forin Microsoft Research September 2007 Technical Report MSR-
TR-2007-128 Microsoft Research Microsoft Corporation One Microsoft Way
Redmond, WA 98052

[4] [https://www.microsoft.com/en-us/research/wp-
content/uploads/...](https://www.microsoft.com/en-us/research/wp-
content/uploads/2016/02/tr-2007-128.pdf)

and

EXTENSIBLE MICROPROCESSOR WITHOUT INTERLOCKED PIPELINE STAGES (eMIPS), THE
RECONFIGURABLE MICROPROCESSOR

A Thesis by RICHARD NEIL PITTMAN

Submitted to the Office of Graduate Studies of Texas A&M University in partial
fulfillment of the requirements for the degree of MASTER OF SCIENCE

May 2007

Major Subject: Computer Engineering

[5]
[https://oaktrust.library.tamu.edu/bitstream/handle/1969.1/59...](https://oaktrust.library.tamu.edu/bitstream/handle/1969.1/5976/etd-
tamu-2007A-CECN-Pittman.pdf)

------
AshamedCaptain
The entire article can be summarized in this paragraph:

    
    
      The problem with Verilog as an ISA is that it is too far removed from the hardware. The abstraction gap between RTL and FPGA hardware is enormous: it traditionally contains at least synthesis, technology mapping, and place & route—each of which is a complex, slow process. 
    

But there is nothing below a netlist that is still mappable to different types
of FPGAs (since different FPGAs do not even necessarily need to have LUTs at
all, much less the same LUT types!) so I fail to see the use of it.

A GPU ISA changes very little between generations of the GPU, however a rather
small change in the structure of the FPGA usually implies a completely new
bitstream. So even if there was bitstream documentation it would be very FPGA-
specific, unlike an ISA.

P&R does not look that much different from compiling, in the sense that you
can spend as much effort as you want on it in order to produce a better or
worse result for your machine.

~~~
nabla9
The solution that emerges is not just a new hardware abstraction for current
generation of FPGAs. It would be similar to the abuse of GPU's in the early
2000s.

You must design completely new type of Computational FPGAs (CFPGA?) that can
have good generic hardware abstraction. It might be something little above the
logic blocks and hardware blocs and routing.

~~~
RantyDave
This has happened (is happening). Intel's latest FPGA's have potentially
thousands of floating point multiply/add blocks and Xilinx have a "many cores
+ on-chip network" design on its way.

------
ChuckMcM
That was an interesting read, an Adrian has done a lot of interesting work,
but I can't agree with the theme that FPGAs are just another compute engine
with a wonky ISA.

I've met a number of software people who are now developing FPGA designs, and
hardware people who are the same place (even looking to hire US Persons with
those skills for SDR work :-) and invariably the software folks get a brain
cramp because it "looks" like code, but it doesn't "work" like code.

I have often wondered how close this experience is to English speaking people
listening to a conversation about a software design and implementation
becoming frustrated because they think they understand each word but they
aren't understanding any semantics of the conversation.

I'm one of those people who essentially double majored in EE and CS[1] which
is sort of like growing up in a house where one parent speaks one language and
the other speaks another. In those situations you tend to naturally accept
when one discipline (or language) doesn't overlap cleanly with the other.

As a result I disagree with the author that an FPGA is just a computation
engine with a "wonky" ISA. Thinking about it that way is some what limiting.
That said, some of the things that give FPGAs their programmability might be
useful additions to GPUS, mostly redefining the internal data paths of the GPU
to allocate more "bits" to change dynamic range for different operators might
allow some interesting things to be done.

HDLs themselves are interesting problems, because you have a target, switching
matrix and logic elements of an FPGA, and text. Designing a way to express the
capabilities in one that translates to the other is very hard. So hard in fact
that there are circuits you can construct in the FPGA that you cannot express
directly in a HDL, and there are things you can write in an HDL that cannot by
synthesized into a set of linked logic elements and a clock. On the surface
this might seem like saying "Well yeah you can write things in assembly that
the compiler won't generate." but it goes deeper than that. Both from
programming what the pins on the package do to changing clock networks to get
timing closures on complex designs, an FPGA is not a stored program
computation device unless you configure it as one.

[1] Computer Engineering as a major wasn't a "thing" yet so I ended up taking
all the EE major classes and nearly all the CS major classes (I missed out on
some of the logic oriented math classes of CS because of the physics and
materials science requirements on the EE side).

------
yifanlu
Instead of complaining about the quality of the article, I’m thinking about
the open question at the end. What would be a higher level abstraction for
FPGA development?

Is it possible to create a language around FSMs? Most hardware seems to have
two parts: the actual logic that implements some functionality and then some
FSM that implements the control logic. The FSM may also have a lot of
implicit/assumed states (like a counter for some timeout). Maybe a higher
level language can expose these design pattern in a nicer way and hide all the
messy low level details (like sequential/combinational logic, connecting ports
and wires, matching signal widths, etc).

~~~
amirhirsch
I proposed a spreadsheet:
[https://dspace.mit.edu/handle/1721.1/45983](https://dspace.mit.edu/handle/1721.1/45983)

------
robomartin
Right, well, the problem with all articles of this kind is that the authors
are doing the equivalent of trying to force a square peg into a round hole.

You do not PROGRAM an FPGA in the software engineering sense of the word, you
CONFIGURE it.

You do not use a programming language to create your configuration bitstream;
you use a HARDWARE DESCRIPTION LANGUAGE.

You describe the hardware you want and how it is to be interconnected. You do
this either explicitly —by literally using code that wires resources as you
specify—or by inference— using idioms, if you will, that you know result in
specific hardware within the device.

Thinking of Verilog or VHDL as software programming languages is wrong and can
only lead to frustration. FPGA’s are still very much the domain of hardware
engineering. Well, at least if what we are after are efficient high-
performance results.

If, as a software engineer, you want to take advantage of an FPGA to
accelerate processing, you should work with a capable FPGA hardware engineer
to create a device with an “API” (using the term loosely) that exposes the
desired functionality. I’ve done just that more times than I can remember;
hanging a large FPGA off something like a small 8 bit 8051-derivative
processor that allows the micro to access powerful computing resources exposed
through means easily accesible with simple C functions in real time.

If you use the right tool for the job and do it correctly it can be blissful;
try to force a paradigm that does not match reality and all you get is
frustration.

~~~
ncmncm
The author is obviously deeply aware of all this. It is exactly what he is
complaining about.

Why should he have to hire you to do something he could do himself, given
better tooling?

It is the same problem as business people had, needing to explain their
problem to a system analyst and get programmers to code it. Then spreadsheets
came along, and 99% of business programming just totally disappeared into it.

Maybe you can do it better than he could. Doesn't matter, there's only one of
you, and what he could do would be good enough. Just like spreadsheets.

~~~
robomartin
Your perspective on this is flawed. The comparison to spreadsheets isn’t
applicable here. Not even close.

That said, I do not expect you to understand my perspective if you are not a
hardware engineer. This isn’t a put-down, just reality.

There have been many attempts to make FPGA’s fit into a nice software
engineering paradigm. I can’t think of one, a single one, that compares well
to what a hardware engineer can do by treating it like hardware.

A simple example I can give you comes from my own work going back about 15
years. Part of my design needed a very high performance multiphase FIR filter.
The tools, even with all possible optimizations enabled, could not get the
performance we needed out of this chip. If we wanted to stay with that
approach the only option was to go up to a larger and faster FPGA as well as
up one speed grade. That would have cost a bundle.

Instead, I hand-placed, wired and routed the filters. As a result I was able
to get 2.5 times the performance the compiler could get. We were able to stay
with the smaller chip and squeezed even more out of it during the five year
product life cycle.

FPGA’s are hardware, not software.

~~~
ncmncm
If you built the exact same product today, almost the cheapest FPGA you can
buy would be fantastically overprovisioned. So, today, all your extra effort
then would be superfluous, and you could treat the design like software, using
defaults.

Programming computers used to be like hardware; where you stored a variable on
the drum mattered because you couldn't afford to wait a whole rotation to
fetch it again. Now most code is just code, and Python is often fast enough
for production.

I know the difference. When I take packets from a 40Gbps link, I'm programming
hardware. I'm counting cycles. Most problems are not like that, and don't
deserve your attention. That doesn't mean they shouldn't be done, it just
means somebody else should do them, using less specialized tools.

~~~
robomartin
And that’s where you are wrong. Who builds the same product they were building
15 or 20 years ago? It never gets simpler unless what you are working on is
trivial. Technology has always been about pushing the limits.

You never have enough chip resources, speed and computing power to spin the
next step up in a design, much less keep up with the state of the art in any
non-trivial domain a decade or two later.

Put a different way: If you are using FPGA’s correctly they will never be like
software...because that’s wasteful of resources and performance. If it’s easy
and “like software” it can probably be done in software with an off-the-shelf
CPU/GPU.

~~~
ncmncm
Not true. Many, many people have so much computing power to spare they program
in Python. In some cases Python isn't up to the job, and C++ on a
microcontroller isn't quite, either, but a cheap FPGA could do it easily.

What is true is that they don't hire _you_ for those projects. People who have
harder problems and enough money to bring you to bear do. But you are far from
representative.

Are they using FPGAs correctly? Who cares? They don't.

~~~
robomartin
It would sure help your case if you provided representative use cases. Look
through Xilinx application notes and the reality of FPGA’s is very clear. It’s
fine to discuss these things academically, but, as a practitioner and
businessman the demarcation lines are very clear, both on technical and
financial grounds.

------
mikewarot
Just because something "has been done before" doesn't mean it's not useful
now, the context has changed. Electric cars were once the default, and then
internal combustion took over... and now they are back... but the were
definitely "done before".

Bulk computation is what we need. The maximum use of all of the transistors in
a chip, at the minimum necessary clock speed and voltage to get the job done.
Delays don't matter at all if you get a result each clock cycle.

------
inaccel
You are right that the FPGA vendors does not provide so far the right way of
abstraction for Computing.

That's way at InAccel we developed the FPGA resource manager that allows to
instantiate and deploy FPGAs in the same way as you invoke typical software
functions. The FPGA manager takes care the scheduling, the resource management
and the configuration of the FPGAs from a pool /marketplace of hardware
accelerators.

That way is easier than ever to use FPGAs in the same way you use optimized
libraries for CPUs/GPUs.

And we have a free community edition: More info at:
[https://www.inaccel.com/](https://www.inaccel.com/)

------
rramadass
Has anybody here read "FPGAs for Software Programmers"
([https://www.amazon.com/FPGAs-Software-Programmers-Dirk-
Koch/...](https://www.amazon.com/FPGAs-Software-Programmers-Dirk-
Koch/dp/3319264060/ref=sr_1_1?keywords=FPGA+for+software&qid=1561470660&s=books&sr=1-1))
?

Any inputs/thoughts on how appropriate it might be for a Software Engineer to
learn and program FPGAs?

------
lawrenceyan
There's a reason why GPUs are the defacto standard increased performance.

------
TBF-RnD
Whivh vendors if any doesn't keep bitstreams a secret?

------
Traster
What I find interesting about people who pick fights with FPGA programming is
that they entirely focus on the things that FPGA doesn't do. If you can solve
your problem on a CPU then it will practically _never_ be better to do it on
an FPGA. If you want to make FPGAs better, you need to figure out how to make
what we already do on an FPGA easier.

I think part of the problem here for example is

>That is, Verilog is to an FPGA as an ISA is to a CPU.

No! Not at all! For a start I can literally write verilog that won't work on
any FPGA ever

always @ (rising edge multiplier_result[8]) begin $print("This is nuts!") end

Verilog is _not_ an FPGA language, Verilog is a hardware language that Vendors
implement a subset of on any given FPGA.

Here would be my advice to someone who wants a better abstraction for FPGAs:
stop relating them to other things that behave very differently. Timing is
important, placement is important, mapping is important. If your abstraction
doesn't include these elements it is fundamentally flawed.

~~~
rachitnigam
The part you quoted is a part of a thought experiment. The author is
explicitly saying that Verilog is _not_ an ISA:

> By way of contradiction, let’s imagine what it would look like if RTL were
> playing each of these roles well.

The point of this section was to point out that the set of people who use
FPGAs as accelerators (a la MSFT catapult) use it with the ISA abstraction:
they don’t want to care about he timing, they just want to get the dataflow
graph to accelerate the computation.

Disclaimer: I work with the author of the article on acceleration
architectures and languages.

~~~
lvoudour
Regarding the end of the article:

 _A new category of hardware that beats FPGAs at their own game could bring
with it a fresh abstraction hierarchy. The new software stack should dispense
with FPGAs’ circuit emulation legacy and, with it, their RTL abstraction_

Are you or the author working on something along these lines and if so can you
give an example? I'll tell you the truth I was expecting a proposition and I
felt kind of "robbed" when I reached the end and no alternative was given

------
pishpash
That's not talking about an FPGA any more, more like a hypothetical FPPU.

