
FPGA Programming for the Masses - nkurz
http://queue.acm.org/detail.cfm?id=2443836
======
AndresNavarro
I think the last paragraph is key:

> We also need to revisit the approach of compiling a high-level language to
> VHDL/Verilog and then using vendor-specific synthesis tools to generate
> bitstreams. If FPGA vendors open up FPGA architecture details, third parties
> could develop new tools that compile a high-level description directly to a
> bitstream without going through the intermediate step of generating
> VHDL/Verilog. This is attractive because current synthesis times are too
> long to be acceptable in the mainstream.

This is both an ideological as well as a practical matter. Until the whole
process INCLUDING bitstream generation is open, I don't see FPGAs as a viable
alternative to general purpose processors.

~~~
DanWaterworth
I can't up vote this enough. I keep gearing up to try using an FPGA, but ever
time I do, I am prevented from continuing by my revulsion of closed source
tools.

~~~
nitrogen
DrDreams: it looks like a comment you made 41 days ago resulted in your
account being disabled (see
<https://news.ycombinator.com/threads?id=DrDreams>). I'm quoting this sibling
comment here for readability by those who do not enable showdead, as it is
relevant to the FPGA discussion.

 _DrDreams 29 minutes ago | link [dead]

Speaking as an embedded developer, I see a number of other embedded devs
hobbying around with FPGAs. However, I very rarely see convincing use cases
for FGPAs. This article seems to lean toward the belief many of my colleagues
have, that FPGAs are right around the corner in terms of general usefulness.
However, I disagree strongly. I find that they are highly-specialized devices.

Before reading the rest of my writing, consider that at this time, brilliant
hardware designers are putting similar amounts of work into both general
purpose CPUs and into FPGAs. However, CPUS are comprised of dense blocks of
special-purpose silicon for common purposes, such as floating-point math.
FPGAs always have to match that dense silicon through configurable silicon,
which is less dense. Furthermore, the routing in CPUs is a known entity at
manufacturing time. On FPGAs, the routing is highly variable and must be re-
negotiated at nearly every compile cycle. That's a huge time sink, both in
terms of build time and in predicting performance. Especially since those
short routes that you get at the beginning of a project, typically end up
being longer by the end of it. Nowadays, we are seeing more FPGAs with
dedicated, pre-made hardware blocks inside of them, such as FPUs and even CPU
cores. These have more of a chance of catching on for general purpose
computing. Notice however, that on these devices, it's the general-purpose CPU
dominating, leaving the FPGA as a configurable peripheral, subordinate to the
dense, pre-designed silicon.

Although one may be able to match GPU performance with an FPGA, it's usually
just not worth it. It will take dozens of hours of FPGA coding and simulation.
Compiling and fitting and the rest of the FPGA dev chain is very time-
consuming and resource-intensive, compared to the speed and elegance of gcc.
Speaking of standard development practices, FPGA code is not nearly as
portable as C. It often has special optimizations done for the sake of the
device implemented. <http://opencores.org> has a number of more generic
modules available, but still, FPGA code does not scale as well as C code.
There are add-on packages that help write FPGA synthesis code - code
synthesizers, but they make matters especially complicated. The syntax of
Verilog and VHDL is not well-designed for scaling. Speaking of these
languages, if you are used to languages written to be parsed easily, such as
lisp, python, or even C and Java to some extent, you will be very appalled at
the structure of Verilog and VHDL. There are many redundant entities, lots of
excess verbiage and all kinds of special cases. It really has evolved very
little since the days of the Programmed Array of Logic (PAL).

Another problem with FPGAs is the additional hardware on board needed to
configure them. It's one more component or interface that is not needed when
using CPUs. It's an additional software image to maintain, revision, store in
source control, etc. FPGAs also often require more power supplies and better
power supply conditioning than a regular CPU and often a separate clock
crystal. They are high-maintenance.

FPGAs do shine though in a few specific instances: 1. When there is a
particular, uncommon high-speed bus protocol you need to communicate with and
can not buy pre-designed silicon for. This does not mean, e.g. USB. It means
something like a non-standard digital camera interface or embedded graphic
display. 2. Software Radio. 3. Obscure, but computationally-intensive
algorithms like Bitcoin.

I hope my words have convinced some people to cool their lust for FPGAs,
because I feel they're a bit of a dead-end or distraction for many who are
attracted to the idea of "executing their algorithm extremely fast" or
"becoming a chip designer." I have seen many students and professionals burn
up hours and hours of their time getting something to run on an FPGA which
could just as easily have been CPU-based. For example, one student implemented
a large number of PWM oscillators on an FPGA where it would have been much
simpler to use the debugged, verified, PWM peripherals on microcontrollers.
Another guy I work with is intent on running CPU cores on FPGAs. This is an
especially perverse use of the FPGA. Unless you've got some subset of the CPU
which adds incredible value to the process, you're exchanging the the density
of the VLSI/ASIC version of the chip for the flexible, less dense version on
FPGA. This may be useful in rare situations, such as adding an out-of-order
address generator to an existing core for speeding up an FFT, but it suffers
an incredible performance and developer time hit to get to this point._

~~~
DanWaterworth
nitrogen: Thank you for reposting.

DrDreams:

 _Furthermore, the routing in CPUs is a known entity at manufacturing time._

The 'routing' of a CPU is much more variable than the routing of an FPGA. Data
moves around a CPU based on the program that is executing. The control logic
of a CPU is the equivalent of the routing logic of an FPGA.

 _On FPGAs, the routing is highly variable and must be re-negotiated at nearly
every compile cycle._

The 'routing' of a CPU is the same. The compiler has to perform register
allocation afresh on every compilation. There's obviously a tradeoff between
fast compilation vs most efficient use of resources. Both problems are NP-
complete I believe.

 _Nowadays, we are seeing more FPGAs with dedicated, pre-made hardware blocks
inside of them, such as FPUs and even CPU cores._

You just contradicted yourself. Previously, you said "FPGAs always have to
match that dense silicon through configurable silicon".

Your next paragraph talks about toolchain issues. This is hardly an
insurmountable problem. Someone just needs to design a high level language
that can be synthesised; something akin to a python of the FPGA world if you
will.

 _Another problem with FPGAs is the additional hardware on board needed to
configure them._

I don't quite understand, do you mean the hardware that reads the bitstream,
etc or the hardware that is required in order for the FPGA to be configurable,
like routing, LUTs, etc.

 _Another guy I work with is intent on running CPU cores on FPGAs._

I do agree with you here, this is a weird perversion if the purpose is not
eventually to create an ASIC.

I also don't believe that future processors will be FPGAs, but I do believe
they will be a lot closer to FPGAs than CPUs.

~~~
caxap
_Someone just needs to design a high level language that can be synthesised;
something akin to a python of the FPGA world if you will._

The advantage of FPGAs is that they allow nontrivial parallelism. On a CPU
with 4 cores, you can run 4 instructions at a time (ignoring the pipelining).
On the FPGA, you can run any number of operations at the same time, as long as
the FPGA is big enough. The problem is not the low-level nature of hardware
description languages, the problem is that we still don't have a smart
compiler that can release us from the difficulty of writing nontrivial
massively-parallel code.

~~~
VLM
"The advantage of FPGAs is that they allow nontrivial parallelism."

Want a system on a chip with 2 cores leaving plenty of space for an ethernet
accelerator, or 3 cores without space for the ethernet accelerator? Its only
an include and some minor configuration away.

"the problem is that we still don't have a smart compiler that can release us
from the difficulty"

Still don't have smart programmer... its hard to spec. Erlang looking elegant,
doesn't magically make it easy to map non-technical description of
requirements to Erlang.

------
VLM
The article missed a VERY important modern FPGA design technique which is
using raw VHDL/Verilog like a "PC programmer" would use hand optimized
assembly. In other words, most of the time, not very much.

So the "inner loop" which needs optimizing is a crazy deep complicated DSP
pipeline, obviously you implement that in FGPA "hardware" directly in a HDL.
On the other hand, you'd be crazy to implement your UI or a generic protocol
like TCP/IP in hardware (unless you're building a router or switch...).
Something like I2C is right about on the cusp where you're better off writing
it in plain ole C or implement it as a "hardware" peripheral in the FPGA.

Peripheral ... of what you ask? Well, depending on your license requirements
and personal feelings there are a zillion options like microblaze/picoblaze
from the FPGA mfgr, or visit the opencores website and download a Z80 or a
6502 or a PDP-10 or whatever floats your boat for the high level. Yes, a full
PDP-10 will fit easily in one of the bigger hobby size Spartan FPGAs. Its not
1995 anymore, you've got enough space to put dozens (hundreds?) of picoblaze
cores on a single FPGA if you want now a days.

There's no point in hand optimized HDL to output "hello world" just like
there's no point in the antique technique of software driven "bit banged"
serial ports just "include" an off the shelf opencore UART to simplify your UI
code.

I've been in this game a long time and this is the future of microcontrollers
and possibly general purpose computing. The engineering "skill" of searching a
feature matrix to find which PIC microcontroller has 3 I2C hardware and 7
timers and 2 UARTS in your favorite core family is going to be dead, you'll
just "include uart.h" and instantiate it 2 times and you pick your favorite
core, be it a Z80 or a microblaze or an ARM or a SPARC.

In the future I think very few people "programming" FPGAs are going to be
writing anything other than a bunch of includes and then doing everything in
the embedded synthesized Z80. The "old timers" who actually write in HDLs are
going to look down on the noob FPGA progs much like the old assembly coders
used to look down on the visual basic noobs, etc.

~~~
robomartin
If the focus of your work is to replace discrete embedded processor blocks
with FPGA's, sure, copy, paste and include might get you pretty far. That is
not the case for all applications, not by a longshot. For example, I had to
build a DDR memory controller from scratch in order to squeeze the last clock
cycle of performance out of the device. Off the shelf cores are often very
--very-- general purpose, badly written and poorly documented. The same can be
true of real time image processing constructs where something like a hand-
coded polyphase FIR filter can easily run twice as fast as plug-and-play
modules floating about.

Then there's the element of surprise. If, for example, I was developing an
FPGA-based board for a drone or a medical device, I would, more than likely,
require that 100% of the design be done in house (or crazy extensive testing
be done to outside modules).

Anyone in software has had the experience of using some open-source module to
save time only to end-up paying for it dearly when something doesn't work
correctly and help isn't forthcoming. If the software you are working on is
for a life support device it is very likely that taking this approach is
actually prohibited, and for good reason.

While I fully understand your point of view, this is one that reduces software
and hardware development to simply wiring together a bunch of includes. In my
experience this isn't even reality in the most trivial of non-trivial real-
world projects.

FPGA's are not software.

I see these "FPGA's for the masses" articles pop-up every so often. Here's
what's interesting to me. If you are an engineer schooled in digital circuit
design, developing with FPGA's is a piece of cake. There's nothing difficult
about it at all, particularly when compared to the old days of wire-wrapping
prototypes out of discrete chips. Sure, there can be a bit of tedium and
repetition here and there. At the same time, one person can be fully
responsible for a ten million logic element design...which was impossible just
a couple of decades ago.

If you don't understand logic circuits, FPGA's are voodoo. Guess what? A
carburetor is voodoo too if you don't understand it.

Let's invert the roles: Ask a seasoned FPGA engineer without (or with
superficial) web coding experience to code a website --server and client
side-- using JS, JQuery, HTML5, CSS3, PHP, Zend and MySQL. Right.

Then let's write an article about how difficult web programming is and how it
ought to be available to the masses. Then let's further suggest that you can
do nearly everything in web development via freely available includes.

I happen to be equally at home with hardware and software (web, embedded,
system, whatever) and I can't see that scenario (development-by-includes)
playing out in any of these domains.

~~~
caxap
At the moment, I am writing some computer vision code in VHDL. A part of the
circuit will perform connected component labeling (CCL) on incoming images,
because I want to extract some features from some object in the images. And
CCL is actually a union find algorithm. The algorithm can be written in a
normal programming language like Racket or even Java in a couple of hours.
However, the same algorithm will take me weeks to work out and test in VHDL! I
have done some nontrivial work with FPGAs, and every single time it was hard,
because every low-level detail has to be considered. Maybe it is so hard
because on FPGAs you are forced to optimize right from the start, whereas when
using programming languages, you can develop a prototype quickly and then
improve upon it? How is your experience with developing stuff on FPGAs?

~~~
VLM
I would talk to these guys (unless you are one of them) working on extending
their results

[http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6...](http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6412129)

The wikipedia entry also has a link to a parallelizable algo from 20+ years
ago for CCL. FPGAs certainly parallel pretty easily. I wonder if your
simplified optimum solution is to calculate one cell and replicate into 20x20
matrix or whatever you can fit on your FPGA and then have a higher level CPU
sling work units and stitch overlapping parts together.

More practically I'd suggest your quick prototype would be slap a SoC on a
FPGA that does it in your favorite low-ish level code, since it only takes
hours, then very methodically and smoothly create an acceleration peripheral
that begins to do the grunt-iest of the grunt work one little step at a time.

So lets start with just are there any connections at all? That seems a
blindingly simple optimization. Well thats a bitwise comparison, so replace
that in your code with a hardware detection and flag. Next thing you know
you've got a counter that automatically in hardware skips past all blank space
into the first possible pixel... But thats an optimization, maybe not the best
place to start.

Next I suppose if you're doing 4-connected you have some kind of inner loop
that looks a lot like the wikipedia list of 4 possible conditions. Now rather
than having the on FPGA cpu compare if you're in the same region one direction
at a time, do all 4 dirs at once in parallel in VHDL and output the result in
hardware to your code, and your code reads it all in and decides which step
(if any) was the lowest/first success.

The next step is obviously move the "whats the first step to succeed?"
question outta the software and into the VHDL, so the embedded proc thinks, OK
just read one register to see if its connected and if so in which direction.

Then you start feeding in a stream and setting up a (probably painful)
pipeline.

This is a solid bottom up approach. One painful low level detail at a time,
only one at a time, never more than one at a time. Often this is a method to
find a local maximum, its never going to improve the algo (although it'll make
it faster...)

"because on FPGAs you are forced to optimize right from the start" Don't do
that. Emulate something that works from the start, then create an acceleration
peripheral to simplify your SoC code. Eventually remove your onboard FPGA cpu
if you're going to interface externally to something big, once the
"accelerator" is accelerating enough.

Imagine building your own floating point mult instead of using an off the
shelf one ... you don't write the control blocks and control code in VHDL and
do the adders later... your first step should be writing a fast adder only
later replacing control code and simulated pipelining with VHDL code. You
write the full adder first, not the fast carry, or whatever.

~~~
caxap
No, I am not one of them :) Thanks for the reference! I am drawing my
inspiration from Bailey, and more recently Ma et al. They label an image line
by line and merge the labels during the blanking period. If you start merging
labels while the image is processed then data might get lost if the merged
label occurs after the merge.

The paper that you reference divides the image into regions, so that the
merging can start earlier, because labels used in one region are independent
of the other regions. If it starts earlier, it also ends earlier, so that new
data can be processed.

In my case, there is no need for such high performance, just a real time
requirement of 100fps for 640x480 images, where CCL is used for feature
extraction. The work by Bailey and his group is good enough, and the reference
can be done in the future, if there is need for more throughput!

My workflow is a lot different from the one that you describe. I don't use any
soft cores, and write everything in VHDL! I have used soft cores before, but
they were kind of not to my liking. I miss the short feedback loop (my PC is a
Mac and the synthesis tools run in a VM).

After trying out a couple of environments, I ended up using open source tools
---GHDL for VHDL->C++ compilation and simulation, and GTKwave for waveform
inspection.

Usually, I start with a testbench a testbench that instantiates my empty
design under test. The testbench reads some test image that I draw in
photoshop. It prints some debugging values, and the wave inspection helps to
figure out what's going on.

If it works in the simulator, it usually works on the FPGA! But the biggest
advantage is that it takes just some seconds to do all that.

I will give the softcore approach another chance once my deadline is over!

~~~
robomartin
One quick note. Sometimes in image processing you can gain advantages by
frame-buffering (to external SDR or DDR memory, not internal resources) and
then operating on the data at many times the native video clock rate.

If your data is coming in at 13.5MHz and you can run your internal evaluation
core at 500MHz there's a lot you can do that, all of a sudden, appears
"magical".

------
jbangert
While I am not an expert on FPGA design, I believe Figures 1 and 2 are
slightly exaggerated. The C program is very ad-hoc (returning a double from
main is actually illegal and will lead to interesting results) and avoids all
I/O, whereas the Verilog program seems to be quite complicated (in particular
with the use of ready flags and clocks). Please correct me if I am wrong, but
couldn't the authors just have used a continuous assignment (or maybe a simple
clock) for their conversion? Also, don't newer verilogs support specifying
IEEE floating point numbers in their natural form, as opposed to having to
manually convert them to hex? This seems a little like catching attention and
potentially trying to make the alternative products (SystemVerilog, etc.) look
nicer.

------
solusglobus
The code in Figure 1 should be for Figure 2 and vice-versa.

~~~
easytiger
Yea that was a tad confusing. I had to readread to make sure i hadn't missed
something

------
zwieback
It's been a while since I worked with FPGAs but my experience was that using
drop-in CPU cores and peripherals is pure joy, drag and drop what you need,
build your bits and you're good to go. Replacing an I2C with SPI or supporting
a wide range of daughterboards is trivially easy.

On the other hand, it only takes a day of writing low-level Verilog to realize
that the problem of correctly and efficiently parallelizing algorithms is a
hard one. We were using a very early C to Verilog (C2H) compiler from Altera
and it worked but was very inefficient in terms of logic element use. I'm sure
there's a lot of R&D going on in that space because without significant
progress general purpose CPUs or at least cores will remain dominant for some
time.

------
dgrnbrg
I have been working on this problem as well, by writing a system that allows
you to compile Clojure to FPGAs. I'll be giving a talk on it at Clojure West:
<http://clojurewest.org/sessions#greenberg>

</shamelessplug>

~~~
DanWaterworth
I've also been working on a HDL DSL. I've been using Idris; the type of a
circuit ensures that it implements the correct behaviour.

------
gatesphere
Having worked with Lime hands-on, it is certainly going to make splashes. It's
a fabulous idea and deserves much more attention than it's getting.

