
How many 32-bit RISC CPUs fit in an FPGA? Now vs. 1995? - luu
http://forums.xilinx.com/t5/Xcell-Daily-Blog/Jan-Gray-s-New-LUT-Math-How-many-32-bit-RISC-CPUs-fit-in-an-FPGA/ba-p/432478
======
crazygringo
So, about 15 years ago I took a college course on microprocessor design, and
our final project was to implement a simple microprocessor on an FPGA.

At the time, it seemed obvious that, over the next decade, FPGA's would work
their way into general-purpose computing, so that (for example) Photoshop
filters would simply reconfigure the FPGA to run blazingly fast. Likewise with
games, or video codecs, or 3D rendering, or whatever else was processor-
intensive.

But that clearly hasn't happened. Instead, GPU's took off as the main
computational supplement to CPU's.

Does anyone here have any insight as to why? Is there a technological reason
why FPGA's never turned into general-purpose hardware standard on every
desktop and laptop? Has it been a chicken-and-egg problem? A standardization
problem? Or something else? Do FPGA's still have potential for general-purpose
consumer computing? Or are they going to be forever relegated to special-
purpose roles?

~~~
pdq
Moore's law is quite amazing.

I believe there are 4 reasons why GPUs have taken off conventionally as
compared to FPGAs:

1\. Last I checked, the FPGA vendors will not open their toolchains up, and
not even document the bitstream formats. They will claim NDA, proprietary,
etc. This has the massive side effect that you are stuck with their bloated,
slow, crappy toolchains. If this were open, I guarantee hackers would be
inventing all kinds of interesting ways to convert their software into FPGA
bits.

2\. FPGAs are VERY hard to write and debug. You have to write your design in
an HDL language (either VHDL or Verilog), and you have to use a software
simulator to prototype the design on first (and of course these tools are
either quite pricy or if free they are usually limited or hard to use). Then
you can synthesize the design and download it into the FPGA for running.

The next problem is debugging your design. The entire internal state of the
FPGA is only accessible through slow scan, unless you dedicate a portion of
your design to "monitors", which tap the traffic and store their values into
internal RAMs. So you may have to respin the design just to get more monitors
to debug where the issue is.

3\. FPGA compilation is SLOW. When I used them professionally a few years ago,
a Virtex5 could take multiple hours to resynthesize/place & route a medium-
sized design. I believe that Virtex7 they are advertising could take over a
day to respin if you change your design.

4\. Most new machines already have a built-in graphics with a GPU that can be
utilized as a general-purpose GPU. No one ships FPGAs in any conventional
computer.

~~~
JamilD
(2) can probably be addressed with OpenCL -- Altera seems to be working an
SDK[1] that allows you to write C code which, as I understand it, would
compile to an image that you could then program on to your FPGA (or you could
just compile for execution on a processor). So fortunately, no Verilog or VDHL
necessary.

(3) is another issue, but I don't think the consumer would necessarily need to
worry about compilation. The developer would just include the compiled
programming files for different FPGAs in the application.

If you mean that it'll be slow on the developer's side, that's definitely a
valid point. I'm sure, however, that you'll see FPGA manufacturers start to
move toward remote compilations so that you're not necessarily limited by the
hardware you have in-house.

[1] [http://www.altera.com/products/software/opencl/opencl-
index....](http://www.altera.com/products/software/opencl/opencl-
index.html#sdk)

~~~
sounds
More about (3):

Altera calls it "logiclock," Xilinx has a different term, but the idea is that
you don't need to re-synthesize the entire FPGA for every change. In fact, you
may not want to. If you are tweaking a certain region, you're usually happier
if the place & route doesn't send your stuff through a route that then kicks
off a line in another block so that the timings are now off in that other
block.

For an FPGA the timing is how you measure performance and getting the best
timings can take quite a bit of work. Being able to lock that once you've got
it right is a big plus.

~~~
prutschman
Xilinx's ISE calls it "SmartGuide".

It's kind of a mixed bag. It's worked okay for me if changes are truly minor,
but if there are large changes to the logic it doesn't seem to be very good
about "forgetting" what it learned from the previous pass. Three or four times
this week I've had a design fail to make timing with SmartGuide, but work when
doing P&R from scratch.

------
jangray
Hi, Jan here.

pdq, the last time I built this design, with more fully elaborated processors
(control units + multiplier FUs) it took three hours and 16 GB physical RAM on
a Core i7-4960HQ rMBP.

~~~
kbenson
Do you know if the process parallelizes well? If so, this seems like something
that high-end temporary AWS instances could help quite a bit with.

~~~
jangray
With the Xilinx ISE toolset I am currently using (which Xilinx is deprecating
in favor of the new Vivado toolset) it parallelizes/multithreads poorly. I
understand that the place and route algorithm is based upon simulated
annealing, in which you make small random perturbations to the current layout
configuration, measure whether it is better or worse, and sometimes retain the
new configuration, and sometimes roll back. This gradually evolves the system
to a configuration which maximizes some objective function, avoiding getting
stuck in a local maximum. It has traditionally been a challenge to parallelize
this sequential algorithm through design partitioning because of placement and
routing interactions between the partitions.

In some flows you can do a coarse floorplan of your design and route the
submodules separately and then stitch them together. I imagine this is how the
very largest devices are implemented in manageable design iteration times.

I don't usually worry about that, though. Since my design is just so many
replicated tiles, I tend to do design iterations of 4- or 16-processor
elements to test the impact on clock period / timing slack. That usually takes
2-3 minutes per design spin. Only once in a while do I place and route the
whole chip to confirm some change doesn't impact timing closure.

------
ggreer
Fab tech has improved at an astonishing rate. The Willamette core (Pentium 4)
from 2000 fit 42 million transistors in 217mm². Eight years later,
Silverthorne (Atom) fit two cores and 47 million transistors in 25mm². That's
nine Atom CPUs in the same space as one Pentium 4.

Today's quad-core Haswell is made of 1.4 billion transistors crammed into
177mm². That's almost 8 million transistors per square _millimeter_.

~~~
anigbrowl
It seems like we're going to run into the limits of that by the end of the
decade or soon after though. I've made quite a few submissions about this but
nobody ever seems to read them :) I really wonder whether we are making
advances with parallelism and other technologies fast enough to offset the
oncoming barriers to shrinkage and speed.

My naive best guess is that when CPUs stop getting much faster the next wave
of innovation will be on bus speeds. But I'm not a chip guy so perhaps my view
of the problem is overblown, but I'm very interested in learning more about
this from others.

~~~
zhemao
> when CPUs stop getting much faster

If by faster you mean clock speed, they haven't been getting faster for a
while now. Memory and IO speeds do lag behind, but we have caching to solve
the former and SSDs to solve the latter. One pressing issue now is getting the
power consumption down. This is especially important for laptops and mobile
phones. FPGAs are pretty good at using less power, but you have to sacrifice a
lot of performance and programmability.

~~~
anigbrowl
Ah sorry, I meant in terms of execution speed not clock, and was thinking
mainly of shrinking die sizes > more transistors > more operations per cycle.
I apologize for the vagueness.

------
bayesianhorse
What practical applications does this have? Could we see something like
Python's Theano? The latter is a library which is capable of turning symbolic
representations of linear algebra into parallelized and optimized code for the
CPU or GPU.

I think that these FPGAs, when put into consumer computers, will be more like
"data centers on a chip" rather than processing cores. In a simple example
they could run map/reduce type operations, colocating storage and computing
silicon.

~~~
zhemao
That's not really what FPGAs are good for. Unless you are using a really high-
end FPGA (read: >$10,000/unit), the raw compute power is not going to beat a
decent gaming GPU. The benefit of FPGAs is that you get very precise timing
control, which makes then very good for software-defined IO and other latency-
sensitive hard real-time applications.

------
jokoon
I'm sure the future will be massively multi processor. I just wish it would
come faster.

Although I'm quite sure those sorts of amazing designs are already well used
by the NSA. I'm sure hardware research could be the real breakthrough in
cryptanalysis.

------
aortega
I wonder why Xilinx choose to use the j32 CPU in this example when they have
better designs like the MicroBlaze, with about same size but faster and able
to boot Linux.

EDIT: Oh I see. Microblaze is propietary and didn't exist back then.

~~~
jangray
Xilinx didn't choose anything, rather they simply linked to a blog (mine).
These cores are more austere, smaller, simpler than MicroBlaze.

[http://www.fpgacpu.org/log/sep00.html#000919](http://www.fpgacpu.org/log/sep00.html#000919)

