
VexRiscv is a quadcore, Linux-capable RISC-V softcore for FPGA - homarp
https://antmicro.com/blog/2020/05/multicore-vex-in-litex/
======
tverbeure
One of the most interesting aspects of the VexRiscv is the way it's
implemented. The VexRiscv is written with SpinalHDL, a hardware description
library in Scala. But that's not the main thing: in additional to Verilog and
VHDL, there are other ways to write RTL, from Python to Scala to Haskell.

What's really special is that the VexRiscv is constructed from a large number
of plugins that split up the design 'horizontally' per feature instead the
traditional 'vertical' way that is pipeline-stage oriented.

It makes it possible to implement all aspects of, say, a new instruction in
one file, instead of spreading it over many different files.

I've written about that here:
[https://tomverbeure.github.io/rtl/2018/12/06/The-VexRiscV-
CP...](https://tomverbeure.github.io/rtl/2018/12/06/The-VexRiscV-CPU-A-New-
Way-To-Design.html).

~~~
klysm
This is somewhat orthogonal to what you are saying, but I’ve wondered for a
while if it’s possible to achieve vertical and horizontal abstraction at the
same time. When you are working on the actual implementations, horizontal
style is clearly preferable, but if you want to change the abstraction, the
vertical style is much easier. The limitation of just one being accessible at
a time seems to purely be a consequence of the fact that we use the same
representation for reading and writing code. Why can’t I switch from vertical
to horizontal mode when reading code? Maybe it’s even possible to switch when
writing?? With a plug-in like structure you certainly get some benefits but at
the same time you can let go of a nice global view - it would be nice to have
both.

~~~
tverbeure
In a way, the VexRiscv is already implemented in both directions:

While stages are specified individually, you can declare those stages to
collapse together.

The VexRiscv can be specified to be between 2 and 5 stages.

In terms of readability, you still have the 5 stages separated out within the
same file.

~~~
klysm
Oh, that's really cool. It seems like a nice place in the middle which I
didn't really think was possible before.

------
gchadwick
Worth mentioning SymbiFlow:
[https://symbiflow.github.io/](https://symbiflow.github.io/), it's a fully
open-source flow for FPGAs, Xilinx support (targetting the Arty A7 that the
project in this story uses for instance) is on the way so hopefully won't be
long until you can build a opensource RISC-V SoC that can run Linux entirely
on open source tooling.

~~~
litghost
To be clear, creating a linux capable Artix-7 image using only open source
tools can be done today, right now! [https://github.com/SymbiFlow/symbiflow-
examples](https://github.com/SymbiFlow/symbiflow-examples)

~~~
gchadwick
Awesome! Hadn't seen that, includes the DDR controller too which I thought may
be one of the trickier parts to get going under Symbiflow.

------
mleonhard
I'm interested in developing an SoC with separate DRAM controllers for
instructions and data. I did some reading about RISCV-BOOM core and LiteDRAM.

I also researched the process of turning an SoC design into physical chips. I
estimated the cost to be around USD $150,000 for the first handful of chips,
using TSMC's CyberShuttle.

The learning curve for this technology is extremely steep. I would probably
need to spend years learning the various skills.

The PolarFire SoC block diagram [1] shows a DRAM controller and a DRAM PHY.

The Arty A7 Reference Manual [2] talks about using the Xilinx Vivado to add
peripheral blocks into the SoC design. Is this a way to add Xilinx's
proprietary DRAM controller block, which would then need to be licensed
separately?

Does Antmicro's demo use LiteDRAM to interface with Arty A7's DRAM PHY?

What would be involved in modifying VexRiscv and its MMU to support normal
data memory and a separate read-only instruction memory? For someone with the
necessary skills, is it a 1-month project or a 1-year project?

I checked a bunch of FPGA development boards. The only boards I found with
multiple DRAM chips have the Cyclone V FPGA and cost >$1,000. Why do those
boards cost so much more than the $150 Artix 7 boards?

[1]: [https://www.microsemi.com/product-directory/soc-
fpgas/5498-p...](https://www.microsemi.com/product-directory/soc-
fpgas/5498-polarfire-soc-fpga#block-diagram)

[2]: [https://reference.digilentinc.com/reference/programmable-
log...](https://reference.digilentinc.com/reference/programmable-
logic/arty-a7/reference-manual#designing_with_the_arty_a7)

------
rkagerer
I'm disappointed today's general purpose CPU's and microcontrollers don't come
with some integrated FPGA space, similar to how you have SRAM and other
peripherals. Intel talked about it a few years back [1] but I'm not sure
anything materialized.

The closest I've seen in popular chips is a few gates worth of programmable
logic. Are there any hidden gems I've missed out on?

[1] [https://www.nextplatform.com/2018/05/24/a-peek-inside-
that-i...](https://www.nextplatform.com/2018/05/24/a-peek-inside-that-intel-
xeon-fpga-hybrid-chip/)

~~~
londons_explore
With the transition of compute from performance focussed to performance per
watt focussed (due to cooling usually being the limiting factor), the niche
for the FPGA has almost vanished.

There are very very very few compute tasks where an FPGA solves a problem with
better performance per watt than both a CPU and a GPU.

I would bet that emulating a RISC-V program on x64 is far more power efficient
than running a RISC-V core on an FPGA for example.

~~~
aseipp
An ECP5 will sit on the order of ~100mW and you can clock those up to dozens
of MHz. They can have multiple cores running in parallel (an ECP5 85k will fit
dozens, probably well over a hundred RISC-V cores if you do your homework.)
Even a laptop sitting at 10W is going to be orders of magnitude more power
inefficient than this in terms of raw instructions-per-cycle-per-watt if
you're emulating. That is not the best metric, necessarily, but there you go.

And since you mentioned perf-per-dollar -- ignoring soft CPUs, any deeply
pipelined algorithm is very likely going to destroy price-comparable CPUs in
terms of throughput e.g. you can do 16-to-32 bytes per cycle of AES on a dinky
FPGA from 10 years ago for a few dollars, and at 50MHz you're doing 1.6GB/s,
and people have been achieving this, or multiple times this, for 15+ years.
Things like TDP are not a measure of "overall system design efficiency", it's
a measure of thermal capacity, thermal budgets, and nothing more. (BTW, the
only general purpose CPU that comes close to this number directly for AES is,
like, Ice Lake, since VAESNI can turn out 16 bytes per cycle or whatever IIRC,
but now you're well back into "multiple watts" territory on a multi-GHz CPU.)
The reason people still use CPUs for these tasks _isn 't_ because they _don
't_ want better performance: it's because software has better agility and is
easier to acquire and modify and distribute. You can have systems that are
dozens of times more efficient than commodity ones for a wide variety of
tasks, they will just be a pain in the ass to use, program, acquire, and
build. You can figure out most of this with basic napkin math.

Stop thinking so much about individual components, and start thinking about
global system design -- because the _entire system_ has its own performance
criteria that may vary drastically compared to an individual component within
it.

> There are very very very few compute tasks where an FPGA solves a problem
> with better performance per watt than both a CPU and a GPU.

This is like stating "There are very few tasks where a car would do as well as
a snowmobile." They aren't comparable for purpose. Hacker News is pop-
culture-y so everyone thinks "the only thing that matters is a cool CPU
running in a rack with a 7nm TSMC process that can run my Go application on
Kubernetes that will disrupt The Market of Smart Toilets" or whatever they do
day to day, and extrapolate from there. But I'd guess the vast majority (like,
85% or more) of FPGA field has literally nothing to do with this. A huge
amount of work basically revolves around "just" interfacing with analog
devices at pico/nanosecond level resolutions...

The quest for best perf-per-watt is one largely driven by datacenters and
personal consumer electronics, which have both high volume and high yield, and
where the largest challenges revolve around power, cooling, etc. Furthermore
these systems run workloads that are largely general purpose "state machines"
that use some memory and some CPU and some disk, etc, and need to try and hit
a balance among all of these. There is a large amount of resource arbitrage
going on. "A rising tide lifts all boats" in this case. But little of that
applies in this field; people use older nodes and the same chips for 5-10+
years (or longer) straight because they need to deliver latency-sensitive
solutions, customized hardware at low volume, "hardware glue" for various
analog systems, highly specialized algorithmic solutions for the lowest total
BOM cost, etc. They aren't aiming to replace the systems created by digital
Silicon Valley software programmers.

There is a push to move FPGAs into the datacenter (see: Xilinx and their
exploding revenue) but it's unclear if they will settle into specific niches
or be used as supplementary devices or whatnot.

~~~
mleonhard
Your comments will be much better without the condescending tone. I point this
out because I also talk down to people unintentionally. It's a difficult habit
to break.

~~~
RL_Quine
It’s hard to get across just how out of their depth someone is without putting
it fairly explicitly.

------
lsllc
As cool as this is, enough with the FPGAs and the RISC-V Arduino clones. A
real, relatively inexpensive (e.g. sub $100) RISC-V SoM/board that can run
Linux is desperately needed (with at least at BeagleBone Black performance
levels).

I really like the idea of RISC-V and I'm willing to make the investment in
software (and in fact have done so with QEMU), I just can't get any real
hardware (for a non-silly price).

~~~
xdxdx
I think the main barrier to a cheap RISC-V board like you describe is a real
Android port. By "real" I mean it needs working ART and V8 compiler ports so
apps and web pages don't run at 10% of the speed of low-end ARM chips.

Once that exists, I think we'll see companies develop RISC-V chips cheap
enough for low-end smartphones and other IoT devices. Those are the chips that
are cheap enough to put in a <$100 board.

------
eebynight
The fact that it fits in the 35T version with room to spare is pretty huge,
especially since certain packages of the 35T start as low as $35 for a single
chip, no MOQ.

I could see myself dropping one of these on a homemade project if I ever spend
the time making a reliable reflow oven...

~~~
tverbeure
If you're happy that it fits in a 35T, you'll be ecstatic to learn that a
single VexRiscv fits comfortably in a Cyclone II EP2C5 FPGA. :-)

It's hard to find an FPGA that's too small to fit one.

------
acrossthepond10
FPGA Noob here. I have 2 questions about FPGA's that I'm hoping someone here
can help me out with:

1\. For the FPGAs i've looked at, you seem to have to initially configure them
before being able to run your programs on them, kind of like EEPROM. I feel it
would be much more interesting from a reconfigurable computing perspective if
the devices were able to programatically re-configure on the fly as easily as
it is to read and write to DRAM or Flash Memory. So what are the barriers that
prevent the hardware from being able to do this?

2\. Its exciting to see projects like Symbiflow making great progress, but
after reading some expert opinions[1] it seems like an extremely difficult
challenge to attempt to reverse engineer hardware from commercial FPGA vendors
who wish to keep their designs closed in order to protect their IP and
compete. So my question is wouldn't it be a more feasible goal to construct
fully open FPGA platform from scratch, just like RISC-V is doing with CPUs?
What would the obstacles be here?

Thanks!

[1]
[https://www.reddit.com/r/FPGA/comments/a5pzs5/prediction_ope...](https://www.reddit.com/r/FPGA/comments/a5pzs5/prediction_open_source_fpga_tools_will_not/)

~~~
mhh__
1\. I'm not sure what you mean but remember that FPGAs don't run programs as
per se (HDL gets compiled to logic, not instructions). The bitstream can be
modified, it just gets loaded from some flash. I'm not sure where it's done,
but it's possible.

2\. The obstacles are billions and billions of R&D (and you'd need similar
amounts to get a fab pick up the phone too). Reverse engineering the bitstream
is also difficult because of this - Symbiflow (i.e. Trellis etc.) have got the
bulk of the bitstream done (apart from specialized blocks like those for DSP),
but you need to have good algorithms to decide what to do with that bitstream
e.g. a fully open source flow requires intricate timing analysis.

~~~
amelius
Regarding 2: the question was about an open-source solution, so I think that
"billions and billions of R&D" will translate to just a lot of time spent and
no literal cost, e.g. just like GCC is free as in beer.

It is true that getting a wafer fabricated will cost a lot of money (in the
millions maybe?) but this may be money well spent because the resulting FPGA
design can be used over and over. I think this would be in the reach of
perhaps some universities or government technology centers, if someone could
formulate the case for it.

~~~
mhh__
Nearly all the cool stuff in GCC and LLVM is paid for by companies (paying the
salaries of developers). The software could definitely be done in this way
(Symbiflow is very very nice), but keep in mind that developing an FPGA will
require a lot of hardware and bums in seats.

The question is similar in scale to building an open source Intel core i7 -
it's not impossible but keep in mind that an FPGA big enough (for example) to
prototype any subsections of the CPU let alone the whole thing would cost
hundreds of thousands.

------
amelius
Now the only problem left is to find an open-source-friendly FPGA
manufacturer.

------
non-entity
Kind of off topic, but how much time is involved building processors with
FPGA's, especially for modern architectures like RISC-V? I only have a very
basic overview knowledge about FPGA's and almost none at the time about HDL's
(I plan on learning!), but with the complexity involved in modern processors,
I can't imagine this being a few weeks or even months or work.

~~~
q3k
RISC-V as an ISA is designed to be easy to implement. You can get a simple
implementation in 3k lines of Verilog [1].

That being said, there's a huge difference between a toy / simple multi-cycle
machine-mode RISC-V core and one with a modern, performent microarchitecture
(pipelined, super-scalar, multi-issue, cache coherent across multiple cores,
with efficient branch prediction). There's also extra work to implement RISC-V
extensions that let you run any 'real' code like Linux (which requires
anything from simple ISA extensions to implementing the Privileged Instruction
spec which dictates additional things like the MMU and interrupt controller).

[1] -
[https://github.com/cliffordwolf/picorv32/blob/master/picorv3...](https://github.com/cliffordwolf/picorv32/blob/master/picorv32.v)

------
dhanna
Is Dolu1990 the primary developer of everything SpinalHDL related?

I'm genuinely impressed with his effort.

------
nullobject
This project looks amazing.

For folks writing in SpinalHDL, is anyone using Quartus? Or are you using a
fully open-source toolchain? i.e. What is your workflow?

I'm interested in trying out SpinalHDL, but I'm not sure how to integrate it
into what I'm doing.

~~~
tverbeure
I'm using SpinalHDL for all my hobby projects, and I use Intel Quartus, Xilinx
ISE, or Yosys, depending on the FPGA family.

This project is an FPGA based ray-tracer that uses Xilinx ISE written in
SpinalHDL:
[https://github.com/tomverbeure/rt](https://github.com/tomverbeure/rt). This
project uses SpinalHDL to drive an LED cube, which uses Quartus:
[https://github.com/tomverbeure/cube](https://github.com/tomverbeure/cube) (it
also uses a VexRiscv). And here is a small project that drives an LED matrix
with WS2812B LEDs, that runs on an Upduino2 with a Lattice UP5K FPGA, which
uses open source Yosys/NextPNR:
[https://github.com/tomverbeure/led_matrix](https://github.com/tomverbeure/led_matrix).

~~~
nullobject
Nice. I was just looking through your ray-tracer code :)

It looks like a really nice abstraction. I've been working on the MiSTer
project, writing arcade cores for the Cyclone V in VHDL.

I've made a huge effort to keep things clean, but SpinalHDL could be a great
way to tame some of the code.

Will start blinking some LEDs and see how it goes...

------
lifeisstillgood
How far down the road of something like nand2tetris can I take a FPGA ? Can I
design my own system in verilog and flash it to a FPGA like this and
effectively run on my own computer?

~~~
ladberg
Yep, that's the point of FPGAs. The easiest way to see is to synthesize it in
an FPGA tool (e.g. Vivado) which you can probably do for free. Different FPGAs
have different hardware resources, but for any small designs you can probably
fit them on a cheap FPGA.

------
tasty_freeze
I didn't find what clock rate it runs at. It mentions booting linux in 4
seconds, but that is hard to extrapolate into a core clock frequency.

100 MHz? 200 MHz? higher?

~~~
Narishma
Doesn't that depend on the FPGA you use?

~~~
tasty_freeze
Of course, but he article author did build it on a specific system and said
the boot speed, so saying that clock rate would have also been a useful data
point if the platform is known. Someone using a different FPGA could use rough
scaling factor to guesstimate what they might achieve for the platform they
have on hand.

~~~
thalain
Added this to the article - it's 100 MHz in this design. For some more
performance info on Vex in general, see
[https://github.com/SpinalHDL/VexRiscv](https://github.com/SpinalHDL/VexRiscv)

We've yet to make detailed analyses on the multicore version, but in general
I'd say it's pretty decent.

