
Nyuzi: open-source processor designed for highly parallel and GPGPU applications - matt_d
http://nyuzi.org/
======
jamesbowman
Regardless of the Nyuzi's merits, that demo is unimpressive.

We (RenderMorphics and then Microsoft D3D) were happily spinning teapots on 50
MHz 486s, significantly faster than that.

~~~
jeffbush
From a performance perspective, admittedly. The software 3D renderer it is
running is not highly optimized. But it's probably not an apples-to-apples
comparison, since this has programmable vertex/pixel shaders that uses
floating point for parameter interpolation and color values. I assume 486
class 3D renderers were probably taking some shortcuts (no shaders, fixed
point math, etc.)

What I think it does demonstrate is that the hardware design can run a non-
trivial programs reliably on FPGA. Maybe that's not impressive either, but I
thought it was kinda cool. :) While this doesn't mean anyone needs to go
shorting NVidia stock, it does allow anyone who is interested in CPU/GPU
architecture to experiment with and modify a full featured design (pipelined,
vector floating point, L1 & L2 caches, hardware multithreading).

I analyzed the basic performance and scalability of the 3D renderer here:

[http://latchup.blogspot.com/2015/02/improved-3d-engine-
profi...](http://latchup.blogspot.com/2015/02/improved-3d-engine-profile.html)

And dove a bit deeper into performance using a custom Quake level renderer:

[http://latchup.blogspot.com/2015/06/not-so-
fast.html](http://latchup.blogspot.com/2015/06/not-so-fast.html)

~~~
Arelius
It may still not be an apples-to-apples comparison, but since it was running
on the 486, it was by definition, doing programmable vertex transformation and
shading.

~~~
jeffbush
What I meant was that the shader is a pluggable function in this
implementation. If you wanted to change the shader function (say add texture
or environment mapping) in this implementation, it would be a fairly minor
code change that would have a relatively small incremental performance hit. If
you wanted to do that for a highly optimized 90s era fixed pipeline software
renderer, you'd most likely need to rewrite it from scratch, and certain
operations wouldn't be feasible at all. So, this is paying a lot of the cost
for flexibility up front.

------
tromp
I wonder how well this would run my recently discussed Cuckoo Cycle proof-of-
work [1], where the bottleneck is atomically updating bit-pairs in a huge
array randomly accessed by truncated outputs of siphash_2_4, by as many
threads as possible.

The hyperthreading should allow for many outstanding loads/stores hopefully
saturating the memory subsystem, while the ability to add instructions could
be useful in reducing the instruction count of each siphash round.

[1]
[https://news.ycombinator.com/item?id=10957765](https://news.ycombinator.com/item?id=10957765)

~~~
maaku
Have you looked at the Knights Landing architecture? It seems that the 3d
stacked memory design would greatly accelerate such an application.

~~~
tromp
Yes; I tried a Xeon Phi. And I reported my experience as:

I wanted to treat the Xeon Phi like a regular Xeon with just many more cores.
So I ran my benchmark mostly unchanged. I was surprised however that the
single threaded performance was roughly 20x lower than a normal Xeon. Using
all 240 threads I was still doing no better than with 12 cores on a normal
Xeon. It seems that the Xeon Phi memory subsystem is not really optimized for
multithreaded random access.

The provider of that system suggested that I ought to perform computation and
memory access (including prefetching) on the SIMD units (VPUs), but I haven't
gotten around to studying the use of VPUs yet...

~~~
trsohmers
The Knights Corner (2012-2015) Xeon Phi was based on a modified and scaled
down version of the Intel P54C core... the same one used in the original 1995
Pentium. On top of that, each core only runs at 1.053 to 1.1GHz. Why would you
think that single threaded performance would be any good for an in order
superscalar core that is 20 years old?

~~~
rbanffy
IIRC, it's actually a bit worse: each core could run two threads, but they'd
run alternating clock cycles, like the Cell PPUs - unless you have two threads
running, each core has half that speed. A normal optimization trick of
spreading across as many cores as possible backfires here.

But that was integer performance. Each core also had a beefy SIMD unit, twice
as wide as the one in then current Xeons.

------
erichocean
Nice work! Now someone just needs to port it to Chisel.[0]

Open Source GPUs seem like they are more "in range" these days thanks to
projects like Vulkan and SPIR-V[1] which significantly improves the driver
situation for GPGPUs.

[0] [https://github.com/ucb-bar/chisel](https://github.com/ucb-bar/chisel)

[1] [https://www.khronos.org/vulkan/](https://www.khronos.org/vulkan/)

~~~
pjc50
Manufacturing is still the critical barrier for open source hardware. It's not
infinitely replicable at zero cost like software.

~~~
erichocean
Oh for sure, but not all hardware is for consumers. I used to design (and
redesign every six months or so) a specialized hardware rendering cluster from
commodity parts and at this point, the entire design is memory bandwidth
limited.

FPGAs _might_ be cost effective given that the total hardware cost is
currently around $2 million with commodity motherboards, CPUs, and GPUs, since
I could spend the silicon more wisely (and ideally, cut power usage at the
same time) by implementing the actual software algorithms I use more
carefully. I'm keeping a close eye on RISC-V and related projects.

------
mankash666
This is juicy! [RISC-V+Nyuzi] running Linux/BSD at the same performance/watt
as [ARM+Mali] opens up both the smartphone/embedded market & servers.

~~~
sitkack
The author has plenty of other great hardware/software projects
[https://github.com/jbush001](https://github.com/jbush001)

------
georgeg
Nyuzi is a Swahili word for threads.

