
Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor - matt_d
https://arxiv.org/abs/1906.00478
======
scottlamb
> Ara runs at 1.2 GHz in the typical corner (TT/0.80 V/25 oC), achieving a
> performance up to 34 DP-GFLOPS. In terms of energy efficiency, Ara achieves
> up to 67 DP-GFLOPS/W under the same conditions, which is 56% higher than
> similar vector processors found in literature.

I wonder if it supports vectorized 8-bit multiplies and, if so, what its
performance and efficiency is on those.

I've been looking around at AI accelerators a bit recently. They AFAIK only do
8-bit operations and are rated in TOPS (tera-operations-per-second). Google's
Coral USB Accelerator (with an Edge TPU chip) claims 4 TOPS, at 2 TOPS/W. The
Gyrfalcon Plai Plug (with a SPR2801S chip) claims 5.6 TOPS, at 9.3 TOPS/W.
It'd be really exciting to have an open architecture chip match or exceed
that. The Kendryte K210 is a RISC-V AI accelerator chip announced recently,
but it only claims 0.25 TOPS, or 0.8 TOPS/W.

~~~
bobajeff
Thanks. A lot of people here seem to focus on Computer Graphics I'm also
wondering about potential application Machine Learning and Neural Networks.

Like how would it fare against today's FPGAs, DSP, NPU and TPUs?

------
ncmncm
Before I can be bullish on RISC-V, I really need to see a set of basic bitwise
instructions -- minimally, popcount and count-leading-zeros (i.e. integer log-
base-2).

When I last looked at the (tabled) bitwise extensions, they had been bloated
out of all reason with really wild bit-shuffling apparatus.

I understand those last are very useful to some people, but if they interfere
with getting a fast popcount in the core ISA, we can't afford them.

~~~
smallstepforman
The whole point of Risc is to avoid instruction complexity for rarely used
instructions. Popcount and family are most likely microcoded to expand to a
loop anyway, so why add this complexity to silicon?

~~~
vardump
> The whole point of Risc is to avoid instruction complexity for rarely used
> instructions

Yeah, that was true 20-30 years ago.

> Popcount and family are most likely microcoded to expand to a loop anyway,
> so why add this complexity to silicon?

No, you can just make a network of small adders. You could easily build a CPU
that has throughput of 1 popcount per cycle (or even better, if SIMD and/or
multiple execution ports supported this ALU operation). Silicon area is not a
problem anyways anymore. Power and heat on the other hand...

~~~
hnuser123456
>Silicon area is not a problem anyways anymore. Power and heat on the other
hand...

Yes, so then why are ARM chips in mobile devices approaching x86 performance
with an order of magnitude less power? C̶o̶u̶l̶d̶ ̶i̶t̶ ̶b̶e̶ ̶t̶h̶a̶t̶
̶h̶a̶v̶i̶n̶g̶ ̶a̶ ̶l̶a̶r̶g̶e̶ ̶c̶h̶u̶n̶k̶ ̶o̶f̶ ̶r̶a̶r̶e̶l̶y̶-̶u̶s̶e̶d̶
̶s̶i̶l̶i̶c̶o̶n̶ ̶e̶n̶d̶s̶ ̶u̶p̶ ̶c̶o̶n̶s̶u̶m̶i̶n̶g̶ ̶s̶o̶m̶e̶
̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶a̶m̶o̶u̶n̶t̶ ̶o̶f̶ ̶p̶o̶w̶e̶r̶,̶ ̶b̶e̶c̶o̶m̶i̶n̶g̶
̶d̶o̶m̶i̶n̶a̶n̶t̶ ̶a̶t̶ ̶e̶x̶c̶e̶s̶s̶i̶v̶e̶l̶y̶ ̶l̶a̶r̶g̶e̶ ̶s̶c̶a̶l̶e̶s̶?̶

~~~
rayiner
That’s not the reason, especially not in this day and age of clock gating.
It’s because those ARM designs are optimized for low power, at the expense of
not having headroom to take advantage of higher power envelopes if they are
available.

------
dmos62
Commentary? What's the difference between a GPU and a vector processor? Is
vector processors being open-source a desirable quality (compared to general
attractiveness of open-source)?

~~~
erik
In order to render current games / 3d with reasonable performance you need
dedicated texturing hardware. Intel's Larrabee had it even though it never
shipped as a graphics card. And early plans for the PS3 involved adding
texture units to a Cell cpu, but Sony went with a Nvidia gpu instead.

~~~
wyldfire
Sampling/texturing helps but I can't help thinking that the software stack is
the critical factor. You might make a great new open ecosystem but if I can't
Port my OGL/OCL/etc benchmark to your hardware easily, I'm probably not going
to waste my time.

~~~
erik
Apparently this was the big thing that held Larrabee back as a GPU. The
hardware was capable, but they never got their drivers to the point where they
could do a good job with existing OpenGL, Direct3D, etc. code.

~~~
TomVDB
Is there any strong evidence of the claim that Larrabee HW had the potential
of being a capable GPU, and that it was just a SW issue?

Despite being increasingly programmable, GPUs still have very fine-tuned
hardware for graphics, and this goes deeper than just adding some texture
mapping units here or there.

A good example of this is the fact that both Nvidia and AMD keep on adding new
modes in HW to streamline graphics primitive processing with varying degrees
of success (see mesh shaders and primitive shaders.)

To me, this signals that a free-for-all software approach as championed by
Larrabee is simply not efficient enough in a competitive environment where
GPUs are declared winner or loser based on benchmark differences of just a few
percent.

~~~
erik
Tom Forsyth's write up seems to make mostly reasonable claims. Though even he
notes that it wasn't competitive with the high end at the time.

"So let's talk about the elephant in the room - graphics. Yes, at that we did
fail. And we failed mainly for reasons of time and politics. And even then we
didn't fail by nearly as much as people think. Because we were never allowed
to ship it, people just saw a giant crater, but in fact Larrabee did run
graphics, and it ran it surprisingly well. Larrabee emulated a fully DirectX11
and OpenGL4.x compliant graphics card - by which I mean it was a PCIe card,
you plugged it into your machine, you plugged the monitor into the back, you
installed the standard Windows driver, and... it was a graphics card. There
was no other graphics cards in the system. It had the full DX11 feature set,
and there were over 300 titles running perfectly - you download the game from
Steam and they Just Work - they totally think it's a graphics card! But it's
still actually running FreeBSD on that card, and under FreeBSD it's just
running an x86 program called DirectXGfx (248 threads of it). And it shares a
file system with the host and you can telnet into it and give it other work to
do and steal cores from your own graphics system - it was mind-bending! And
because it was software, it could evolve - Larrabee was the first fully
DirectX11-compatible card Intel had, because unlike Gen we didn't have to make
a new chip when Microsoft released a new spec. It was also the fastest
graphics card Intel had - possibly still is. Of course that's a totally unfair
comparison because Gen (the integrated Intel gfx processor) has far less power
and area budget. But that should still tell you that Larrabee ran graphics at
perfectly respectable speeds. I got very good at ~Dirt3 on Larrabee.

Of course, this was just the very first properly working chip (KNF had all
sorts of problems, so KNC was the first shippable one) and the software was
very young. No, it wasn't competitive with the fastest GPUs on the market at
the time, unless you chose the workload very carefully (it was excellent at
running Compute Shaders). If we'd had more time to tune the software, it would
have got a lot closer. And the next rev of the chip would have closed the gap
further. It would have been a very strong chip in the high-end visualization
world, where tiny triangles, super-short lines and massive data sets are the
main workloads - all things Larrabee was great at. But we never got the time
or the political will to get there, and so the graphics side was very publicly
cancelled."

[http://tomforsyth1000.github.io/blog.wiki.html#%5B%5BWhy%20d...](http://tomforsyth1000.github.io/blog.wiki.html#%5B%5BWhy%20didn%27t%20Larrabee%20fail%3F%5D%5D)

~~~
TomVDB
Thanks for that!

> it was excellent at running Compute Shaders

On one hand, that statement supports my belief that there is more to graphics
than just a lot of raw compute flops.

> It would have been a very strong chip in the high-end visualization world,
> where tiny triangles, super-short lines and massive data sets are the main
> workloads - all things Larrabee was great at.

But this goes the opposite way. :-)

Because you'd think that a lot of small primitives would make it harder to
deploy those raw compute flops.

------
phkahler
Has the vector spec been finalized yet? I think it will be critical to bring
graphics to RISCV.

~~~
orbifold
Graphics is much more subtle than just vector instructions as can be seen from
the failures of Larrabee.

~~~
fluffything
Note that RISCV vector extension also supports matrices, tensors, etc. So at
least for GPGPU compute applications, it feels quite modern (e.g. Volta has
tensor cores).

Also, Larrabee had fixed-width 512-bit wide vectors (it did not support SSE
128-bit wide or AVX 256-bit wide vectors). RISC-V has Cray vectors of dynamic
length (e.g. need a 128-bit vector, no problem, need a 4096 bit one? no
problem either).

I think the main issue for programming CPUs to solve GPU problems is that our
programming models for CPUs don't expose the memory hierarchy, while the ones
for GPUs do that in a much more finer grained way.

~~~
nullwasamistake
Memory bandwidth is another issue holding CPUs back

------
erichocean
RISC-V sponsors need to spend the money on developers getting ISPC to target
whatever the RISC-V Vector spec ends up being.

Without that, you'd have to rely on auto-vectorization and that's not great
for anything but HPC applications using OpenMP.

------
alexnewman
It's really happening

------
amelius
Does it have speculative execution vulnerabilities?

~~~
DSingularity
There doesn’t look to be any speculation in this architecture.

