Hacker News new | past | comments | ask | show | jobs | submit login
Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor (arxiv.org)
187 points by matt_d on June 7, 2019 | hide | past | favorite | 56 comments

> Ara runs at 1.2 GHz in the typical corner (TT/0.80 V/25 oC), achieving a performance up to 34 DP-GFLOPS. In terms of energy efficiency, Ara achieves up to 67 DP-GFLOPS/W under the same conditions, which is 56% higher than similar vector processors found in literature.

I wonder if it supports vectorized 8-bit multiplies and, if so, what its performance and efficiency is on those.

I've been looking around at AI accelerators a bit recently. They AFAIK only do 8-bit operations and are rated in TOPS (tera-operations-per-second). Google's Coral USB Accelerator (with an Edge TPU chip) claims 4 TOPS, at 2 TOPS/W. The Gyrfalcon Plai Plug (with a SPR2801S chip) claims 5.6 TOPS, at 9.3 TOPS/W. It'd be really exciting to have an open architecture chip match or exceed that. The Kendryte K210 is a RISC-V AI accelerator chip announced recently, but it only claims 0.25 TOPS, or 0.8 TOPS/W.

Thanks. A lot of people here seem to focus on Computer Graphics I'm also wondering about potential application Machine Learning and Neural Networks.

Like how would it fare against today's FPGAs, DSP, NPU and TPUs?

Before I can be bullish on RISC-V, I really need to see a set of basic bitwise instructions -- minimally, popcount and count-leading-zeros (i.e. integer log-base-2).

When I last looked at the (tabled) bitwise extensions, they had been bloated out of all reason with really wild bit-shuffling apparatus.

I understand those last are very useful to some people, but if they interfere with getting a fast popcount in the core ISA, we can't afford them.

The B extension isn't tabled, it's active again:


That is really an admirable document. I learned a lot, about way more than just the RISC-V.

I would like at least popcount (or, alternatively, the "sideways add" in MMIX, which does popcount(x&~y)) and "multiple or" (another instruction in MMIX).

I think that the generalized reverse instruction could have fit just fine, because the circuit that computes it is also capable of the complete set of shifts and rotations. Not sure about that generalized zip, though.

clz/ctz are pretty useful as priority decoders that show up in a variety of applications.

Carryless multiplication can subsume all of the CRC's.

What are the users for popcount?

> What are the users for popcount?


EDIT: What might not have been obvious is that popcount combines with other operations well. E.g. popcount(A ^ B) gives the number of bits that differ. popcount(A & B) gives the number of bits that are the same. popcount(A) gives the number of options set in a flag field. etc., etc.

popcnt is all over the place once you start doing any interesting bitwise stuff at all.

Can you be more specific?

I've used it for bitwise hamming distance - xor+popcnt - having a hardware instruction on x86 made the entire pipeline over 6x faster.

I’ve used popcounts for full vectorization for b-bit minwise hashing, comparing bloom filters efficiently, and bitvector traversal. It’s a standard primitive in a lot of low-level code.

What's the rush? The instructions can be emulated in the meantime at a passable speed.

If they could be emulated at passable speed, I would not need them.

Could you be more specific? What are you doing on a RISC-V board where emulated popcnt loses more than 10-20% of your performance?

The whole point of Risc is to avoid instruction complexity for rarely used instructions. Popcount and family are most likely microcoded to expand to a loop anyway, so why add this complexity to silicon?

I checked the latency and throughput of the SSE4 POPCNT and LZCNT instructions on a bunch of microarchitectures [1], and can't think of any way to get that kind of performance with a combination of simpler instructions. These are occasionally very useful operations, and I really do want them to be fast.

[1] https://www.agner.org/optimize/instruction_tables.pdf

I don't know why you're being downvoted, because you are correct. It's an explicit goal of RISC-V to support complex operations by using a mixture of macro-op fusion and extensions, rather than trying to encode everything in the base architecture.

Source for macro-op fusion claim: Celio, Christopher, et al. "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V." arXiv preprint arXiv:1607.02318 (2016).

Source for extension claim: https://en.wikipedia.org/wiki/RISC-V#ISA_base_and_extensions (Note "B" extension for bit ops)

> The whole point of Risc is to avoid instruction complexity for rarely used instructions

Yeah, that was true 20-30 years ago.

> Popcount and family are most likely microcoded to expand to a loop anyway, so why add this complexity to silicon?

No, you can just make a network of small adders. You could easily build a CPU that has throughput of 1 popcount per cycle (or even better, if SIMD and/or multiple execution ports supported this ALU operation). Silicon area is not a problem anyways anymore. Power and heat on the other hand...

>Silicon area is not a problem anyways anymore. Power and heat on the other hand...

Yes, so then why are ARM chips in mobile devices approaching x86 performance with an order of magnitude less power? C̶o̶u̶l̶d̶ ̶i̶t̶ ̶b̶e̶ ̶t̶h̶a̶t̶ ̶h̶a̶v̶i̶n̶g̶ ̶a̶ ̶l̶a̶r̶g̶e̶ ̶c̶h̶u̶n̶k̶ ̶o̶f̶ ̶r̶a̶r̶e̶l̶y̶-̶u̶s̶e̶d̶ ̶s̶i̶l̶i̶c̶o̶n̶ ̶e̶n̶d̶s̶ ̶u̶p̶ ̶c̶o̶n̶s̶u̶m̶i̶n̶g̶ ̶s̶o̶m̶e̶ ̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶a̶m̶o̶u̶n̶t̶ ̶o̶f̶ ̶p̶o̶w̶e̶r̶,̶ ̶b̶e̶c̶o̶m̶i̶n̶g̶ ̶d̶o̶m̶i̶n̶a̶n̶t̶ ̶a̶t̶ ̶e̶x̶c̶e̶s̶s̶i̶v̶e̶l̶y̶ ̶l̶a̶r̶g̶e̶ ̶s̶c̶a̶l̶e̶s̶?̶

That’s not the reason, especially not in this day and age of clock gating. It’s because those ARM designs are optimized for low power, at the expense of not having headroom to take advantage of higher power envelopes if they are available.

Die area does impact latency, and by including popcount, you would be increasing gate count by a fraction of a full multiplier.

The number of gates required for popcount scales linearly with the input size. It's much, much smaller than a multiplier.

No, it would scale logarithmically if you want to minimize latency.

It cannot scale logarithmically because you need at least a linear number of gates to represent the inputs.

It doesn't have to scale superlinearly either, because the natural circuit using a linear number of adders is already of a logarithmic depth, which is best possible asymptotically.

The gate depth scales logarithmically, and that determines how many cycles it takes, or (worse) how fast your cycles can be, if you want the answer in one cycle.

A lot of complexity could be eschewed if processors weren't faster than 100 Mhz.

Commentary? What's the difference between a GPU and a vector processor? Is vector processors being open-source a desirable quality (compared to general attractiveness of open-source)?

Open hardware is less important than open software but that doesn't mean it isn't desirable. But anyway, vector v gpu:

Think AVX2 instead of GPU. You extend support for MIMD architecture will all the benefits of Out-of-Order execution and deep pipelines, but add data-level parallelism through SIMD or "vector" operations. This can allow significant performance gains for fundamentally sequential operation that still needs to perform linear algebra operation. Furthermore a lot of the speedup relies on unrolling sequential math operations and manually breaking data dependencies in the code, which has diminishing returns for larger vector/matrix dimensions (you'd spend more time loading/storing into GPU registers even if you had no bandwidth concerns over PCIe).

Vector processors/architectures are commonly found in special purpose architectures called Digital Signal Processors (DSPs), which are important for a variety of applications like automotive, aerospace, controls, audio, IoT, or anywhere with real-time data acquisition. FPGAs are also popular for this task.

However a lot of those devices are pretty cheap (or really expensive, not a lot of middle ground due to economies of scale) and either underclocked or overspec'd - meaning you either pay out the ass for an overkill processor or pay out the ass for a complicated systems architecture to use multiple chips on the same board (with proprietary tooling, shout-out to ADI)

Cheap(ish), low(ish) power, high(ish) clock, CPUs with 256 width vector operations are highly attractive in a number of markets. The fact that its open source makes it even more attractive, if you can afford to do a run of them.

From: https://theincredibleholk.wordpress.com/2012/10/26/are-gpus-...

> Under this model, each SM on an NIVIDA GPU corresponds to a more traditional CPU core. These SMs would contain some number of 32-wide vector registers. It seems that CUDA exposes operations on vector registers as a warp. They appear to be 32 threads because each instruction on 32 lanes at once, while the threads must proceed in lock step because they are actually a single stream of instructions.

>Is vector processors being open-source a desirable quality (compared to general attractiveness of open-source)?

Absolutely. Imagine an initiative to create a standardized open-source GPU for Linux? This vector processor could provide open source silicon IP to help build it.

This chip is just another step towards fully open-source systems.

> Imagine an initiative to create a standardized open-source GPU for Linux?

A CPU that can do number crunching like a GPU as well as general purpose computing would be amazing.

The Web really needs a good resource on answering this question. I didn't see one in quick search. These links each have pieces of the answer:




In order to render current games / 3d with reasonable performance you need dedicated texturing hardware. Intel's Larrabee had it even though it never shipped as a graphics card. And early plans for the PS3 involved adding texture units to a Cell cpu, but Sony went with a Nvidia gpu instead.

A lot of later PS3 games started pulling GPU shading tasks onto the SPUs instead, because the shader pipeline was comparatively slow and SPUs were so well suited to the task. Here's a great presentation about occlusion on the SPE, part of a series of wonderful presentations about SPE use in Killzone 3: https://www.slideshare.net/guerrillagames/practical-occlusio...

And another about shaders on the SPE as part of a deferred rendering pipeline: http://www.dice.se/wp-content/uploads/2014/12/Christina_Coff...

The RSX was still generally used for texturing and rasterization.

Sampling/texturing helps but I can't help thinking that the software stack is the critical factor. You might make a great new open ecosystem but if I can't Port my OGL/OCL/etc benchmark to your hardware easily, I'm probably not going to waste my time.

Apparently this was the big thing that held Larrabee back as a GPU. The hardware was capable, but they never got their drivers to the point where they could do a good job with existing OpenGL, Direct3D, etc. code.

Is there any strong evidence of the claim that Larrabee HW had the potential of being a capable GPU, and that it was just a SW issue?

Despite being increasingly programmable, GPUs still have very fine-tuned hardware for graphics, and this goes deeper than just adding some texture mapping units here or there.

A good example of this is the fact that both Nvidia and AMD keep on adding new modes in HW to streamline graphics primitive processing with varying degrees of success (see mesh shaders and primitive shaders.)

To me, this signals that a free-for-all software approach as championed by Larrabee is simply not efficient enough in a competitive environment where GPUs are declared winner or loser based on benchmark differences of just a few percent.

Tom Forsyth's write up seems to make mostly reasonable claims. Though even he notes that it wasn't competitive with the high end at the time.

"So let's talk about the elephant in the room - graphics. Yes, at that we did fail. And we failed mainly for reasons of time and politics. And even then we didn't fail by nearly as much as people think. Because we were never allowed to ship it, people just saw a giant crater, but in fact Larrabee did run graphics, and it ran it surprisingly well. Larrabee emulated a fully DirectX11 and OpenGL4.x compliant graphics card - by which I mean it was a PCIe card, you plugged it into your machine, you plugged the monitor into the back, you installed the standard Windows driver, and... it was a graphics card. There was no other graphics cards in the system. It had the full DX11 feature set, and there were over 300 titles running perfectly - you download the game from Steam and they Just Work - they totally think it's a graphics card! But it's still actually running FreeBSD on that card, and under FreeBSD it's just running an x86 program called DirectXGfx (248 threads of it). And it shares a file system with the host and you can telnet into it and give it other work to do and steal cores from your own graphics system - it was mind-bending! And because it was software, it could evolve - Larrabee was the first fully DirectX11-compatible card Intel had, because unlike Gen we didn't have to make a new chip when Microsoft released a new spec. It was also the fastest graphics card Intel had - possibly still is. Of course that's a totally unfair comparison because Gen (the integrated Intel gfx processor) has far less power and area budget. But that should still tell you that Larrabee ran graphics at perfectly respectable speeds. I got very good at ~Dirt3 on Larrabee.

Of course, this was just the very first properly working chip (KNF had all sorts of problems, so KNC was the first shippable one) and the software was very young. No, it wasn't competitive with the fastest GPUs on the market at the time, unless you chose the workload very carefully (it was excellent at running Compute Shaders). If we'd had more time to tune the software, it would have got a lot closer. And the next rev of the chip would have closed the gap further. It would have been a very strong chip in the high-end visualization world, where tiny triangles, super-short lines and massive data sets are the main workloads - all things Larrabee was great at. But we never got the time or the political will to get there, and so the graphics side was very publicly cancelled."


Thanks for that!

> it was excellent at running Compute Shaders

On one hand, that statement supports my belief that there is more to graphics than just a lot of raw compute flops.

> It would have been a very strong chip in the high-end visualization world, where tiny triangles, super-short lines and massive data sets are the main workloads - all things Larrabee was great at.

But this goes the opposite way. :-)

Because you'd think that a lot of small primitives would make it harder to deploy those raw compute flops.

And didn't ever offer OCL IIRC.

I think they mean something like SSE/AVX, but with something like ARM Scalable Vector Extension (code that works on registers with 128 to 2048 bit size, unknown at compile time).

Has the vector spec been finalized yet? I think it will be critical to bring graphics to RISCV.

It's still a work-in-progress, but there should be some interesting news at the RISC-V Zurich conference next week [1]. For example, Imperas has announced an ISA simulator for the current 0.71 version of the specification [2].

[1] (https://tmt.knect365.com/risc-v-workshop-zurich/agenda/1)

[2] (http://imperas.com/articles/imperas-delivers-first-risc-v-si...)

> at the RISC-V Zurich conference next week

Curiously this session is only 15 minutes long. June 11, 17:45 - 18:00, by Krste Asanovic.

Nevertheless it is quite promising:

"Vector Extension 0.7. The vector extensions have reached a major milestone with the release of version 0.7, which is intended for widespread implementation and comment".

Excellent news.

There were a ton of talk submissions, and the schedule is incredibly tight because of it. =(

Graphics is much more subtle than just vector instructions as can be seen from the failures of Larrabee.

Note that RISCV vector extension also supports matrices, tensors, etc. So at least for GPGPU compute applications, it feels quite modern (e.g. Volta has tensor cores).

Also, Larrabee had fixed-width 512-bit wide vectors (it did not support SSE 128-bit wide or AVX 256-bit wide vectors). RISC-V has Cray vectors of dynamic length (e.g. need a 128-bit vector, no problem, need a 4096 bit one? no problem either).

I think the main issue for programming CPUs to solve GPU problems is that our programming models for CPUs don't expose the memory hierarchy, while the ones for GPUs do that in a much more finer grained way.

Memory bandwidth is another issue holding CPUs back

>> Graphics is much more subtle than just vector instructions as can be seen from the failures of Larrabee.

It is. But given there are zero other open source solutions, I think LLVM Pipe running on the vector extension would be a decent option. It's not going to compete with the big guys, but it will make SoCs possible without licensing grahpics IP.

Larrabee had dedicated texturing hardware, but management decided to not expose it or try shipping it as a GPU.

I'd imagine Larrabees failures primarily has been being competitive with AMD and nVidia?

Yes. But the RISC-V Vector instruction will serve as the baseline. And further extensions will add register types, Esperanto is already extending them with graphic types.

RISC-V sponsors need to spend the money on developers getting ISPC to target whatever the RISC-V Vector spec ends up being.

Without that, you'd have to rely on auto-vectorization and that's not great for anything but HPC applications using OpenMP.

It's really happening

Does it have speculative execution vulnerabilities?

There doesn’t look to be any speculation in this architecture.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact