I wonder if it supports vectorized 8-bit multiplies and, if so, what its performance and efficiency is on those.
I've been looking around at AI accelerators a bit recently. They AFAIK only do 8-bit operations and are rated in TOPS (tera-operations-per-second). Google's Coral USB Accelerator (with an Edge TPU chip) claims 4 TOPS, at 2 TOPS/W. The Gyrfalcon Plai Plug (with a SPR2801S chip) claims 5.6 TOPS, at 9.3 TOPS/W. It'd be really exciting to have an open architecture chip match or exceed that. The Kendryte K210 is a RISC-V AI accelerator chip announced recently, but it only claims 0.25 TOPS, or 0.8 TOPS/W.
Like how would it fare against today's FPGAs, DSP, NPU and TPUs?
When I last looked at the (tabled) bitwise extensions, they had been bloated out of all reason with really wild bit-shuffling apparatus.
I understand those last are very useful to some people, but if they interfere with getting a fast popcount in the core ISA, we can't afford them.
clz/ctz are pretty useful as priority decoders that show up in a variety of applications.
Carryless multiplication can subsume all of the CRC's.
What are the users for popcount?
EDIT: What might not have been obvious is that popcount combines with other operations well. E.g. popcount(A ^ B) gives the number of bits that differ. popcount(A & B) gives the number of bits that are the same. popcount(A) gives the number of options set in a flag field. etc., etc.
Source for macro-op fusion claim: Celio, Christopher, et al. "The Renewed Case for the Reduced Instruction Set Computer: Avoiding ISA Bloat with Macro-Op Fusion for RISC-V." arXiv preprint arXiv:1607.02318 (2016).
Source for extension claim: https://en.wikipedia.org/wiki/RISC-V#ISA_base_and_extensions (Note "B" extension for bit ops)
Yeah, that was true 20-30 years ago.
> Popcount and family are most likely microcoded to expand to a loop anyway, so why add this complexity to silicon?
No, you can just make a network of small adders. You could easily build a CPU that has throughput of 1 popcount per cycle (or even better, if SIMD and/or multiple execution ports supported this ALU operation). Silicon area is not a problem anyways anymore. Power and heat on the other hand...
Yes, so then why are ARM chips in mobile devices approaching x86 performance with an order of magnitude less power? C̶o̶u̶l̶d̶ ̶i̶t̶ ̶b̶e̶ ̶t̶h̶a̶t̶ ̶h̶a̶v̶i̶n̶g̶ ̶a̶ ̶l̶a̶r̶g̶e̶ ̶c̶h̶u̶n̶k̶ ̶o̶f̶ ̶r̶a̶r̶e̶l̶y̶-̶u̶s̶e̶d̶ ̶s̶i̶l̶i̶c̶o̶n̶ ̶e̶n̶d̶s̶ ̶u̶p̶ ̶c̶o̶n̶s̶u̶m̶i̶n̶g̶ ̶s̶o̶m̶e̶ ̶s̶i̶g̶n̶i̶f̶i̶c̶a̶n̶t̶ ̶a̶m̶o̶u̶n̶t̶ ̶o̶f̶ ̶p̶o̶w̶e̶r̶,̶ ̶b̶e̶c̶o̶m̶i̶n̶g̶ ̶d̶o̶m̶i̶n̶a̶n̶t̶ ̶a̶t̶ ̶e̶x̶c̶e̶s̶s̶i̶v̶e̶l̶y̶ ̶l̶a̶r̶g̶e̶ ̶s̶c̶a̶l̶e̶s̶?̶
It doesn't have to scale superlinearly either, because the natural circuit using a linear number of adders is already of a logarithmic depth, which is best possible asymptotically.
Think AVX2 instead of GPU. You extend support for MIMD architecture will all the benefits of Out-of-Order execution and deep pipelines, but add data-level parallelism through SIMD or "vector" operations. This can allow significant performance gains for fundamentally sequential operation that still needs to perform linear algebra operation. Furthermore a lot of the speedup relies on unrolling sequential math operations and manually breaking data dependencies in the code, which has diminishing returns for larger vector/matrix dimensions (you'd spend more time loading/storing into GPU registers even if you had no bandwidth concerns over PCIe).
Vector processors/architectures are commonly found in special purpose architectures called Digital Signal Processors (DSPs), which are important for a variety of applications like automotive, aerospace, controls, audio, IoT, or anywhere with real-time data acquisition. FPGAs are also popular for this task.
However a lot of those devices are pretty cheap (or really expensive, not a lot of middle ground due to economies of scale) and either underclocked or overspec'd - meaning you either pay out the ass for an overkill processor or pay out the ass for a complicated systems architecture to use multiple chips on the same board (with proprietary tooling, shout-out to ADI)
Cheap(ish), low(ish) power, high(ish) clock, CPUs with 256 width vector operations are highly attractive in a number of markets. The fact that its open source makes it even more attractive, if you can afford to do a run of them.
> Under this model, each SM on an NIVIDA GPU corresponds to a more traditional CPU core. These SMs would contain some number of 32-wide vector registers. It seems that CUDA exposes operations on vector registers as a warp. They appear to be 32 threads because each instruction on 32 lanes at once, while the threads must proceed in lock step because they are actually a single stream of instructions.
>Is vector processors being open-source a desirable quality (compared to general attractiveness of open-source)?
Absolutely. Imagine an initiative to create a standardized open-source GPU for Linux? This vector processor could provide open source silicon IP to help build it.
This chip is just another step towards fully open-source systems.
A CPU that can do number crunching like a GPU as well as general purpose computing would be amazing.
And another about shaders on the SPE as part of a deferred rendering pipeline: http://www.dice.se/wp-content/uploads/2014/12/Christina_Coff...
The RSX was still generally used for texturing and rasterization.
Despite being increasingly programmable, GPUs still have very fine-tuned hardware for graphics, and this goes deeper than just adding some texture mapping units here or there.
A good example of this is the fact that both Nvidia and AMD keep on adding new modes in HW to streamline graphics primitive processing with varying degrees of success (see mesh shaders and primitive shaders.)
To me, this signals that a free-for-all software approach as championed by Larrabee is simply not efficient enough in a competitive environment where GPUs are declared winner or loser based on benchmark differences of just a few percent.
"So let's talk about the elephant in the room - graphics. Yes, at that we did fail. And we failed mainly for reasons of time and politics. And even then we didn't fail by nearly as much as people think. Because we were never allowed to ship it, people just saw a giant crater, but in fact Larrabee did run graphics, and it ran it surprisingly well. Larrabee emulated a fully DirectX11 and OpenGL4.x compliant graphics card - by which I mean it was a PCIe card, you plugged it into your machine, you plugged the monitor into the back, you installed the standard Windows driver, and... it was a graphics card. There was no other graphics cards in the system. It had the full DX11 feature set, and there were over 300 titles running perfectly - you download the game from Steam and they Just Work - they totally think it's a graphics card! But it's still actually running FreeBSD on that card, and under FreeBSD it's just running an x86 program called DirectXGfx (248 threads of it). And it shares a file system with the host and you can telnet into it and give it other work to do and steal cores from your own graphics system - it was mind-bending! And because it was software, it could evolve - Larrabee was the first fully DirectX11-compatible card Intel had, because unlike Gen we didn't have to make a new chip when Microsoft released a new spec. It was also the fastest graphics card Intel had - possibly still is. Of course that's a totally unfair comparison because Gen (the integrated Intel gfx processor) has far less power and area budget. But that should still tell you that Larrabee ran graphics at perfectly respectable speeds. I got very good at ~Dirt3 on Larrabee.
Of course, this was just the very first properly working chip (KNF had all sorts of problems, so KNC was the first shippable one) and the software was very young. No, it wasn't competitive with the fastest GPUs on the market at the time, unless you chose the workload very carefully (it was excellent at running Compute Shaders). If we'd had more time to tune the software, it would have got a lot closer. And the next rev of the chip would have closed the gap further. It would have been a very strong chip in the high-end visualization world, where tiny triangles, super-short lines and massive data sets are the main workloads - all things Larrabee was great at. But we never got the time or the political will to get there, and so the graphics side was very publicly cancelled."
> it was excellent at running Compute Shaders
On one hand, that statement supports my belief that there is more to graphics than just a lot of raw compute flops.
> It would have been a very strong chip in the high-end visualization world, where tiny triangles, super-short lines and massive data sets are the main workloads - all things Larrabee was great at.
But this goes the opposite way. :-)
Because you'd think that a lot of small primitives would make it harder to deploy those raw compute flops.
Curiously this session is only 15 minutes long. June 11, 17:45 - 18:00, by Krste Asanovic.
Nevertheless it is quite promising:
"Vector Extension 0.7. The vector extensions have reached a major milestone with the release of version 0.7, which is intended for widespread implementation and comment".
Also, Larrabee had fixed-width 512-bit wide vectors (it did not support SSE 128-bit wide or AVX 256-bit wide vectors). RISC-V has Cray vectors of dynamic length (e.g. need a 128-bit vector, no problem, need a 4096 bit one? no problem either).
I think the main issue for programming CPUs to solve GPU problems is that our programming models for CPUs don't expose the memory hierarchy, while the ones for GPUs do that in a much more finer grained way.
It is. But given there are zero other open source solutions, I think LLVM Pipe running on the vector extension would be a decent option. It's not going to compete with the big guys, but it will make SoCs possible without licensing grahpics IP.
Without that, you'd have to rely on auto-vectorization and that's not great for anything but HPC applications using OpenMP.