EDIT: What would be cool would be a more traditional GPU architecture, married to RISC-V minion looking cores that today already exist, but aren't open at all. There's a lot of very closed (even more so than shader unit) RISC processors in GPUs. AMD has a RISC core reading and processing the command lists, among other things. https://github.com/fail0verflow/radeon-tools/tree/master/f32 PowerVR has one doing thread scheduling and argument marshaling (Programmable Data Sequencer). Nvidia has their cores that they just switched to RISC-V.
Imagine having a custom command list tailored to your application sort of like the benefits custom N64 RSP microcode had.
Is that so bad? I would think the point of the project is to demonstrate proof of principle, not to actually build something usable. So efficiency would be pretty low on the list of priorities.
Graphics APIs (openGl, dx) are leaky in that the graphics programmer is knowingly targeting an architecture with GPU performance properties. This means raster is cheap, texture and filter is cheap, free fixed function blending and saturations, free clears, and free compressions.
CPU rendering cannot hope to compete against GPU hardware as GPU style optimizations were made by the graphics programmer in their usage of the GPU APIs to render 3d scenes.
RISC-V is devloping a Vector extension (V) will allow SIMD style programming with a variable vector length.
See from the last workshop:
Project Update: https://www.youtube.com/watch?v=ESu9NI3h1Y4
Initially part of the standard was also a Vector Type field for each register, and that would have allowed different types, such as Tensors, Matrix and so on.
This has been removed from the initial Vector extension proposal but work on this will continue. At least one company is already actively devloping hardware for V with Tensors (see below), sadly not open-source.
This Libre GPU project is going a slightly different route targeting the Simple-V, a slimmer version of the Vector extension, Simple-V. See: http://hands.com/~lkcl/simple_v_chennai_2018.pdf
Esperanto Technology is devloping a chip (and IP) that will do this, but it will be closed source (as far as we know).
See this talk by David Ditzel: https://www.youtube.com/watch?v=f-b4QOzMyfU
Imagine a processor with a couple beefy RISC-V cores, with lots of memory bandwidth and deep pipelines, sharing the system with some more cores that are more power efficient (but slower) and some more cores that have very wide SIMD pipelines, but sacrifice branch prediction and speculative execution for that.
I'd love to program such a beast.
A += B
sets K to the sign bit of scalar float in A
There were a few other instructions that were implemented (MAD233) and not implemented (full permute) that were needed to finish out the performance profile.
In addition, LRB was designed so (almost) every instruction retired in 4 stages. Each 'core' was a 4-wide barrel shifter so, except for a wart dealing with RAW mask register ops, all instructions (including fused test-jmps) retired "the next cycle".
LRB died on it's shit backbone (the triple ring). If they'd had a proper message-passing cache, with a parallel scratch RAM next to the cache hierarchy it'd have knocked everyone's socks off---even in 2009, three years late.