
The MRISC32 – A vector first CPU design - mbitsnbites
http://www.bitsnbites.eu/the-mrisc32-a-vector-first-cpu-design/
======
jrk
The RISC V V-extension follows a Cray-style model, and builds on the work with
Hwacha to do mixed-precision vectors much more cleanly:

[https://pdfs.semanticscholar.org/b9e8/fcf11b662a31cecd5a08d5...](https://pdfs.semanticscholar.org/b9e8/fcf11b662a31cecd5a08d5e022dcae774df2.pdf)
[https://www.sigarch.org/simd-instructions-considered-
harmful...](https://www.sigarch.org/simd-instructions-considered-harmful/)

~~~
mbitsnbites
Yes, I've seen it. It's quite neat. On the other hand it seems more complex to
implement in hardware.

In the simplest possible design, the cost of vector support in the MRISC32 is
essentially the cost of the register file (i.e. the register memory), since
the vector control logic just consists of a few adders and flip-flops, and the
scalar execution units can be reused for vector operations.

------
petermcneeley
There are similarities between this and the PS3 SPE processors. The SPEs had
128 bit registers that could be interchangeably used for various sized and
various typed operations.
[https://en.wikipedia.org/w/index.php?title=Cell_(microproces...](https://en.wikipedia.org/w/index.php?title=Cell_\(microprocessor\)#Synergistic_Processing_Elements_\(SPE\))

------
monocasa
I was under the impression that we split the registers file on the
floating/integer pipelines because those extra muxes between register file and
execution units as well as on the bypass network really cut into the critical
path. It's only on vector units with their long pipelines that we don't really
care.

~~~
petermcneeley
Ya but a modern processor has virtual registers and perhaps hundreds of them,
so im guessing this might be historical. Also it should be noted that there is
some strong connection between floating point operations and implicit
registers like EAX ( _mm_movemask_epi8 is one example)

~~~
monocasa
They still keep the physical register files split between integer and
floating/vector, despite the renaming that happens on each of the pipes.
Ironically enough the integer file has more physical registers than the vector
file on most uarchs (although less bits).

~~~
_chris_
If your processor uses register renaming, the frequency of your core limits
how many physical registers you can have in the register file. Therefore, you
are incentivized to split integer and floating point register files so you can
size up each register file to its limit.

You can also play other games like internal recoding the FP format if you
split the register files.

On the other hand, you can remove the need for IntToFP and FPToInt move
instructions if you use one register file.

At the end of the day though, I suspect it largely comes down to having more
addressable architectural state to keep the two register files separate. Why
have only 32 scalar registers when you can have 32 scalar int and 32 scalar fp
registers?

~~~
mbitsnbites
I suspect that what you say is true. One thing worth considering though is
that on x86_64 you actually use the vector register file for floating point
(and it can do integer too!).

On the MRISC32 you have a vector register file with 32 registers that can
easily be configured to do scalar operations (integer or floating point) by
setting the vector length to 1.

------
kbob
Great project.

Have you looked at the Convex C-Series architecture? They copied the Cray
vector idea. Starting with the C2 (I think) there was also a Vector Mask (VM)
register which described which vector elements had valid data. Since the
C-Series (and the Cray, I think) processed vectors serially, vector ops using
the VM would only spend time on the valid elements. It was targeted at codes
like this:

    
    
            for (i = 0; i < n; i++) {
                if (A[i] < 34) {
                    D[i] = A[i] * B[i] + C[i];
                }
            }
    

That would translate into code like this. (Actual instructions and mnemonics
lost to the mists of time; this is pseudo-assembler.)

    
    
            ld       vl, #N      ; assume N <= 128 (-:
            ld.v     v0, A       ; load A
            ld       s0, #32
            less.v   v0, s0      ; stores boolean vector into VM
            ld.v.t   v1, B       ; load B where mask == true
            mul.v.t  v3, v1, v0  ; calc A*B under mask, store in v3
            ld.v.t   v2, C       ; load C under mask
            add.v.t  v3, v3, v2  ; calc A*B+C under mask, store in v3
            st.v.t   v3, D
    

I am ignoring the difference between I32, I64, F32 and F64 because I don't
remember how those were coded into the mnemonics, sorry.

There were also instructions to load and store the VM to a scalar register
pair.

The Mill architecture, if I understood the lectures, only has a masked store
instruction; other vector instructions calculate the values for all elements
and maintain an "invalid result" bitvector. An exception is only triggered if
an invalid result is actually stored.

~~~
childintime
The Mill is far ahead of everybody else, and even vectorizes regular for
loops. A most beautiful architecture, what it needs is an implementation, and
get out of their bubble. Intel and the Mill are destined for each other, but
they don't seem to know how to join forces.

~~~
gbrown_
> but they don't seem to know how to join forces.

I don't understand what you mean by this? The Mill team seem intent of
producing a chip of their own.

~~~
childintime
Indeed. Intel sees the Mill as insignificant, and itself as the Gorilla in the
room. But in terms of the architecture, the Mill is on the order of 10x
better. That's incredible. While Intel tends to rely on strong-arming it is
the #1 Mill enemy while it has to go alone. But it should definitely not be
so.

------
gbrown_
Fun fact about the Cray-1, originally it didn't obey commutative law. It was
"fixed" by ensuring inputs were sorted.

------
lisk1
Excellent project hope it will keep developing in future. We need projects
like these to detach from the grip of certain vendors. This is interesting
project also based on expired patents on SuperH CPU, j-core.org also they have
nice and informative video presentation on youtube

------
justaaron
this is very cool! any idea what sort of gate-count we are looking at for
synthesizing the VHDL? I'm not familiar with the Intel FPGA you refer to on
the docs for the VHDL implementation:

[https://github.com/mbitsnbites/mrisc32/tree/master/mrisc32-a...](https://github.com/mbitsnbites/mrisc32/tree/master/mrisc32-a1)

~~~
mbitsnbites
I'm still learning FPGA:s myself, so I'm not sure if I can give you a relevant
figure. Currently the A1 pipeline (no memory subsystem) synthesises to
2000-3000 ALMs (or about 15% of the logic in the device I'm targeting). That
will increase as I implement more instructions (the FPU in particular), but on
the other hand it will probably decrease when I switch to using BRAM for the
scalar register file.

~~~
mbitsnbites
Update: I just moved the scalar register file to block RAM, and now the
pipeline uses ~1800 ALMs, or 9% of the chip logic.

