
Arm Helium: New vector extension for the M-Profile Architecture - flagada
https://community.arm.com/processors/b/blog/posts/arm-helium-the-new-vector-extension-for-arm-m-profile-architecture
======
jedharris
The blog post [https://community.arm.com/arm-
research/b/articles/posts/maki...](https://community.arm.com/arm-
research/b/articles/posts/making-helium-why-not-just-add-neon) has more useful
content.

~~~
p1mrx
So, Helium is similar to Neon, but optimized for simpler CPUs with fewer
gates.

------
snvzz
Meantime, in the royalty-free side, RISCV's work on V extension continues:
[https://www.embecosm.com/2018/09/09/supporting-the-risc-v-
ve...](https://www.embecosm.com/2018/09/09/supporting-the-risc-v-vector-
extension-in-gcc-and-llvm/)

~~~
tyingq
I wonder what level of concern RISCV raises internally at ARM. Are they
panicked, or just adjusting strategy in a more routine way?

It seems the moves from Western Digital would be concerning to them.

Though I suppose on the other side, there's not yet any credible/volume RISCV
general purpose CPUs yet (just MCUs), so maybe they aren't yet scrambling.

~~~
monocasa
Given their public reactions, I imagine they see it as an existential threat.

[https://www.theregister.co.uk/2018/07/10/arm_riscv_website/](https://www.theregister.co.uk/2018/07/10/arm_riscv_website/)

~~~
vardump
ARM has done an incredible job legitimizing RISC-V.

------
brucehoult
I downloaded the spec and took a look.

TLDR: it's pretty much a traditional short-register SIMD, but with the
addition of predication, including handling the tail of random-length loops
using the vector processing body (as in RISC-V and Cray), not an extra scalar
loop as previously needed.

\- provides 8 "Q" vector registers, always exactly 128 bits each

\- overlays the FP register file (32 "S" registers of 32 bits each / 16 "D"
registers of 64 bits each)

\- MVE-I (8/16/32 bit integer) and MVE-F (16 and 32 bit FP) subsets.

\- architecturally defined to execute each vector instruction in 4 beats

\- 1, 2, or 4 beats per "architecture tick", and can vary during execution. An
"architecture tick" might or might not be 1 clock cycle.

\- two forms of predication, each with its own mask: "loop tail predication",
which is like RISC-V/Cray "vl" (but described as a mask), and "VPT
predication" for data-dependent conditions. The two masks are ANDed together.

\- A VPT block is defined as the n instructions following a VPT or VPST
instruction, where n <= 4

\- can be predicated with the condition or the inverse of the condition.
Similar to the existing If/Then/Else predicated execution. e.g. VPT, VPTT,
VPTE, VPTTE, VPTEE, VPTEEE variants.

\- "VPT can be considered as the vectorized combination of CMP and IT"

\- predication is per-byte regardless of the element size.

\- loads set predicated-off bytes to 0, other instructions leave them
untouched

\- VLD2/VLD4 and VST2/VST4 are provided for interleaving/deinterleaving. Each
instruction always loads/stores exactly 128 bits to/from 2 or 4 consecutive Q
registers.

\- there is also scatter/gather

\- there are some fancy operations. e.g. VCADD: Vector Complex Add with
Rotate. This instruction performs a complex addition of the first operand with
the second operand rotated in the complex plane by the specified amount,
either 90 or 270 degrees. Also VCMLA: Vector Complex Multiply Accumulate.

------
ChuckMcM
Fun times, can't wait to have these in the Cortex M line. Of course the chips
start to look more and more like the Pentium line of old.

------
bfrog
The current dsp instructions are pretty limited, interested in seeing these!

