
SIMD Instructions Considered Harmful (2017) - nuclx
https://www.sigarch.org/simd-instructions-considered-harmful/
======
glangdale
This argument is less effective given that SIMD is not always a
straightforward substitute for vector processing. Sometimes we _want_ 128, 256
or 512 bits of processing as a unit and will follow it up with something
different, not a repeated instance of that same process.

We had numerous different examples of this in the Hyperscan project and I
broke out something similar on my blog:
[https://branchfree.org/2018/05/30/smh-the-swiss-army-
chainsa...](https://branchfree.org/2018/05/30/smh-the-swiss-army-chainsaw-of-
shuffle-based-matching-sequences/)

We also used SIMD quite extensively as a 'wider GPR' \- not doing stuff over
tons of input characters but instead using the superior size of SIMD registers
to implement things like bitwise string and NFA matchers.

A SIMD instruction can be a reasonable proxy for a wide vector processor but
the reverse is not true - a specialized vector architecture is unlikely to be
very helpful for this kind of 'mixed' SIMD processing. Almost any "argument
from DAXPY" fails for the much richer uses of SIMD processing among active
practitioners using modern SIMD.

~~~
creato
I agree. It seems like this strategy makes only the most braindead
applications of SIMD better (simple loops that can be vectorized by an
arbitrary factor), but doesn't really do anything to help the meatier SIMD
workloads. Most SIMD code isn't as simple as this, and the code that is
usually isn't a significant factor in either developer experience or runtime.

~~~
glangdale
Seriously. A lot of these proposals go veering off into second-order
considerations ("Easier to decode!" "A few picojoules less energy") as I'd be
very surprised if the bottlenecks are going to be from SIMD vs vector
architecture ISA issues - as compared to, say, memory bandwidth or multiply-
add bandwidth.

~~~
CoolGuySteve
A few years ago I tried to buy a liquid cooled overclocked server for trading.
Enabling AVX cost extra due to the concentrated heat output from the MMU and
each core's vector unit.

It was along the lines of being able to get a server that was tested stable at
5GHz without AVX vs 4.5 GHz with AVX for the same price.

So at least on Intel, these vector units are apparently limiting clock speeds
and yields due to power consumption.

~~~
kllrnohj
Yes, but not due to instruction decode costs, which is really all this article
is talking about.

The real heat comes from actually doing the work, not decoding what work to
do.

------
ajayjain
In some recent work from my group [1], we reduce the complexity of keeping up
with new SIMD ISAs by retargeting code between generations. For example, a
compiler pass can take code written to target SSE2 (with intrinsics) and emit
AVX-512 - it auto-vectorizes hand-vectorized code. With a more capable
compiler, if the ISA grows in complexity, programmers and users of libraries
get speedups without rewriting their code or relying on scalar auto-
vectorization. However, the x86 ISA growth certainly pushed some complexity on
us as compiler writers - we had to write a pass to retarget instructions!

[1] [https://www.nextgenvec.org/#revec](https://www.nextgenvec.org/#revec)

~~~
jabl
Recently a patch was contributed to gcc that converts mmx intrinsics to sse.
Also the gcc power target supports x86 vector intrinsics, converting them to
the power equivalents.

It's not as ambitious as your approach though, more like a 1:1 translation and
thus cannot take advantage of wider vectors.

~~~
glangdale
That patch primarily is there to avoid the pitfalls of MMX on modern
architectures; it is gradually becoming deprecated. On SKX, operations that
are available on both ports 0 and 1 for SSE or AVX are only available on port
0 for MMX. So code that uses MMX is getting half the throughput (which may or
may not matter, but still).

~~~
jabl
Thanks for the explanation, I wasn't aware of the reasoning behind it. I would
guess by now all actively maintained performance-critical code has been
rewritten in something more modern, so it certainly makes sense for Intel to
minimize the number of gates they dedicate to MMX.

------
chx
There is probably a lot of merit in the advantages of vectors but it weakens
the article to set them up as against SIMD when the presented facts are
dubious at best:

> An architect partitions the existing 64-bit registers

> The IA-32 instruction set has grown from 80 to around 1400 instructions
> since 1978, largely fueled by SIMD.

Wait, what. IA-32 started in 1985 not 1978. It didn't have any existing 64 bit
registers. It was called IA-32 because of the 32 bit registers, like EAX and
EBX. And then looking at the 1986 reference manual
[https://css.csail.mit.edu/6.858/2014/readings/i386.pdf](https://css.csail.mit.edu/6.858/2014/readings/i386.pdf)
I count 96 instructions under 17.2.2.11. The IA-32 instruction set didn't grow
much all these years, IA-64 did to the best my knowledge but please let me
know if I am wrong here. As for IA-64, I looked at
[https://www.intel.com/content/dam/www/public/us/en/documents...](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
software-developer-instruction-set-reference-manual-325383.pdf) and it's hard
to get an accurate count because some instructions are grouped together, it's
either 627 or 996 (and I may have made a counting mistake given I started from
a PDF, but it should be close) which is indeed very high but even our best
attempt only finds a tenfold growth (and perhaps only a 6.5) instead of the
17.5 the article suggested.

~~~
CalChris
Small nit. IA-64 refers to Itanium. I think you meant Intel 64.

[https://en.wikipedia.org/wiki/IA-64](https://en.wikipedia.org/wiki/IA-64)

~~~
chx
You are correct.

------
NL807
Roll my eyes every time I see a "Considered Harmful" headline.

As for SIMD, it's a huge benefit when used in the right context. I applied it
image processing and video compression algorithms in the past, with
significant performance gains.

~~~
tom_mellior
> Roll my eyes every time I see a "Considered Harmful" headline.

Me too.

> As for SIMD, it's a huge benefit when used in the right context.

Sure, but as the article points out, that benefit could be even huger when
used on a "proper" vector architecture with veeeery wide vector registers that
do not also double as not-very-wide scalar registers. "The SIMD instructions
execute 10 to 20 times more instructions than RV32V because each SIMD loop
does only 2 or 4 elements instead of 64 in the vector case."

I think adding SIMD instructions to x86 was a good trade-off at the time, but
I also think the authors are correct that new ISAs designed now are better off
with a vector architecture like they propose. In the end it's apples vs.
oranges because the two contexts are not comparable.

~~~
dkersten
I feel like even though full vector architecture might perform a lot better,
the use cases may be much narrower than SIMD, especially on typical desktop or
server (web applications etc, not scientific computing, deep learning or image
processing -- many of which are already vectorised on GPUs) workloads. As
others have mentioned, SIMD allows you to do a little bit of vectorisation in
an otherwise non-vector workload, or use it for the wider registers or
whatever. I don't know enough about it personally to be able to judge either
way, though. I just know that I've attempted to vectorise some hobby game code
for fun a few times and typically found it much harder to achieve than it
first seemed, even though the data _seemed_ trivially vectorisable at first.
Perhaps that's just lack of experience.

------
DarkWiiPlayer
So, if I understand it correctly, the text argues in favor of the GPU approach
of pipelining independant vector operations instead of the current SIMD
approach.

I see how this could be beneficial, specially when writing codes, as it's way
closer to just a normal loop.

Then again, why not combine both ideas and pipeline chunks of SIMD type? Say
we have 4 execution stages and 32bit SIMD types (unrealistic, I know) and want
to process 8-bit numbers. Wouldn't we be able to process 16 of them at the
same time? Actually, isn't that kind of what GPUs already do?

I'm sure smarter people than I have reasoned about this, maybe someone can
link a good article. I only know of this one [1] and one about GPGPU that I
just can't find any more (but which was also very interesting)

[1]
[http://www.lighterra.com/papers/modernmicroprocessors/](http://www.lighterra.com/papers/modernmicroprocessors/)

~~~
arundemeure
By coincidence I started a new blog a few days ago and my first article is
about the SIMD Instructions Considered Harmful post from a power efficiency
perspective... maybe I should post it separately on HN? :)

[https://massivebottleneck.com/2019/02/17/vector-vs-simd-
dyna...](https://massivebottleneck.com/2019/02/17/vector-vs-simd-dynamic-
power-efficiency/)

I think I'm kinda explaining how it's similar (and different) to what modern
GPUs do but I'm not sure I understand what you mean by "wouldn't we be able to
process 16 of them at the same time" \- do you mean a throughput of 16/clock,
or just that 16 are "in flight" through the pipeline with a throughput of
4/clock?

I'm not sure I'm clear enough about it for those without a GPU HW background.
If it's not clear I'm happy to write down a more detailed explanation here!

------
etaioinshrdlu
Maybe CPU architectures should just have data-parallel loop support of
arbitrary width. The CPU can implement it in microcode however it feels like,
or perhaps a kernel can trap it and send it off to a GPU transparently.

Strikes me as much cleaner design-wise than stuff like CUDA or openCL or SIMD
of today.

~~~
vnorilo
This sounds quite optimistic. How would microcode deal with allocating
registers, or nested data parallelism? You are describing transformations that
usually happen fairly early in compiler optimization pipelines, and pushing
that down to microcode would bring huge complexity.

~~~
magicalhippo
IIRC the Mill CPU handles this by performing a translation at install time.

For Mill CPU variants with wide vector units the CPU could execute certain
instructions in one go, while for variants with narrow units it might have to
issue multiple instructions.

Their idea is to handle this by basically doing ahead of time compilation of a
generic program image, turning it into a specialized version for the installed
CPU.

Sounds neat, proof is in the pudding.

~~~
wolfgke
This sounds like the claims that Intel made for the Itanium and its EPIC
instruction set when Itanium did not yet exist. The rest is history.

All of the following quotes taken from

> Their idea is to handle this by basically doing ahead of time compilation of
> a generic program image, turning it into a specialized version for the
> installed CPU.

To quote
[https://en.wikipedia.org/w/index.php?title=Itanium&oldid=884...](https://en.wikipedia.org/w/index.php?title=Itanium&oldid=884142590):

"EPIC implements a form of very long instruction word (VLIW) architecture, in
which a single instruction word contains multiple instructions. With EPIC, the
compiler determines in advance which instructions can be executed at the same
time, so the microprocessor simply executes the instructions and does not need
elaborate mechanisms to determine which instructions to execute in parallel."

The problem with all the approaches that depend on AOT compilation is that no
such "magic" compiler exists. And no, machine learning or AI is not the
solution. ;-)

~~~
magicalhippo
As I understood it, and as far as I can remember, the Mill AOT compiler has an
easier job than that. The generic image already contains the parallelized
instructions, the AOT just has to split those who are too wide for the given
CPU.

Been a while since I saw the AOT talk tho. And as mentioned, so far it's all
talk anyway.

~~~
wolfgke
> As I understood it, and as far as I can remember, the Mill AOT compiler has
> an easier job than that. The generic image already contains the parallelized
> instructions, the AOT just has to split those who are too wide for the given
> CPU.

In my opinion, this just moves the problem on a meta level. For the EPIC
instructions of Itanium, one could encode multiple (parallel) instructions
into one VLIW instruction. It was a huge problem to parallelize existing, say,
C or C++ code so that this capability could be used. The fact that such a
"smart compiler" turned out so hard to write was one of the things that broke
Itanium's neck.

I openly have no idea by what magic a "sufficiently smart compiler" that can
create such a "generic image [that] already contains the parallelized
instructions" suddenly appears. How is it possible that compilers can suddenly
parallelize the program, which turned out to be nigh impossible for the
Itanium?!

~~~
magicalhippo
It's been too long since I watched the videos, so unfortunately I don't
remember the specifics. For reference, here's[1] the relevant one on the
compiler aspect.

I do seem to recall that they seemingly had studied the failures of Itanium,
and supposedly designed their architecture to not fall into the same pitfalls
as with the EPIC/Itanium.

One aspect I recall is that while they have VLIW, different operations within
the (compound) instruction are issued in such a way that lets them be
interdependent. Like, a single VLIW instruction could have an add and a
multiply, where the result of the add is used as input for the multiplication.
So while the operations are grouped in a single instruction, they're not
executed strictly in parallel. There's a lot of other aspects too, that's just
the one I remember.

But yeah, really curious to know how that pudding will turn out.

[1]:
[https://www.youtube.com/watch?v=D7GDTZ45TRw](https://www.youtube.com/watch?v=D7GDTZ45TRw)

------
yxhuvud
The discussion in the comments beneath the article was more interesting than
the article itself.

------
phkahler
And yet the current RISC-V approach is not as good as MXP:

[https://www.youtube.com/watch?v=gFrMcRqNH90](https://www.youtube.com/watch?v=gFrMcRqNH90)

It's an entirely different approach than what the RISC-V folks are pushing
for. It's great that this guy is working with them on the vector instructions,
but I'm afraid it's too soon to claim a "right" way to go.

It's also not fair to compare instructions executed between SIMD and some huge
vector register implementation. Most common RISC-V CPUs are likely to have
smaller vector register from 256 to 512 bits wide.

~~~
jabl
> And yet the current RISC-V approach is not as good as MXP

I watched that presentation a while ago, and while the figures that are shown
look nice, I suspect the crux is that I'm not sure whether MXP is practically
implementable? I'm not at all an expert on this topic, so take this with a
large grain of salt. Anyway:

1) With MXP instead of a vector register file you have a scratchpad memory,
i.e. a chunk of byte-addressable memory in the CPU. Now, if you want multiple
vector ALU's (lanes), that scratchpad then needs to be multi-ported, which
quickly starts to eat up a lot of area and power. In contrast, a vector
regfile can be split into single-ported per lane chunks, saving power and
area.

2) MXP seems to be dependent on these shuffling engines to align the data and
feed to the ALU's. What's the overhead of these? Seems far from trivial?

As for other potential models, I have to admit I'm not entirely convinced by
their dismissal of the SIMT style model. Sure, it needs a bit more micro-
architectural state, a program counter per vector lane, basically. But there's
also a certain simplicity in the model, no need for separate vector
instructions, for the really basic stuff you need only the fork/join type
instructions to switch from scalar execution to SIMT and back. And there's no
denying that SIMT has been an extremely successful programming model in the
GPU world.

> It's also not fair to compare instructions executed between SIMD and some
> huge vector register implementation. Most common RISC-V CPUs are likely to
> have smaller vector register from 256 to 512 bits wide.

True; the more interesting parts is the overhead stuff. Does your ISA require
vector load/stores to be aligned one a vector-size boundary? Well, then when
vectorizing a loop you need a scalar pre-loop to handle the first elements
until you hit the right alignment and can use the vectorized stuff. Similarly,
how do you handle the tail of the loop if the number of elements is not a
multiple of the vector length? If you don't have a vector length register or
such you need a separate tail loop. Or is the data in memory contiguous?
Without scatter-gather and strided load/store you have to choose between not
vectorizing or packing the data.

That bloats the code and is one of the reasons why autovectorizing for SIMD
ISA's is difficult for compilers, as often the compiler doesn't know how many
iterations a loop will be executed, and due to the above a large number of
iterations are necessary to amortize the overhead. With a "proper" vector ISA
the overhead is very small and it's profitable to vectorize all loops the
compiler is able to.

------
Symmetry
Comparing dynamic instructions between a SIMD architecture with a 32 byte
vector width versus a vector architecture with 8*64=512 byte vectors is
laughably misleading. Of course you can use fewer instructions if you're
willing to throw hugely more transistors at the problem and carry around so
much more architectural state.

There are reasons to prefer SIMD or vectors machines or, put another way,
packed or unpacked vectors. But this is a very one-sided presentation. Also,
some SIMD ISAs like Arm's SVE can handle different widths pretty nicely.

~~~
jabl
I'd guess in the classification of the authors, SVE would qualify as a "real"
vector ISA. SVE resembles the risc-v vector extension quite a lot.

------
hsivonen
Prediction: In order to be performance-competitive with other ISAs for
software written for SIMD, RISC-V will get a SIMD extension. However, because
it wasn't there from thw start, Linux distros will not compile their packages
with SIMD enabled and the result will be sad like NEON on 32-bit ARM.

------
tomxor
> While a simple vector processor might execute one vector element at a time,
> element operations are independent by definition, and so a processor could
> theoretically compute all of them simultaneously. The widest data for RISC-V
> is 64 bits, and today’s vector processors typically execute two, four, or
> eight 64-bit elements per clock cycle.

So does this argument boil down to an inversion of control which in turn
removes unnecessary instructions? It certainly sounds more elegant to my naive
ISA understanding.

Can I ask, someone with hands on SIMD experience: does relinquishing control
over exactly what and how many "vector" operations occur in a single clock
make any real world difference?

------
CalChris
Why doesn't this comparison include ARMv8 and ARMv8 NEON? ARMv8 NEON does
support double precision and that can help DAXPY. I believe this has been the
case since 2011 when AArch64 was announced (well, at least 2015).

[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/CEGDJGGC.html)

~~~
wmu
Well, there is also no AVX512 mentioned.

------
zackmorris
I've been saying this for nearly 20 years. My first experience with it was
Altivec on PowerPC:

[https://en.wikipedia.org/wiki/AltiVec](https://en.wikipedia.org/wiki/AltiVec)

I have a computer engineering degree from UIUC and my very first thought upon
seeing MMX/SSE/Altivec/etc was "why didn't they make this arbitrarily
scalable?" I was excited to be able to perform multiple computations at once,
but the implementation seemed really bizarre to me (hardcoding various
arithmetic operations in 128 bits or whatnot).

If it had been up to me, I would have probably added an instruction that
executed a certain number of other instructions as a single block and let the
runtime or CPU/cluster divvy them up to its cores/registers in microcode
internally.

It turns out that something like this is conceptually what happens in vector
languages like MATLAB (and Octave, Scilab, etc), which I first used around
2005. It's implementation is not terribly optimized, but in practice it
doesn't need to be, because all personal computers since the mid 1990s are
limited by memory bandwidth, not processing power.

For what it's worth, we're seeing similar ideas in things like graphics
shaders, where the user writes a loop that appears to be serial and
synchronous, but is parallelized by the runtime. I'm saddened that they had to
evolve via graphics cards with their unfortunate memory segmentation (inspired
by DOS?) but IMHO the future of programming will look like general-purpose
shaders that abstract away caching so that eventually CPU memory, GPU memory,
mass storage, networks, even the whole internet looks like a software-
transactional memory or content-addressable memory.

We'll also ditch frictional abstractions like asynchronous promises in favor
of something like the Actor model from Erlang/Go or a data graph that is
lazily evaluated as each dependency is satisfied so it can be treated as a
single synchronous serial computation. I've never found a satisfactory name
for that last abstraction, so if someone knows what it's called, please let us
know thanks!

P.S. the point of all this is to provide an efficient transform between
functional programming and imperative programming so we can begin dealing in
abstractions and stop prematurely optimizing our programs (which limits them
to running on specific operating systems or hardware configurations).

------
qwerty456127
> IA-32 instruction set has grown from 80 to around 1400 instructions since
> 1978, largely fueled by SIMD.

Holy quack! I didn't even know there were 80 (feels too much already, I barely
used a tiny portion when exercising in assembly), 1400 sounds really insane.

~~~
londons_explore
We're past the time that a human needs to understand assembly instructions.

In the future, instructions will be designed by machine, for example by
considering millions of permutations of possible instruction "combined add
with shift with multiply by 8 and set the 6th bit", "double indirect program
counter jump with offset 63", etc.

Each permutation will be added to various compilers and simulated by running
benchmarks on complete architecture simulators to find out which new
instruction adds the most to the power to die area to performance to code size
tradeoff.

I predict there will be many more future instructions with 'fuzzy' effects
which don't affect correctness, only performance. Eg. 'Set the branch
predictor state to favor the next branch for the number of times in register
EAX', or 'go start executing the following code speculatively, because you'll
jump to it in a few hundred clock cycles and it would be handy to have it all
pre-executed'.

~~~
cameronh90
"We're past the time that a human needs to understand assembly instructions."

Until you're debugging broken compiler/JIT output, which I've had to do
multiple times in the last year while using .NET Core.

~~~
copperx
And how did you fix it? Did you patch the compiler?

------
hohohmm
Isn't the whole point of SIMD being as similar to original x86 instructions as
possible? reusing as much the existing cpu as possible? Otherwise you would
have something like the ps3?

~~~
petermcneeley
It was primarily the memory architecture that made the PS3 unique.

~~~
CoolGuySteve
If you care about latency, a modern 8-or-more core x86 with its L1/L2 cache
segmentation and penalized-but-shared L3 cache is almost as complex. It
becomes even more complex if you use the CPU topology to make inferences
hyperthreading shared caches or need to deal with the shared FPU on older AMD
processors.

My understanding is that the largest difference is that some of the Cell cores
had different opcodes that meant you could schedule some threads on some cores
but not any thread on any core.

~~~
petermcneeley
I have written quite a bit of SPE code. The primary issue is that the SPE
processor could only read/write to 256kB localized memory (without doing a
DMA). So literally object orientated code doesnt even work (because of
VTables). The c/c++ model is not designed for this type of architecture. Yes
there were also limitations like vector only registers and memory alignment
but the biggest issue was the local memory.

~~~
justrobert
Yep the SPU you end up spending so much time managing memory.

No cache they are just dumb processors. I find it funny they thought they can
take ps2 vu0/vu1 and make it a processor.

------
auggierose
I feel if you have a strong need for vectors then you should consider running
(part of) your code on the GPU.

~~~
_chris_
Too far away and requires huge thread counts to make up for its overheads. A
good vector unit should work well even for very short vectors (e.g., any size
of memcpy).

------
etjossem
Considered Harmful Clickbait Considered Harmful (2019)

