We had numerous different examples of this in the Hyperscan project and I broke out something similar on my blog: https://branchfree.org/2018/05/30/smh-the-swiss-army-chainsa...
We also used SIMD quite extensively as a 'wider GPR' - not doing stuff over tons of input characters but instead using the superior size of SIMD registers to implement things like bitwise string and NFA matchers.
A SIMD instruction can be a reasonable proxy for a wide vector processor but the reverse is not true - a specialized vector architecture is unlikely to be very helpful for this kind of 'mixed' SIMD processing. Almost any "argument from DAXPY" fails for the much richer uses of SIMD processing among active practitioners using modern SIMD.
For a while, it seemed like a set of instructions based on a rank-2 CLOS network (aka butterfly network) would get the job done. It scales at O(log(N)) in gate delay+ and O(N * log(N)) in area, and is very capable. Fanout is O(1). You can do all kinds of inter- and intra-lane swaps, rotations, and shifts with it. You can even do things like expand packed 16-bit RGB into 32-bit uniform RGBA.
But things like that SMH algorithm are definitely out of scope: Each input bit can only appear in at most one output location with the butterfly. So the cost to replicate a prefix scales at O(repetitions), which is unfortunate. Some algorithms based on using general shuffle are also relying on the use of PSHUFB's functionality as a complete lookup table, which the butterfly network can't do, either.
My conclusion was that you're basically stuck with a general permute instruction for a modern SIMD ISA, scaling be damned.
+ The latency scaling is somewhat misleading thanks to wire delay - they are both O(N) in wire delay.
However the RISC-V Vector Extensions basically let you use the processor in "SIMD mode" by setting the vector length to a small value. It will depend on the processor architecture whether a vector length of e.g. 4 is efficient or not, but I expect for many implementations it will be relatively efficient (and you can definitely just use it as a "wider GPR" if you want to).
The only catch at the ISA level is it costs an instruction to change the vector size. So if you keep swiching between e.g. 128-bit and 512-bit instructions at a very fine granularity, that might add overhead... I'm not sure that's a very common case though?
It was along the lines of being able to get a server that was tested stable at 5GHz without AVX vs 4.5 GHz with AVX for the same price.
So at least on Intel, these vector units are apparently limiting clock speeds and yields due to power consumption.
The real heat comes from actually doing the work, not decoding what work to do.
With a SIMD architecture, as registers get ever wider you need to recompile your code, and binary releases require multiple code paths.
With a vector architecture, the hardware designers can just increase the machine's internal vector size and existing code will benefit immediately.
Furthermore, it seems reasonable to count GPU users among "active practitioners of modern SIMD". All code that is written in a GPU-style -- which admittedly isn't too common yet on CPUs, but it would certainly be feasible and is possible since ispc came along -- immediately becomes a beneficiary of a well-designed vector instruction architecture.
Then shipping this binary for mobile phones for instance would take advantage of the wider register available on high-end device while working well on more modest CPUs.
Advanced compute shaders very often need to synchronize local data within warp. When you program such shaders you have to be aware about SIMD width (32 on NVidia and Intel, 64 on AMD), and design both algorithms and data structures accordingly. Failing to do so often have significant performance costs, I saw up to 10x speedup after implementing cooperative algorithm instead of straightforward one.
What I really want is a lots of different types of vector & permute ops and the ability to reconfigure the simd unit on a dime when I am done with a specific type of compute (like crypto) and switch to another type (like signal processing).
However, there's a case to be made that, in order to utilize our hardware better, we should be using vector units much more often. To make that feasible, we need a good programming paradigm that doesn't have to be rewritten for a different architecture. If that ends up not utilizing the hardware perfectly, that's okay: using a 256-bit vector unit even at 50% of the potential performance is still many times faster than scalar code.
PTX Assembly from NVidia still runs on today's architectures. I think this variable-length issue they focus on so much is a bit of a red-herring: NVidia always was 32-way SIMD but the PTX Code remains portable nonetheless.
The power is that PTX Assembly (and AMD's GCN Assembly) has a scalar-model of programming, but its execution is vectorized. So you write scalar code, but the programmer knows (and assumes it to be) in a parallel context. EDIT: I guess PTX is technically interpreted: the number of registers is not fixed, etc. etc. Nonetheless, the general "SIMD-ness" of PTX is static, and has survived a decade of hardware changes.
There are a few primitives needed for this to work: OpenCL's "Global Index" and "Local Index" for example. "Global Index" is where you are in the overall workstream, while "Local Index" is useful because intra-workgroup communications are VERY VERY FAST.
And... that's about it? Really. I guess there are a bunch of primitives (the workgroup swizzle operations, "ballot", barrier, etc. etc.), but the general GPU model is actually kinda simple.
I see a lot of these CPU architecture changes, but none of them really seem to be trying to learn from NVidia or AMD's model. A bit of PTX-assembly or GCN Assembly probably would do good to the next generation of CPU Architects.
If you're happy with these limitations, write OpenCL code and run it on CPU. Will work much faster than scalar code but likely slower than a GPU would.
Those are fundamental design tradeoffs of the GPU architecture. Not of PTX Assembly language.
> inefficient branching
Ehh? I disagree strongly. GPUs branch very efficiently, considering its within a SIMD-environment. Modern GPUs can arbitrarily diverge execution paths and reconverge as necessary. Its far more advanced than any SIMD I've seen implemented on a CPU.
GPUs have the most efficient way to branch when you're managing 32 or 64 work items at a time.
> limited memory model (can't allocate RAM, write access is very limited).
You can definitely allocate RAM on a GPU. That's what the CudaMalloc function does. What do you mean by "limited write access" ??
Programmers don't want to "think" in SIMD. They want to think in terms of scalar code. Per-lane Gather/Scatter is far easier to think about than vpshufb (even though vpshufb is effectively a gather instruction over an AVX register)
That's the difference. GPU Assembly language is almost equivalent to CPU stuff, its just "seen" in a different light.
CPUs are missing very, very few assembly instructions before they can run like a GPU. CPUs simply need a hardware accelerated branching-mechanism to help handle the divergent flow graphs (a simple advancement over bitmasks... AMD GCN is a good example. NVidia's Volta/Turing fully divergent flow would be nicer, but that requires a program-counter per SIMD-lane).
AMD's GCN architecture implements that mostly in just two assembly instructions: S_CBRANCH_FORK and S_CBRANCH_JOIN.
Intel is missing the "scatter-equivalent" of vpshufb. GPUs can gather/scatter arbitrarily between SIMD lanes, but Intel AVX512 only has "SIMD-gather" in the "vpshufb" instruction. Add "SIMD-scatter" and AVX512 would be complete. AMD GCN calls the two instructions "permute" and "bpermute" (backwards permute). So maybe a "backwards-vpshufb" or "vpbshufb" instruction is all AVX512 needs.
Finally, allow SIMD-instructions on CPUs to pipeline in more complex manners, to increase utilization. Only guarantee that they complete through the use of a barrier instruction. This one is an architectural shift, but GPUs are aware of "larger" workgroup sizes. (AMD GCN natively executes 64-at-a-time. But it can grow thanks to the barrier-mechanism).
The "variable length" discussed in the article here is backwards. GPUs grow their workgroups bigger, and then the programmer creates synchronization points with barrier instructions. Its actually a very simple model to work with.
Bam. Those 5 or 6 assembly instructions would make CPU-programming effectively on the same level as GPU programming.
PTX is not general purpose, it was specifically designed for that architecture, and it incorporates these tradeoffs.
> far more advanced than any SIMD I've seen implemented on a CPU.
Less advanced than normal non-SIMD branching available to CPU cores. A lot of practical algorithms need both SIMD compute and scalar branching.
> That's what the CudaMalloc function does.
CudaMalloc function can't be called from GPU code. You can't allocate RAM at all from inside your algorithm, there's no stack, no heap, nothing. You have to know in advance how much RAM do you need. For some practical problems this is a showstopper, e.g. try to implement unzip in CUDA.
> What do you mean by "limited write access"?
You only have fixed-length arrays/textures. Global memory barriers are very slow. There're producer-consumer buffers but practically speaking they're merely thread safe cursors over statically sized buffer.
GPUs can multiply dense matrices very fast, but many practical problems are different, e.g. for sparse matrices GPUs don't deliver value performance wise. SIMD on CPU is often very useful for such problems, but yes, programming model is different, lower level and more complex. No free lunches.
> Programmers don't want to "think" in SIMD.
I'm a programmer and I like SIMD.
> even though vpshufb is effectively a gather instruction over an AVX register
On GPU it would be because you don't care about latency you just spawn more threads.
On CPU it's not because shuffle_epi8 is 1 cycle latency instruction, and RAM access is much slower, if you'll think they're equivalent you'll miss the performance difference.
> CPUs are missing very, very few assembly instructions before they can run like a GPU
Even if you'll add these few instructions, GPUs will still be much faster, by orders of magnitude. Hardware is too different but it's not the instructions, CPU is spending transistors minimizing latency (caches and their sync, branch prediction, speculative execution, etc.) GPUs don't care about latency.
It's not instruction set that allowed simple programming model on GPUs. It's fundamentally different tradeoffs in hardware.
1. Barriers and Workgroups -- Scale SIMD UP, not downwards. The variable-length vector (discussed in this article) is backwards to the current programming model. GPU Programmers combine lanes with the concept of a OpenCL workgroup or CUDA Thread Block, and it works pretty well in my experience.
2. Implement AMD GCN-style branching with S_CBRANCH_FORK and S_CBRANCH_JOIN. This will accelerate branching when SIMD-lanes diverge in execution paths.
3. Implement "backwards vpshufb". GPUs can gather or scatter values between lanes, while CPUs can only gather data between lanes (with vpshufb). Intel AVX512 is missing an obvious and very important instruction for high-speed communication between SIMD lanes.
I’m just not sure it’s worth it. People who are OK with GPU programming model are already using GPUs because way more powerful. AVX-512 theoretical max is 64 FLOPs/cycle, a modern $600 CPU i7-7820X with good enough cooling is capable of 1.8 TFlops single precision. A generation old $600 GPU 1080Ti is capable of 10.6 TFLops. Huge difference.
> GPUs can gather or scatter values between lanes
Can they? AFAIK they can only permute values between lanes but not scatter/gather.
__shfl_sync() in CUDA does exactly the same as _mm256_permutevar8x32_ps() in AVX2 or _mm512_permutexvar_ps in AVX512.
GPUs have a crossbar between OpenCL __shared memory and every work-item in a workgroup. Its so innate to the GPU that its almost implicit. In terms of OpenCL, the code looks like this:
__local uint32_t gatherFoo;
gatherFoo[get_local_id(0)] = fooBar();
myFoo = gatherFoo[generate_index()];
__local uint32_t scatterFoo;
scatterFoo[generate_index()] = fooBar();
myFoo = scatterFoo[get_local_id(0)];
GPUs don't really "need" a a dedicated instruction to do this, because __local memory is a full-speed crossbar when workgroups are of size 32 (NVidia) or 64 (AMD), the native SIMD size. Its performance characteristics are not quite equivalent to vpshufb, but its still very, very, very fast in practice.
> __shfl_sync() in CUDA does exactly the same as _mm256_permutevar8x32_ps() in AVX2 or _mm512_permutexvar_ps in AVX512.
There's little reason to use __shfl_sync(), because it goes through CUDA Shared memory anyway. Ditto with AMD's __amdgcn_ds_permute() and __amdgcn_ds_bpermute() intrinsics.
EDIT: I guess __shfl_sync() and the __amdgcn_ds_permute / bpermute instructions save a step. They're smaller assembly language and more concise. But I expect the overall performance to not be much different from using LDS / Shared Memory explicity.
> I’m just not sure it’s worth it. People who are OK with GPU programming model are already using GPUs because way more powerful. AVX-512 theoretical max is 64 FLOPs/cycle, a modern $600 CPU i7-7820X with good enough cooling is capable of 1.8 TFlops single precision. A generation old $600 GPU 1080Ti is capable of 10.6 TFLops. Huge difference.
People don't program CPUs for FLOPs. They program on CPUs for minimum latency.
SIMD Compute is useful on a CPU because it stays in L1 cache. L1 cache is 64kB, more than enough to have a good SIMD processor accelerate some movement. CPUs even have full bandwidth to L2 cache, which is huge these days (512kB on EPYC to 1MB on Skylake-server)
CPU-based SIMD won't ever be as big or as broad as GPU-based SIMD. But... CPU-based SIMD should become easier as Intel figures out how to adopt OpenCL or CUDA programming paradigms.
There are already many problems implemented in CPU-AVX512 which execute faster than 15.8GB/s that a PCIe x16 bus will give you. Therefore, its more efficient to execute the whole problem on the CPU, rather than transfer the data to the GPU.
NVidia says the opposite is true. Here's a link: https://devblogs.nvidia.com/using-cuda-warp-level-primitives...
The data exchange is performed between registers, and more efficient than going through shared memory, which requires a load, a store and an extra register to hold the address.
If you count shared memory scatter/gather, CPU SIMD already have both. Scatter very recently so, only appeared in AVX512. Gather is available for 5 years now, _mm256_i32gather_ps was introduced in AVX2, albeit it's not particularly fast.
> They program on CPUs for minimum latency.
Not just that. I code for CPU SIMD very often, and only occasionally for GPGPU. Even for code that would work very well on GPUs. The main reason for me is compatibility. I mostly work on desktop software, picking CUDA decreases userbase by a factor of 2 which is often not an option. But yeah, another reason is that CPU SIMD is fast enough already and spending time on PCIx IO doesn't pay off.
Update: another reason why I don't code GPGPU more is different programming model. GPU programming model makes writing device code easy, and like you mentioned earlier it even has good scalability built-in i.e. in many cases compute shaders need not to be aware of the warp size.
But the downside is upfront engineering costs.
I have to keep my data in very small number of continuous buffers. I have to upload these buffers to GPU. I have to know in advance how much VRAM do I need for output data.
I find this part much easier for CPU SIMD, on CPU I only need to design the lowest level of my data structures accordingly, but I can use anything at all on higher levels of the structures: hash maps, trees, linked graphs, they all work just fine, as long as their lower-level nodes are not too small, aligned, dense, and composed of these 128/256 bits SIMD vectors.
You can say that again. Its still not very fast btw.
VPSCATTERDD ZMM-register is measured to be 17-clock cycles (!!!), while VPGATHERDD ZMM-register is 9-clock cycles. Gather/scatter on CPUs is very, very slow!
Its actually faster to gather/scatter through a for-loop than to actually use the VPGATHERDD or VPSCATTERDD instructions.
In contrast, Shared Memory on GPUs is a full crossbar on NVidia and AMD.
I think my details are getting a bit AMD specific. Lemme do a citation:
> As a previous post briefly described, GCN3 includes two new instructions: ds_permute_b32 and ds_bpermute_b32 . They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.
The important tidbit is that the Load/Store units of AMD's GPU Core can support a gather/scatter to separate LDS memory banks at virtually no cost. This is a full crossbar that allows GPUs to swap lanes between SIMD registers in different lanes.
Intel CPUs do NOT have this feature, outside of vpshufb. I'm arguing that GPU Shared memory is of similar performance to vpshufb (a little bit slower, but still way faster than even a CPU's Gather/Scatter).
So yes, the "bpermute" and "permute" instructions on AMD are a full-crossbar and can execute within 2-clock cycles (if there are no bank conflicts). That's 64-dwords that can be shuffled around in just 2-clock cycles.
In contrast, Intel's Gather/scatter to L1 cache is 9-clocks or 17-clocks respectively.
The important thing here is to do a high-speed permute. The programmer can choose to use vpgatherdd, vpshufb, GPU permute, GPU bpermute, GPU LDS Memory, etc. etc. It doesn't matter for program correctness: it all does the same thing.
But GPUs have the highest-performance shuffle and permute operators. Even if you go through LDS Memory. In fact, the general permute operators of AMD GPUs just go through LDS memory, that's their underlying implementation!
Quite expected because RAM in general is much slower than registers. Even L1 cache on CPU / groupshared on GPU. On both CPUs and GPUs, you want to do as much work as possible with the data in registers.
The reason why it’s OK on GPU is massive parallelism hiding latency i.e. the hardware computes other threads instead of just waiting for data to arrive. But even on CPUs, if the requested data is not too far (in L1 or L2), hyperthreading in most modern CPUs does acceptable job in this situation.
> Intel CPUs do NOT have this feature, outside of vpshufb
Yes they have. If you prefer assembly, look vpermps instruction. It can permute lanes arbitrary even across 128-bit lanes (this is unlike vpshufb/vshufps/etc.), with permutation indices being taken from another vector register. Quite fast, specifically on Haswell/Broadwell/Skylake it’s 3 cycles latency, 1 cycle throughput, single micro-op.
> Yes they have.
No they don't, but its very tricky to see why. You're blinded by vpshufb, and can't see how it can fail to solve some problems.
Lets take stream compaction as an example problem.
Stream Compaction can be used to remove redundant whitespace from strings, or to "compress" the raytracer rays so that they are all able to be read by a simple load (as opposed to a gather/scatter operation). How do you perform stream compaction?
Well, its a straightforward scatter operation: https://i.imgur.com/aIoO8dm.png
Now, how do you do this using vpshufb? You can't. vpshufb is backwards, and won't efficiently solve this stream compaction problem. GPUs can solve this problem very efficiently, but Intel's current implementation of AVX512 is missing the "backwards" vpshufb command, to perform this operation.
Or as I've been trying to say: vpshufb is equivalent to a "gather" over SIMD Registers. But Intel is MISSING a scatter over SIMD Registers. The instruction is just... not there. I've looked for it, and it doesn't exist. As such, GPU-code (such as the stream compaction algorithm) CANNOT be implemented efficiently on a CPU.
I mean, you can use vpscatterdd to implement it, but again... vpscatterdd is 17-clock cycles. That's way too slow.
> Quite expected because RAM in general is much slower than registers. Even L1 cache on CPU / groupshared on GPU. On both CPUs and GPUs, you want to do as much work as possible with the data in registers.
The L1 cache has the bandwidth to do it, it just isn't wired up correctly for this mechanism. Intel's load/store units can read/write 32-bytes at a time, in parallel across 2xload units + 1x store unit.
But the thing is: writing many small "chunks" of 4-bytes here and there (ie: a vpgatherdd / vpscatterdd operation) is not a contiguous group of 32-bytes. Therefore, Intel's cores lose a LOT of bandwidth in this case.
GPUs on the other hand, have a great-many number of load/store units. Effectively one per SIMD unit. As such, reading / writing to LDS "shared" memory on AMD GPUs can be done 32-at-a-time.
So the equivalent "vpgatherdd" over LDS cache will execute in something like 2-clock ticks on AMD GPUs (assuming no bank conflicts), while it'd take 9-clock ticks on Intel cores.
Again, LDS cache is so fast on GPUs, that it is effectively functioning as if any LDS-load/store is as fast as a vpshufb instruction. (Not quite: vpshufb is 1-clock tick, and doesn't have to worry about bank-conflicts. So vpshufb is still faster... but GPUs gather/scatter capabilities are downright incredible)
How long before you think Intel will implement a true crossbar so that the vpgatherdd and vpscatterdd instructions can actually execute quickly on a CPU?
GPUs actually implement a richer language for data-movement. Intel could very easily fix this problem by writing a "backwards vpshufb" instruction, but I'm not aware of anything that exists like that... or any plans to implement something like that.
You keep mentioning vpshufb despite it's unable to move data across 128 bit lanes.
Here's how to do that with vpermps, there's some overhead but not much, very likely much faster than RAM access even when in L1: https://stackoverflow.com/a/36951611/126995
Besides, new CPUs have AVX-512 that has vcompressps instruction just for that use case.
> The L1 cache has the bandwidth to do it
It has bandwidth, but there's extra latency involved. Registers are faster.
It's the same on GPU, that's why nVidia recommends these CUDA permute lanes intrinsics over scatter/gather RAM access.
Here's a recent article about people actually using these permute intrinsics achieving quite good results: https://news.ycombinator.com/item?id=19018240
Thanks for the discussion!
CPU architectures are quite stable. SSE2 is almost 20 years old now. You can't even run modern Windows on a system which doesn't support it.
Vectorize to SSE and you'll get your 50% of potential performance. You can do it without any new paradigms, C and C++ support SSE intrinsics for decades already, other languages are catching up.
If you wrote SSE2 code 20 years ago, then yes, it's still going to run. But it's not going to benefit from the advances made with AVX and AVX2.
Compare this to the GPU model, or ispc or vector instructions, where you automatically benefit from a wider machine without a rewrite of your code (and in the case of vector instructions even without a recompile).
This is orthogonal to the fact that these are warp-size aware I believe.
Are you getting hung up on the term 'vector'? That doesn't assume you're just doing linear algebra.
Its possible to have a chip code specialized transforms or have a many-to-many crossbar when N is small (ex: 16), but when N is large (ex: 1024 elements), its no longer easy to see how to build a high-speed permute operator.
That's the thing. People only use vpshufb because its WAY faster than L1 cache. If Intel made a faster gather/scatter, there wouldn't be much point to the vpshufb instruction. But the vpshufb instruction is so fast, because its so specialized and small. It only has to worry about a 16-byte permute.
In short: we ALREADY have an arbitrary permute instruction. Its called gather/scatter. That's not what programmers want however. (I mean, programmers want a faster gather/scatter... but... vpshufb programmers use that operator only because its faster than L1 cache)
There's a front-end operating cost to the bookkeeping instructions, and a complexity cost to all the variants of the same instructions. For short vectors, the cpu can use the same hardware it uses today in SIMD, just that the SIMD work is in microcode instead of asm. The cost of fetching the permutation vector arg out of L1 isn't terribly high compared to the cost of fetch/decode on the bookkeeping instructions. And the cost of supporting all those instruction variants could be replaced with more functionality the front-end.
There's a reason that gather is hard to do; I think if you rocked up and asked the architecture guys for a gather that was competitive with small-scale permute they would reply with the time-honored Intel putdown ("You are overpaid for whatever it is you do").
And now I'm distracted looking for 6-bit lookup tables that will enjoy that instruction. DES had 6-bit SBoxes for example.
Hmmmmm... 6-bit lookup tables. Yum. I wonder what else is out there that would benefit?
Mass availability of the VBMI goodies looks to be bottlenecked behind Icelake/Sunny Cove, so you'll have plenty of time to think through the implications of fast 6-bit lookup. :-)
(And, of course, that's all assuming there's enough registers, and I don't remember enough about JPEG to make a guess.)
Why? AFAICS, a "real" vector ISA is mostly a superset of a SIMD ISA. What do you think is missing?
Of course, if you want to eke out the absolute maximum performance then you need to be aware of the microarchitecture you're targeting, and a length-agnostic vector ISA like SVE or RVV don't buy you that much. But I don't see how that's worse than having to redo your code whenever the vendor introduces new HW with a new SIMD ISA.
I guess one could argue that an implementation of a vector ISA targeting, say, linear algebra, would be different than an implementation focusing on maximum short-vector performance. Say, by having a vector length >> execution width, and using tricks like vector pipelining etc. to get performance rather than focusing on minimizing short-vector latency. But, that's a question of what the implementation is optimized for rather than saying what the ISA is good for, no?
It's not as ambitious as your approach though, more like a 1:1 translation and thus cannot take advantage of wider vectors.
> An architect partitions the existing 64-bit registers
> The IA-32 instruction set has grown from 80 to around 1400 instructions since 1978, largely fueled by SIMD.
Wait, what. IA-32 started in 1985 not 1978. It didn't have any existing 64 bit registers. It was called IA-32 because of the 32 bit registers, like EAX and EBX. And then looking at the 1986 reference manual https://css.csail.mit.edu/6.858/2014/readings/i386.pdf I count 96 instructions under 188.8.131.52. The IA-32 instruction set didn't grow much all these years, IA-64 did to the best my knowledge but please let me know if I am wrong here. As for IA-64, I looked at https://www.intel.com/content/dam/www/public/us/en/documents... and it's hard to get an accurate count because some instructions are grouped together, it's either 627 or 996 (and I may have made a counting mistake given I started from a PDF, but it should be close) which is indeed very high but even our best attempt only finds a tenfold growth (and perhaps only a 6.5) instead of the 17.5 the article suggested.
As for SIMD, it's a huge benefit when used in the right context. I applied it image processing and video compression algorithms in the past, with significant performance gains.
> As for SIMD, it's a huge benefit when used in the right context.
Sure, but as the article points out, that benefit could be even huger when used on a "proper" vector architecture with veeeery wide vector registers that do not also double as not-very-wide scalar registers. "The SIMD instructions execute 10 to 20 times more instructions than RV32V because each SIMD loop does only 2 or 4 elements instead of 64 in the vector case."
I think adding SIMD instructions to x86 was a good trade-off at the time, but I also think the authors are correct that new ISAs designed now are better off with a vector architecture like they propose. In the end it's apples vs. oranges because the two contexts are not comparable.
It all started with the famous Edsger Dijkstra paper titled “Go To Statement Considered Harmful” , which lead to an endless series of subsequent papers later patterned after its title .
I agree with the sentiment that this title pattern is overused currently, to the point of being cliche — with perhaps a bit of presumptuousness as well, due to the implicit suggestion that the author’s claim of “Considered Harmful” will withstand the test of time as well as Dijkstra’s paper (though perhaps I’m reading too much into it, in this case).
Same. Please, when you write a title for your article, be specific about what you think is wrong. "X Considered Harmful" is lazy writing and devalues any substantive argument you make because it's become a cliche to the point where it's almost anti-clickbait.
GUILTY ADMISSION: a related trap I started falling into when writing content for work (email, documents, whatever) and I was in a hurry for a subject line or title was "Thoughts on X". That's right, "thoughts": I communed with the creator and distilled them from on high. Yeah, ain't nobody got time to read that.
I'm always dubious of claims like this. I think in general the architecture would need to turn it back into the highest SIMD code it supports at instruction decode.
I see how this could be beneficial, specially when writing codes, as it's way closer to just a normal loop.
Then again, why not combine both ideas and pipeline chunks of SIMD type? Say we have 4 execution stages and 32bit SIMD types (unrealistic, I know) and want to process 8-bit numbers. Wouldn't we be able to process 16 of them at the same time? Actually, isn't that kind of what GPUs already do?
I'm sure smarter people than I have reasoned about this, maybe someone can link a good article. I only know of this one  and one about GPGPU that I just can't find any more (but which was also very interesting)
I think I'm kinda explaining how it's similar (and different) to what modern GPUs do but I'm not sure I understand what you mean by "wouldn't we be able to process 16 of them at the same time" - do you mean a throughput of 16/clock, or just that 16 are "in flight" through the pipeline with a throughput of 4/clock?
I'm not sure I'm clear enough about it for those without a GPU HW background. If it's not clear I'm happy to write down a more detailed explanation here!
I disagree. GPUs have a fixed vector width. AMD GPUs have 64x32-bit vectors, while NVidia GPUs have 32x32-bit vectors.
What's being discussed here is a variable lengthed vector being supported on the hardware level, which is very, very different than how GPUs work.
Yes you would if you had 4 execution ports available and no data dependencies. Of course, those execution ports could also be processing 256 bit wide SIMD registers instead of just 32 bits. So it's a bad idea.
Instruction count is also higher, which is never a good thing.
> Actually, isn't that kind of what GPUs already do?
Strikes me as much cleaner design-wise than stuff like CUDA or openCL or SIMD of today.
For Mill CPU variants with wide vector units the CPU could execute certain instructions in one go, while for variants with narrow units it might have to issue multiple instructions.
Their idea is to handle this by basically doing ahead of time compilation of a generic program image, turning it into a specialized version for the installed CPU.
Sounds neat, proof is in the pudding.
All of the following quotes taken from
> Their idea is to handle this by basically doing ahead of time compilation of a generic program image, turning it into a specialized version for the installed CPU.
To quote https://en.wikipedia.org/w/index.php?title=Itanium&oldid=884...:
"EPIC implements a form of very long instruction word (VLIW) architecture, in which a single instruction word contains multiple instructions. With EPIC, the compiler determines in advance which instructions can be executed at the same time, so the microprocessor simply executes the instructions and does not need elaborate mechanisms to determine which instructions to execute in parallel."
The problem with all the approaches that depend on AOT compilation is that no such "magic" compiler exists. And no, machine learning or AI is not the solution. ;-)
Been a while since I saw the AOT talk tho. And as mentioned, so far it's all talk anyway.
In my opinion, this just moves the problem on a meta level. For the EPIC instructions of Itanium, one could encode multiple (parallel) instructions into one VLIW instruction. It was a huge problem to parallelize existing, say, C or C++ code so that this capability could be used. The fact that such a "smart compiler" turned out so hard to write was one of the things that broke Itanium's neck.
I openly have no idea by what magic a "sufficiently smart compiler" that can create such a "generic image [that] already contains the parallelized instructions" suddenly appears. How is it possible that compilers can suddenly parallelize the program, which turned out to be nigh impossible for the Itanium?!
I do seem to recall that they seemingly had studied the failures of Itanium, and supposedly designed their architecture to not fall into the same pitfalls as with the EPIC/Itanium.
One aspect I recall is that while they have VLIW, different operations within the (compound) instruction
are issued in such a way that lets them be interdependent. Like, a single VLIW instruction could have an add and a multiply, where the result of the add is used as input for the multiplication. So while the operations are grouped in a single instruction, they're not executed strictly in parallel. There's a lot of other aspects too, that's just the one I remember.
But yeah, really curious to know how that pudding will turn out.
It's an entirely different approach than what the RISC-V folks are pushing for. It's great that this guy is working with them on the vector instructions, but I'm afraid it's too soon to claim a "right" way to go.
It's also not fair to compare instructions executed between SIMD and some huge vector register implementation. Most common RISC-V CPUs are likely to have smaller vector register from 256 to 512 bits wide.
I watched that presentation a while ago, and while the figures that are shown look nice, I suspect the crux is that I'm not sure whether MXP is practically implementable? I'm not at all an expert on this topic, so take this with a large grain of salt. Anyway:
1) With MXP instead of a vector register file you have a scratchpad memory, i.e. a chunk of byte-addressable memory in the CPU. Now, if you want multiple vector ALU's (lanes), that scratchpad then needs to be multi-ported, which quickly starts to eat up a lot of area and power. In contrast, a vector regfile can be split into single-ported per lane chunks, saving power and area.
2) MXP seems to be dependent on these shuffling engines to align the data and feed to the ALU's. What's the overhead of these? Seems far from trivial?
As for other potential models, I have to admit I'm not entirely convinced by their dismissal of the SIMT style model. Sure, it needs a bit more micro-architectural state, a program counter per vector lane, basically. But there's also a certain simplicity in the model, no need for separate vector instructions, for the really basic stuff you need only the fork/join type instructions to switch from scalar execution to SIMT and back. And there's no denying that SIMT has been an extremely successful programming model in the GPU world.
> It's also not fair to compare instructions executed between SIMD and some huge vector register implementation. Most common RISC-V CPUs are likely to have smaller vector register from 256 to 512 bits wide.
True; the more interesting parts is the overhead stuff. Does your ISA require vector load/stores to be aligned one a vector-size boundary? Well, then when vectorizing a loop you need a scalar pre-loop to handle the first elements until you hit the right alignment and can use the vectorized stuff. Similarly, how do you handle the tail of the loop if the number of elements is not a multiple of the vector length? If you don't have a vector length register or such you need a separate tail loop. Or is the data in memory contiguous? Without scatter-gather and strided load/store you have to choose between not vectorizing or packing the data.
That bloats the code and is one of the reasons why autovectorizing for SIMD ISA's is difficult for compilers, as often the compiler doesn't know how many iterations a loop will be executed, and due to the above a large number of iterations are necessary to amortize the overhead. With a "proper" vector ISA the overhead is very small and it's profitable to vectorize all loops the compiler is able to.
There are reasons to prefer SIMD or vectors machines or, put another way, packed or unpacked vectors. But this is a very one-sided presentation. Also, some SIMD ISAs like Arm's SVE can handle different widths pretty nicely.
So does this argument boil down to an inversion of control which in turn removes unnecessary instructions? It certainly sounds more elegant to my naive ISA understanding.
Can I ask, someone with hands on SIMD experience: does relinquishing control over exactly what and how many "vector" operations occur in a single clock make any real world difference?
I have a computer engineering degree from UIUC and my very first thought upon seeing MMX/SSE/Altivec/etc was "why didn't they make this arbitrarily scalable?" I was excited to be able to perform multiple computations at once, but the implementation seemed really bizarre to me (hardcoding various arithmetic operations in 128 bits or whatnot).
If it had been up to me, I would have probably added an instruction that executed a certain number of other instructions as a single block and let the runtime or CPU/cluster divvy them up to its cores/registers in microcode internally.
It turns out that something like this is conceptually what happens in vector languages like MATLAB (and Octave, Scilab, etc), which I first used around 2005. It's implementation is not terribly optimized, but in practice it doesn't need to be, because all personal computers since the mid 1990s are limited by memory bandwidth, not processing power.
For what it's worth, we're seeing similar ideas in things like graphics shaders, where the user writes a loop that appears to be serial and synchronous, but is parallelized by the runtime. I'm saddened that they had to evolve via graphics cards with their unfortunate memory segmentation (inspired by DOS?) but IMHO the future of programming will look like general-purpose shaders that abstract away caching so that eventually CPU memory, GPU memory, mass storage, networks, even the whole internet looks like a software-transactional memory or content-addressable memory.
We'll also ditch frictional abstractions like asynchronous promises in favor of something like the Actor model from Erlang/Go or a data graph that is lazily evaluated as each dependency is satisfied so it can be treated as a single synchronous serial computation. I've never found a satisfactory name for that last abstraction, so if someone knows what it's called, please let us know thanks!
P.S. the point of all this is to provide an efficient transform between functional programming and imperative programming so we can begin dealing in abstractions and stop prematurely optimizing our programs (which limits them to running on specific operating systems or hardware configurations).
Holy quack! I didn't even know there were 80 (feels too much already, I barely used a tiny portion when exercising in assembly), 1400 sounds really insane.
Much in the same way, AVX adds only 8 completely new instructions, but adds new 256-bit variants for many pre-existing SSE instructions. This generates enormous amounts of opcodes, without actually increasing complexity _that_ much.
Long answer: Intel and AMD loooove length prefixes that basically move the base instruction into a new space. Because x86/x86-64 is a variable length ISA they can get away with this. When people like to talk ISA opcode bloat they deliberately include all the prefixes as 'separate opcodes'. Whereas most users would see them as the same opcode, but with modifiers that are not actually 'instructions' per se as the core opcode didn't change. Historically however there are some quirky op codes because of how things were implemented as mentioned elsewhere. So some instructions may have one memnonic but multiple core opcodes due to things like r/m vs reg/reg. This was because traditionally x86 didn't support three operand op-codes, and still doesn't for many general non-vector instructions.
AVX introduced three-operand instructions (essentially rA = rB op rC) instead of the normal two-operand instructions (rA = rA op rC) that is typical for x86. AVX-512 introduced vector masks as well (and per-instruction rounding modes).
Some quick overview of the x86 ISA: each instruction is essentially an opcode (of 1-3 bytes) followed by a "modR/M" byte. These bytes encode a 3-bit register number and a second input operand which is either an immediate, a single register, or a memory operand that has 2 registers, an immediate, and a scale (multiply by 1/2/4/8) parameter. Legacy prefixes provide a segment override, address size override, and operand size override capability--essentially controlling if the register should be referred to as dx or edx. Knowing if the register is eax or xmm0 is dependent on what the opcode is; they're both encoded as 0.
Extending the ISA to 64-bit added a REX prefix, which provides a bit to indicate if its 64-bit or 32-bit, as well as three extra register selector bits that get prepended to the modR/M results. The VEX prefix, introduced for AVX, includes all of the bits in the REX prefix, as well as another 4-bit register selector (the three-operand form as mentioned above), a 1-bit for 128-bit or 256-bit vectors, and another 2 bits of opcode table extension. The EVEX prefix (for AVX-512) includes all of the bits from the VEX prefix, as well as another vector length bit, 3 bits for accessing zmm16-31 registers, 3-bits for a vector mask register, another bit for masking mode, and another bit for rounding mode.
It sounds complicated, but most of these bits are actually just providing either an extension to the size of the register set, the size of the operation being done, or the introduction of a few new operands into the operation. By the time you leave the decoder, there's not really any new data being passed onto later hardware stages.
The problem with this is that when the processor stalls on an instruction fetch for the next cacheline of code, it just sits there idle for the entire time. This greatly elongates your tails when looking at performance in terms of latency percentiles.
It makes me wonder if Intel or AMD have investigated a MicroVAX-like trimming or compression of the opcodes so that the most common/useful codes fit in the fewest bytes. In particular it seems like SIMD lengths are inverted, the longest vectors should have shorter opcodes since they're more useful. It might even be worth deprecating MMX/128-bit SSE.
AMD64 came out in 2003, a new decoder might be appropriate by 2023.
In the future, instructions will be designed by machine, for example by considering millions of permutations of possible instruction "combined add with shift with multiply by 8 and set the 6th bit", "double indirect program counter jump with offset 63", etc.
Each permutation will be added to various compilers and simulated by running benchmarks on complete architecture simulators to find out which new instruction adds the most to the power to die area to performance to code size tradeoff.
I predict there will be many more future instructions with 'fuzzy' effects which don't affect correctness, only performance. Eg. 'Set the branch predictor state to favor the next branch for the number of times in register EAX', or 'go start executing the following code speculatively, because you'll jump to it in a few hundred clock cycles and it would be handy to have it all pre-executed'.
Until you're debugging broken compiler/JIT output, which I've had to do multiple times in the last year while using .NET Core.
You don't normally need to write assembly, but you do need to use compiler intrinsics, which map 1-1 with assembly.
Really neat idea, though!
The x86 SIMD extensions such as SSE and AVX, on the other hand, aim to integrate that concept with x86 and are therefore pretty similar.
My understanding is that the largest difference is that some of the Cell cores had different opcodes that meant you could schedule some threads on some cores but not any thread on any core.
No cache they are just dumb processors. I find it funny they thought they can take ps2 vu0/vu1 and make it a processor.