I write a lot of SIMD and I don't really agree with this..
Flaw1:fixed width
I prefer fixed width as it makes the code simpler to write, size is known as compile time so we know the size of our structures. Swizzle algorithms are also customized based on the size.
Flaw2:pipelining
no CPU I care about is in order so mostly irrelevant, and even scalar instructions are pipelined
Flaw3: tail handling
I code with SIMD as the target, and have special containers that pad memory to SIMD width, no need to mask or run a scalar loop. I copy the last valid value into the remaining slots so it doesn't cause any branch divergence.
In AVX-512 we have a platform that rewards the assembly language programmer like few platforms have since the 6502. I see people doing really clever things that are specific to the system and one level it is really cool but on another level it means SIMD is the domain of the specialist, Intel puts out press releases about the really great features they have for the national labs and for Facebook whereas the rest of us are 5-10 years behind the curve for SIMD adoption because the juice isn't worth the squeeze.
Just before libraries for training neural nets on GPUs became available I worked on a product that had a SIMD based neural network trainer that was written in hand-coded assembly. We were a generation behind in our AVX instructions so we gave up half of the performance we could have got, but that was the least of the challenges we had to overcome to get the product in front of customers. [1]
My software-centric view of Intel's problems is that they've been spending their customers and shareholders money to put features in chips that are fused off or might as well be fused off because they aren't widely supported in the industry. And that they didn't see this as a problem and neither did their enablers in the computing media and software industry. Just for example, Apple used to ship the MKL libraries which like a turbocharger for matrix math back when they were using Intel chips. For whatever reason, Microsoft did not do this with Windows and neither did most Linux distributions so "the rest of us" are stuck with a fraction of the performance that we paid for.
AMD did the right thing in introducing double pumped AVX-512 because at least assembly language wizards have some place where their code runs and the industry gets closer to the place where we can count on using an instruction set defined 12 years ago.
[1] If I'd been tasked with updating the to next generation I would have written a compiler (if I take that many derivatives by hand I'll get one wrong.) My boss would have ordered me not to, I would have done it anyway and not checked it in.
AVX-512 also has a lot of wonderful facilities for autovectorization, but I suspect its initial downclocking effects plus getting yanked out of Alder Lake killed a lot of the momentum in improving compiler and library usage of it.
Even the Steam Hardware Survey, which is skewed toward upper end hardware, only shows 16% availability of baseline AVX-512, compared to 94% for AVX2.
It will be interesting seeing what happens now that AMD is shipping good AVX-512. It really just makes Intel seem incompetent (especially since they're theoretically bringing AVX-512 back in next year anyway)
No proof, but I suspect that AMD's AVX-512 support played a part in Intel dumping AVX10/256 and changing plans back to shipping a full 512-bit consumer implementation again (we'll see when they actually ship it).
The downside is that AMD also increased the latency of all formerly cheap integer vector ops. This removes one of the main advantages against NEON, which historically has had richer operations but worse latencies. That's one thing I hope Intel doesn't follow.
Also interesting is that Intel's E-core architecture is improving dramatically compared to the P-core, even surpassing it in some cases. For instance, Skymont finally has no penalty for denormals, a long standing Intel weakness. Would not be surprising to see the E-core architecture take over at some point.
> For instance, Skymont finally has no penalty for denormals, a long standing Intel weakness.
yeah, that's crazy to me. Intel has been so completely discunctional for the last 15 years. I feel like you couldn't have a clearer sign of "we have 2 completely separate teams that are competing with each other and aren't allowed to/don't want to talk to each other". it's just such a clear sign that the chicken is running around headless
Not really, to me it more seems like Pentium-4 vs Pentium-M/Core again.
The downfall of Pentium 4 was that they had been stuffing things into longer and longer pipes to keep up the frequency race(with horrible branch latencies as a result). They scaled it all away by "resetting" to the P3/P-M/Core architecture and scaling up from that again.
Pipes today are even _longer_ and if E-cores has shorter pipes at a similar frequency then "regular" JS,Java,etc code will be far more performant even if you lose a bit of perf for "performance" cases where people vectorize (Did the HPC computing crowd influence Intel into a ditch AGAIN? wouldn't be surprising!).
Thankfully, the P-cores are nowhere near as bad as the Pentium 4 was. The Pentium 4 had such a skewed architecture that it was annoyingly frustrating to optimize for. Not only was the branch misprediction penalty long, but all common methods of doing branchless logic like conditional moves were also slow. It also had a slow shifter such that small left shifts were actually faster as sequences of adds, which I hadn't needed to do since the 68000 and 8086. And an annoying L1 cache that had 64K aliasing penalties (guess which popular OS allocates all virtual memory, particularly thread stacks, at 64K alignment.....)
The P-cores have their warts, but are still much more well-rounded than the P4 was.
You mentioned "initial downclocking effects", yet (for posterity) I want to emphasize that in 2020 Ice Lake (Sunny Cove core) and later Intel processors, the downclocking is really a nothingburger. The fusing off debacle in desktop CPU families like Alder Lake you mentioned definitely killed the momentum though.
I'm not sure why OS kernels couldn't have become partners in CPU capability queries (where a program starting execution could request a CPU core with 'X' such as AVX-512F, for example) -- but without that the whole P-core/E-core hybrid concept was DOA for capabilities which were not least-common denominators. If I had to guess, marketing got ahead of engineering and testing on that one.
Sure, but any core-wide downclocking effect at all is annoying for autovectorization, since a small local win easily turns into a global loss. Which is why compilers have "prefer vector width" tuning parameters so autovec can be tuned down to avoid 512-bit or even 256-bit ops.
This is also the same reason that having AVX-512 only on the P-cores wouldn't have worked, even with thread director support. It would only take one small routine in a common location to push most threads off the P-cores.
I'm of the opinion that Intel's hybrid P/E-arch has been mostly useless anyway and only good for winning benchmarks. My current CPU has a 6P4E configuration and the scheduler hardly uses the E-cores at all unless forced, plus performance was better and more stable with the E-cores disabled.
Noob question! What about AVX-512 makes it unique to assembly programmers? I'm just dipping my toes in, and have been doing some chemistry computations using f32x8, Vec3x8 etc (AVX-256). I have good workflows set up, but have only been getting 2x speedup over non-SIMD. (Was hoping for closer to 8). I figured AVX-512 would allow f32x16 etc, which would be mostly a drop-in. (I have macros to set up the types, and you input num lanes).
AVX-512 has a lot of instructions that just extend vectorization to 512-bit and make it nicer with features like masking. Thus, a very valid use of it is just to double vectorization width.
But it also has a bunch of specialized instructions that can boost performance beyond just the 2x width. One of them is VPCOMPRESSB, which accelerates compact encoding of sparse data. Another is GF2P8AFFINEQB, which is targeted at specific encryption algorithms but can also be abused for general bit shuffling. Algorithms like computing a histogram can benefit significantly, but it requires reshaping the algorithm around very particular and peculiar intermediate data layouts that are beyond the transformations a compiler can do. This doesn't literally require assembly language, though, it can often be done with intrinsics.
SIMD only helps you where you're arithmetic-limited; you may be limited by memory bandwidth, or perhaps float division if applicable; and if your scalar comparison got autovectorized you'd have roughly no benefit.
AVX-512 should be just fine via intrinsics/high-level vector types, not different from AVX2 in this regard.
It is kind of a bummer that MKL isn’t open sourced, as that would make inclusion in Linux easier. It is already free-as-in-beer, but of course that doesn’t solve everything.
Baffling that MS didn’t use it. They have a pretty close relationship…
Agree that they are sort of going after hard-to-use niche features nowadays. But I think it is just that the real thing we want—single threaded performance for branchy code—is, like, incredibly difficult to improve nowadays.
and web browsers at the very least spent a lot of cycles on decoding HTML and Javascript which is UTF-8 encoded. It turns out AVX-512 is good at a lot of things you wouldn't think SIMD would be good at. Intel's got the problem that people don't want to buy new computers because they don't see much benefit from buying a new computer, but a new computer doesn't have the benefit it could have because of lagging software support, and the software support lags because there aren't enough new computers to justify the work to do the software support. Intel deserves blame for a few things, one of which is that they have dragged their feet at getting really innovative features into their products while turning people off with various empty slogans.
They really do have a new instruction set that targets plain ordinary single threaded branchy code
If you pay attention this isn't a UTF-8 decoder. It might be some other encoding, or a complete misunderstanding of how UTF-8 works, or an AI hallucination. It also doesn't talk about how to handle the variable number of output bytes or the possibility of a continuation sequence split between input chunks.
I paid attention and I don't see where Daniel claimed that this a complete UTF-8 decoder. He's illustrating a programming technique using a simplified use case, not solving the worlds problems. And I don't think Daniel Lemire lacks an understanding of the concept or needs an AI to code it.
I don't understand the push for variable width SIMD. Possibly due to ignorance but I think it's an abstraction that can be specialized for different hardware so the similar tradeoffs between low level languages and high level languages apply. Since I already have to be aware of hardware level concepts such as 256bit shuffle not working across 128bit lanes and different instructions having very different performance characteristics on different CPUs I'm already knee deep in hardware specifics. While in general I like abstractions I've largely given up waiting for a 'sufficiently advanced compiler' that would properly auto-vectorize my code. I think AGI AI is more likely to happen sooner. At a guess it seems to be that SIMD code could work on GPUs but GPU code has different memory access costs so the code there would also be completely different.
So my view is either create a much better higher level SIMD abstraction model with a sufficiently advanced compiler that knows all the tricks or let me work closely at the hardware level.
As an outsider who doesn't really know what is going on it does worry me a bit that it appears that WASM is pushing for variable width SIMDs instead of supporting ISAs generally supported by CPUs. I guess it's a portability vs performance tradeoff - I worry that it may be difficult to make variable as performant as fixed width and would prefer to deal with portability by having alternative branches at code level.
>> Finally, any software that wants to use the new instruction set needs to be rewritten (or at least recompiled). What is worse, software developers often have to target several SIMD generations, and add mechanisms to their programs that dynamically select the optimal code paths depending on which SIMD generation is supported.
Why not marry the two and have variable width SIMD as one of the ISA options and if in the future variable width SIMD become more performant then it would just be another branch to dynamically select.
Part of the motive behind variable width SIMD in WASM is that there's intentionally-ish no mechanism to do feature detection at runtime in WASM. The whole module has to be valid on your target, you can't include a handful of invalid functions and conditionally execute them if the target supports 256-wide or 512-wide SIMD. If you want to adapt you have to ship entire modules for each set of supported feature flags and select the correct module at startup after probing what the target supports.
So variable width SIMD solves this by making any module using it valid regardless of whether the target supports 512-bit vectors, and the VM 'just' has to solve the problem of generating good code.
Personally I think this is a terrible way to do things and there should have just been a feature detection system, but the horse fled the barn on that one like a decade ago.
It would be very easy to support 512-bit vectors everywhere, and just emulate them on most systems with a small number of smaller vectors. It's easy for a compiler to generate good code for this. Clang does it well if you use its built-in vector types (which can be any length). Variable-length vectors, on the other hand, are a very challenging problem for compiler devs. You tend to get worse code out than if you just statically picked a size, even if it's not the native size.
The risk of 512-bit vectors everywhere is that many algorithms will spill the registers pretty badly if implemented in e.g. 128-bit vectors under the hood. In such cases you may be better off with a completely different algorithm implementation.
> It would be very easy to support 512-bit vectors everywhere, and just emulate them on most systems with a small number of smaller vectors. It's easy for a compiler to generate good code for this
Wouldn’t that be suboptimal if/when CPUs that support 1024-bit vectors come along?
> Variable-length vectors, on the other hand, are a very challenging problem for compiler devs. You tend to get worse code out than if you just statically picked a size, even if it's not the native size.
Why would it be challenging? You could statically pick a size on a system with variable-length vectors, too. How would that be worse code?
Optimal performance in a vector algorithm typically requires optimizing around things like the number of available registers, whether the registers in use are volatile (mandating stack spills when calling other functions like a comparer), and sizes of sequences.
If you know you're engineering for 16-byte vectors you can 'just' align all your data to 16 bytes. And if you know you have 8 vector registers where 4 of them are non-volatile you can design around that too. But without information like that you have to be defensive, like aligning all your data to 128 bytes instead Just In Case (heaven forbid native vectors get bigger than that), minimizing the number of registers you use to try and avoid stack spills, etc. (I mention this because WASM also doesn't expose any of this information.)
It's true that you could just design for a static size on a system with variable-length vectors. I suspect you'd see a lot of people do that, and potentially under-utilize the hardware's capabilities. Better than nothing, at least!
> Wouldn’t that be suboptimal if/when CPUs that support 1024-bit vectors come along?
Is that likely or on anyone's roadmap? It makes a little less sense than 512 bits, at least for Intel, since their cache lines are 64 bytes i.e. 512 bits. Any more than that and they'd have to mess with multiple cache lines all the time, not just on unaligned accesses. And they'd have to support crossing more than 2 cache lines on unaligned accesses. They increase the cache line size too, but that seems terrible for compatibility since a lot of programs assume it's a compile time constant (and it'd have performance overhead to make it a run-time value). Somehow it feels like this isn't the way to go, but hey, I'm not a CPU architect.
Variable length vectors seem to be made for closed-source manually-written assembly (nobody wants to unroll the loop manually and nobody will rewrite it for new register width).
You can write code that runs on many processors or code that takes advantage of the capabilities of one specific processor - not both. Is portability (write once, run anywhere) no longer a goal of WASM? Or will every SIMD instruction be slowly emulated when run on "wrong" processors? What if the interpreter is too old to support the instruction at all?
> I code with SIMD as the target, and have special containers that pad memory to SIMD width...
I think this may be domain-specific. I help maintain several open-source audio libraries, and wind up being the one to review the patches when people contribute SIMD for some specific ISA, and I think without exception they always get the tail handling wrong. Due to other interactions it cannot always be avoided by padding. It can roughly double the complexity of the code [0], and requires a disproportionate amount of thinking time vs. the time the code spends running, but if you don't spend that thinking time you can get OOB reads or writes, and thus CVEs. Masked loads/stores are an improvement, but not universally available. I don't have a lot of concrete suggestions.
I also work with a lot of image/video SIMD, and this is just not a problem, because most operations happen on fixed block sizes, and padding buffers is easy and routine.
I agree I would have picked other things for the other two in my own top-3 list.
> Masked loads/stores are an improvement, but not universally available.
Traditionally we’ve worked around this with pretty idiomatic hacks that efficiently implement “masked load” functionality in SIMD ISAs that don’t have them. We could probably be better about not making people write this themselves every time.
I think that SIMD code should not be written by hand but rather in a high-level language and so dealing with tail becomes a compiler's and not a programmer's problem. Or people still prefer to write assembly be hand? It seems to be so judging by the code you link.
What I wanted is to write code in a more high-level language like this. For example, to compute a scalar product of a and b you write:
1..n | a[$1] * b[$1] | sum
Or maybe this:
x = sum for i in 1 .. n: a[i] * b[i]
And the code gets automatically compiled into SIMD instructions for every existing architecture (and for large arrays, into a multi-thread computation).
Zig exposes a Vector type to use for SIMD instructions, which has been my first introduction to SIMD directly. Reading through this thread I was immediately mapping what people were saying to Vector operations in Zig. It seems to me like SIMD can reasonably be exposed in high level languages for programmers to reach to in contexts where it matters.
Of course, the compiler vectorizing code when it can as a general optimization is still useful, but when it's critical that some operations must be vectorized, explicit SIMD structures seem nice to have.
It depends on how integrated your SIMD strategy is into the overall technical design. Tail handling is much easier if you can afford SIMD-friendly padding so a full vector load/store is possible even if you have to manually mask. That avoids a lot of the hassle of breaking down memory accesses just to avoid a page fault or setting off the memory checker.
Beyond that -- unit testing. I don't see enough of it for vectorized routines. SIMD widths are small enough that you can usually just test all possible offsets right up against a guard page and brute force verify that no overruns occur.
I agree; and the article seems to have also quite a few technical flaws:
- Register width: we somewhat maxed out at 512 bits, with Intel going back to 256 bits for non-server CPUs. I don't see larger widths on the horizon (even if SVE theoretically supports up to 2048 bits, I don't know any implementation with ~~>256~~ >512 bits). Larger bit widths are not beneficial for most applications and the few applications that are (e.g., some HPC codes) are nowadays served by GPUs.
- The post mentions available opcode space: while opcode space is limited, a reasonably well-designed ISA (e.g., AArch64) has enough holes for extensions. Adding new instructions doesn't require ABI changes, and while adding new registers requires some kernel changes, this is well understood at this point.
- "What is worse, software developers often have to target several SIMD generations" -- no way around this, though, unless auto-vectorization becomes substantially better. Adjusting the register width is not the big problem when porting code, making better use of available instructions is.
- "The packed SIMD paradigm is that there is a 1:1 mapping between the register width and the execution unit width" -- no. E.g., AMD's Zen 4 does double pumping, and AVX was IIRC originally designed to support this as well (although Intel went directly for 256-bit units).
- "At the same time many SIMD operations are pipelined and require several clock cycles to complete" -- well, they are pipelined, but many SIMD instructions have the same latency as their scalar counterpart.
- "Consequently, loops have to be unrolled in order to avoid stalls and keep the pipeline busy." -- loop unroll has several benefits, mostly to reduce the overhead of the loop and to avoid data dependencies between loop iterations. Larger basic blocks are better for hardware as every branch, even if predicted correctly, has a small penalty. "Loop unrolling also increases register pressure" -- it does, but code that really requires >32 registers is extremely rare, so a good instruction scheduler in the compiler can avoid spilling.
In my experience, dynamic vector sizes make code slower, because they inhibit optimizations. E.g., spilling a dynamically sized vector is like a dynamic stack allocation with a dynamic offset. I don't think SVE delivered any large benefits, both in terms of performance (there's not much hardware with SVE to begin with...) and compiler support. RISC-V pushes further into this direction, we'll see how this turns out.
Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA.
Also there are processors with larger vector length, e.g. 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200, 16384-bit: NEC SX-Aurora, Ara, EPI
> no way around this
You rarely need to rewrite SIMD code to take advantage of new extensions, unless somebody decides to create a new one with a larger SIMD width.
This mostly happens when very specialized instructions are added.
> In my experience, dynamic vector sizes make code slower, because they inhibit optimizations.
Do you have more examples of this?
I don't see spilling as much of a problem, because you want to avoid it regardless, and codegen for dynamic vector sizes is pretty good in my experience.
> I don't think SVE delivered any large benefits
Well, all Arm CPUs except for the A64FX were build to execute NEON as fast as possible.
X86 CPUs aren't built to execute MMX or SSE or the latest, even AVX, as fast as possible.
> Performance was a lot better than I expected, giving between 14 and 63% uplift. Larger block sizes benefitted the most, as we get higher utilization of the wider vectors and fewer idle lanes.
> I found the scale of the uplift somewhat surprising as Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so in terms of data-width the two should work out very similar.
> Also there are processors with larger vector length
How do these fare in terms of absolute performance? The NEC TSUBASA is not a CPU.
> Do you have more examples of this?
I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement. Compilers probably/hopefully have gotten better in the mean time and I haven't redone the experiments since then, but I'd be surprised if this changed drastically. Spilling is sometimes unavoidable, e.g. due to function calls.
> How do these fare in terms of absolute performance? The NEC TSUBASA is not a CPU.
The NEC is an attached accelerator, but IIRC it can run an OS in host mode.
It's hard to tell how the others perform, because most don't have hardware available yet or only they and partner companies have access.
It's also hard to compare, because they don't target the desktop market.
> I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement.
Oh, wow. Was this autovectorized or handwritten intrinsics/assembly?
Any chance it's of a small enough scope that I could try to recreate it?
> I was specifically referring to dynamic vector sizes.
Ah, sorry, yes you are correct.
It still shows that supporting VLA mechanisms in an ISA doesn't mean it's slower for fixed-size usage.
I'm not aware of any proper VLA vs VLS comparisons. I benchmarked a VLA vs VLS mandelbrot implementation once where there was no performance difference, but that's a too simple example.
no. Because Intel is full of absolute idiots, Intel atom didn't support AVX 1 until Gracemont. Tremont is missing AVX1, AVX2, FMA, and basically the rest of X86v3, and shipped in CPUs as recently as 2021 (Jasper Lake).
Intel also shipped a bunch of Pentium-branded CPUs that have AVX disabled, leading to oddities like a Kaby Lake based CPU that doesn't have AVX, and even worse, also shipped a few CPUs that have AVX2 but not BMI2:
> Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA.
This is a wrong approach. You should be writing you code in a high-level language like this:
x = sum i for 1..n: a[i] * b[i]
And let the compiler write the assembly for every existing architecture (including multi-threaded version of a loop).
I don't understand what is the advantage of writing the SIMD code manually. At least have a LLM write it if you don't like my imaginary high-level vector language.
This is the common argument from proponents of compiler autovectorization. An example like what you have is very simple, so modern compilers would turn it into SIMD code without a problem.
In practice, though, the cases that compilers can successfully autovectorize are very limited relative to the total problem space that SIMD is solving. Plus, if I rely on that, it leaves me vulnerable to regressions in the compiler vectorizer.
Ultimately for me, I would rather write the implementation myself and know what is being generated versus trying to write high-level code in just the right way to make the compiler generate what I want.
> "Loop unrolling also increases register pressure" -- it does, but code that really requires >32 registers is extremely rare, so a good instruction scheduler in the compiler can avoid spilling.
No, it actually is super common in hpc code. If you unroll a loop N times you need N times as many registers. For normal memory-bound code I agree with you, but most hpc kernels will exploit as much of the register file as they can for blocking/tiling.
I think the variable length stuff does solve encoding issues, and RISCV takes so big strides with the ideas around chaining and vl/lmul/vtype registers.
I think they would benefit from having 4 vtype registers, though. It's wasted scalar space, but how often do you actually rotate between 4 different vector types in main loop bodies. The answer is pretty rarely. And you'd greatly reduce the swapping between vtypes when. I think they needed to find 1 more bit but it's tough the encoding space isn't that large for rvv which is a perk for sure
Can't wait to seem more implementions of rvv to actually test some of it's ideas
If you had two extra bits in the instruction encoding, I think it'd make much more sense to encode element width directly in instructions, leaving LMUL multiplier & agnosticness settings in vsetvl; only things that'd suffer then would be if you need tail-undisturbed for one instr (don't think that's particularly common) and fancy things that reinterpret the vector between different element widths (very uncommon).
Will be interesting to see if longer encodings for RVV with encoded vtype or whatever ever materialize.
Thanks, I misremembered. However, the microarchitecture is a bit "weird" (really HPC-targeted), with very long latencies (e.g., ADD (vector) 4 cycles, FADD (vector) 9 cycles). I remember that it was much slower than older x86 CPUs for non-SIMD code, and even for SIMD code, it took quite a bit of effort to get reasonable performance through instruction-level parallelism due to the long latencies and the very limited out-of-order capacities (in particular the just 2x20 reservation station entries for FP).
> - Register width: we somewhat maxed out at 512 bits, with Intel going back to 256 bits for non-server CPUs. I don't see larger widths on the horizon (even if SVE theoretically supports up to 2048 bits, I don't know any implementation with >256 bits). Larger bit widths are not beneficial for most applications and the few applications that are (e.g., some HPC codes) are nowadays served by GPUs.
Just to address this, it's pretty evident why scalar values have stabilized at 64-bit and vectors at ~512 (though there are larger implementations). Tell someone they only have 256 values to work with and they immediately see the limit, it's why old 8-bit code wasted so much time shuffling carries to compute larger values. Tell them you have 65536 values and it alleviates a large set of that problem, but you're still going to hit limits frequently. Now you have up to 4294967296 values and the limits are realistically only going to be hit in computational realms, so bump it up to 18446744073709551615. Now even most commodity computational limits are alleviated and the compiler will handle the data shuffling for larger ones.
There was naturally going to be a point where there was enough static computational power on integers that it didn't make sense to continue widening them (at least, not at the previous rate). The same goes for vectorization, but in even more niche and specific fields.
Do you have examples for problems that are easier to solve in fixed-width SIMD?
I maintain that most problems can be solved in a vector-length-agnostic manner. Even if it's slightly more tricky, it's certainly easier than restructuring all of your memory allocations to add padding and implementing three versions for all the differently sized SIMD extensions your target may support.
And you can always fall back to using a variable-width SIMD ISA in a fixed-width way, when necessary.
I also prefer fixed width. At least in C++, all of the padding, alignment, etc is automagically codegen-ed for the register type in my use cases, so the overhead is approximately zero. All the complexity and cost is in specializing for the capabilities of the underlying SIMD ISA, not the width.
The benefit of fixed width is that optimal data structure and algorithm design on various microarchitectures is dependent on explicitly knowing the register width. SIMD widths aren’t not perfectly substitutable in practice, there is more at play than stride size. You can also do things like explicitly combine separate logic streams in a single SIMD instruction based on knowing the word layout. Compilers don’t do this work in 2025.
The argument for vector width agnostic code seems predicated on the proverbial “sufficiently advanced compiler”. I will likely retire from the industry before such a compiler actually exists. Like fusion power, it has been ten years away my entire life.
> The argument for vector width agnostic code is seems predicated on the proverbial “sufficiently advanced compiler”.
A SIMD ISA having a fixed size or not is orthogonal to autovectorization.
E.g. I've seen a bunch of cases where things get autovectorized for RVV but not for AVX512. The reason isn't fixed vs variable, but rather the supported instructions themselves.
There are two things I'd like from a "sufficiently advanced compiler”, which are sizeless struct support and redundant predicated load/store elimination.
Those don't fundamentally add new capabilities, but makes working with/integrating into existing APIs easier.
> All the complexity and cost is in specializing for the capabilities of the underlying SIMD ISA, not the width.
Wow, it almost sounds like you could take basically the same code and run it with different vector lengths.
> The benefit of fixed width is that optimal data structure and algorithm design on various microarchitectures is dependent on explicitly knowing the register width
Optimal to what degree? Like sure, fixed-width SIMD can always turn your pointer increments from a register add to an immediate add, so it's always more "optimal", but that sort of thing doesn't matter.
The only difference you usually encounter when writing variable instead of fixed size code is that you have to synthesize your shuffles outside the loop.
This usually just takes a few instructions, but loading a constant is certainly easier.
The interplay of SIMD width and microarchitecture is more important for performance engineering than you seem to be assuming. Those codegen decisions are made at layer above anything being talked about here and they operate on explicit awareness of things like register size.
It isn’t “same instruction but wider or narrower” or anything that can be trivially autovectorized, it is “different algorithm design”. Compilers are not yet rewriting data structures and algorithms based on microarchitecture.
I write a lot of SIMD code, mostly for database engines, little of which is trivial “processing a vector of data types” style code. AVX512 in particular is strong enough of an ISA that it is used in all kinds of contexts that we traditionally wouldn’t think of as a good for SIMD. You can build all kinds of neat quasi-scalar idioms with it and people do.
There's a category of autovectorization known as Superword-Level Parallelism (SLP) which effectively scavenges an entire basic block for individual instruction sequences that might be squeezed together into a SIMD instruction. This kind of vectorization doesn't work well with vector-length-agnostic ISAs, because you generally can't scavenge more than a few elements anyways, and inducing any sort of dynamic vector length is more likely to slow your code down as a result (since you can't do constant folding).
There's other kinds of interesting things you can do with vectors that aren't improved by dynamic-length vectors. Something like abseil's hash table, which uses vector code to efficiently manage the occupancy bitmap. Dynamic vector length doesn't help that much in that case, particularly because the vector length you can parallelize over is itself intrinsically low (if you're scanning dozens of elements to find an empty slot, something is wrong). Vector swizzling is harder to do dynamically, and in general, at high vector factors, difficult to do generically in hardware, which means going to larger vectors (even before considering dynamic sizes), vectorization is trickier if you have to do a lot of swizzling.
In general, vector-length-agnostic is really only good for SIMT-like codes, which you can express the vector body as more or less independent f(index) for some knowable-before-you-execute-the-loop range of indices. Stuff like DAXPY or BLAS in general. Move away from this model, and that agnosticism becomes overhead that doesn't pay for itself. (Now granted, this kind of model is a large fraction of parallelizable code, but it's far from all of it).
A number of the cool string processing SIMD techniques depend a _lot_ on register widths and instruction performance characteristics. There’s a fair argument to be made that x64 could be made more consistent/legible for these use cases, but this isn’t matmul—whether you have 128, 256, or 512 bits matters hugely and you may want entirely different algorithms that are contingent on this.
The SLP vectorizer is a good point, but I think it's, in comparison with x86, more a problem of the float and vector register files not being shared (in SVE and RVV).
You don't need to reconfigure the vector length; just use it at the full width.
> Something like abseil's hash table
If I remember this correctly, the abseil lookup does scale with vector length, as long as you use the native data path width. (albeit with small gains)
There is a problem with vector length agnostic handling of abseil, which is the iterator API.
With a different API, or compilers that could eliminate redundant predicated load/stores, this would be easier.
> good for SIMT-like codes
Certainly, but I've also seen/written a lot of vector length agnostic code using shuffles, which don't fit into the SIMT paradigm, which means that the scope is larger than just SIMT.
---
As a general comparison, take AVX10/128, AVX10/256 and AVX10/512, overlap their instruction encodings, remove the few instructions that don't make sense anymore, and add a cheap instruction to query the vector length. (probably also instructions like vid and viota, for easier shuffle synthesization)
Now you have a variable-length SIMD ISA that feels familiar.
(For other readers:) This is what our Highway library does - wrapper functions around intrinsics, plus a (constexpr if possible) Lanes() function to query the length.
For very many cases, writing the code once for an 'unknown to the programmer' vector length indeed works.
One example that doesn't work so well is a sorting network; its size depends on the vector length. (I see you mention this below.)
:)
Yes, vqsort recently tickled a bug in clang. I've seen a steady stream of issues, many caused by SLP or the seeming absence of CI. You might try re-enabling it on GCC.
Yes, the issue with the sorting network is that it is limited to 16x16 to reduce code explosion. With uint16_t, XMM are sufficient for the 8-column case; your Godbolt link does have some YMM for the 16-column case. When changing the type to sort to uint32_t, we see ZMM as expected.
It has a few more instructions then the VLS version, but the critical dependency chain is the same.
It's also slightly less optimal on x86, because it alway uses lane crossing permutes. For AVX512 that is 5 out of 15 permutations that are vperm, but could've been vshuf. (if the loop isn't unrolled and optimized by the compiler)
I wasn't able to figure out how to implement the multi vector register sort in a VLA way.
Nice work :) Clang x86 indeed unrolls, which is good. But setting the CC and AA mask constants looks fairly expensive compared to fixed-pattern shuffles.
Yes, the 2D aspect of the sorting network complicates things. Transposing is already harder to make VLA and fusing it with the other shuffles certainly doesn't help.
I advised the Abseil design and regret not pointing this out earlier: changing the interface to insert/query batches of items would be considerably more efficient, especially for databases. Whenever possible, 'vertical' algorithms (independent SIMD elements) usually scale better than 'horizontal' (pick one element within a vector).
That's probably true.
Last time I looked at it, it seemed like parts of vectorscan could be vectorized VLA, but from my, very limited, understanding of the main matching algorithm, it does seem to require specialization on vector length.
It should be possible to do VLA in some capacity, but it would probably be slower and it's too much work to test.
Keccak on the other hand was optimized for fast execution on scalar ISAs with 32 GPRs. This is hard to vectorize in general, because GPR "moves" are free and liberally applied.
Another example where it's probably worth specializing is quicksort, specifically the leaf part.
I've written a VLA version, which uses bitonic sort to sort within vector registers. I wasn't able to meaningfully compare it against a fixed size implementation, because vqsort was super slow when I tried to compile it for RVV.
On vqsort: yes, the current RVV set of shuffles is awfully limited and several implementations produce one element per cycle. We also saw excessive VSETVLI, though I understand that has been fixed by an extra compiler pass. Could be interesting to retry with a uarch having O(1) shuffles.
Another reason to prefer fixed width, compilers may pass vectors to functions in SIMD registers. When register size is unknown at compile time, they have to pass data in memory. For complicated SIMD algorithms the performance overhead gonna be huge.
Back in the day, you had Cray style vector registers, and you had CDC style[1] 'vector pipes' (I think I remember that's what they called them) that you fed from main memory. So you would (vastly oversimplifying) build your vectors in consecutive memory locations (up to 64k as I recall), point to a result destination in memory and execute a vector instruction. This works fine if there's a close match between cpu speed and memory access speed. The compilers were quite good, and took care of handling variable sized vectors, but I have no idea what was going on under the hood except for some hi-level undergrad compiler lectures. As memory speed vs cpu speed divergence became more and more pronouced, it quickly became obvious that vector registers were the right performance answer, basically everyone jumped that way, and I don't think anyone has adopted a memory-memory vector architecture since the '80s.
[1] from CDC STAR-100 and followons like the CDC Cyber 180/990, Cyber 200 series & ETA-10.
AFAIK about every modern CPU uses out of order von Neumann architecture. The only people who don't are the handful of researchers and people who work with the government research into non van Neumann designed systems.
Alpha was out of order starting with EV7, but most importantly the entire architecture was designed with eye for both pipeline hazards and out of order execution, unlike VAX that it replaced which made it pretty much impossible
Lisp is so powerful, but without static types you can't even do basic stuff like overloading, and have to invent a way to even check the type(for custom types) so you can branch on type.
Which the inferencer probably can't do because of how dynamic standard-class can be. Also, if we want to get pedantic, method-dispatch does not dispatch on types in Common Lisp, but rather via EQL or the class of the argument. Since sub-classes can be made (and methods added or even removed) after a method invocation is compiled, there is no feasible way in a typical lisp implementation[1] to do compile-time dispatch of any type that is a subtype of standard-object.
Now none of this prevents you from extending lisp in such a way that lets you freeze the method dispatch (see e.g. https://github.com/alex-gutev/static-dispatch), but "a modern compiler will jmp past the type checks" is false for all of the CLOS implementations I'm familiar with.
1: SICL is a research implementation that has first-class global environments. If you save the global-environment of a method invocation, you can re-compile the invocation whenever a method is defined (or removed) and get static-dispatch. There's a paper somewhere (probably on metamodular?) that discusses this possibility.
Missed your reply when it happened. Note that you cannot specialize a method on "fixnum" since it's not a class (though you can on "number" or "integer" since they are classes).
I mean they did engage in gaslighting about the lab leak being impossible, thou Trump was president during much of that, so I guess he is blaming himself?
Yeah I had a pretty high opinion of Lua when I first used it, then I came back to code I'd written years earlier, and the lack of types just made it a nightmare.
It really could use a fully statically typed layer that compiles down to Lua, and also fixes some of the stupid stuff such as 1 based indexing and lack of increment ops etc.
I have my doubts with Jai, the fact that Blow & co seems to have major misunderstandings with regards to RAII doesn't lend much confidence.
Also a 19,000 line C++ program(this is tiny) does not take 45 minutes unless something is seriously broken, it should be a few seconds at most for full rebuild even with a decent amount of template usage.
This makes me suspect this author doesn't have much C++ experience, as this should have been obvious to them.
I do like the build script being in the same language, CMake can just die.
The metaprogramming looks more confusing than C++, why is "sin"/"cos" a string?
Based on this article I'm not sure what Jai's strength is, I would have assumed metaprogramming and SIMD prior, but these are hardly discussed, and the bit on metaprogramming didn't make much sense to me.
> Also a 19,000 line C++ program(this is tiny) does not take 45 minutes unless something is seriously broken
Agreed, 45 minutes is insane. In my experience, and this does depend on a lot of variables, 1 million lines of C++ ends up taking about 20 minutes. If we assume this scales linearly (I don't think it does, but let's imagine), 19k lines should take about 20 seconds. Maybe a little more with overhead, or a little less because of less burden on the linker.
There's a lot of assumptions in that back-of-the-envelope math, but if they're in the right ballpark it does mean that Jai has an order of magnitude faster builds.
I'm sure the big win is having a legit module system instead of plaintext header #include
Yeah it's weird but the author of this post claiming that defer can replace RAII kinda suggests that. RAII isn't just about releasing the resource you acquired in the current scope in the same scope. You can pass the resource across multiple boundaries with move semantics and only at the end when it's no longer needed the resources will be released.
The author of the post claims that defer eliminates the need for RAII.
Well, goto also eliminates the "need" but language features are about making life easier, and life is much easier with RAII compared to having only defer.
It makes things easier. Usually the move constructor (or move assignment operator) will cause the moved-from object to stop being responsible for releasing a resource, moving the responsibility to the moved-to object. Simplest example: move- construct unique-ptr X from unique-ptr Y. When X is destroyed it will free the memory, when Y is destroyed it will do nothing.
So you can allocate resource in one function, then move the object across function boundaries, module boundaries, into another object etc. and in the end the resource will be released exactly once when the final object is destroyed. No need to remember in each of these places along the path to release the resource explicitly if there's an error (through defer or otherwise).
I agree that it makes some things easier (at the expense of managing constructors/destructors), I'm disputing the blanket assertion that it's superior to manual management, in the context of Jai (and Odin). You're also introducing a reference count, but that's besides the point.
In Jai/Odin, every scope has default global and temp allocators, there's nothing stopping you from transferring ownership and/or passing pointers down the callstack. Then you either free in the last scope where the pointer lives or you pick a natural lifetime near the top of the callstack, defer clear temp there, and forget about it.
You may also want to pass a resource through something like a channel, promise/future pair or similar. So it's not just down/up the callstack, sometimes it's "sideways". In those cases RAII is a life savior. Otherwise you have to explicitly remember about covering all possibilities:
- what if resource never enters the channel
- what if it enters the channel but never gets retrieved on the other side
- what if the channel gets closed
- what if other side tries to retrieve but cancels
Honestly I concur. Out of interest in what sort of methods they came up with to manage memory, I checked out the language's wiki, and not sure if going back to 1970s C (with the defer statement on top) is an improvement. You have to write defer everywhere, and if your object outlives the scope of the function, even that is useless.
I'm sure having to remember to free resources manually has caused so much grief, that they decided to come up with RAII, so an object going out of scope (either on the stack, or its owning object getting destroyed) would clean up its resources.
Compared to a lot of low-level people, I don't hate garbage collection either, with a lot of implementations reducing to pointer bumping for allocation, which is an equivalent behavior to these super-fast temporary arenas, with the caveat that once you run out of memory, the GC cleans up and defragments your heap.
If for some reason, you manage to throw away the memory you allocated before the GC comes along, all that memory becomes junk at zero cost, with the mark-and-sweep algorithm not even having to look at it.
I'm not claiming either GC or RAII are faultless, but throwing up your hands in the air and going back to 1970s methods is not a good solution imo.
That being said, I happen to find a lot that's good about Jai as well, which I'm not going to go into detail about.
This take is equally bizarre. Most languages have an addition semantic. Most languages do not have RAII. That's, by and large, a C++ thing. Jai does NOT have RAII. So, again, why would anybody care what his opinion on RAII is?
Creating strong types for currency seems like common sense, and isn't hard to do. Even the Rust code shouldn't be using basic types.
reply