Real world benchmarks also wouldn't be great because they would be showing how well it works in someone else's program, rather than yours.
This at least gives you an idea of what the relative cost of different operations are so you can consider what are frequent operations in your program and then benchmark a couple from there, rather than everything.
> Not sure why HN removes indention between a leading > and the start of the text, makes quoting unnecessary hard
Because HN’s markup is the most dreary, half-assed, and useless one in the history of lightweight markups.
HN will remove a bunch of characters it doesn’t like because they might be used for fun, parse two+ newlines as a paragraph break, 4-indent as a literal block, and mis-parse * as emphasis.
That’s it. So you get all the drawbacks of HTML with almost none of the tools provided by even just the original markdown. No quoting, no line breaks, no lists, no inline code, …
2-indent is a literal block. 4-indent also is, but only because it's a 2-indent.
HN markup also supports \-escapes (so you can write \*hello\* to get *hello*), but they grow linearly, not exponentially, with repeated escapes (\\\* becomes \\*). It also truncates long URLs in display, and it handles a single pair of parentheses in them (so you can write https://example.com/print(%22Hello,%20world%21%22)). But I think that's all.
but HN doesn't provide me with a HTML input field but a text input field and HTML has stuff like <blockquote> absent from HN input format and properly converting user input is a bread and butter task of anything displaying user content
so HTML collapsing whitespace is really meaningless for the argument IMHO
Do you let the shelters become drugs dens? Because many chronically-homeless people in SF are addicts, and they prefer to live on the street than give up drugs to sleep indoors.
Addicts should have a choice: shelter, treatment, or jail. If you bring drugs in the shelter, your choice becomes treatment or jail. Drug encampments on city sidewalks should simply not be an option.
Some chronically homeless people in SF also suffer from mental illness and cannot look after themselves. They may also not do well in shelters. But leaving them outside is not humane. Institutions had a reputation for poor living conditions, but leaving them to suffer in the street is no better. And institutions can be improved.
That’s the first I’ve heard of the S III Neo. What an odd product. Came out 2 years after the original S III, in practically the same chassis, but with a few parts swapped.
IMO for a graphical program that's fine, but in general I really hate hard requirements for a GPU which I've seen in the wild multiple times. Just simulate the darn thing in software, I don't care if it takes 10x longer, I have all the time in the world.
yes. it's working fine for me on 9 y/o integrated intel graphics.
but it's kind of still a weird statement to make. i thought it was generally the OS's job to supply the vulkan layer, and that mesa -- which just about every linux OS will be using -- provides pretty robust software implementations of those things as fallback. what would cause them to require a "physical" anything?
The Program Optimization course I followed did something similar with regular CPUs / GPUs. It started out as "don't do stupid things", but it quickly went into SoA vs AoS, via hand-writing AVX intrinsics, into eventually running OpenCL code on the GPU.
Part of your grade depended on the performance of your code, with a bonus for the fastest submission. It led to a very nice semi-competitive semi-collaborative atmosphere.
Does anyone have insight into why arm CPU vendors seem so hesitant about implementing SVE2? ~They seem~ *Apple seems to have no issue with SSVE2 or SME.
Edit: Only Apple has implemented SSVE and SME I think.
What is the measurable benefit to implementing 128b SVE2? Like, ARM has CPUs that implement that, and it's not even disabled on some chips. So there must be benchmarks somewhere showing how worthwhile it is.
And implementing 256b SVE has different issues depending on how you do it. 4x256b vector ALUs are more power hungry than generally useful. 2x256b is only beneficial over 4x128b if you're limited by decode width, which isn't an issue now that A32/T32 support has been dropped. 3x256b would probably imply 3x128b which would regress existing NEON code. And little cores don't really want to double the transistors spent on vector code, but you can't have a different vector length than the big cores...
I'd say that the theoretical ability to gang units together would be appealing.
If you have four 128-bit packed SIMD, you must execute 4 different instructions at once or the others go to waste. With SVE, you could (in theory) use all 4 as a single, very wide vector for common operations if there weren't a lot of instructions competing for execution ports. You could even dynamically allocate them based on expected vector size or amount of vector instructions coming down the pipeline.
Additionally, adding two 2048-bit vectors using NEON (128-bit packed SIMD) would require 16 add instructions while SVE would require just one. That's a massive code size reduction which matters for I-cache and the frontend throughput.
I don't see how this would work out beneficially. Let's say your hardware can join 4x128b units as a virtual 512-bit SVE SIMD unit. This means you have to advertise VL as 512bit for reasons of consistency. Yes, you will save some entries in the reorder buffer if you encounter a single SVE instruction, but if the code contains independent SVE streams, you will be stalled. Moreso, not all operations will utilize all 512 register bits, so your occupancy might suffer. The only scenario I see this feature working out is if you are decode or reorder buffer limited. Neither is a problem for modern high-performance ARM cores. With x86, it might be a different story. From what I understand, AVX512 instructions can be quite large.
Modern out-of-order cores are already good at superscalar execution, so why not let them do their job? 4x128b units give you much more flexibility and better execution granularity.
On x86 at least, the cost of OoO is astonishing - more pJ per instruction dispatch than the operation itself. Amortizing that over more operations is the whole point of SIMD. I have not yet seen such data for Arm.
That aside, see the "cmp" sibling thread for a major (4x penalty) downside to 4x128.
Yes, OoO is expensive — after all, that is the cost of performance. Very wide SIMD is great for energy efficiency if that is what your compute patterns require (there is a good reason why GPUs are in-order very wide SMT SIMD processors). Is this the best choice for a general-purpose CPU? That I am not so sure about. A CPU needs to be able to run all kinds of code. A single wide SIMD unit is great for some problems, but it won't deliver good performance if you need more flexibility.
Could you point me to the "cmp" thread you mentioned? I don't know where to look for it.
> I agree with you we do not only want "very wide SIMD", and it seems to me that 2x512-bit (Intel) or 4x256 (AMD) are actually a good middle ground.
I'd already classify this as "very wide". And the story is far from being that simple. Intel's 512-bit implementation is very area- and power-hungry, so much so that Intel is dropping the 512-bit SIMD altogether. AMD has 4x add units, but only two are capable of multiplication. So if your code mostly does FP addition, you get good performance. If your workflows are more complex, not so much.
The thing is that on many real-world SIMD workloads, Apple's 4x128bit either matches or outperforms either Intel's or AMD's implementation. And that on a core that runs lower clock and has less L1D bandwidth. Flexibility and symmetric ALU capabilities seems to be the king here.
Ah, that is what you meant. Thank you for linking the post! My comment would be that this is not about 128b or 256b SIMD per se but about implementation details.
There is nothing stopping ARM from designing a core with more mask write ports. Apparently, they felt this was not worth the cost. Other vendors might feel differently. I'd say this is similar to AMD shipping only two FMA units instead of four. Other vendors might feel differently.
For very wide, I'm thinking of Semidynamic's 2048-bit HW, which with LMUL=8 gives 2048 byte vectors, or the NEC vector machines.
AFAIK it has not been publicly disclosed why Intel did not get AVX-512 into their e-cores, and I heard surprise and anger over this decision. AMD's version of them (Zen4c) are a proof that it is achievable.
I am personally happy with the performance of AMD Genoa e.g. for Gemma.cpp; f32 multipliers are not a bottleneck.
> The thing is that on many real-world SIMD workloads, Apple's 4x128bit either matches or outperforms either Intel's or AMD's implementation
Perhaps, though on VQSort it was more like 50% the performance. And if so, it's more likely due to the astonishingly anemic memory BW on current x86 servers. Bolting on more cores for ever more imbalanced systems does not sound like progress to me, except for poorly optimized, branch-heavy code.
> Perhaps, though on VQSort it was more like 50% the performance.
I looked at the paper and my interpretation is that the performance delta between M1 (Neon) and the Xeon (AVX2) can be fully explained by the difference in clock (3.7 vs 3.3 Ghz) and the difference in L1D bandwidth (48byes/cycle vs. 128bytes/cycle). I don't see any evidence here that narrow SIMD is less efficient.
The AVX-512 is much faster, but that is because it has hardware features (most importantly, compact) that are central to the algorithm. On AVX2 and Neon these are emulated with slower sequences.
Note that compact/compress are not actually the key enablers: also with AVX-512 we use table lookups for u64 keys, because this allows us to actually partition a vector and write it both to the left and write sides, as opposed to compressing twice and writing those individually.
Isn't the L1d bandwidth tied to the SIMD width, i.e., it would be unachievable on Skylake if also only using 128-bit vectors there?
> Note that compact/compress are not actually the key enablers: also with AVX-512 we use table lookups for u64 keys, because this allows us to actually partition a vector and write it both to the left and write sides, as opposed to compressing twice and writing those individually.
That is interesting! So do I understand you correctly that the 512b vectors allow you to implement the algorithm more efficiently? That would indeed be a nice argument for longer SIMD
> Isn't the L1d bandwidth tied to the SIMD width, i.e., it would be unachievable on Skylake if also only using 128-bit vectors there?
It's a hardware detail. Intel does tie it to SIMD width, but it doesn't have to be the case. For example, Apple has 4x128b units but can only load up to 48 bytes (I am not sure about the granularity of the loads) per cycle.
Right, longer vectors let us write more elements at a time.
I agree that the number of L1 load ports (or issue width) is also a parameter: that times the SIMD width gives us the bandwidth. It will be interesting to see what AMD Zen5 brings to the table here.
If you do streaming-type operations on long arrays, yes. If your data sizes are small, however, four smaller units might be more flexible. As a naive example, let's take the popular SIMD acceleration of hash tables. Since the key is likely to be found close to its optimal location, long SIMD will waste compute. With small SIMD however you could do multiple lookups in parallel courtesy of OoO.
This is why I like the ARM/Apple design with "regular SIMD" and "streaming SIMD". The regular SIMD is latency-optimized and offers versatile functionality for more flexible data swizzling, while the streaming SIMD uses long vectors and is optimized for throughput.
You can't do 2048 bits of addition in one SVE instruction; not portably, at least (and definitely not on any existing hardware). While the maximum SVE register size is 2048 bits, the minimum is 128 bits, and the hardware chooses the supported register size, not the programmer. For portable SVE, your code needs to work for all of those widths, not just the smallest or largest. (of related note is RISC-V RVV, which allows you to group up to 8 registers together, allowing a minimum portable operation width of 128×8 = 1024 bits in a single instruction (and up to 65536×8 = 64KB for hypothetical crazy hardware with max VLEN), but SVE/SVE2 don't have any equivalent)
A for() loop does the same thing at the cost of like 3 instructions. 4x128b has the flexibility that you don't need 512b wide operations on the same data to keep the ALUs fed. If you have 512b wide operations being split to 4x128b instructions, great, otherwise the massive OoOE window of modern chips can decode the next few loop iterations to keep the ALUs fed, or even pull instructions from a completely different kernel.
> What is the measurable benefit to implementing 128b SVE2
Probably not much, SVE2 has some nicer instructions, but neon already is quite solid.
> And implementing 256b SVE has different issues depending on how you do it
For in-order, and not very aggressively out-of-order cores having a larger vector length can be very useful to still get a lot of throughput out of your design. It also helps hide memory latency.
For aggressively out-of-order cores it should, for the most part, just be about decode, and some what memory latency hiding.
> 2x256b is only beneficial over 4x128b if you're limited by decode width [...] 3x256b would probably imply 3x128b which would regress existing NEON code.
I agree, that's why I don't get why people are "excited" for Zen5 to have 512b execution units, instead of 256b ones. At best there won't be a performance improvement for avx/avx2 code, at worst a regression.
Anyone interested in getting such numbers could run github.com/google/gemma.cpp on Arm hardware with hwy::DisableTargets(HWY_ALL_NEON) or HWY_ALL_SVE to compare the two :) I'd be curious to see the result.
Calling hwy::DispatchedTarget indicates which target is actually being used.
What is the percentage gain of using masked instructions on any benchmark/task of your choice? It can be negative on weird kernels that do lots of vector cmp since even ARM decided the cost of more than one write port in the predicate register file wasn't worth it, or if the masking adds lots of unnecessary and possibly false dependencies on the destination registers.
> This is only true if we ignore more complex instructions and focus on things like adding two vectors.
ARM implemented a CPU that had 2x256b SVE and 4x128b NEON. Literally the only benchmarks that benefitted from SVE were because they were limited by the 5-wide decode in NEON.
It's great you bring up cmp, helps to understand why 4x128 is not necessarily as good as 1x512. Quicksort, hardly a 'weird kernel', does comparisons followed by compaction. Because comparisons return a predicate, which have only a single write port, we can only do 128 bits of comparisons per cycle. Ouch.
However, masking can still help our VQSort [1], for example when writing the rightmost partition right to left without stomping on subsequent elements, or in a sorting network, only updating every second element.
I think it's somewhat unfair to ask for real world examples when there really aren't many people writing optimized SVE code right now. Probably because there are hardly any devices with the extension.
I think the transition from AVX2 to AVX512 is comparable in that it provided not only larger vectors, but also a much nicer ISA. There were certainly a few projects that benefited significantly from that move. simdjson is probably the most famous example [0].
>I think it's somewhat unfair to ask for real world examples when there really aren't many people writing optimized SVE code right now. Probably because there are hardly any devices with the extension.
Ironically, on the RISC-V side, RVV 1.0 hardware is readily available and cheap. BananaPI BPI-F3 (spacemiT K1) is RVA22+RVV, as well as some C908-based MCUs.
CPUs with SVE have been generally available for two years now. SME and AVX-512 got benchmarks written showing them off before the CPUs were even available. Seems fair to me.
simdjson specifically benefitted from Intel's hardware decision to implement a 512b permute from 2x 512b registers with a throughput of 1/cycle. That's area-expensive, which is (probably) why ARM has historically skimped on tbl performance, only changing as of the Cortex-X4.
Anyway simdjson is an argument for 256b/512b vector permute, not 128b SVE.
Having written a lot of NEON and investigated SVE... I disagree that SVE is a nicer ISA. The set of what's 2-operand destructive, what instructions have maskable forms vs. needing movprfx that's only fused on A64FX, and dealing the intrinsics issues that come from sizeless types are all unneeded headaches. Plus I prefer NEON's variable shift to SVE's variable shifts.
Fair point about movprfx, I understand they were short on encoding space. This can be mitigated by using *_x versions of intrinsics where masks are not used.
The sizeless headache is anyway there if you want to support RISC-V V, which we do.
One other data point in favor of SVE: its backend in Highway is only 6KLOC vs NEON's 10K, with a similar ratio of #if (indicating less fragmentation, more orthogonal).
It’s been a while since I looked, but I remember SVE2 being much more usable than SVE. A64FX was SVE IIRC. I think SVE did not do a great job of fully replacing NEON.
AVX512 is all around a nice addition as JIT-based runtimes like .NET (8+) can use it for most common operations: text search, zeroing, copying, floating point conversion, more efficient forms of V256 idioms with AVX512VL (select-like patterns replaced with vpternlog).
SVE2 is an extension on top of SVE which some stuff already implements. The issue is more likely to be the politics of moving to ARMv9 than anything else.
As to SVE though, I'd guess variable execution time makes the implementation require a bit of work. Normally, multi-cycle tasks have a fixed number. Your scheduler knows that MUL takes N cycles and plans accordingly.
SVE seems like it should require N-M cycles depending on what is passed. That must be determined and scheduled around. This would affect the OoO parts of the core all the way from ordering through to the end of the pipeline.
That's definitely bordering on new uarch territory and if that is the case, it would take 4-5 years from start to finish to implement. This would explain why all the ARMv8 guys never got around to it. ARMv9 makes it mandatory, but that was released in 2021 or so which means non-ARM implementors probably have a ways to go.
SVE doesn't need variable-execution-time instructions, outside of perhaps masked load/store, but those are already non-constant. Everything else is just traditional instructions (given that, from the perspective of the hardware, it has a fixed vector size), with a blend.
reply