Hacker News new | comments | show | ask | jobs | submit login
Towards fearless SIMD (raphlinus.github.io)
182 points by fanf2 50 days ago | hide | past | web | favorite | 81 comments



> Code written for too high a SIMD capability will generally crash (it’s undefined behavior), while code written for too low a SIMD capability will fall short of the performance available.

I can't help but think that SIMD intrinsics are the wrong level of abstraction. Literally assembly being moved into a high level language but without the ability for the compiler to optimize well. Hand written intrinsics can really be fine tuned for a given microarchitecture, but I think XLA's approach of having a higher level of abstraction, then JIT compiling to the device in question (whether it be SIMD of X width, or a gpu, or a tpu) is the right way to go.

Whenever I see hand coded assembly, I can't help but think to myself:

* Did the author actually have instruction scheduling information on hand when writing this?

* For what generation of chips was this found to be optimal for, and is it still? Invariably, you wind up with hand coded assembly routines that were written a long time ago still in use without anyone revisiting the code or it's performance.


The compiler can optimize intrinsics, though. Why shouldn't it be able to?

I agree that somewhat with AVX2 and especially with AVX-512 there's probably less reason to write intrinsics if you have a vector language. But for now SSE2, SSSE3, and SSE4 are still the bread and butter for SIMD on x86, and there are some important instructions (Fabien Giesen goes into detail at [1]) that you really have to think about how to use effectively. For example (this is mentioned at the end), all horizontal adds on x86 are awful except for PSADBW, which is really limited, as Intel designed it in a fit of myopia to target only motion estimation in contemporary codecs, and it requires you to basically design your whole algorithm around it.

[1]: https://fgiesen.wordpress.com/2016/04/03/sse-mind-the-gap/


I find intrinsics to be a useful level of abstraction because it allows me to control and automate the code generation in C++ without having to know the details of register scheduling, etc. It also means I don't have to hand write intrinsic algorithms for every architecture -- I have template libraries that can take a rough specification of the architecture characteristics and generate approximately appropriate intrinsic code. It is neither hand writing all the architecture-specific intrinsic code nor trusting the compiler to do a good job. I started working this way when I used to target code at a really weird mix of exotic vector/parallel computing hardware, and it worked nicely.

The real problem is that, in my experience, compilers universally do a poor job of auto-vectorization/parallelization unless the code pattern is blindingly obvious. All of the "higher level abstractions" I've seen are no better at finding vectorizable patterns than the compiler does -- which makes sense since they are both just software. It ends up being syntactic sugar. The amount of effort required to trick compilers into doing the right thing, plus the things they won't do at all, is such that the complexity is worse than just writing the intrinsic code directly. Writing a lot of intrinsic code directly far from optimal but fortunately those intrinsics are available in languages like C++ that can abstract much of that.

Compilers will not consistently generate competent auto-vectorized for the foreseeable future. I've seen very little forward progress in the decade I've been exposed to it. There are still some relatively simple non-vectorized code patterns that compilers currently have a difficult time detecting and doing optimized code generation for.


> All of the "higher level abstractions" I've seen are no better at finding vectorizable patterns than the compiler does

I think a lot of that comes from C/C++ language corners.

Have you seen XLA (https://www.tensorflow.org/performance/xla/) or Glow (https://facebook.ai/developers/tools/glow)? These are very high level abstractions called from C++ that can do a great job vectorizing high level operations for the given device.


As far as I know from reading papers, Fortran compilers do a much better job thanks to the language semantics.


For the simple SIMD cases where the computation doesn't have any horizontal dependencies (except for a total reduction step), there's little need for any sort of intrinsic usage. (Although you may need some compiler hints to push it to autovectorize).

The real problem is that SIMD hardware sets tends to have lots of instructions that have mixing of horizontal lanes. The scope and performance of these mixing instructions varies greatly from implementation to implementation, and these kinds of instructions are hard for compilers to automatically pick up. It's the latter case that means you need the SIMD intrinsics to be exposed to use the hardware effectively.


I think so too, that SIMD is too low-level to be utilized effectively. Luckily joe_the_user enlightened me with this little gem on a previous post: https://news.ycombinator.com/item?id=17419917

Or for the lazy:

I don't know about tensor flow in particular but are little-known methods of running "general purpose" parallel programs on GPUs. Specifically, H. Dietz' MOG, "Mimd on GPU". It's a shame the project hasn't gotten more attention imo.

http://aggregate.org/MOG/

See: https://en.wikipedia.org/wiki/Flynn%27s_taxonomy for explanations of terms.


Cool!

I have "evangelized" one person at least.

The thing about MOG is it's a demonstrated technique that Dietz has not yet developed into a fully releasable product. His latest comment says it's six months from release but lacks funding.

I think the problem might be that most "Mimd" programming is things like weather-simulations where access to a MIMD supercomputer is standard and price isn't that much of an issue.

That said, cheap Simd parallel computing jump-started today's deep learning advances and so cheap MIMD might do things no one anticipated.

Still, very niche. But thanks for the mention.

Edit: Note, MOG is specifically a GPU. Dietz did earlier work on other machines in the 90s but this is GPU specific (though a less flexible architecture than what Nvidia has evolved now).


what do you mean by effectively? Too hard for the programmer? Because what most of us find when we try to do SIMD stuff is anything higher level than intrinsics is too hard to use effectively where effective means good runtime performance.

I agree there should be something higher level we could use but it doesn't exist yet, to my knowledge.


Ya too hard for the programmer. What it comes down to is that there are a few main categories of multiprocessing from lowest to highest level:

1. SIMD - manually deal with packed elements (like in SSE, MMX etc)

2. DSP - Maybe someone knows a better term for this, but treating each slice of the data array as an independent serial stream (shaders, OpenCL, CUDA)

3. MIMD - freeform vector/matrix operations that get compiled down to the first two categories (MATLAB, GNU Octave, Scilab)

I'm not sure if TensorFlow fits best in 2 or 3 but my gut feeling is that it's closest to 2. The problem with the lower level abstractions is that it's more work to format the data for the problem space.

So with MATLAB, everything is a vector and operations applied to each vector happen in parallel across all elements. Then if you need to, you can drop down to less efficient code and operate on elements of the vector manually with C-like code.

Unfortunately that becomes more difficult in shaders, because you can't just magically access a neighboring element. And getting general computation to work with SIMD is often infeasible because you have to rewrite your code and potentially alter the layout of the data in memory to achieve better performance.

I also agree that currently nothing really exists to give us general-purpose vector math akin to MATLAB within a language like C.


The eigen c++ library comes somewhat close. But remember, Matlab is terrible at using multiple processors, so most people using Matlab won't see anywhere the performance most people c intrinsics get. This isn't because Matlab isn't capable, but I found that virtually no Matlab users try to write parallel code.


Half the point of using SIMD intrinsics rather than embedding assembly is that register allocation and instruction scheduling are performed by the compiler. Using intrinsics certainly does not guarantee that exactly the corresponding instructions will be emitted in exactly the given order.


> Using intrinsics certainly does not guarantee that exactly the corresponding instructions will be emitted in exactly the given order.

I hope they will fix the compilers to add such guarantees. Currently, every time compilers try to mess with manually-written intrinsics, they decrease performance.

See this bug about LLVM failing to emit the exact instructions, decreasing the performance: https://bugs.llvm.org/show_bug.cgi?id=26491

See this bug about VC++ failing to emit them in the given order, again decreasing performance: https://developercommunity.visualstudio.com/content/problem/...


As usual, things aren't quite that simple. While it certainly can happen that the compiler can pessimize carefully crafted SIMD code by applying undesirable transformations, simply not optimizing intrinsics would lead to other problems.

If you consider SIMD code that spans more than just one small function, then it is quite useful if the compiler can SROA (e.g. you pass around an array of vectors instead of having 8 arguments to each function, but still want them to end up in registers eventually), LICM (you're calling a function that needs a constant vector in a loop) or otherwise optimize your code.

If you want the compiler to leave alone your intrinsics, link in an assembly file.


> it is quite useful if the compiler can SROA

__attribute__((always_inline)) / __forceinline usually help.

> LICM (you're calling a function that needs a constant vector in a loop)

I can calculate that vector outside of the loop.

> link in an assembly file.

Way more complex, I need to write outer scalar code in assembly too, need to manually allocate registers, also C++ templates are sometimes very useful for that kind of code.

In 99% of cases intrinsics are good enough for me, but I would love the compilers to leave alone my intrinsics.


Can’t just isolate those in a compilation unit with -O0?


Interesting idea, will try next time.

I usually want to optimize the scalar code outside of the manually-vectorized body of the loops. A function call to that external compilation unit will be slower than inlining I have when everything is in the same unit. However, it could be the call overhead is small enough, obviously need to profile.


On GCC-like compilers, you could just use inline assembly. That definitely won’t get rewritten into some other instruction sequence, and it can handle things like register allocation and loads/stores for you. Downsides include that the compiler won’t be able to estimate instruction timings, the ease of screwing up the input/output notation, and that MSVC doesn’t support inline assembly on x64 at all.


As someone coming from CUDA I think you‘re on the right path. CUDA unifies multicore and vectorization parallelims. Crucial here is hardware support for branching (including early returns) even if it degrades performance. This allows a CUDA kernel being programmed in a programmer friendly way that is often already fairly performant, and the actual mapping to hardware is deferred to kernel invocation where it‘s straightforward to specialize. I think this concept is something that CPU (programming) could learn from GPU. Why not adopt the concept of kernels?


OpenCL on CPU is exactly this and does very well. Last real world test I did, an 8 core Haswell and a GTX 970 were similar in performance, for the kernel I was running. Caveats apply of course but it was refreshing to have a single programming model.


I did some evaluation of (Linux based) OpenCL implementations two years ago (but I don't think anything changed significantly since then).

I had a few takeaways:

- OpenCL on GPUs and CPUs have little to do with each other (in terms of performance characteristics) and if you tune well for one of them, the other one will suffer.

- Vectorization of work items doesn't really work well unless your kernel is so simple that normal compiler auto vectorization with a loop would have probably worked just as well if not better.

- Nvidia intentionally makes OpenCL a second class citizen vs. CUDA. I had nearly identical (simple) kernels running on both platforms and only the CUDA one managed to saturate memory throughput.

- The whole ecosystem is mostly more effort than it's worth. Portability between different OpenCL implementations is a gamble, some will even silently compute invalid results (I'm looking at you Intel...). I had kernel hangs with both Nvidia and AMD.


> if you tune well for one of them, the other one will suffer.

This is one of the things that bothers me the most about OpenCL. It attempts to offer this uniform abstraction over a generic compute accelerator, which can be CPU vector extensions, GPUs, or FPGAs, but these things are different enough that you have to develop for a specific type if you want reasonable performance. So you get none of the benefits of a accelerator specific abstraction while still writing accelerator specific code.

There is a real cost to a generic abstraction, and distinct languages/platforms would in my mind be better than different "dialects" of the same language/platform that pretend to be compatible but really aren't.

I like that CUDA is very clearly designed to run only on GPUs - it provides a clarity that OpenCL lacks.


The Intel OpenCL driver vectorized work items really well, better than the same code provided to ICC as C. It was still more fragile than CUDA.

The other points are spot on and I would add that debugging code in OpenCL is a bad experience.


I agree that this space needs to be explored more. ISPC is basically this, and it had some great ideas, it just needs to be more broadly available and integrated into build systems.


Would a better example be OpenCL (at least the vision of it)? OpenCL kernels can run on both CPU and GPU, and unlike CUDA, it's not vendor locked. Though I've only really used CUDA, I don't know how comparable OpenCL is these days.


One problem is that although OpenCL can run on both CPU and GPU, OpenCL code that performs well on CPU and OpenCL code that performs well on GPU can be different.


I think what you're looking for is ISPC.[0][1] It distinguishes between 'uniform' (scalar) vs 'varying' (vector) variables[2] and applies SIMD instructions pretty much transparently. It defaults to varying, whereas I think for a general-purpose C-like compiler you'd want scalar by default and explicitly declare eg:

  vector float add2(float base,vector float offs)
    {
    return base + 2*offs;
    }
  double sum(double* ar,size_t ln)
    {
    vector double x = (vector)0.0;
    while(ln >= sizeof vector) /* == # of simd lanes */
      {
      x += *ar; /* maybe this needs some kind of cast? */
      ln -= sizeof vector; /* == 1 on arches with no simd support */
      }
    /* this part's definitely not done properly */
    double buf[sizeof vector]; *buf = x;
    double ret = 0.0; /* slurp up the last few */
    for(size_t i=0; i<ln ;i++) ret += buf[i] + ar[i];
    return ret;
    }
0: https://pharr.org/matt/blog/2018/04/18/ispc-origins.html

1: https://ispc.github.io/

2: https://pharr.org/matt/blog/2018/04/26/ispc-volta-more-on-pe...


Yes, SIMD intrinsic programming is often worthless unless implemented for every target architecture. Let me give you an example case; my target production architecture rarely changes, but I want my program to run on every developer computer where CPUs are heterogeneous.

Now, I think the problem is that it is tying optimizations to instructions rather than CPU architectures. When I ported code from an I7 to an I9 it seemed that instruction latency changed. Still, if you are only supporting limited computational targets any hooks at all help.

But why intrinsic rather than assembly? Because usually vector extensions in GCC work well enough, and when they don't intrinsics are easier for me (a man who rarely needs to write assembly and doesn't have the time to beat the compiler in 99.99% of cases).

And a minor point... I really wish GCC had a pragma which allowed me to specify a function should be optimized to multiple targets and then install the thunks automatically.


GCC has function multiversioning. See: https://gcc.gnu.org/onlinedocs/gcc-4.9.2/gcc/Function-Multiv... and since release 6 quite comfortably: https://lwn.net/Articles/691932/


Thank you so much for sharing this, I did not know. I guess I will have to look for the version GCC version which supported X86_64.. I might have to upgrade distros for GCC6 (older versions were i386).

Btw, this was a nice presentation of the idea: https://blog.linuxplumbersconf.org/2016/ocw/system/presentat...


Clang supports vector types for C and they're still nowhere near as fast as using intrinsics. For whatever various reasons, compilers tend to do pretty poorly at producing properly vectorized code. I think part of this is that it's hard to encode all the information you need to vectorize properly (this loop will run some power of 2 number of times, these pointers won't alias, etc)

Even using intrinsics is dicey sometimes - compilers spill registers more often than can make sense. I think the ideal situation would be if you could somehow hint at register scheduling without writing asm yourself, but compiler are still a long ways from that.


I wonder how it would like with Fortran code though.


Right, I've run into this a number of times, especially when porting code originally written for 32 bit to 64 bit. Besides compilers getting better, I think there's another factor - it used to be that a compelling reason to write asm is to get good utilization of a limited number of registers. But especially on aarch64 (and, in the not too distant future, AVX-512), there are a lot more.

Higher level abstractions are a good idea, but the problem is, are you able to exploit the capabilities that the chip exposes? For simple things like computing a scalar function elementwise over a vector, no problem, but a lot of the more interesting problems don't fit such simple templates.


Another Rust project with similar goals, different approach:

https://github.com/jackmott/simdeez

full disclosure, that's mine.


Your library is actually mentioned in the linked article

> Another crate, with considerable overlap in goals, is simdeez. This crate is designed to facilitate runtime detection, and writing the actual logic without duplication, but leaves the actual writing of architecture specific shims to the user, and still requires nontrivial unsafe code.


cool, he updated it, it wasn't originally. We have chatted a bit and exchanged ideas.


Nice! This is much simpler than faster. Thanks for dropping the link here.

I really need to revive my multithreaded, SIMD-ified Rust PNG decoder someday with something like this…


Weren't your working on a PNG decoder running on the GPU ?


Yeah, I was, but branch divergence made it slower than just using SIMD.


I’ve found template metaprogramming to be an efficient, generic way to generate SIMD-accelerated code. (See [0].) By giving the type functions associated with each operation, the same interface can support 128, 256, or 512-bit SIMD, depending on hardware.

It’s C++, so while it’s outside of the Rust ecosystem, it’s still a workable solution I’ve used in a half dozen projects.

[0] https://github.com/dnbaker/vec


I don't know if you've ever seen Kokkos[1], but that's one of the big C++ frameworks for taking parallel loops and using templates to generate code for SIMD and multiple cores. They also support GPU code generation to some degree. The main issue is compile time, which gets pretty bad when you start using it in non-toy applications.

I can't find the reference right now, but there have been some attempts to augment C++ compilers to understand the semantics of the library directly, basically treating the template metaprogramming as a DSL instead of general-purpose. This can be speed up the compile times dramatically, but of course you lose generality and the compiler has to be customized to understand every library it wants to optimize. Overall it seems like there is more research to be done on doing this in a safe way without paying through the nose with compile time.

[1]: https://github.com/kokkos/kokkos


That doesn't give you run-time dispatch based on available hardware capabilities. So you either need virtual functions (and take the associated perf hit), or some other solution like the linker tricks mentions, or cached JIT, or....


It’s not runtime, it’s compile-time. That means it needs to be separately compiled for each hardware backend.


How do you combine the separately compiled libraries into a single deliverable library/executable?

Because that is what the original article is aiming for.


It doesn’t do the runtime dispatch this person wants, but it eliminates the worry of running into an illegal instruction error.

I think there are use cases for a fat binary, but I’m less certain as to its necessity.


it can, you generate all desired version of simd-ified function at compile time, then write a function to select the right one at runtime.

that is what the Rust libs are doing, just using traits or macros instead of templates.


The simplest target-independent fearless SIMD is autovectorization[1] . But taking maximum advantage of that probably means writing some code that feels a little unnatural. Also, IIRC bounds checks thwart some autovectorization.

[1] https://llvm.org/docs/Vectorizers.html


I think autovectorization is differently-fearful rather than fearless; you live in perpetual dread of the wind changing and suddenly your code doesn't autovectorize any more.


I've spent quite a bit of time looking at autovectorization, but didn't write about it much here, as it's only good for a pretty small subset of problems. One subtle gotcha I ran into is that `round` doesn't autovectorize, but `float` does, even though today's chips have perfectly good vector round instructions. See rust issue #55107 for a deep dive into that problem.


Autovectorizing is quite a long way away from being as or more performant from using intrinsics or asm.

This makes sense when you think about it. The compiler would either need to be extremely capable of generalizing about some kinds of arithmetic, or it would need millions of special cases to recognize. By writing your own vectorization, you are basically covering your singular special case on your own.


In an LLVM context it also means you don't get runtime feature detection. You would need to build N dlls and then write code to load the proper one at runtime based on feature detection.

JITs could solve that problem, but few JITs currently do very much auto vectorization, because they don't have time.


> In an LLVM context it also means you don't get runtime feature detection. You would need to build N dlls and then write code to load the proper one at runtime based on feature detection.

Does LLVM not support GCC-style function multiversioning?


LLVM does (quite recently!), but Rust currently does not. I think it's likely that Rust should add it, but a large part of my post was exploring how far we could go with Rust as of today.

Edit adding citation: http://lists.llvm.org/pipermail/llvm-announce/2018-September...


Does GCC use that when auto-vectorizing? It's been a while, but in the past when I built binaries via gcc with autogenerated SSE/AVX I don't think they fell back to the C code for CPUs that lacked those SIMD instructions. They just crashed.


You have to tell gcc it should also build a generic version. Then, it should work.


Intel has made quite a few contributions to Hotspot, including AVX support.

ART started supporting SIMD on Oreo, and it was further expanded on Pie. Naturally very few devices have gotten those improvements thanks the state of Android's updates.

So on Android's case Renderscript is still the best way for a JIT like approach for SIMD.


As far as I know the JVM will only auto-vectorize integers. Is there a way to tag a function such that you want floats vectorized too now?

ART is a good point, compile on install lets you do this.


You can get the information here. It also refers to floating point calculations.

http://cr.openjdk.java.net/~vlivanov/talks/2017_Vectorizatio...

https://software.intel.com/sites/default/files/managed/19/ae...

Also the Vector API development is ongoing and there is a talk at this week's Oracle ONE about the current state.

ART no longer compiles on install since Android 7, that behaviour is specific to Android 5 and 6 versions.

Since 7 it is a multistage runtime with hand written in Assembly interpreter, JIT + PGO, AOT + PGO on idle device. And as of 9, PGO data gets uploaded into the store and shared across devices on installation.


> Note that this is a performance of approximately 470 picoseconds per sample. Modern computers are fast when running optimized code.

Or about half the time it takes a photon from your phone screen to hit your eyeball


We’ve been working on this problem for over a decade and have some support for Rust. Let’s work together if that makes sense to you l! I’m John at ArrayFire.


In rust this looks rather painful. Might I suggest trying this in Julia, might be a good comparison of performance, ease of use, and readability. Julia does a very nice job of compiling directly to SIMD instruction and lets you inspect the low level code generated.

inline function sin9_shaper(x) c0 = 6.28308759 c1 = -41.33318707 c2 = 81.39900205 c3 = -74.66884436 c4 = 33.15324345

    a = abs(x - round(x)) - 0.25
    a2 = a * a
    ((((a2 * c4 + c3) * a2 + c2) * a2 + c1) * a2 + c0) * a
end

function gen_sinwave(freq, init=0.0, step=0.1) wave = [sin9_shaper(x) for x = init:step:freq] end

julia> @code_native gen_sinwave(1113.0); .section __TEXT,__text,regular,pure_instructions ; Function gen_sinwave { ; Location: REPL[60]:2 pushl %ebx decl %eax subl $48, %esp vmovaps %xmm0, %xmm2 ; Function gen_sinwave; { ; Location: REPL[60]:2 decl %eax movl $769501344, %eax ## imm = 0x2DDDA8A0 addl %eax, (%eax) addb %al, (%eax) decl %eax movl $773805120, %ecx ## imm = 0x2E1F5440 addl %eax, (%eax) addb %al, (%eax) vmovsd (%ecx), %xmm1 ## xmm1 = mem[0],zero decl %eax movl %esp, %ebx vxorps %xmm0, %xmm0, %xmm0 decl %eax movl %ebx, %edi calll %eax decl %eax movl $769544608, %eax ## imm = 0x2DDE51A0 addl %eax, (%eax) addb %al, (%eax) decl %eax movl %ebx, %edi calll %eax ;} decl %eax addl $48, %esp popl %ebx retl nopw %cs:(%eax,%eax) ;}

end # module


Your formatting didn't work, but no, it doesn't "look painful."

What you wrote does not guarantee vectorization, it just relies on autovectorization.

Rust already does autovectorization magically behind the scenes thanks to LLVM (which Julia also uses), but explicit SIMD makes it a guarantee.


Here is how to do explicit SIMD in julia: https://github.com/eschnett/SIMD.jl


Your code only shows that

    gen_sinwave(113.0)
calls

    gen_sinwave(1113.0, 0.0, 0.1)
(which are the default arguments). It is usually better to use @code_llvm because the LLVM IR is typically easier to read.

    julia> @code_llvm gen_sinwave(1113.0);
    
    ; Function gen_sinwave
    ; Location: REPL[2]:2
    define nonnull %jl_value_t addrspace(10)* 
    @julia_gen_sinwave_345638059(double) {
    top:
      %1 = call nonnull %jl_value_t addrspace(10)* @julia_gen_sinwave_345638060(double %0, double 0.000000e+00, double 1.000000e-01)
      ret %jl_value_t addrspace(10)* %1
    }
However, defining

    function gen_sinwave(freq, init=0.0, step=0.1)
        data = collect(init:step:freq)
        fin9_shaper.(data)
    end
and looking at

    @code_llvm gen_sinwave(1113.0, 0.0, 1.0)
we can see that there is a lot of auto-vectorization going on


From the asm you posted I have no idea whether it's optimized or not; it looks like two virtual function calls (calll %eax). Also, while I'm sure Julia is nice, the GC makes me worry about whether it'll work well in real-time audio synthesis (my primary use case).


Why is SIMD considered unsafe? I thought that safe code was permitted to cause panics, and the worst thing that will happen if unsupported SIMD is used is a panic.


No, it's not about panicking. It's undefined behavior to run code compiled with CPU features that aren't supported by the current CPU. See: https://github.com/rust-lang/rfcs/blob/master/text/2045-targ...

There are some other ideas for making it easier to reason about safety at this level: https://github.com/rust-lang/rfcs/pull/2212

Can you point to where you heard about unsupported SIMD causing a panic? I'd like to fix that because it's really wrong!


I mean that it really shouldn’t be UB to call a function compiled for an unsupported target feature. Following that link, I see two arguments that it’s UB:

1. A multibyte NOP might be used. Supposedly there might be a multibyte NOP that older CPUs will decode as a jump. I am not sure I believe this. Is there an example?

2. int3 might happen, causing SIGTRAP. I see no explanation of how this would occur.

So I think that, if LLVM really has UB if the wrong target is used, it should be fixed. Arguable the old Knights Landing instructions are an exception, but those are basically dead. Maybe non-x86 targets are different.

Also, I have a suggestion for a potentially much nicer way to deal with safety: use the type system instead of magic annotations. Have a function like GetAVX2() -> Option<AVX2>. Teach Rust that code that statically has a live AVX2 object (as a parameter, say) can use AVX2. Other than the code generation, this could be done in stable rust right now.


_This_ is why I linked to my undefined behavior post. Your comment is about as clear an example of being in the semi-portable camp as any I've seen. And I'm not blaming you, because in C you _can't_ do SIMD in the standard camp, so it's actually one of the more compelling reasons to remain in semi-portable. Rust is different though.

Also: the linked crate does use the type system in pretty much this way so that code that clients can be safe. However, there are limitations; it's not just whether a particular instruction can be used, which remains immutable once it's detected, but also which _registers_ (and, by extension, calling convention) can be used. That varies from function to function, and requires the `#[target_feature(enable)]` annotation to control, so just having an `Avx` type in hand is not quite enough to ensure that you're in a context where using the ymm registers is ok, and the intrinsic will be inlined to a "V" variant asm instruction. This is discussed in some detail in the "caveats" section.


I'm not experienced enough in compilers at this level to make a prescriptive argument here. My comment was just intended to be descriptive. I think it would be well worth the effort to dig into the LLVM side of things here to root out the specific reasons for UB. Intuitively though, it makes sense to me that it would be UB. I'd be surprised if it weren't. It seems like it should be reasonable for compilers to assume that the execution of an instruction implies that instruction is supported on the current CPU.

Your type system idea works great for simple cases. It was the very first thing I did in my own SIMD code.


There is the case of LZCNT and BSR. On processors that do not support the former, LZCNT is interpreted as (REP) BSR, but the instructions have different behavior that could result in silent failures.


What is SIMD?


Single Instruction Multiple Data

https://en.wikipedia.org/wiki/SIMD

ELI5: Your processor can process more data in parallel.

If you have for example loop that has something like this in the body:

a[i] + b[i] = c[i]

Using SIMD processor can make all this operations same time:

a[i] + b[i] = c[i]

a[i+1] + b[i+1] = c[i+1]

a[i+2] + b[i+2] = c[i+2]

a[i+3] + b[i+3] = c[i+3]

Normally in one iteration you would only get one of this addition done without SIMD, thanks to SIMD you can have multiple.


Thanks! I get that if you don't already know the acronym, you're not the audience, but it is helpful to not have to search externally for a reference. Could you imagine if that was in a general programming magazine before global internet search engines were popular? You'd have to cross your fingers that your local encyclopedia has an entry!

ETA: My bad - I did not think to click that. Embarrassing. Thanks for pointing it out!


In fairness the first word of the article is SIMD with a hyperlink to the Wikipedia page.


> Could you imagine if that was in a general programming magazine before global internet search engines were popular?

Style manuals of the time encouraged expanding acronyms on first use, but I can imagine this wasn't consistently done in a technical field!


What is "ETA"? /s :-)


> The next-level challenge is compiling multiple alternates into the same binary, and selecting the best at runtime.

What are the use cases for having multiple alternates in the same binary? Why not decide them during the compile time for the given architecture?

If portability is the concern here, wouldn't it lead to sub-optimal code anyway?


If you're shipping binaries, you don't know the exact architecture in advance (because there are many extensions to x86 and you don't know if the end user is running a new enough processor to use all of them). If you don't use them you are likely leaving performance on the table. So you want to select the fastest option supported by the processor you happen to be running on. You can do this with fairly minimal peformance impact by linking in different versions of the function at runtime, but this requires some support from your compiler and runtime environment.


I think the question here, which one is better:

1. A portable binary where only individual SIMD operations are optimized for all targets. 2. Building the optimized binary for every target architecture when needed (either by the user or by the binary distributor).

Concern with (1) is, as the number of dynamically called functions (or decided by if-else nests) increases the quality of the generated code reduces for any architecture. Basically, compiler will be left with opaque unrecognizable functions which restricts even the target independent optimizations (Like, GVN, CSE Constant propagation etc).

Let's say, if the user writes a SIMD program which contains full of dynamically called functions (which are opaque to the compiler), doesn't it impact the performance heavily?

Isn't taking the compiler support for optimizing the SIMD operations necessary rather than writing wrapper libraries ? For example, lowering the SIMD operation calls to the existing vectorized math libraries which are recognizable by the compilers ( Example: sin(), cos(), pow() in libm ).


Nice write up!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: