Hacker News new | past | comments | ask | show | jobs | submit login
Authoring a SIMD enhanced WASM library with Rust (nickb.dev)
103 points by lukastyrychtr 31 days ago | hide | past | favorite | 8 comments



> Since no alternative has presented itself, every time I revisit wasm-pack I’m pained by new potholes that have risen due to neglect

I've had open PRs for a year now: https://github.com/rustwasm/wasm-pack/pull/937 https://github.com/rustwasm/wasm-pack/pull/1089

Very straight forward changes. Emailed Ashley about helping with wasm-pack, no response (in response to https://github.com/rustwasm/wasm-pack/issues/928)

Updates to wasm-bindgen shouldn't require a change to wasm-pack. It should expose ways to pass arbitrary flags down

While having a webpack plugin is nice, I since gave up on wasm-pack & added a build-wasm to my package.json's scripts: https://github.com/serprex/openEtG/commit/9997fb098d168920bb...

This way if someone wants to contribute to openetg they don't need to install my wasm-pack fork. Ideally wasm-pack-plugin would skip wasm-pack & use wasm-bindgen directly

I do hope wasm-bindgen is able to be adequately resourced. It's a pleasure to program wasm modules in Rust


> The [NEON] Wasm SIMD implementation is 65% faster than native! But what is perhaps more interesting is that the Wasm scalar implementation is only half as fast as the Wasm SIMD version instead of the 3x seen on x86. Perhaps v8 doesn’t have enough optimizations on the Wasm SIMD to Neon front.

That's almost exactly what I'd expect from an optimal compiler.

Graviton2 has 3 scalar integer ALUs, and 2 128-bit. Scalar code can do 3 intops per cycle, x4 vector code can do 8. 8/3 is +67%. Intel processors have typically 4 scalar ALUs, and 3 vector units. 12/4 = 3x.

Zen has 4 units for 128-bit vectors, though until Zen3 not all units can do all operations, so the speedup in AMD land would be 2x-8x depending on application (although code doing brief 128-bit vector work would be limited by Zen having only 1 vector write port).


If you don't mind working with the Rust nightly release, there is a library that tries to provide a portable layer above all architecture specific SIMD APIs:

https://github.com/rust-lang/portable-simd


> A good example of this is _mm_mul_epu32. The below is the Wasm equivalent

Pretty sure the "equivalent" is at least 10 times slower.

AMD64 CPUs don't have SIMD instructions multiplying 64-bit integers. The wasm32::u64x2_mul WASM function must be emulated somehow. The emulation gonna take many instructions and cycles.


Is it emulated, or does it have a peephole optimization for the mask/multiply idiom? If it's hitting the emulation path on this case, that can easily be fixed.


> that can easily be fixed

Theoretically, yes. Practically, I think that’s a “sufficiently smart compiler” class of problems, insanely hard to solve. Especially given that WASM is a JIT compiler, it simply doesn’t have time for expensive optimizations.

Integer SIMD is weird on AMD64. Even state of the art C++ compilers fail to emit optimal code for rather simple use cases. A trivial example is computing sum of bytes: I’m yet to see a compiler which would optimize that code into _mm[256]_sad_epu8 / _mm[256]_add_epi64 instructions.


> Theoretically, yes. Practically, I think that’s a “sufficiently smart compiler” class of problems, insanely hard to solve. Especially given that WASM is a JIT compiler, it simply doesn’t have time for expensive optimizations.

Detecting every way of doing a 32-bit multiply with a 64-bit mul operator is impossible, yes. But there only needs to be one way of doing it that the compilers knows about, and then people can use that idiom.

It's not pretty, but it works. Compare the common scalar int rotate: x86 can do it in one instruction, but C doesn't have an operator for it. The way to do it in C is to use an idiom that optimizers are known to recognize[1].

1: https://blog.regehr.org/archives/1063


> there only needs to be one way of doing it that the compilers knows about

Too much magic to my taste. If compiler will be doing that anyway, why not expose an intrinsic we can use? The SSE instruction in question is rather efficient to emulate on NEON, only takes two instructions, vmovn_u64 and vmull_u32.

It’s the same about scalar code. When I need to rotate an integer, I normally use intrinsics instead of relying on the compiler to optimize the code. Recently, C++ language even added these things in their standard library, <bit> header in C++/20.

IMO, relying on such compiler optimization is fragile in the long run, for 2 reasons.

1. These are undocumented implementation details. Compiler developers don’t make any guarantees they will continue to support these things in exactly the same way.

2. Most real-life software is developed by multiple people. It’s too easy for developers to neglect comments, and slightly change the code in a way which no longer has a shortcut in the compiler.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: