Huh! I was expecting adding u128 integers to be slower because of the cast; but ...

Huh! I was expecting adding u128 integers to be slower because of the cast; but it looks like llvm is (correctly) realising the upcast + downcast has no effect and replacing it with a single u64 add in release mode.

It also will happily vectorize and all the rest:

https://rust.godbolt.org/z/hn888ezj4

I want to do some additional testing to check if it also optimizes correctly for wasm and in 32 bit contexts, but generally I'm shocked that works so well. Thanks!