I'm guessing the 128-bit multiplication implementation on the ARM architecture isn't as well done as is on the Intel platform?

You might be able to reclaim the performance if you manually implement the multiplication using 64-bit variables instead.

No, the compiler is generating good code. If you use a smaller word size you just end up doing more multiplications (e.g. cut your word size in half, do 4x as many multiplications).

