As for loading large constants, if you read the post and follow the link at "reuse my benchmark" (https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...) you will see that these functions as measured are inside hot loops. As such, presumably constant loading is very likely to be hoisted out of these loops on both architectures.
This will make the considerably slower UMULH stick out like a sore thumb. Also note that the measurement loop allows most of the work of each iteration to be done in parallel - the work of the rng is a long dependency chain within the calculation but the update of the seed is quick and independent of that.
I would guess that the Ampere box has a wretchedly slow multiply. In a comment on the post, Daniel finds an ugly performance corner on A57 (possibly related, possibly not): "On a Cortex A57 processor, to compute the most significant 64 bits of a 64-bit product, you must use the multiply-high instructions (umulh and smulh), but they require six cycles of latency and they prevent the execution of other multi-cycle instructions for an additional three cycles."
It turns out that the Nth wyhash64_x doesn't depend on any of the multiplies in the N-1th iterations. It only depends on the addition of the zeroth order constant.
So, with a sufficiently deep pipeline, the instruction scheduler can effectively be in the middle of several of those wyhash iterations all at the same time, thus hiding nearly all of the hash's latency by using the other iterations to do it.
Such are the perils of micro-benchmarking.