Hacker News new | past | comments | ask | show | jobs | submit login

I think Daniel's use of the word "separate" in "separate and expensive" is ill-advised, as it implies a critique of ARM's ISA design in a way that isn't relevant for this case. One might be concerned if you needed all 128 bits in some other use, but not here.

As for loading large constants, if you read the post and follow the link at "reuse my benchmark" (https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...) you will see that these functions as measured are inside hot loops. As such, presumably constant loading is very likely to be hoisted out of these loops on both architectures.

This will make the considerably slower UMULH stick out like a sore thumb. Also note that the measurement loop allows most of the work of each iteration to be done in parallel - the work of the rng is a long dependency chain within the calculation but the update of the seed is quick and independent of that.

I would guess that the Ampere box has a wretchedly slow multiply. In a comment on the post, Daniel finds an ugly performance corner on A57 (possibly related, possibly not): "On a Cortex A57 processor, to compute the most significant 64 bits of a 64-bit product, you must use the multiply-high instructions (umulh and smulh), but they require six cycles of latency and they prevent the execution of other multi-cycle instructions for an additional three cycles."




There could be an instruction scheduler impact here as well. Intel processors are known for having an uncommonly deep execution window.

It turns out that the Nth wyhash64_x doesn't depend on any of the multiplies in the N-1th iterations. It only depends on the addition of the zeroth order constant.

So, with a sufficiently deep pipeline, the instruction scheduler can effectively be in the middle of several of those wyhash iterations all at the same time, thus hiding nearly all of the hash's latency by using the other iterations to do it.

Such are the perils of micro-benchmarking.


Indeed. Of course, the idea that this is invalid implies that "real" application code (whatever that is) would be designed to have a sequential dependency on a single wyhash64 result and to sit on its thumbs waiting. Maybe, and maybe not. One can make up any argument one likes.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: