Hacker News new | past | comments | ask | show | jobs | submit login

-Os helps there, but it sequences umulh just after mul, making that mul cost 3 cy. ouchy.

EDIT: reply to comment below: you are right, i misread

I don't think that's the case. I read the optimization manual as saying that on entry to the pipe, no other multiply can enter it for 3 more cycles. That would be to ensure that there isn't contention on the functional unit's result bus. Therefore, the preceding mul should be able to enter without being delayed by the next umul.

If you look at the whole section, all of the multiplication results that take more than 3 cycles stall the multiplier pipe for N-3 cycles.

Amendment: Clang's decision to schedule the mul right after the umulh would also appear to be terrible. But in fact, I think that if the umulh enters on cycle 0, that the mul enters on cycle 3, the umulh's result appears on cycle 5 and the mul's result appears on cycle 6. So, it has the same total latency through the pipe as mul followed by umulh: 7 cycles.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact