> Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time
I don't think that's accurate. That latency exists because the execution unit is pipelined. If it were not pipelined, there would be no latency. The latency corresponds to the fact that "doing division" is distributed across multiple clock cycles.
Division is complicated by the fact that it is a complex micro-coded operation with many component micro-operations. Many or all of those micro-operations may in fact be pipelined (e.g., 3/1 lat/itput) , but the overall effect of executing a large number of them looks not very pipelined at all (e.g., 20 of them on a single EU would have 22/20 lat/itput, basically not pipelined when examined at that level).
Sorry, correcting myself here: it's cut across multiple cycles but not pipelined. Maybe I confused this with multiplication?
If it were pipelined, you'd expect to be able to schedule DIV every cycle, but I don't think that's the case. Plus, 99% of the time the pipeline would just be doing nothing because normal programs aren't doing 18 DIV instructions in a row :^)
I don't think that's accurate. That latency exists because the execution unit is pipelined. If it were not pipelined, there would be no latency. The latency corresponds to the fact that "doing division" is distributed across multiple clock cycles.