I could just be wrong, but I thought that on Intel Ports 2 and 3 handle the address calculation separately from the fetch, and thus technically do two operations per cycle. Or at least, this is the impression I got from the fact that IACA splits the number of operations for each into two separate columns, and from some diagrams that split those ports into LD and STA.
I think this may be related to the simple vs complex address calculations, although I'm fuzzy on the details. When I said "I wasn't sure" I was if anything understating my uncertainty. Searching now, it looks like Peter and Percy talk about it here: https://stackoverflow.com/questions/47701898/store-forwardin...
Oh, and some niggling typos: determing, differnet, limtis, reoganization, eliminat, soure, Essnentially, doens’t, intersting, itration, particpate, somethign, implicity, transforamtion, architecturs, depeneding, StackOveflow, unecessary, differnet, independnet, adressing, adressing, transfomations, instructoins, instructon, broaded, regsiter, unecessary, multi-predictate, subsequet, analagous, limtiation, uncontional, unecessary, ephemism, posssible, endianess. To be fair, I only noticed about half of these while reading, but then pasted it in to TextEdit for a more complete list. Other than 'broaded', none affected clarity.
Stores are different: they are two uops, STA (store address) and STD (store data) and they need different ports p237 for STA and p4 for STD and so they play within the rules as well. Basically the two ops are needed for the separate inputs they use: the store address, and the store data and they can execute happen independently. Loads OTOH have 1 input (the address) and one output.
Note that loads _can_ actually end up dispatching multiple uops, in the case of a cache miss!
If the load uop misses in L1 when it executes, it gets replayed 7 cycles later, with the idea that the data will be arriving from L2 if there is an L2 hit. If that also misses, the load will replayed a final time when the data arrives from L3 or DRAM (there is no additional replay for an L3 miss, because the latency is variable so the load just goes to sleep waiting for the result, whether from L3, L4, DRAM or wherever). The replayed uops still have to play by the one port, one op rule.