Bitcoin's proof of work uses SHA-256(SHA-256(x)). Combining that with your figures reduces the differences to well within minutia of how you count bit operations and exactly which tradeoff the circuits make.
Interestingly enough, the Square attack (otherwise more generally known as integral cryptanalysis) is much more powerful than regular linear or differential cryptanalysis when applied to the AES.
Antoine Joux was on the side of classical cryptanalysis on a 2014 bet. This was right after the small-characteristic discrete log advances, so that might no longer be the bet if it was made today.
Jasmin is something like this. It is essentially a high-level assembler, will handle register allocation (but not spills) for you, has some basic control flow primitives that map 1-to-1 to assembly instructions. There is also an optional formal verification component to prove some function is equivalent to its reference , is side-channel free, etc.
While the observation has previously focused on latency it also affects throughput: whereas you could run 2 independent shifts per cycle before, each shift going to either p0 or p6, this anomaly lowers this to a single shift per cycle.
Besides the shifts and rotations BZHI and BEXTR are also affected. While they are not a shift per se, it makes sense that it would be implemented with the same circuitry (e.g., BZHI is dst = arg1 & ~(-1 << arg2)). BEXTR in particular goes from 2 to 6 cycle latency!
Another thing I'm noticing is that the affected instructions are all p06 shift ops. You can alternatively implement rotation using SHLD r,r,c but this is a p1 operation and I have not seen any slowdown from this issue.
Really interesting. Normal uops don't work like that, they are always pipelined, so a p06 op with 3-cycle latency would always be 3/0.5, not 3/1.
So the 1-throughput strikes me as a renamer limit, not an execution limit. I.e., these instructions can only flow through the renamer at 1 per cycle.
This could perhaps be tested by interleaving unrelated p06 ops in the same ratio as SHLX, and see if they + SHLX are able to saturate p06, and also trying different interleaving granularities like 1:1 and 5:5 since I would expect those to behave differently in the renamer (but not much different in execution).
Interleaving CQO and SHLX results in ~1.33 throughput with the anomaly, ~2.0 without. This ratio is more or less constant whether it's 1:1 or 2:2 or 4:4 or 8:8 (with 1:1 it's slightly lower at ~1.28).
This may or may not be consistent with one CQO uop being executed once a cycle as expected, and one SHLX uop taking a a spot (stalling for one cycle?) for 2 cycles, resulting in a runtime of (x/2 * 1 + x/2 * 2)/2 ~ x/1.33 cycles.
Well very interesting. I don't really read that as supporting my idea of a renamer limit: seems like the throughput would either be higher (assuming the CQO just goes "for free" in the same rename cycle) or lower and also depend heavy on the interleaving (assuming the SHLX uses a whole rename cycle by itself or something like that).
So maybe it's like you say: SHLX with the anomaly is only semi-pipelined: it takes 2 cycles before the next SHLX can dispatch. Perhaps it needs to use 1 cycle to handle the unfolding of the folded immediates, then the shift happens in the second cycle, and then the latency just has to be rounded up to 3 since all uops on those ports are 1 or 3, never 2 (which simplifies the scheduler, I believe).
That would also explain the original 1/cycle throughput for pure SHLX with anomaly: as 2 non-pipelined cycles on the port, 2 ports = the observed throughput.
So it's sort of like a 2 uop instruction but can't _actually_ be 2 uops because "rename" is too late to crack something into 2 uops (that's already happened), so it just does the doubling up internally?
Noise from the surrounding code aside, we see the same number of uops issued. However in the anomaly case, ~1/4th of the cycles are spent with no uops being executed, 1/2 are spent with only 1 uop being executed, and around 1/4 of the cycles have 2 uops being executed. I expected 0 and 2 being 50/50, consistently with there being one cycle stall, but if the uops are desynched and issued one cycle apart it would also explain the 1 being so prominent.
To confirm this I add an LFENCE at the start of each loop iteration to serialize the pipeline and try to ensure that each SHLX pair is issued in the same cycle. And it works:
An interesting data point is that Kahn's The codebreakers, from 1967, uses "encipher" everywhere except for various US goverment agency quotes, which use "encrypt."
[1] https://nigelsmart.github.io/MPC-Circuits/sha256.txt
reply