More

pbsd · 2025-07-05T03:00:18 1751684418

The Pentium 4 had branch hints in the form of taken/not taken prefixes. They were not found to be useful and basically ignored in every subsequent Intel microarchitecture, until Redwood Cove brought back the branch taken prefix in 2023.

pbsd · 2025-06-22T03:44:52 1750563892

This circuit [1] puts it at <=135k bit operations. Bitcoin uses SHA-256, not SHA-1.

[1] https://nigelsmart.github.io/MPC-Circuits/sha256.txt

nullc · 2025-06-22T06:44:13 1750574653

Bitcoin's proof of work uses SHA-256(SHA-256(x)). Combining that with your figures reduces the differences to well within minutia of how you count bit operations and exactly which tradeoff the circuits make.

pbsd · 2025-05-16T17:52:16 1747417936

Karatsuba is definitely faster than schoolbook multiplication at practical sizes. You presumably mean Strassen.

pbsd · 2025-03-21T20:59:46 1742590786

In page 10, should the ring R be GF(2)[X]/(X^32-1) and the map p be from {0,1}^{32} to R?

cperciva · 2025-03-22T15:51:28 1742658688

I think so yes. I emailed my coauthors to confirm.

pbsd · 2025-03-04T21:49:43 1741124983

Interestingly enough, the Square attack (otherwise more generally known as integral cryptanalysis) is much more powerful than regular linear or differential cryptanalysis when applied to the AES.

pbsd · 2025-02-19T22:31:21 1740004281

Antoine Joux was on the side of classical cryptanalysis on a 2014 bet. This was right after the small-characteristic discrete log advances, so that might no longer be the bet if it was made today.

https://x.com/hashbreaker/status/494867301435318273

pbsd · 2025-02-17T20:22:05 1739823725

Jasmin is something like this. It is essentially a high-level assembler, will handle register allocation (but not spills) for you, has some basic control flow primitives that map 1-to-1 to assembly instructions. There is also an optional formal verification component to prove some function is equivalent to its reference , is side-channel free, etc.

[1] https://github.com/jasmin-lang/jasmin/wiki

kibwen · 2025-02-17T21:37:00 1739828220

Thanks, this looks quite interesting.

pbsd · 2025-01-28T20:21:31 1738095691

It goes way back; check the work of the likes of Thorsten Holz or Christof Paar. TU Graz is another one.

pbsd · 2025-01-04T21:04:25 1736024665

While the observation has previously focused on latency it also affects throughput: whereas you could run 2 independent shifts per cycle before, each shift going to either p0 or p6, this anomaly lowers this to a single shift per cycle.

Besides the shifts and rotations BZHI and BEXTR are also affected. While they are not a shift per se, it makes sense that it would be implemented with the same circuitry (e.g., BZHI is dst = arg1 & ~(-1 << arg2)). BEXTR in particular goes from 2 to 6 cycle latency!

Another thing I'm noticing is that the affected instructions are all p06 shift ops. You can alternatively implement rotation using SHLD r,r,c but this is a p1 operation and I have not seen any slowdown from this issue.

tavianator · 2025-01-04T23:35:38 1736033738

Very interesting, thanks! If I get time I'll try to measure every instruction to see the whole set of affected instructions.

BeeOnRope · 2025-01-06T03:10:10 1736133010

How did you test the throughout? Seems quite weird. Does it go to both ports still?

pbsd · 2025-01-06T13:45:03 1736171103

Same framework but instead of, say, SHLX RAX, RAX, RCX I do SHLX R[i], R[i], RCX for 8 consecutive registers. Yes, it still does go to both ports.

BeeOnRope · 2025-01-07T16:57:54 1736269074

Really interesting. Normal uops don't work like that, they are always pipelined, so a p06 op with 3-cycle latency would always be 3/0.5, not 3/1.

So the 1-throughput strikes me as a renamer limit, not an execution limit. I.e., these instructions can only flow through the renamer at 1 per cycle.

This could perhaps be tested by interleaving unrelated p06 ops in the same ratio as SHLX, and see if they + SHLX are able to saturate p06, and also trying different interleaving granularities like 1:1 and 5:5 since I would expect those to behave differently in the renamer (but not much different in execution).

pbsd · 2025-01-07T20:23:14 1736281394

Interleaving CQO and SHLX results in ~1.33 throughput with the anomaly, ~2.0 without. This ratio is more or less constant whether it's 1:1 or 2:2 or 4:4 or 8:8 (with 1:1 it's slightly lower at ~1.28).

This may or may not be consistent with one CQO uop being executed once a cycle as expected, and one SHLX uop taking a a spot (stalling for one cycle?) for 2 cycles, resulting in a runtime of (x/2 * 1 + x/2 * 2)/2 ~ x/1.33 cycles.

BeeOnRope · 2025-01-08T02:51:46 1736304706

Well very interesting. I don't really read that as supporting my idea of a renamer limit: seems like the throughput would either be higher (assuming the CQO just goes "for free" in the same rename cycle) or lower and also depend heavy on the interleaving (assuming the SHLX uses a whole rename cycle by itself or something like that).

So maybe it's like you say: SHLX with the anomaly is only semi-pipelined: it takes 2 cycles before the next SHLX can dispatch. Perhaps it needs to use 1 cycle to handle the unfolding of the folded immediates, then the shift happens in the second cycle, and then the latency just has to be rounded up to 3 since all uops on those ports are 1 or 3, never 2 (which simplifies the scheduler, I believe).

That would also explain the original 1/cycle throughput for pure SHLX with anomaly: as 2 non-pipelined cycles on the port, 2 ports = the observed throughput.

So it's sort of like a 2 uop instruction but can't _actually_ be 2 uops because "rename" is too late to crack something into 2 uops (that's already happened), so it just does the doubling up internally?

pbsd · 2025-01-08T21:34:38 1736372078

Trying some perf events confirms that there is no extra inserted uop. Going back to the SHLX R[i], R[i], RCX loop, we have:

No anomaly:

     2,190,954,207      cpu_core/cycles:u/                                                      ( +-  0.14% )
     4,412,790,656      cpu_core/uops_issued.any:u/                                             ( +-  0.11% )
        39,386,389      cpu_core/exe_activity.1_ports_util:u/                                        ( +- 11.57% )
     2,121,401,346      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.11% )
         6,015,432      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  8.87% )
       593,599,670      cpu_core/uops_retired.stalls:u/                                         ( +-  0.85% )

Anomaly:

     4,357,567,336      cpu_core/cycles:u/                                                      ( +-  0.15% )
     4,448,899,140      cpu_core/uops_issued.any:u/                                             ( +-  0.26% )
     2,107,051,688      cpu_core/exe_activity.1_ports_util:u/                                        ( +-  0.14% )
     1,106,699,503      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.13% )
     1,129,497,409      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  0.42% )
     2,502,226,997      cpu_core/uops_retired.stalls:u/                                         ( +-  0.38% )

Noise from the surrounding code aside, we see the same number of uops issued. However in the anomaly case, ~1/4th of the cycles are spent with no uops being executed, 1/2 are spent with only 1 uop being executed, and around 1/4 of the cycles have 2 uops being executed. I expected 0 and 2 being 50/50, consistently with there being one cycle stall, but if the uops are desynched and issued one cycle apart it would also explain the 1 being so prominent.

To confirm this I add an LFENCE at the start of each loop iteration to serialize the pipeline and try to ensure that each SHLX pair is issued in the same cycle. And it works:

     4,581,269,346      cpu_core/cycles:u/                                                      ( +-  0.10% )
     4,556,347,404      cpu_core/uops_issued.any:u/                                             ( +-  0.12% )
       133,363,872      cpu_core/exe_activity.1_ports_util:u/                                        ( +-  7.73% )
     2,082,838,530      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.24% )
     2,165,817,614      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  0.06% )
     3,090,362,239      cpu_core/uops_retired.stalls:u/                                         ( +-  0.16% )

Now the uops are split between 0 and 2 executed per cycle, as theorized.

pbsd · 2024-10-19T00:25:53 1729297553

An interesting data point is that Kahn's The codebreakers, from 1967, uses "encipher" everywhere except for various US goverment agency quotes, which use "encrypt."