Hacker News new | past | comments | ask | show | jobs | submit | pbsd's comments login

This circuit [1] puts it at <=135k bit operations. Bitcoin uses SHA-256, not SHA-1.

[1] https://nigelsmart.github.io/MPC-Circuits/sha256.txt


Bitcoin's proof of work uses SHA-256(SHA-256(x)). Combining that with your figures reduces the differences to well within minutia of how you count bit operations and exactly which tradeoff the circuits make.

Karatsuba is definitely faster than schoolbook multiplication at practical sizes. You presumably mean Strassen.


In page 10, should the ring R be GF(2)[X]/(X^32-1) and the map p be from {0,1}^{32} to R?


I think so yes. I emailed my coauthors to confirm.


Interestingly enough, the Square attack (otherwise more generally known as integral cryptanalysis) is much more powerful than regular linear or differential cryptanalysis when applied to the AES.


Antoine Joux was on the side of classical cryptanalysis on a 2014 bet. This was right after the small-characteristic discrete log advances, so that might no longer be the bet if it was made today.

https://x.com/hashbreaker/status/494867301435318273


Jasmin is something like this. It is essentially a high-level assembler, will handle register allocation (but not spills) for you, has some basic control flow primitives that map 1-to-1 to assembly instructions. There is also an optional formal verification component to prove some function is equivalent to its reference , is side-channel free, etc.

[1] https://github.com/jasmin-lang/jasmin/wiki


Thanks, this looks quite interesting.


It goes way back; check the work of the likes of Thorsten Holz or Christof Paar. TU Graz is another one.


While the observation has previously focused on latency it also affects throughput: whereas you could run 2 independent shifts per cycle before, each shift going to either p0 or p6, this anomaly lowers this to a single shift per cycle.

Besides the shifts and rotations BZHI and BEXTR are also affected. While they are not a shift per se, it makes sense that it would be implemented with the same circuitry (e.g., BZHI is dst = arg1 & ~(-1 << arg2)). BEXTR in particular goes from 2 to 6 cycle latency!

Another thing I'm noticing is that the affected instructions are all p06 shift ops. You can alternatively implement rotation using SHLD r,r,c but this is a p1 operation and I have not seen any slowdown from this issue.


Very interesting, thanks! If I get time I'll try to measure every instruction to see the whole set of affected instructions.


How did you test the throughout? Seems quite weird. Does it go to both ports still?


Same framework but instead of, say, SHLX RAX, RAX, RCX I do SHLX R[i], R[i], RCX for 8 consecutive registers. Yes, it still does go to both ports.


Really interesting. Normal uops don't work like that, they are always pipelined, so a p06 op with 3-cycle latency would always be 3/0.5, not 3/1.

So the 1-throughput strikes me as a renamer limit, not an execution limit. I.e., these instructions can only flow through the renamer at 1 per cycle.

This could perhaps be tested by interleaving unrelated p06 ops in the same ratio as SHLX, and see if they + SHLX are able to saturate p06, and also trying different interleaving granularities like 1:1 and 5:5 since I would expect those to behave differently in the renamer (but not much different in execution).


Interleaving CQO and SHLX results in ~1.33 throughput with the anomaly, ~2.0 without. This ratio is more or less constant whether it's 1:1 or 2:2 or 4:4 or 8:8 (with 1:1 it's slightly lower at ~1.28).

This may or may not be consistent with one CQO uop being executed once a cycle as expected, and one SHLX uop taking a a spot (stalling for one cycle?) for 2 cycles, resulting in a runtime of (x/2 * 1 + x/2 * 2)/2 ~ x/1.33 cycles.


Well very interesting. I don't really read that as supporting my idea of a renamer limit: seems like the throughput would either be higher (assuming the CQO just goes "for free" in the same rename cycle) or lower and also depend heavy on the interleaving (assuming the SHLX uses a whole rename cycle by itself or something like that).

So maybe it's like you say: SHLX with the anomaly is only semi-pipelined: it takes 2 cycles before the next SHLX can dispatch. Perhaps it needs to use 1 cycle to handle the unfolding of the folded immediates, then the shift happens in the second cycle, and then the latency just has to be rounded up to 3 since all uops on those ports are 1 or 3, never 2 (which simplifies the scheduler, I believe).

That would also explain the original 1/cycle throughput for pure SHLX with anomaly: as 2 non-pipelined cycles on the port, 2 ports = the observed throughput.

So it's sort of like a 2 uop instruction but can't _actually_ be 2 uops because "rename" is too late to crack something into 2 uops (that's already happened), so it just does the doubling up internally?


Trying some perf events confirms that there is no extra inserted uop. Going back to the SHLX R[i], R[i], RCX loop, we have:

No anomaly:

     2,190,954,207      cpu_core/cycles:u/                                                      ( +-  0.14% )
     4,412,790,656      cpu_core/uops_issued.any:u/                                             ( +-  0.11% )
        39,386,389      cpu_core/exe_activity.1_ports_util:u/                                        ( +- 11.57% )
     2,121,401,346      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.11% )
         6,015,432      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  8.87% )
       593,599,670      cpu_core/uops_retired.stalls:u/                                         ( +-  0.85% )
Anomaly:

     4,357,567,336      cpu_core/cycles:u/                                                      ( +-  0.15% )
     4,448,899,140      cpu_core/uops_issued.any:u/                                             ( +-  0.26% )
     2,107,051,688      cpu_core/exe_activity.1_ports_util:u/                                        ( +-  0.14% )
     1,106,699,503      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.13% )
     1,129,497,409      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  0.42% )
     2,502,226,997      cpu_core/uops_retired.stalls:u/                                         ( +-  0.38% )
Noise from the surrounding code aside, we see the same number of uops issued. However in the anomaly case, ~1/4th of the cycles are spent with no uops being executed, 1/2 are spent with only 1 uop being executed, and around 1/4 of the cycles have 2 uops being executed. I expected 0 and 2 being 50/50, consistently with there being one cycle stall, but if the uops are desynched and issued one cycle apart it would also explain the 1 being so prominent.

To confirm this I add an LFENCE at the start of each loop iteration to serialize the pipeline and try to ensure that each SHLX pair is issued in the same cycle. And it works:

     4,581,269,346      cpu_core/cycles:u/                                                      ( +-  0.10% )
     4,556,347,404      cpu_core/uops_issued.any:u/                                             ( +-  0.12% )
       133,363,872      cpu_core/exe_activity.1_ports_util:u/                                        ( +-  7.73% )
     2,082,838,530      cpu_core/exe_activity.2_ports_util:u/                                        ( +-  0.24% )
     2,165,817,614      cpu_core/exe_activity.exe_bound_0_ports:u/                                        ( +-  0.06% )
     3,090,362,239      cpu_core/uops_retired.stalls:u/                                         ( +-  0.16% )
Now the uops are split between 0 and 2 executed per cycle, as theorized.


An interesting data point is that Kahn's The codebreakers, from 1967, uses "encipher" everywhere except for various US goverment agency quotes, which use "encrypt."


Normal for loops can't really make that work, they're too general, but range for loops plausibly could. Something like

    for(auto&& e : range) {
        std::print("{}" e);
        join {
            std::print(", ");
        }
    }
where join {} is effectively syntax sugar for if(std::next(__first) != __last) {}.

The fmt library makes this sort of task easier by providing the join adaptor; this example would become

    fmt::print("{}", fmt::join(range, ", "));


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: