This can be translated to the discrete domain pretty easily, just like the NTT. Pick a sufficiently large prime with order 15k, say, p = 2^61-1. 37 generates the whole multiplicative group, and 37^((2^61-2)/3) and 37^((2^61-2)/5) are appropriate roots of unity. Putting it all together yields
This involves 6 exponentiations by n with constant bases. Because in fizzbuzz the inputs are sequential, one can further precompute c^(2^i) and c^(-2^i) and, having c^n, one can go to c^(n+1) in average 2 modular multiplications by multiplying the appropriate powers c^(+-2^i) corresponding to the flipped bits.
The SIKE comparison is not particularly inconsistent since Bernstein has been banging the drum that structured lattices may not be as secure as thought for years now.
Currently the best attacks on NTRU, Kyber, etc, are essentially the same generic attacks that work for something like Frodo, which works on unstructured lattices. And while the resistance of unstructured attacks is pretty well studied at this point, it is not unreasonable to suspect that the algebraic structure in the more efficient lattice schemes can lead to more efficient attacks. How efficient? Who knows.
Vector ALU instruction latencies are understandably listed as 2 and higher, but this is not strictly the case. From AMD's Zen 5 optimization manual [1], we have
The floating point schedulers have a slow region, in the oldest entries of a scheduler and only when the scheduler is full. If an operation is in the slow region and it is dependent on a 1-cycle latency operation, it will see a 1 cycle latency penalty.
There is no penalty for operations in the slow region that depend on longer latency operations or loads.
There is no penalty for any operations in the fast region.
To write a latency test that does not see this penalty, the test needs to keep the FP schedulers from filling up.
The latency test could interleave NOPs to prevent the scheduler from filling up.
Basically, short vector code sequences that don't fill up the scheduler will have better latency.
So if you fill up the scheduler with a long line of dependent instructions, you experience a significant slowdown? I wonder why they decided to make it do that instead of limiting size/fill by a bit. What all the tradeoffs were.
>Even after the static variable has been initialised, the overhead of accessing it is still considerable: a function call to __cxa_guard_acquire(), plus atomic_load_explicit(&__b_guard, memory_order::acquire) in __cxa_guard_acquire().
No. The lock calls are only done during initialization, in case two threads run the initialization concurrently while the guard variable is 0. Once the variable is initialized, this will always be skipped by "je .L3".
Right, I was scratching my head exactly for that reason too. Even if the analysis was correct it would still be a solution for the problem that doesn't exist.
The Pentium 4 had branch hints in the form of taken/not taken prefixes. They were not found to be useful and basically ignored in every subsequent Intel microarchitecture, until Redwood Cove brought back the branch taken prefix in 2023.
Bitcoin's proof of work uses SHA-256(SHA-256(x)). Combining that with your figures reduces the differences to well within minutia of how you count bit operations and exactly which tradeoff the circuits make.
reply