Hacker Newsnew | past | comments | ask | show | jobs | submit | pbsd's commentslogin

This can be translated to the discrete domain pretty easily, just like the NTT. Pick a sufficiently large prime with order 15k, say, p = 2^61-1. 37 generates the whole multiplicative group, and 37^((2^61-2)/3) and 37^((2^61-2)/5) are appropriate roots of unity. Putting it all together yields

    f(n) = 5226577487551039623 + 1537228672809129301*(1669582390241348315^n + 636260618972345635^n) + 3689348814741910322*(725554454131936870^n + 194643636704778390^n + 1781303817082419751^n + 1910184110508252890^n) mod (2^61-1).
This involves 6 exponentiations by n with constant bases. Because in fizzbuzz the inputs are sequential, one can further precompute c^(2^i) and c^(-2^i) and, having c^n, one can go to c^(n+1) in average 2 modular multiplications by multiplying the appropriate powers c^(+-2^i) corresponding to the flipped bits.

Integer exponentiation is still really, really expensive. 3-4 modulus operations and a few branches is a lot cheaper.

No, the generated code seems to be mostly the same as the manual version: https://gcc.godbolt.org/z/aK8orbKE8

The main difference there seems to be that GCC treats the if() as unlikely to be taken while the for() as likely.


The SIKE comparison is not particularly inconsistent since Bernstein has been banging the drum that structured lattices may not be as secure as thought for years now.

Currently the best attacks on NTRU, Kyber, etc, are essentially the same generic attacks that work for something like Frodo, which works on unstructured lattices. And while the resistance of unstructured attacks is pretty well studied at this point, it is not unreasonable to suspect that the algebraic structure in the more efficient lattice schemes can lead to more efficient attacks. How efficient? Who knows.


Without wanting to engage much more deeply on this topic let me just say I concede any cryptography point 'pbsd makes.


I thought this was gonna be about the actual Scream stream cipher: https://eprint.iacr.org/2002/019


Cimino's Heaven's Gate (1980) is usually pointed as the movie that caused the "no animals were harmed" disclaimer to be added to subsequent movies.


Vector ALU instruction latencies are understandably listed as 2 and higher, but this is not strictly the case. From AMD's Zen 5 optimization manual [1], we have

    The floating point schedulers have a slow region, in the oldest entries of a scheduler and only when the scheduler is full. If an operation is in the slow region and it is dependent on a 1-cycle latency operation, it will see a 1 cycle latency penalty.
    There is no penalty for operations in the slow region that depend on longer latency operations or loads.
    There is no penalty for any operations in the fast region.
    To write a latency test that does not see this penalty, the test needs to keep the FP schedulers from filling up.
    The latency test could interleave NOPs to prevent the scheduler from filling up.
Basically, short vector code sequences that don't fill up the scheduler will have better latency.

[1] https://www.amd.com/content/dam/amd/en/documents/processor-t...


So if you fill up the scheduler with a long line of dependent instructions, you experience a significant slowdown? I wonder why they decided to make it do that instead of limiting size/fill by a bit. What all the tradeoffs were.


>Even after the static variable has been initialised, the overhead of accessing it is still considerable: a function call to __cxa_guard_acquire(), plus atomic_load_explicit(&__b_guard, memory_order::acquire) in __cxa_guard_acquire().

No. The lock calls are only done during initialization, in case two threads run the initialization concurrently while the guard variable is 0. Once the variable is initialized, this will always be skipped by "je .L3".


Right, I was scratching my head exactly for that reason too. Even if the analysis was correct it would still be a solution for the problem that doesn't exist.


The Pentium 4 had branch hints in the form of taken/not taken prefixes. They were not found to be useful and basically ignored in every subsequent Intel microarchitecture, until Redwood Cove brought back the branch taken prefix in 2023.


This circuit [1] puts it at <=135k bit operations. Bitcoin uses SHA-256, not SHA-1.

[1] https://nigelsmart.github.io/MPC-Circuits/sha256.txt


Bitcoin's proof of work uses SHA-256(SHA-256(x)). Combining that with your figures reduces the differences to well within minutia of how you count bit operations and exactly which tradeoff the circuits make.


Karatsuba is definitely faster than schoolbook multiplication at practical sizes. You presumably mean Strassen.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: