Intel® Stratix® 10 devices deliver up to 23 TMACs of fixed-point performance and up to 10 TFLOPS of IEEE-754 single-precision floating-point performance.
You can't really compare them on a FLOPS basis. Firstly because our and Jäkel's algorithms are completely different. In fact within the FPGA accelerator itself we don't even use a single multiply, all our operations are boolean logic and counting. Whereas Jäkel was able to exploit the GPU's strong preference for matrix multiplication. All his operations were integer multiplications.
In fact, in terms of raw operation count, Jäkel actually did more. There appears to be this tradeoff within the Jumping Formulas, where any jump you pick keeps the same fundamental complexity, with even a slight preference towards smaller jumps. It is just that GPU development is several decades ahead of FPGA development, thanks to ML and Rendering hype, which more than compensates for the slightly worse fundamental complexity.
As a sidenote, raw FLOP counts from FPGA vendors are wildly inflated. The issue with FPGA designs is that getting this theorethical FLOP count is nigh-impossible, because getting all components running at the theorethical clock frequency limit is incredibly difficult, compare that with GPUs, where at least your processing frequency is a given.
Hello, thanks for taking the time to reply here! My own Master's thesis was also about optimising and implementing algorithms to count certain (graphical) mathematical objects, but you picked a much more famous problem than me. I'm very surprised I didn't know the definition of Dedekind numbers, although it's related to things I touched on.
I'm not too familiar with FPGAs but hope to have a use for them some day. Measuring their performance in FLOPs seems strange. How close to those theoretical limits does one typically get? Are there are a lot of design constraints that conspire against you, or is it just that whatever circuit you want can't be mapped densely to the gate topology?
I'm curious how does credit work in these situations? Is it first to upload/publish somewhere (arvix) or is it "here are the logs when we found it"? If it's the former, I'd build a pipeline that autogenerates a quick paper and uploads immidiately when the number is spit out. If it's the latter, how do we make sure the logs are not tampered with?
Either way it's fun for me that one of the teams used the Noctua 2 cluster, I was in the server room a couple of month ago. Looked very underwhelming but it's quite interesting because I've never seen an FPGA cluster before. They are also doind some interesting ML things there (optimizing for power consumption).
Edit: nevermind the first part, I read the van Hirtum paper and they registered the number in Github. Can't find any other time mentioned in the Jäkel paper.
While Noctua 2 is certainly not one of the larger clusters out there, it does stand out as one of the only FPGA superclusters in Europe. This is the unique capability that made our project possible there.
And yes, assigning unique credit in this case isn't trivial. I think that the only good answer here is: "It was discovered Simultaneously, independently"
Hi Christian and Lennart - if you are still listening in here...
Did you know the other guy was working on the same problem - have you ever met before?
Lennart through his master thesis has a track record on working on the thing,
Christian I am not sure, haven't seen any previous publication yet.
It is sort of funny that the number was computed practically simultaneously some 450 kms apart
(though it happens in science, as if there is something in the air sometimes ;-)
Also, it has now been stated that D(9) "may be the last in the sequence that is feasible to discover" (The New Scientist)
- is this a view you can share?
(Never say never, I know :-)
> The problem: The calculation of these terms on normal processors is slow and also the use of GPUs as currently the fastest hardware accelerator technology for many AI applications is not efficient for this algorithm.
> https://arxiv.org/pdf/2304.03039.pdf At
this rate, a single FPGA processes about 5.2 α values per second, taking 47’000
FPGA hours to compute D(9)
and
> https://arxiv.org/pdf/2304.00895.pdf The only thing left to say is that we run the algorithm on Nvidia A100 GPUs. 5311 GPU hours and
4257682565 matrix multiplications later, we got the following value for the ninth Dedekind number
Does this mean that Christian Jäkel developed a new technique? Or was this extending the existing P-coefficient formula work and making it work on GPUs?
It does seems so, both seems to compute the Dedekind number by using sums over equivalence classes of lower number of variables functions. Before [0][1] It was only possible to compute D(n+2) from Rn and Jäkel gave way to compute D(n+3) and D(n+4) from Rn and that seems to be major speed improvement. The D(n+3) computed D(8) in 17.56s, he couldn't use the same method for presumably because R(7) is 30,000 times R(6) and he is using matrix multiplication so it will take stupid long.
Edit: I didn't read carefully enough, it is not new at all see [2] or [3] for examples.
[0]: Yusun, T.J.: Dedekind numbers and related sequences (2008) page 19
[1]: Lennart Van Hirtum. A path to compute the 9th dedekind number using
fpga supercomputing. https://hirtum.com/thesis.pdf, 2021. KU Leu-
ven, Masters Thesis page 51
"While it would be nice to pick k = 3 and just compute D(9) from the structure of 6,
at this time no efficient methods exist for computing P(n,k,α,β) for k ≥ 3. So we have
to use k = 2"
[2]: F. a Campo, Relations between powers of Dedekind numbers and exponential sums
related to them, J. Integer Seq. 21 (2018), article 18.4.4, https://cs.uwaterloo.ca/
journals/JIS/VOL21/Campo/campo3.pdf and https://cs.uwaterloo.ca/journals/
JIS/VOL21/Campo/campo3-corrigendum.pdf.
I'm Christian, the one-man team. I want to clarify that the symmetries I used are not based on Rn. I developed methods to utilize equivalent lattice intervals. That’s a generalization, since the Rn values can be seen as a special kind of interval. This new approach, based in intervals, made the reformulation in terms of matrix multiplication possible.
Both fundamental techniques were known beforehand, and in both cases the major innovation was adapting the algorithm for the specific compute platform. For us it was exploiting FPGAs to compute these P-Coëfficients, for Jäkel, his major contribution was restructuring the summation as large matrix multiplications, which could then be executed very efficiently on GPU.
Someone always asks this, and it's been answered many times before.
In short: this specific result probably has no practical applications just now.
But there are several points to consider:
0: The techniques developed to answer the question might be useful for other applications;
1: It can be fun to chase down problems like this;
2: Sometimes results like this end up being useful decades, even centuries later;
3: The skills the researchers develop might be able to be used on other problems;
4: Not everything has to be useful. Consider music, art, hobbies.
As examples of some of these:
a: Elliptic curves were initially studied purely out of curiosity. Now they form the basis of modern cryptosystems;
b: Factorising Integers was regarded as the pursuit of cranks, until it became an important tool in breaking cryptosystems;
c: Latin Squares and Lattices were studied well before they were suggested as a possibility for post-quantum cryptography;
d: Graph Theory was chased purely because it was interesting, and now forms the foundation of a gazillion algorithms;
e: Eigenvectors were studied purely out of curiosity, then suddenly they became the basis of the Page Rank algorithm;
f: Non-Euclidean Geometry was created purely to show that classic geometry was probably right, then showed it was actually "wrong", then turned out to be useful in relativity.
And so on.
There are more, and thousands, possibly millions of words have been written on the subject of how pure research, pursued simply because it was interesting, has turned out (in some cases!) to be useful.
The 9th Dedekind number is unlikely to be useful, but its pursuit may be valuable in ways we can't imagine.
Thanks for the wall of text, I don’t doubt basic research is valuable and never wrote anything like that, so you could have saved that typing.
> The 9th Dedekind number is unlikely to be useful, but its pursuit may be valuable in ways we can't imagine.
The fact that you can’t imagine it doesn’t mean somebody else can’t imagine it. If you can’t imagine it, save the typing and let somebody that can imagine it reply, next time.
I spent some time searching for an algebraic expression or straight forward recursive formula for dedekind numbers in the past because such a formula would make it easy to construct an efficient encoding function that maps a monotone boolean function to an integer optimally.
Why was I interested in optimally encoding monotone boolean functions? Because they're exactly equivalent to the set of credible multi-party signing policies. So, for example you have some N parties that can participate in signing to control some Bitcoin-- which combinations of them should be allowed? People commonly use thresholds, M of N, but sometimes not all participants are equal. The full set of policies you could possibly have is the monotone functions (since it doesn't make sense to have a policy that says a party CANT participate, since a party could always refrain to satisfy the policy).
Knowing the next dedekind number, even if not through a an efficient expression provides extra information that could help understand the structure of the problem.
For my application though having an optimal encoding isn't strictly necessary (and in any case may not be possible, given the amount of thought that has gone into solving the problem finding the dedekind numbers)-- though it would be neat.
The naive encoding is to just express it as a boolean function, which takes 2^n bits, but already at 9 inputs an optimal encoding of monotone functions would be 26.8% of the size and the gap will increase as the number of inputs increases.
More interesting to me would be an encoding that efficiently (if not optimally) expressed the monotone functions in terms of how many bits were allowed to be false, because I'd come up with a commitment scheme that allows the policy setter to set the policy as a single hash (which has a hidden bound on he number of missing parties), while the signers can collaboratively produce a signature of size proportional to the number of missing parties. In the case that the policy is a simple threshold, this construct alone is enough, but if it isn't additional information for which missing-sets aren't allowed is needed and to maximize policy privacy it would be best if he information was hierarchically encoded. E.g. if none are missing then no information is needed, if 1 is missing which parties are it not allowed to be, if 2 are missing which pairs, and so on. (this structure also has more promise for an efficient encoding, since its easy to optimally encode an n choose m selection).
For these more generalized encoding questions, particularly if they don't seek to encode optimally the next dedekind number is less useful, but exploration of it could still yield helpful insights towards more efficient, if not optimal encodings or somewhat efficient encodings with special properties like the hierarchical one I wanted above.
https://arxiv.org/abs/2304.00895