Romu – Fine random number generators 61 points by nkurz on Feb 29, 2020 | hide | past | favorite | 68 comments

 I'd stay away from this work. There are ample math and reasoning errors in the paper, and the claims are dubious.One prime example: the algo for ROMU is of the form new_state64 = ROTATE_LEFT(old_state64,32) * constant, which decomposes the space {1,2,..,2^64-1} into a permutation. From a given start seed, your cycle may be short, and you won't know this until it bites you.The author incorrectly addresses this using the theory of random permutations, but his is most definitely NOT a random permutation chosen uniformly from S_n. His analysis most does not follow. The proofs he cites are not applicable to his since he has a fixed permutation.Among all cycles, among all permutations, the results hold. They do not hold for a fixed permutation. For example his results fail for the n-cycle and they fail for the identity permutation. They similarly fail for many, many other permutations, and he has no proof that his is not a badly chosen one. Based on the structure of it (rotate and mult), I suspect from past experience that his permutation has lots of structure that leads to all sorts of pathologies for this use.This is a basic math error, and the paper is full of such basic errors.For why to listen (somewhat) to me, check out my name - I've done a bit of writing (and a lot of teaching) on PRNGs for decades. I've followed the field for quite some time, and am well aware of much of the research in it.There is no need with all the good choices out there to choose a PRNG that cannot provide a solid proof of cycle length (or at least correct bounds) for any seed value. This one fails this basic test.If I get time (and remember) later, I've written some SAT solver code to explicitly find holes in PRNGs, and I'll try to dig it out and analyze the cycles in this fixed permutation. If anyone else wants to try it the idea is you can assign variables to bit positions in the state, represent the permutation as boolean expressions in these bits, and find cycles using SAT solvers without having to run through steps incrementally. I suspect in this case it will break this PRNG pretty quickly.
 Suppose you generated two permutations, one using true random numbers and the other using pseudo-random numbers. Now you have two fixed permutations. The cycles in both will obey the probabilities about such cycles, unless the pseudo-random generator was poor. But Romu generators pass PractRand, so their high quality numbers and cycles will obey those probabilities. But you brought up a good point: Each kind of Romu generator relies on one permutation and thus one set of cycles. The paper needs to discuss that fact. Thanks for pointing that out.Important request: Please post every specific error in reasoning or math you saw in the paper. You said there are several. I expect that you are not like many internet posters who reply with vagueness or discouragement when pressed for specifics. I need the specific errors so I can fix all of them before submitting the paper to a peer-reviewed journal.Also, a SAT solver that finds cycles in these generators would be very helpful!
 It also dawned on me your method is basically the same as vonNeumann's 1946 middle square method, which turned out to be terrible from short cycles. His was base 10; yours is base 2. His dropped the bottom and top "bits". Yours drops the top (during the mult). His permuted each time via a different number; yours uses a fixed number.Both use a mult to scramble, and then retain a subset of the bits. Adding more rotates and mults does not change the underlying issues.This type of PRNG has been pretty thoroughly mined over the past 70+ years. You should dig through this work and the following literature on why PRNGs of this structure fell out of favor. I suspect exactly yours appears somewhere following this work.
 You cannot fix this error. Your proofs do not apply to such limited set of random permutations. You have to prove the cycle structure for the specific ones you chose, no matter how you choose them, otherwise you are playing with a time bomb. As such there’d be no reason to choose your method over many other PRNGs that do provide good cycle length guarantees and equidistribution metrics at the same CPU speeds.In fact, I think your method is a weaker subset of PCG generators, which are as fast and have all the good theoretical guarantees.What do you mean by “true random numbers”?There are 64! permutations on 64 bits. All Roma permutations are a vanishingly small subset of these with very specific structure. How can you prove that structure doesn’t introduce all sorts of bad cases?Don’t worry about all the errors. I doubt this one is fixable.
 I have given this question (and my paper) to a PhD who works heavily with probability and has a track-record of accomplishments. Also, I have another idea on empirical verification. So hang on...
 > The proofs he cites are not applicable to his since he has a fixed permutation.This is something I'm concerned about (hence https://news.ycombinator.com/item?id=22451307), but I lack the know-how to demonstrate the difference.> I get time (and remember) later, I've written some SAT solver code to explicitly find holes in PRNGs, and I'll try to dig it out and analyze the cycles in this fixed permutation.Thanks, I would love to see this. Can I expect to see the results on your blog or here or somewhere else?Presumably you've done this for other PRNGs in the past. If you've got links to share, I'd like to take a look, and hopefully learn a thing or two. :-)
 I’ve done it for crypto work, and to find or prove impossible certain mixing functions for PRNG work. Knuth fascicles on Boolean Functions and SAT solvers give a good foundation for this type of work. Do all or most of his related exercises to get going.Unfortunately, I rarely write stuff up. Too busy doing :) Writing up well takes considerable time away from making things.
 You can easily get over 1 GB/s of cryptographic randomness with modern CSPRNGs, for example an AES-NI accelerated CTR_DRBG (basically encrypting zeros with AES-128 in CTR mode).I don't think there is any reason to use non-cryptographic PRNGs. (CTR_DRBG is deterministic if seeded with a fixed seed, so that's not an excuse either).
 In many Monte-Carlo algorithms random number generation is responsible for a significant chunk of total runtime. Modern PRNGs like Xoshiro256+ implemented with AVX are much faster than any CSPRNG.
 Yes indeed. In fact people often use things like Sobol Sequences
 Not so much because they are fast, though, but because they produce more evenly distributed (less clumpy) numbers.
 Yes but this (sometimes/often) means you actually need less paths in your MC simulation and therefore makes it faster overall. Obviously depending on what you're trying to simulate and what the underlying stochastic processes are etc.I guess my point was there are often legitimate uses for things that are random-like but far from crypto random.
 I found a figure of 0.4 cpb for xoshiro256++. AES-128-CTR using AES-NI runs at over 10 GB/s per core on a 3900X. That's approximately the same throughput. Of course, performance embedded in a real simulation may differ wildly, I expect xoshiro is much, much friendlier to frequent invocations generating a block each than AES, so to reach that throughput with AES you'd likely need a not-so-small buffer, which eats into L1 and L2 cache budgets.
 I suspect that cpb is without using vector operations. AVX-512 lets you run 8 generators in parallel, and you can do even better with CUDA.
 .25 cpb (1cycle/32bit) for xorshift and pcg with avx512 in here: https://lemire.me/blog/2018/06/07/vectorizing-random-number-...
 There are parallel AES ops on new CPUs too though. Not sure if that figure is already using those.
 `````` // Romu generators, by Mark Overton, 2020-2-7. // // This code is not copyrighted and comes with no warranty of any kind, so it is as-is. // You are free to modify and/or distribute it as you wish. You are only required to give // credit where credit is due by: // (1) not renaming a generator having an unmodified algorithm and constants; // (2) prefixing the name of a generator having a modified algorithm or constants with "Romu"; // (3) attributing the original invention to Mark Overton. `````` So, is it copyrighted or not? If not, then you can't require credit as described. Also, it's the first time I see a license that dictates how I should name my variables.
 Schroedinger's license. It's both copyrighted and not, and you won't know until a court decides.The more likely outcome though is that it's copyrighted, because things just are copyrighted unless very clearly stating otherwise. CC0 needs three pages to explain just how "not copyrighted" the associated work is.
 "This code is not copyrighted" sure sounds like it is clearly stating otherwise
 But then the author makes demands, which they could only make if it -was- copyrighted. So it does seem like it’d be up to a court to decide.
 God I wish programmers would atop trying to reinvent copyright law. The FSF has done a ton of work examining copyright law in various jurisdictions. Many places have no notion of "public domain"
 I'm easy-going about this. I want anybody to be able to copy and modify the code, but not claim credit for inventing it. - Mark Overton
 All copyright licenses (except CC0) require attribution for the work in all copies (verbatim and derivative). Please just use something like MIT, BSD, or Apache-2.0 (or the GPL family if that's your preference) which have clearly understood legal language and are known free software licenses.Personally I would never use or depend on code that has a license its author wrote themselves. It's just too risky and other people will be less inclined to use whatever project I wrote that uses the weirdly-licensed code.
 I've changed the license to Apache version 2.0. Thanks for suggesting it!
 Thank you!
 Boost requires attribution in the source form (you can't remove the copyright notice from the source repo when redistributing). The second paragraph of the license:> The copyright notices in the Software and this entire statement, including the above license grant, this restriction and the following disclaimer, must be included in all copies of the Software, in whole or in part, and all derivative works of the Software, unless such copies or derivative works are solely in the form of machine-executable object code generated by a source language processor.It's similar to MIT or Apache-2.0 in that respect. As for Unlicensed and WTFPL, sure they're both effectively the same as CC0 (though I would personally be cautious about removing copyright notices from a WTFPL codebase).
 Interesting, but I can't think of any situations where I'd take chances with probabilistic periods in exchange for a slight speed boost over something with a known period and good output (e.g. a large enough mcg/lcg where you drop the low bits on output, or pcg -- these are all plenty fast and can pass the same tests).
 For Monte Carlo eg in finance, you’d definitely take the speed improvement. This thing passes Big Crush and PractRand - it’s going to compute the approximate value of your three factor Power Reverse Dual note just fine.
 > This thing passes Big Crush and PractRand... for the seeds that were tested.In theory, you can get unlucky and end up in a short cycle, and it will not pass these tests.That's why I'm not comfortable with it, when there are generators with a known, guaranteed period. They simply do not have this failure mode.
 That's why I pointed out that the probability of a too-short cycle or sequence-overlap is no higher than that of randomly selecting one snowflake out of all snowflakes in Earth's history. Also, the paper has a graph and accompanying discussion of what happens with the shorter cycles.
 The probability of the problem occurring is vanishingly small - much smaller than a bit flip due to a cosmic ray in the machine computing it, leading to a wrong answer or a crash. And even that doesn’t matter, because the computations are repeated so often.Again, the application determines what the most suitable PRNG is.
 This looks pretty interesting and may well be an improvement over PCG[1].For non-simulation purposes, however, using a fast-key-erasure[2] ChaCha20 or AES-256-CTR RNG has served me well enough, and doesn't leave me worrying if maybe someone could abuse the RNG for whatever purposes (e.g. cheating in a game by using RNG manipulation).
 > This looks pretty interesting and may well be an improvement over PCG[1].Like this website, the PCG website also makes rather bold claims and has no peer review to back it up. What's the deal with this?
 Good question.I wish, as in the romantic ideal of science, a consensus would emerge of what “the best” PRNG are, and how to choose among them. Instead, you seem to have many academics promoting their own ones (Vigna with the xoroshiro family, O’Neill with PCG, now this guy Overton with Romu).And in effect every programming language has to go and make sensible default choices, while the experts in PRNG refuse to make unambiguous recommendations.(FWIW, I’d advocate to have, by default, a cryptographically secure PRNG, and offer a choice of faster ones (like the three mentioned above) that have to be chosen explicitly, maybe naming them InsecureXYZ, for applications where speed is paramount.)
 I doubt there's ever going to be consensus. That's because requirements are generally ambiguous, and you make tradeoffs based on them.
 Agreed, but even a simple flowchart/decision diagram is not given (if you require this, choose that). At its most basic, we have cryptographically secure & unpredictable, vs. fast: those two should be offered in the standard library of any language, IMHO - not the Meraner Twister.
 I agree with you, and would say that the situation is even a bit worse. We're not really seeing individual PRNG's vying with each other, but rather families, each with a wide range of configuration parameters and tradeoffs.I would like to see a systematic comparison, some way of arriving at an answer to the question of which of these is "best."
 > Instead, you seem to have many academics promoting their own onesI suspect that large part of the problem is that there are not all that many academics in the field, promoting their own or otherwise. Ultra high speed PRNGs are just a bit of a niche, and the widely used stuff is already good enough for many industry uses.
 The PCG website is peer review. There are explanations and analyses of numerous algorithms.O'Neill also uses the same methods to study PCG.
 No peer review.There are many bold claims on this website. But you don't screw around with RNGs: they need to be right. Until this author's claims have been verified, I would not use this algorithm.
 Right?> In effect, Romu generators are infinitely fast when inlined.What the hell does "infinitely fast" even MEAN in this context? They can be parallelized easily? The computational cost amortizes well? That it literally requires zero operations to generate output?This may be something interesting if it packs a lot of the features claimed. But there are a LOT of ill-defined claims on the website, and a fair bit of time spent pointing to the state of current research in OTHER algorithms/proposals as if open questions are an automatic disqualification ("[H]ow can you know whether such a generator has enough capacity for your large job? You don’t know.").I don't think the guy is a crank, per se-- just excited. But this is the sort of stuff that makes me think that he hasn't done the requisite level of research required before claiming a breakthrough.
 > What the hell does "infinitely fast" even MEAN in this context?It's nonsense, but their argument seems to be that instruction level parallelism allows the RNG's user to keep executing simultaneously while the RNG pre-computes its next output. So the application doesn't have to delay execution to wait for the RNG to spit something out.
 Well that clearly assumes that the user code is not utilizing most of the issue bandwidth or rather that the unutilized issue bandwidth is sufficient to run the RNG and that there are enough EUs left over to actually run it. In that case you do indeed get operations "for free" (in terms of wall-clock time), because your extra operations are exactly contained within the increased IPC.It doesn't mean that it is infinitely fast. It does however mean that in this particular case it adds no latency for adding random number generation; an infinitely fast algorithm would also add no latency for random number generation. Doesn't mean one implies the other.
 Yes, the output latency is zero clock cycles. The paper contains ILP tables that detail what happens in each clock cycle. The generator computes its next output while the application is running. But you are correct in that the statement (and ILP table) assumes that the application is not using all available issue-slots.
 What's the advantage compared to this:`````` static uint128_t state = 1; // can be seeded to any odd number static inline uint64_t next(){ return (state *= 0xda942042e4dd58b5ULL) >> 64; } `````` from http://www.pcg-random.org/posts/on-vignas-pcg-critique.html?
 RomuDuo (same state size) appears to be slightly faster.
 I am not aware of any architecture which allows for 128-bit multiplication.
 Both CLang and GCC do it just fine. CLang even optimizes away unnecessary steps when upper half of one multiplicand is all zeroes like in this example.
 15 years ago I have observed that gcc on i386 would optimize away unnecessary steps of 64bit multiplication so one would assume that it is capable of doing the same for 128bit multiplication on 64bit platform and that thus it is not exclusively llvm/clang feature.
 SSE2 introduced native 128-bit types (__m128i), and GCC and Clang both have their own implementations that may be emulated in software or SSE instructions, depending on support.
 Are you sure? AFAIU, __m128i isn't for 128-bit arithmetic, it's simply a convenience type for packing 8-bit, 32-bit, or 64-bit values for SIMD operations.GCC and clang do have a 128-bit integral type, __int128, but arithmetic operations are synthesized inline, and judging by the generated assembly don't even make use of SIMD at all.Note that an unintended consequence of C99's uintmax_t made the introduction of a standard 128-bit integral type impossible on platforms that cared about ABI compatibility, an effect somewhat contradictory to the original intention of improving portability and code integration. A future C standard will loosen some language and introduce some new facilities to help promote the introduction of larger integral types. At that point Unix platforms may begin typedef'ing __int128 to int128_t.
 There aren't many compelling reasons to use insecure RNGs. In very few applications is the RNG your bottleneck. In many applications there are negative consequences to your RNG being compromised.
 > In many applications there are negative consequences to your RNG being compromised.Witness the security researcher's mindset. RNGs aren't just used in security applications.A large chunk of RNG needs are in simulation. Simulation has no need for security guarantees. It does need to be as fast as possible. A great many secure generators are pokey.
 There’s always the danger people misuse rng output unfortunately, see math.random() being used for cryptography almost incessantly.
 Yes, but I don't see this as a good reason to drop all fast but not cryptogaphycally secure prng from which simulations can gain an advantage. Every exising tool can be misused, not providing the tool at all is an extreme measure that in this specific case I think wouldn't even solve the problem: if a programmer does the mistake to use a non crypto secure prng for cryptography is very likely that he will also do other mistakes like reusing nonces or leaking memory or other less trivial errors. What I think is important is whenever somebody proposes a new prng to clearly state if it's cryptograpycally secure or not.
 No, there’s not for most applications where you don’t need determinism.
 re the ILP argument ("app continues running with no delay because it already has the random number it needs"): can't any RNG take advantage of that by caching the next value to return? ie. turning this`````` state_t s; uint rng_next() { X; return Y; } `````` into`````` state_t s; uint next; uint rng_next() { uint n = next; X; next = Y; return n; }``````
 I mention this idea in the paper (which is linked at the top of the website). The problem is that this technique increases register pressure when the generator is inlined, which in turn increases spills, which reduces performance.
 Nice. RNGs are getting better and better.Only minor nit: would be nice if they had ready to compile libraries. The ROTL macro is missing, and useful wrappers are missing too (like correct limiting to a smaller range) Of course it's trivial to implement, and I like the simplicity of the algorithm to copy it into programs.
 > The ROTL macro is missingIt's missing in the landing page, but the paper [1] linked at the bottom of it provides two implementations:`````` // C #define ROTL(d,lrot) ((d<<(lrot)) | (d>>(8*sizeof(d)-(lrot)))) // C++ template inline uDataT rotl (uDataT d, unsigned lrot) { return (d<>(8*sizeof(d)-lrot)); } `````` [1] http://www.romu-random.org/romupaper.pdf
 If you need the speed of this PRNG, you probably don't want the overhead of calling a library function. Instead you would implement it in place with AVX-512 or equivalent for the platform you're running on.
 Function calls are very cheap on modern CPUs. The same ILP argument the author makes applies in most cases.And both vectorization and inlining works fine with a function with modern tool chains that do LTO.
 For a second I thought this was a "random number generator in your USB port!" ala TOMU/FOMU...
 the wyhash rnd uses the similar trick. LCG are only for old 32 compact, on 64bit larger state and hash-like mixers are needed/much better. Just look at the graphics.
 Interesting

Search: