(Similar incentives exist for hash functions, by the way.)
Popular RNG-bound benchmarks:
* "fasta" from the Benchmarks Game  (thankfully, this one mandates use of its own RNG, though it's vulnerable to fast-forwarding as demonstrated in )
* "Perlin noise"  (almost entirely RNG performance bound)
* From SunSpider (this one being particularly relevant to V8): string-validate-input , string-base64 
That was for primitive types, it may well be different for types where comparison is more costly. And of course if you need protection from algorithmic complexity attacks, then you will need to take that into account.
Using associative arrays implemented via chaining, one could have a bit in the associative array header that indicates if the associative array uses the fast keyed hash or siphash (or another secure hash). When inserting an item, if one finds that the chain length exceeds a certain limit without the load exceeding the resize threshold (indicating poor hash distribution), one could rehash all of the keys using siphash.
An associative array using open addressing could do something similar, switching hash functions if the probe sequence got too long (analogous to a chain being too long).
Alternatively, if using open addressing, one could use something like cuckoo hashing. A very fast keyed hash could be computed, and upon a collision, siphash could be used, with a probe sequence similar to used by Python dicts. Though, if your use case has lots of missed lookups, then performance will be better just using siphash. Of course, one could use counters to detect this condition, set a bit in the associative array header to indicate all future lookups should just use siphash, and rehash all of the existing keys.
But yeah, data structuring is ultimately really work-load specific. Any solution a language provides will always be suboptimal for tons of use cases. The solution you describe sounds good for a "never think about it" solution that works for 85% of cases. Rust's solution kinda necessitates more thinking more often, but I think it makes it more applicable for more usecases if you're willing to do that little bit of thinking.
While SipHash perf is generally quite good, it's still not as fast as Fnv on small inputs or XxHash on large inputs. As in, 4x slower . FarmHash has great perf if you can invoke it in the way it wants, but doesn't do well in a streaming context.
It is intimately tied up with VMAC. The title of the 2006 paper that introduced them is "Message authentication on 64-bit architectures". Google scholar says it has been cited 32 times, while "SipHash: a fast short-input PRF" has been cited 43 times.
> As far as I know, the state of the art for "fast and secure" is still SipHash 2-4 or 2-3.
SipHash seems to me like a good choice.
SipHash has some security properties that Vhash does not, although Vhash-based-VMAC has similar security properties to SipHash, I believe. However, for hash flooding attacks, if the hash values themselves are not directly exposed to the attackers and there is enough noise in the response latency that timing attacks are difficult, I do not know what benefits SipHash has.
"Faster 64-bit universal hashing using carry-less multiplications" has some benchmarks that show Vhash as faster than SipHash, substantially so on long input.
Speed-wise, an iterated string hash using Dietzfelbinger-style multiply-shift hashing is also substantially faster (3-10x) than SipHash on both large and small inputs in my testing, and it needs about 20 lines of code to implement. I haven't written any Rust in a while, but I'll try and send you a patch to the repo you linked.
The benchmarks from the "Faster 64-bit ..." paper are available at https://github.com/lemire/StronglyUniversalStringHashing. The iterated string hash I referenced is https://github.com/lemire/StronglyUniversalStringHashing/blo....
The repo's a bit of a commented-out mess (I was trying to get a bunch of broken impls working before I had to get back to real work), so let me know if you have any trouble. Always happy to help people learn the language. :)
I'm Gankro on github and the #rust IRC channel.
"but who cares about complex type structures, right? Ima just go ahead and post this as if that didn't matter."
Part of the inheritance ;-)
>> though it's vulnerable to fast-forwarding <<
Is that a vulnerability given the way it is used in the benchmarks game?
Someone then asked whether it is also possible in Chrome and Node: https://github.com/fta2012/ReplicatedRandom/issues/2. After examining the MWC1616 code I saw the same “two concatenated sub-generators” problem explained in this post. The implication of it is that you can brute force the top and bottom 16 bits independently. So it is also easy since that’s just doing 2^16 possibilities twice! Code: https://gist.github.com/fta2012/57f2c48702ac1e6fe99b
GUIDs and UUIDs are guaranteed to be unique not simply because of their length, but also how they're constructed (MAC address, timestamp, and a random element).
So even if the random number generator does loop back around, it still won't ever generate the same GUID/UUID even on the same hardware, and on other hardware it is more unlikely yet still (due to MAC address).
* UUID generation still requires a good (CS)PRNG and should be vetted the same way you'd vet a (CS)PRNG.
* UUIDs are one-size-fits-all 36 character base-16 encoded strings. If you want a short random identifier that you can use in a URL (that people might need to type on a phone or something) they're not ideal. Re-encoding them and/or truncating them is as hard and error prone as just generating the randoms yourself.
* Researching a good UUID library and maintaining a dependency is harder than vetting the 10 lines of code required to do it yourself. Without a trusted standard library solution it's unclear which library to use (maybe not as true today as it was a couple years ago when this happened)
I'm guessing that UUIDs aren't standardized because the standards process is browser-centric and it's low priority in that context... but that's just a guess.
> we like using randomly generated identifiers.
Why? Why do your identifiers need to be cryptographically secure at all? Why do they even need to be random? Unless you're using them AS KEYS (your use of Math.random() suggests you aren't), "cryptographically secure" doesn't even have a valid context.
It seems like UNIQUE would suffice, which brings me to:
> Researching a good UUID library and maintaining a dependency is harder than vetting the 10 lines of code required to do it yourself
Well.....except that if you had done the former, you wouldn't have had to analyze collisions because, you know, they wouldn't have happened.
And if you're not using identifiers AS KEYS (which would be horrible practice anyway, since it's apparently output to the logs), why does the library need to be kept "up to date"? Seems as long as its generating unique id's, you're good to go.
I mean the math was definitely an interesting read, but it seems like it was all for naught.
Again, I didn't want to come off as rude or anything, I just happen to agree with the top level comment.
> Why? Why do your identifiers need to be cryptographically secure at all?
They don't have to be cryptographically secure. They just need to be unique. Using random identifiers are easier to generate than deterministic/semi-sequential identifiers which would require some form of coordination, which is a pain in the ass and a point of failure. We switched to a cryptographically secure PRNG because it is higher quality than the non-CS alternative.
> Well.....except that if you had done the former, you wouldn't have had to analyze collisions because, you know, they wouldn't have happened.
You're assuming a UUID library that didn't just use Math.random() under the hood. Many libraries at the time of this incident did. Either way it would have been work to find one we could trust, vs. 10 lines. Also, UUIDs are 36 characters and not suitable for some use cases like short identifiers that will appear in URLS (which might need to be typed oh phones, etc).
> why does the library need to be kept "up to date"?
We would have had to update it when someone realized it was using Math.random() under the hood, for instance. In general there's a certain amount of maintenance involved with any dependency.
> why couldn't you just build a dead simple add-on/module in C++
> it was all for naught.
The random identifier generation was relatively irrelevant / just a fun narrative around the real story, which was about the subtleties of PRNGs, doing your homework, and Math.random() being crappy. If you don't think that use case is legitimate, that's fine. There are lots of other use cases where Math.random() is problematic, like shuffling an array (for an unbiased shuffle of a length n array you need a cycle length of n!).
UUIDv1 is pretty much your best bet here because it's the least pain in the ass. 36 chars are hardly any less suitable for the use cases you mention than 22 chars. UUIDv1 can survive on a poor PRNG, so your reason for updating library isn't really valid. Sure there's a small amount of maintenance, but I'd expect a developer with solid experience and knowledge to almost immediately understand that (even) a 10 line homebaked solution is likely to cause more issues than a standardised algorithm developed specifically for this purpose.
As for your other use cases, CSPRNG! If you need something to be unbiased, a PRNG just doesn't work. In fact, PRNGs pretty much never work for anything that needs anything. They exist for convenience only.
"To put these numbers into perspective, the annual risk of a given person being hit by a meteorite is estimated to be one chance in 17 billion, which means the probability is about 0.00000000006 (6 × 10−11), equivalent to the odds of creating a few tens of trillions of UUIDs in a year and having one duplicate. In other words, only after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%."
There seems to be some sort of collective delusion that UUIDs magically appear somehow. Most UUID libraries are very small packages that contain code nearly identical to the code we're using. Many of the libraries that existed at the time this problem occurred actually used Math.random() under the hood. Moreover, we need our code to produce variable length identifiers for different use cases. Since we already need it, it's sensible to re-use it for a case where UUIDs might also be used.
In some places (when we're producing much shorter IDs, for instance, or in extremely critical sections of code) we do check the database for duplicates. In our Java-based accounting system we even use UUID1s for transaction IDs. I understand the trade offs. The code in question was being used to generate correlation identifiers for log tracking. The performance hit of checking for dupes would have been a waste.
PRNGs work find for many use cases. Please cite some source or at least give me some reasonable argument to believe otherwise.
Most solutions inject the HTTP header in the frontend reverse-proxy to avoid that kind of issue.
Feed the whole thing into sha256 and take the last X bytes where X is the amount you need for your shortener. If you can show that the last X bytes of sha256 have a chance for collision due to input, there is a lot of money in it for you. One of the nice things about a cryptographic hash is that all of the output has to be completely unpredictable based on the input.
What makes you believe this is the case?
Hashing a counter is actually a variety of CSPRNG. Basically you're re-seeding a hash-based PRNG with whatever PRNG the UUID code uses at each run. It's well known that an PRNG cannot have more entropy than its seed. Hence, it will at best have the same entropy, and probably have slightly less.
Google's shortener is also not very good in a url ( as implementation of unique short identifier).
Any solution that has it possible to have an I ( capital i) or l ( lowercased l) in the same sentence is flawed due to various fonts in any app where you can use an url ( SMS, facebook, Whatsapp, browsers, nokia 3310 ...)
Do not assume that UUIDs are hard to guess; they should not be used
as security capabilities (identifiers whose mere possession grants
access), for example. A predictable random number source will
exacerbate the situation.
Who uses UUID1? The vast majority are UUID4 and they are pure random.
Specifically with Postgres I use the v1mc version as it provides the added benefit of using "a random multicast MAC address instead of the real MAC address of the computer".
True, but in crypto that's not the only thing you care about. Usually you want random numbers that are both hard to replicate and, and this was missing, hard to predict. MAC & and timestamp do nothing to reduce the predictability of the RNG.
Also, going back to the speed debate: There are usecases where you don't care about these aspects at all and the cited perlin noise, often used in graphics, is one of them. If anything I probably want the noise to be predictable so my preview in Cinema 4D or whatever looks exactly like the rendered result.
Stuff gets dangerous when implementors ignore the tradeoffs and want to make things go fast without knowing the implications.
Full commit message: "Tweak RNG."
Reviewer comment: "LGTM"
No reason for the change, no source code comment, no discussion?
* The state size is 2kb
* It fails some rudimentary statistical tests
* The output is trivially predicted
* It isn't particularly fast
* It lacks multi-stream support
If you need a CSPRNG use ChaCha20 or AES. For simulation use PCG.
MT is far from a "bad generator". It doesn't pass a few stringent TestU01 tests, which is the case for a number of very well regarded generators.
Simple PCG PRNGs are fast and pass TestU01 BigCrush. Someone could conceivably cook up a new battery of tests that it fails on, but we'd need to reevaluate the whole zoology of RNGs at this point.
Edit: Sorry, I was playing loose classifying MT's BigCrush failures as rudimentary. That said, I think failing any part of BigCrush should disqualify a PRNG from use.
Would include the PCG family then?
When it comes down to it, PRNG quality gets assessed empirically. Why would you use ever a generator with 2kb of state when faster & smaller generators like PCG or Xorshift* pass the same statistical battery?
You can also write proofs about their behavior to ferret out the lousy ones. For example, we can demonstrate that PCG is full period. Consider the inner LCG:
x1 = (ax0 + c) % m
If you look at the permutation function, it's also provably unbiased by counting. Since all possible bit sequences are fairly represented in the PCG state we know that all combinations of permuting bits appear with all possible output bits.
contains a nice comparison table.
Wikipedia entry for MT specifies some of the problems.
Comparing to ChaCha20, which is a CSPRNG, is a little apples/oranges.
But I know that MT is a very good algorithm and using it has a certain pragmatic appeal - you make a lot of people's lives easier because when they Google to learn more about your generator they'll find good information quickly.
PRNG design is an empirical science. We can implement a generator and throw it into a statistical battery. If one passes, we've got a constructive proof of goodness. Any of us can run BigCrush and verify the results. Unlike CSPRNGs, we don't need to sit around and wait for the results of cryptanalysis.
This specific RNG follows a traditional approach (that is, like in C or Java standard libraries) to have one algorithm in the standard library that is "as fast as possible, as bad as possible." Which meant even "just multiply with constant, add another constant, return. Bad? Yes we know." So it's still on the level "it's two multiplies and two-three adds."
Traditionally, those who expect some specific properties, like the author of the article, should have actually used the implementation they added to their code (for example MT if it fits) if they ultimately cared about the speed and wanted exactly the properties they know they needed, or if they just didn't care about the speed used some much "slower" but "much better" library.
Maybe the tradition can be broken in this specific case, I really don't know which benchmarks define what's acceptable here. Note also that the implementation must be verified, just the fact that there is somewhere some implementation available doesn't even mean that it does what is expected to do. The devil is in the details here.
One should use a combo of a sequential component and a random component. When you have multiple servers, make them combine a sequential component, a unique system identifier and a random number. In this case, you can now use the faster PRNGs which will also give you speed while ensuring uniqueness.
Another takeaway from this problem is that the usual method of Multiply-and-floor has a serious flaw of using only the highest bits which may not be good enough for the default fast PRNGs. A better method is to carry forward the remaining bits from the floor using a modulo function. The default fast PRNGs rely on all of their bits and simply throwing away a big chunk is not going to help. Multiply and floor might be more useful in a scenario where you are using better PRNGs. Even then, they are slower and it is not wise to simply throw the hard earned randomness.
Good PRNGs have equivalent entropy at each bit. With a good PRNG (even a non-CS PRNG) you shouldn't need to mix entropy to do scaling. You should still do rejection sampling if you care about bias. It looks like a good scaling method might be added to the ECMA spec as part of the standard library thanks to some awesome people at Google.
Sorry, that's just not sensible. You always want to add node IDs and timestamps (provided you hash the final output so as not to leak details about your system) in case your generator fails. Why would you not want another layer of safety? It also helps protect against the case where an attacker might gain something by being able to predict the next ID in the sequence.
It absolutely is sensible. Adding node IDs and timestamps leaks information. If you add a hash function now you have two problems -- the hash of a random value is actually a _new_ random value with entirely different characteristics. You're falling into another trap. Which is why you might not want another layer of safety -- you're introducing another layer of complexity and another place to fuck up. You had good intentions, but in the scenario you described you've just introduced an additional point of failure with limited upside. Why wouldn't you do the math and implement the simpler solution using a generator that won't fail?
As I've said elsewhere, the likelihood of collision with our identifiers is lower than the uncorrectable error bit rate of a HDD. In other words, it's more likely for a perfect deterministic method to generate a collision because there was a hardware failure persisting it to disk. Or, more pragmatically, the risk is far below the level that any sensible person should ever be worried about.
It seems your original function made the same assumption of V8's Math.random. All PRNGs fail at some point, even CSPRNGs. You may as well write your code accordingly, with less optimistic assumptions.
If you're not going to be adding layers of safety, and if you're going to keep insisting that your PRNG "won't fail" then I guess it's only a matter of time before you will have to repeat the same mistake.
As other commenters have pointed out, you should have written your function in such a way that it does not place a critical reliance on any single component.
I don't see the need for belt-and-suspenders here and there are legit reasons not to add host/time to an identifier. That's why we have UUID1 and UUID4.
It's one thing to trust "peer reviewed formal mathematical proofs".
It's another to assume that these are perfectly implemented.
On the other hand, enough people make this mistake that APIs should just use a CSPRNG for "random()" and offer a "fastRandom()" for those who need speed but not "secure" random.
1. Generate a pool of 2048 bytes or so of entropy at startup using window.crypto.getRandomValues or crypto.pseudoRandomBytes. To this, append Date.now() and Math.random().toString() just in case your crypto method fails badly. Then call SHA256 on this to get your starting entropy distilled into 32 bytes. If some of the 2048 bytes of entropy are not high quality, it won't matter as much since you compress it into 32 bytes (i.e. it's worse if you just ask for 32 bytes from getRandomValues).
2. Initialize a counter to 0.
3. Each time you need an ID, increment the counter (handle wrap-around if necessary), and take a SHA256 hash of (counter, 32 bytes entropy hash obtained in 1. above, Date.now(), Math.random().toString()). Then truncate and encode this using whatever character set as needed.
This way you don't drain out your cryptographic entropy pool every time you generate an ID. You also don't leak any details as to your system time, startup time, or your current position in Math.random() (which would allow someone to predict the next Math.random() result) since the final ID is hashed.
If you need a predictable stream of uncorrelated bits, this has the benefits of being simple and trivially seedable.
In case the entropy pool is drained, one would still want to get more than 128 bits from urandom and then hash this with SHA2 to get the 128 bit key, right?
I had in mind something portable to the browser without requiring AES there, but will try it out on the server.
The primary reason CSPRNGs rekey themselves periodically is for forward security, so that if your machine gets hacked, an attacker can't snarf the RNG state and predict all future numbers the machine generates.
If an attacker has access to your computer on a level where he can inspect the CSPRNG's state, you've probably lost completely and no reseeding will help you.
I would concatenate utc time in ms to the random number.
The UTC time + random number idea is interesting, but I'm not sure it's any better than totally random. Say you want a 64-bit unique id. It would take about 42 bits to store a millisecond Unix timestamp, leaving 22 bits for randomness. If you consider the probability that a given id will collide with a previously-generated one: with this scheme, you only need to consider ids generated in the same millisecond, but you only have 22 bits of randomness that can be used to avoid a collision. If you use 64 random bits, you can theoretically collide with ids generated across all of time, but the odds of collision are many orders of magnitude lower. By constraining the first 42 bits to store a millisecond timestamp, every millisecond that goes by removes 2^22 values from the possible id space (regardless of whether they're used). (Besides that, this scheme removes from the space of possible ids all of the millisecond values from before the system was created, which seems like about 2/3 of them, assuming about a 68-year span of values for Unix times, which started 45 years ago.)
That's kind of handwavy, so let's do the math. With 22 bits of randomness, the odds of any randomly selected pair of ids generated in the same millisecond colliding is 1/4194304. Assuming independently generated ids, the expected number of collisions after 4194304 milliseconds is 1. So if you have just two threads generating ids once per millisecond, you'd expect a collision in just 70 minutes. That's not great.
It's late, though. Maybe my math is wrong?
(My argument here, really, is that using a PRNG at all sounds pretty crazy for ID generation, if you don't take steps to prevent or detect collisions)
The chances of a collision must be vanishingly smaller within even a second (let alone a millisecond) even if the random number generation is defective.
I didn't check your maths.
Correction - 21 chars.
With a second resolution its 17 chars.
Counting from now, the addition would only be five chars.
echo -n 0 | base64 | wc -c
Both are created for the same purpose and Guid's are even possible as keys in MS SQL Server (next to the standard autoincremented integer). So it's safe to say that it's pretty reliable.
Bitable's scheme : 132 bits
Essentially the same size. They ended up with 132 bits because they use a 6 bit alphabet; 128 is not divisible by 6, so you would need 22 characters of a 6 bit alphabet to represent a UUID, exactly the same as them.
(note, the 6 bit alphabet is so identifiers can go in urls; the commonly used uuid hexadecimal representation is a 4-bit alphabet, so there are 32 characters and usually 4 hyphens; so 36 characters in the URL vs. only 22)
npm install node-uuid
1. Math.random, despite his insistence, isn't broken. There are different degrees of randomness that a programmer might need for their program and Math.random simply didn't suit his.
2. Their initial implementation of a critical portion of their infrastructure was so naïve, I almost couldn't couldn't believe it. I just seemed like such an anti-pattern to try to build your own UUIDs, discover (in production!) that they're not so good and then spend days figuring out why a built-in random number generator that ships in every browser isn't so great.
2. I've addressed the standard solution / UUID question elsewhere. The way we were generating identifiers is not an anti-pattern. It's a pattern. It's the standard way to produce a random string from an alphabet. If you look at the source code of any website, from HN to Google, I guarantee you will find an almost identical piece of code somewhere. The code is simple, but that doesn't make it bad.
A safe approach is to run multiple uncorrelated sources of probable uniqueness/[pseudo]randomness through a modern cryptographic hash function, and generate your "unique" string from the hash output.
I've acknowledged the incorrect assumptions / taken blame for not doing proper diligence on the PRNG. What I'm saying here is that the identifier generation technique is not, in principal, flawed. It is a common technique and it has many legitimate use cases. It's simple enough that pulling in a dependency to solve it for you is not an obviously better alternative.
True randomness and a crypto-strength PRNG are good options here, but they're not necessary. There are peer reviewed proofs that show something like MT19937 is suitable (perhaps even better suited) for this sort of task. See the deeply nested comment thread ITT for more of that debate.
The cryptographic hash acts sort of like a blender for all your random/unique-ish information sources. Counters/MACs/urandom/times/IPaddrs/etc go in, a nicely mixed value comes out.
Anyways, if you decide you want to learn more about secure hash functions and cryptography in general, you can't go wrong with Bruce Schneier's "Applied Cryptography".
I'm not sure how your link is relevant. If there's not enough entropy on the system for the kernel, there's not enough for you either.
I know MongoDB is not popular around here, but their ObjectIds are (imho) a much better approach than yours. Much more robust and less prone to errors when planets align just right. That said, I have seen worse solutions than yours and made my share of bad decisions too, so I don't judge you. This is just a technical opinion, given that this particular problem wouldn't even exist if you avoided randomness.
The point of the comparison to HDD error rates is meant to indicate that the collision probability is well within what is typically considered an acceptable risk level. There's a higher probability of much worse things happening, and I accept those risks for similar pragmatic reasons. Why should I behave any differently in this case? I do encourage anyone implementing this sort of system to do the math themselves and take whatever precautions they deem necessary. I've done so and decided that the additional complexity and space trade off of the alternative approaches make them inferior for our use cases. That doesn't mean ours is universally the best choice. That's why we have UUID1 and UUID4, for instance.
I am quite sure it would be easier to find cause for duplicates if you were not using random. Anyway, at least you caught a nice bug. :)
The probability of a collision is orders of magnitude higher than if you just grab some random bytes out of /dev/urandom and generate a type-4 UUID.
For a discussion on the benefits of your approach (a type-1 UUID) vs a random id (a type-4 UUID), check out the wikipedia article: https://en.wikipedia.org/wiki/Universally_unique_identifier
Randomness is great because it doesn't require any synchronization or access to some special token.
Also, see the warning about entropy in your linked article. There is a difference between mathematical random() and CS random().
With 2¹³² possible values, if identifiers were randomly generated at the rate of one million per second for the next 300 years the chance of a collision would be roughly 1 in six billion.
I tries using the formula for the Birthday problem, but the values are too large.
Although I got ~43 years using the one from https://en.wikipedia.org/wiki/Birthday_attack#Source_code_ex...
julia> birthday(prob, vals) = sqrt(2 * vals * -log1p(-prob))
birthday (generic function with 1 method)
julia> iters = birthday(1/6e9, 2.0^132)
julia> iters / 1e6 / 60 / 60 / 24 / 365
As a side benefit you can get by with shorter (more human readable, typeable, fits-in-a-tweet) keys because you correct the few collisions you get.
If the extra keyspace seek were ever a serious performance bottleneck I would just skip the collision check in that one spot and use a proper RNG as OP is suggesting. But how often does that really happen? Very few systems need to generate millions of unique keys a second.
At those astronomical possibilities, there are many things that can go wrong before collisions become viable. Think about it: Why bother checking, when you have a somewhat similarly egregious probability that a random bit flip occurs right after your check?
In the situations you encountered, it's probably simple and cheap, but in the cases where it's not (the situation in the article being a probable example), it can be expensive, error-prone complexity without gain.
rand() in C,
rand() in PHP,
Math.random() in JS.
Any use case where you want a poor random number generator is a niche use case and you can write your own. We should be secure by default.
How much faster do you need your RNG to be in any non-niche situation?
Interestingly according to https://bocoup.com/weblog/random-numbers Firefox is just going to use its CSPRNG once the crypto api is ready. (Bug 322529 is still in NEW state, though.)
CSRNGs make good general-purpose-use guarantees, because those who don't demand much of their RNG will enjoy "living in ignorance" of RNG trade-offs while getting the right set of guarantees so as to never accidentally generate a collision; while those who don't want the particular set of guarantees a CSRNG makes will be domain experts (i.e. people who have profiled their code and noticed that the RNG is the hotspot) who know what they need instead.
The same cannot be said if you make a non-CS PRNG the default. There, "living in ignorance" will result in disaster (as in the article); and just knowing that the provided PRNG isn't right for you doesn't imply you know what you do need.
I totally agree! Very few applications will notice if the random number generation takes 100 times more/less time. My point wasn't that CSPRNGs shouldn't be the default, it was that there definitely exist real situations where CSPRNGs are both not fast enough, and don't maximize the right kind of (domain-specific) quality.
The more vexing aspect is that many default random number generators are both slower and much, much worse than some really simple (we're talking 10 lines of C) PRNGs that out score them in every measure of quality. This is a choice that makes absolutely zero sense.
> CSPRNGs actually have slightly worse distributional properties than some much faster PRNGs, because the CSPRNG is more focused on being unpredictable.
What does this even mean? Non-uniform distribution for CSRNG is fatal. What is an RNG that has "better" distribution than CSRNG?
The actually relevant aspect is what percentage of a simulation's time is spent generating random numbers if it's using a CSPRNG. I have had this become a bottleneck in my work, and it's very rare to use CSPRNGs in scientific simulations for this reason.
> What does this even mean? Non-uniform distribution for CSRNG is fatal. What is an RNG that has "better" distribution than CSRNG?
There are distributional properties other than uniformity, and depending on how you use these numbers (e.g. Monte Carlo kernel form) these can be far more important than linear uniformity. The author briefly mentioned this in the article ("CSPRNGs emphasize unpredictability over all other measures of quality, some of which might be more important for your use case"). Look into the work of George Marsaglia, he wrote the test suites that generate the quality numbers most engineers use to evaluate both CSPRNG and normal PRNGs.
By the way, I think I've said this before, but I'll say it again. I really dislike the usage of the term 'CSPRNG' in these discussions. A CSPRNG is typically understood to be /dev/urandom---an algorithm that not only generates a long sequence of numbers, but also harvests entropy and tries to achieve forward and backward security. But it is also sometimes overloaded to mean a stream cipher, which is a conceptually simpler construct, and also able to be much faster. Stream ciphers can definitely compete in performance with things like Xorshift.
These characteristics don't just make Monte Carlo simulations converge quicker, they also do things like guarantee that your 52 deck shuffle will definitely produce the one shuffle that gives player two a royal flush on the flop in a round of texas hold 'em. Or, more generally, they guarantee that your Array.shuffle() method is unbiased for a reasonable sized array. In general you need true randomness and/or a long cycle linear-transformation PRNG to make those sorts of guarantees. It's also nice that they're fast. That said, I do also agree that the scenarios in which these things matter are rather rare, and most people should probably just use a CSPRNG by default.
Do you really think a CSPRNG is fast enough for your standard library's array shuffle though? Wouldn't you rather have something like xorshift1024* that's probably > an order of magnitude faster? Maybe a CSPRNG is fast enough... I don't have all the numbers. So this is an honest question.
edit: By in theory I mean that with enough computation resources you could break them even if you didn't find some new, clever weakness. Not that you could break them in theory because a weakness could always be found.
Not really. In general, the statistical tests are chosen such that when the law of large numbers comes into force, a perfectly random sequence would pass them with flying colors. If repeated runs of a perfect random number generator were failing these with a frequency that were improbable (i.e. far outside what the central limit theorem would predict), I think it would be safe to say the universe was playing a trick on you. Many of these are very closely related to predictability in practice as well. Uniformity is a statistical test, and I think it's probably the lowest bar you can set for "unpredictable".
The fundamental issue with measuring "randomness", especially for a PRNG, is that there is no set definition of information entropy suitable for calculation. Shannon entropy requires you to have a (possibly numerically sampled) PDF. The problem with that is that is you want to find the entropy of a sequence, you can't use the PDF of all the bytes you've gotten out the generator, since that PDF will have 1 for the exact sequence you obtained, and zero everywhere else. If you do it by multiple runs, there is no way you'd get enough data to construct an accurate PDF with a suitably large sample size. You have to use something based on highly subdivided sequence of values, but if you only look at each value in isolation you will only determine the uniformity with some small bin size, which ignores patterns due to proximity of values in the output.
So generally entropy is calculated using a more complex probability kernel, like the ones used in these statistical tests. But none of these can possibly be perfect. The "perfect" measure of entropy is Kolmogorov complexity, which captures the minimum length of a program you would need to be able to print the given output. On top of being undecidable in general, and also unobtainable in practice, you can show an upper bound of the Kolmogorov complexity on a CSPRNG that is actually quite low. Just take the algorithm the CSPRNG uses, plus any external entropy it got (from hardware interrupts, etc.). So in the end, various kinds of statistical tests are the only ones that make sense.
The actual goal of a CSPRNG is far more specific than just simulating a truly random sequence, which is why it differs from those used in simulations. It is unpredictability, defined by trying to make it so this short program which can generate all the output is impossibly hard to obtain from just the output in practice. In a vague sense, it is like (some of) the desired properties of a hash or cipher; we can't use a perfect random source (resp. one time pad), so let's get something that will take an impossible amount of computational power/measured output to break (resp. ciphertext using the same key), and mix it in with some as true random as we can get seed (resp. IV), and then mix some more in from time to time to make it even harder (resp. changing keys). That is the reason stream ciphers, while generally not as good as /dev/urandom-style CSPRNGs, are actually closely related enough that I think it's fair to call them CSPRNGs as well. They're just ones with a far worse algorithm/entropy source.
A statistical test _is_ a distinguisher! This seems self-evident to me. When the test repeatedly fails (with an extreme p-value or whatever criteria you want to apply here for 'failure'), you have successfully distinguished the sequence from random. Also you seem to misunderstand what I meant by equidistribution---uniformity is a different property, and of course crucial. I am puzzled, then, as to why you claim some generators are more 'uniform' than cryptographic ones.
Any pseudorandom bit generator whose next bit cannot be predicted in under 2^n "operations" also passes _all_ statistical tests that require less than this computational effort. This is a well-known result by Yao. In other words, unpredictability and indistinguishability are equivalent. In the case of, e.g., AES-CTR, n = 64 (birthday-bound distinguishers force it to be 64 instead of 128).
Your philosophizing about what true randomness really is does not strike me as very useful for practical purposes. Given unbounded computational power, obviously no such algorithm is actually indistinguishable. But who cares?
I'll repeat once again that I do think people should generally default to CSPRNGs, but I really don't understand why everyone involved in crypto insists that there are no valid use cases for non-CS PRNGs / that CSPRNGs are uniformly better for all use cases when it seems very obvious to me that they are not. I don't see any reason why a good PRNG and a good CSPRNG shouldn't exist in every standard library. Am I missing something?
As to why we insist that cryptographic primitives are uniformly better? Well, that is more subjective. I personally believe that they have gotten fast enough that the set of cases where they are unacceptably slow is vanishingly small, and the trend is clearly on the side of crypto getting faster. And I can sleep at night knowing that my results are not wrong due to some unknown, unexpected, property of my generator.
Another case where a long cycle length is useful would be the sort of scenario in my post, at much larger scale. If you have thousands of machines re-seeding at every service restart and then generating sequences of millions or billions of random numbers from a sequence, and you want a vanishingly small probability that those sequences will never overlap, could you use a CSPRNG?
Re: the generator properties, do you _really_ believe that? I mean, good non-CS PRNGs usually have fairly rigorous presentations and pass statistical tests. They aren't as sexy as crypto stuff, and maybe don't get the eyeballs, but they're relatively straightforward proven math whereas all of the CSPRNG results are based on conjecture. It seems at least as likely that some unexpected property of a CSPRNG will be discovered, a la RC4.
I do believe that. RC4 is a good example, actually; the unexpected properties that make RC4 be considered completely broken for cryptography are considered irrelevant for simulations (I won't swear by this, but I'll bet that RC4 passes all of TestU01's tests). The standards are just so much higher for cryptographic primitives. I'm sure unsexy generators can be suitable to this or that application. But due to the ad hoc nature of the statistical tests used, you can't be sure that your generator is not oddly correlated with a concrete application. I've written a comment a few days ago---on a thread very similar to this one---about a real-world case of this happening .
From a cryptanalysis standpoint I think that's consistent -- the probability of generating any particular single shuffle is astronomically low, so the difference between the actual probability and 0% is probably outside the realm of what's computationally feasible. But with something like MT19937 I can show that its at least possible for every outcome to have a non-zero probability. Perhaps this is just interesting academically, but maybe not if you're actually playing a card game for instance?
You have already covered the distinction between guaranteed equidistribution and not, so I don't think there is much to add there. I do think the concern is mostly academic, since the difference between a subsequence that doesn't exist and one that will never be reached before the universe collapses is irrelevant to the user. I will note that Richard Brent, someone who is much better qualified than I to discuss this, has argued against the usefulness of equidistribution: http://maths-people.anu.edu.au/~brent/pub/pub240.html
In particular, if there were some 0 <= y < 2^128 such that AES_k(x) != y for all 0 <= x < 2^128, then there must also be some z such that two different values x1 and x2 encrypt to the same value, which can't happen if AES is reversible.
We fucked up by not vetting the algorithm, that's definitely the primary lesson here. Mea culpa. I'm sure you would have done things differently, but V8 is a modern system and I don't think assuming a modern PRNG was completely unreasonable while quickly getting to MVP at a new startup.
I heavily criticized you in another post for 1) your initial choice of substandard algorithm and 2) your rationalizations (e.g. Google's "good reputation").
But that is ancient history. Here you're admitting that you fucked up, which IMO you didn't admit previously.
Also, kudos for your article which kicked off this entire discussion. In that article you showed that you carefully analyzed what was happening, how you went wrong, and how you could improve.
Even more important, you did this all publicly, both your article and your responses here on HN. Everyone learns from this. I commend you for your openness.
Seems like V8's Math.random was only fine for extremely trivial cases (Google doodle Pacman I guess). Meanwhile Webkit has been using a CSPRNG and nobody seems to be complaining.
V8 is a great piece of work but it was never intended for anything beyond speeding up web pages. Why are people using node.js on the server side at all, when there are more industrial strength runtimes with better performance available?
Depends on how you're using them. If you only care about collisions, and you're not worried about an attacker trying to manipulate your system, then using a PRNG whose output is well-distributed (read: not necessarily a language's random function) is fine. The problem is most language's default PRNGs are complete crap :(
When in doubt, use unix, and <updated>/dev/urandom</updated>
Also, here is a command (to my knowledge secure) which generates a nice random 20 length alphanumeric:
LC_CTYPE=C < /dev/urandom tr -dc A-Za-z0-9 | head -c20
/dev/urandom is the same cryptographically secure PRNG as /dev/random, except that /dev/random will superstitiously block when it thinks entropy has been "consumed". (CSPRNGs do not meaningfully "consume" entropy.)
If you use /dev/random for the mythical warm fuzzies, you might also be misunderstanding other things. Don't be superstitious. Use /dev/urandom.
 Technically, it is correct to say that we do consume entropy in an information theoretic sense. However, this is not particularly revelent, as the CSPRNG is still (believed to be) secure in a complexity theoretic sense. If it turns out that the CSPRNG is not secure in a complexity theoretic sense, then we have bigger issues to worry about then a broken RNG.
Also, you mixed up random and urandom, but no matter. People should not code for the case where their distribution's boot sequence totally fucks up.
There are no security concerns with /dev/urandom
In brief, wifi routers need to generate unique crypto keys. The programmers started off using /dev/random, observed that blocking, and switched to /dev/urandom. The problem was that they were generating the keys at first boot, before any entropy had accumulated, with the result that many, many devices wound up with similar or identical keys.
The right answer is Linux's new getrandom syscall, whose default mode "blocks if the entropy pool has not yet been initialized", but thereafter acts like /dev/urandom. But on systems where that syscall is not available, I'm skeptical that using /dev/urandom is sane design.
To put it another way: /dev/urandom is effectively a service that can be in a "starting" or "started" state. When you, as a service, depend on another service, the idiomatic thing to do is to teach your init/supervisor daemon about that dependency. Then, instead of your service coming up and sitting around doing voodoo to your pipes/sockets to try to figure out whether your dependencies are up, your supervisor can just delay starting your service until it knows the dependent service is up.
(Or, to get really clever about it, you could just make /dev/urandom itself a socket attached to a "service" whose job is to, on startup, do a blocking-seed of the "real" /dev/urandom device, and thereafter become a dumb pipe over to the "real" /dev/urandom device. Then all your services just block on reading /dev/urandom [the socket] the first time, with the first one of them to try reading from it in fact causing the reseeding to happen. Socket activation!)
Random are numbers - Wikipedia
1) Do NOT use a non-deterministic generator, you'll thank me when debugging
2) find what uniquely identifies the object and generated deterministically string from those values
For example to identify a request to a service on the internet you can use microtime, IP and a counter and so on.
Arguably it's usually going to be preferable to know you're getting the same kind of random numbers across platforms anyway, so a BYO Mersenne Twister is a decent remedy.
One is linked in the post, and I have a CoffeeScript one here: https://github.com/jawj/mtwist/blob/master/mtwist.coffee
This said, I see that often only v4 is implemented, so maybe there are issue with the other variants.
If you go and look at node-uuid commit history it, too, used Math.random() back in the day.
Of course, on Node.js that is a different issue.
We did make an incorrect assumption that the PRNG (created by Google, who has a good reputation, and in what is probably the most popular software in the world) was high quality. That was a mistake, and we should have done more homework. However, there's no reason why the Math.random() implementation in V8 should _not_ be good enough to not have to worry about.
The analysis in this article was great engineering - something that in my experience often happens after poor engineering... YMMV
It's frustrating when you get there first and someone else hits the jackpot, but it evens out in the long run if you submit enough good stories.
I am only a little frustrated that I missed out on so much karma love (I am in my 30's now and find that these things matter less as I get older :), it was more the fact that Medium naturally appends this, so literally all submissions of a Medium article will result in reposts.
It's not a simple problem as some URLs will rely on the hash to work correctly, e.g. hxxp://www.example.com/#!/foo/bar, so you can't just strip it. I think it requires some special cases for sites known to do this kind of thing (not sure why Medium do it.)
Glad to hear duplicates is an issue you're looking at, has certainly been a problem, though I agree it makes sense to allow it in certain cases. For example I love the XV6 OS and like to see it periodically submitted again from time-to-time to see new discussion. It might even be nice to have an auto-generated list of previous submissions in this case?
As an amusing aside, this whole issue has a relationship to the story - perhaps the generated hash uses Math.random() and therefore might actually result in duplicate URLs after not so many resubmissions? ;)
Wait, let me restate that. You're Engineering the Disruption of Real Money Gaming and yet you say of some PRNG "no reason the ... implementation ... should not be good enough to not have to worry about"? You're not just wrong, in fact you're "not even wrong", that's how wrong you are. Plus that sentence has three negatives!
Jules Winnfield said it better than I could. Riffing from him, what you were doing was, compared to what you should have been doing: ain't the same fuckin' ballpark, it ain't the same league, it ain't even the same fuckin' sport.
Here's something I came up with in 60 seconds. It's far better than what you didn't want to worry about:
ID Quantique hardware RNG 
mixed with Intel RdRand 
Once again, I can't believe your defense of how badly you failed was the "reputation" that Google has. Real money is involved and you thought it was OK to rely on some crappy PRNG created by Google? For something that is at the heart of, the quintessence of what you are attempting to accomplish?
We detached this subthread from https://news.ycombinator.com/item?id=10601984 and marked it off-topic.
Not a badge of honor to have a subthread detached.
Good article. I do have a question for the author, why after a huge amount of time writing this post don't you submit a patch to V8?
Good question. As they author noted, k bits of state can support a cycle length of 2^k. You can look at that from the other direction and state it as a cycle length of L requires at least log2(L) bits of state.
If your cycle length is L and you are using k bits of state, and k > log2(L), then in theory with a cleverer encoding of state you could save k - log2(L) bits of memory.
Whether or not I'd characterize k - log2(L) excess bits as "wasted" memory depends on just how big it is. For instance, suppose someone implemented Mersenne Twister and used twice as many bits as theoretically needed for the cycle length. Typical MT has a cycle length of around 2^20000, so needs 20000 bits of state. Someone doubling that is using an excess of 2500 bytes.
On the other hand, the generator the article is about has a cycle length of around 2^60, and so need in theory 60 bits of state. If an implementation doubled that they have an excess of less of 8 bytes.
2500 bytes is enough that if I were implementing I'd probably take a good look at seeing if I could eliminate it, even if it meant making the implementation a little more obtuse. I'd be inclined to consider that 2500 bytes as wasted memory.
8 bytes? That's small enough that even on most embedded systems I'd probably consider clean and clear code to be more important than saving that memory. I'd not be inclined to consider it wasted memory.
Maximal cycle length generators don't produce degenerate sequences. Meaning I seem to remember ones where they'd have say a sequence of 2^k - 349 and a sequence of 340 and a sequence of 7. Which means they fail badly if seeded incorrectly.
 Noting an old article I read on FPGA design that said, 'helps to design your state machines so that that illegal states transition to legal ones' So I assume sub 2^k generators exist that don't have sub-sequences.
Re: submitting a patch, there were several previous issues and discussions about Math.random() being problematic that were closed-wontfix or otherwise dismissed. It seemed like the general perception was that it was "good enough." It didn't make sense to me to open another similar ticket without a good analysis and whatnot.
I meant to file another issue before posting, but I forgot. There is an open issue now, however!