
Math.random in V8 is broken - r0muald
https://medium.com/@betable/tifu-by-using-math-random-f1c308c4fd9d#.7p3lk9bvk
======
pcwalton
My experience with compiler development is that the incentives all align
toward making the "default RNG that people go to" _as fast as possible_ to the
exclusion of all else, including the quality of the generated numbers. That's
because people frequently write benchmarks gated on the speed of random number
generation, and those benchmarks usually don't care about the quality of the
random numbers. Sometimes those benchmarks get popular, which encourages a
race to the bottom in RNG quality.

(Similar incentives exist for hash functions, by the way.)

Popular RNG-bound benchmarks:

* "fasta" from the Benchmarks Game [1] (thankfully, this one mandates use of its own RNG, though it's vulnerable to fast-forwarding as demonstrated in [2])

* "Perlin noise" [3] (almost entirely RNG performance bound)

* From SunSpider (this one being particularly relevant to V8): string-validate-input [4], string-base64 [5]

[1]: [http://benchmarksgame.alioth.debian.org/u64q/fasta-
descripti...](http://benchmarksgame.alioth.debian.org/u64q/fasta-
description.html#fasta)

[2]: [https://github.com/TeXitoi/benchmarksgame-
rs/blob/master/src...](https://github.com/TeXitoi/benchmarksgame-
rs/blob/master/src/fasta.rs#L121)

[3]: [https://github.com/nsf/pnoise](https://github.com/nsf/pnoise)

[4]: [https://www.webkit.org/perf/sunspider-0.9/string-validate-
in...](https://www.webkit.org/perf/sunspider-0.9/string-validate-input.html)

[5]: [https://www.webkit.org/perf/sunspider-0.9/string-
base64.html](https://www.webkit.org/perf/sunspider-0.9/string-base64.html)

~~~
im2w1l
Fast hash functions can be really important for performance of hash sets and
maps. In my experience high speed is more important than a low collision rate.

That was for primitive types, it may well be different for types where
comparison is more costly. And of course if you need protection from
algorithmic complexity attacks, then you will need to take that into account.

~~~
Gankro
As long as your usecase isn't worried about hash-flooding [0]. Unfortunately
too few people know this is even an attack, so it may be reasonable for
languages to default to a stronger hash algorithm for safety, but this isn't
the common case... language design is hard.

[0]:
[https://www.youtube.com/watch?v=wGYj8fhhUVA](https://www.youtube.com/watch?v=wGYj8fhhUVA)

~~~
KMag
There isn't much complexity and performance cost of optimistically using a
faster hash, and dynamically falling back to siphash (or another secure keyed
hash function).

Using associative arrays implemented via chaining, one could have a bit in the
associative array header that indicates if the associative array uses the fast
keyed hash or siphash (or another secure hash). When inserting an item, if one
finds that the chain length exceeds a certain limit without the load exceeding
the resize threshold (indicating poor hash distribution), one could rehash all
of the keys using siphash.

An associative array using open addressing could do something similar,
switching hash functions if the probe sequence got too long (analogous to a
chain being too long).

Alternatively, if using open addressing, one could use something like cuckoo
hashing. A very fast keyed hash could be computed, and upon a collision,
siphash could be used, with a probe sequence similar to used by Python dicts.
Though, if your use case has lots of missed lookups, then performance will be
better just using siphash. Of course, one could use counters to detect this
condition, set a bit in the associative array header to indicate all future
lookups should just use siphash, and rehash all of the existing keys.

~~~
Gankro
Rust's solution is to provide the hash function as a generic parameter so you
can just Fnv/Xx/Sip based on your workload, with Sip as the default if you
don't pick. Unfortunately this is mostly incompatible with the adaptive scheme
you suggest, because you definitely don't want the adaptive logic if you've
already picked Xx/Fnv. Making it possible to disable all that if you pick
something other than Sip _would_ make things quite complicated.

But yeah, data structuring is ultimately really work-load specific. Any
solution a language provides will always be suboptimal for tons of use cases.
The solution you describe sounds good for a "never think about it" solution
that works for 85% of cases. Rust's solution kinda necessitates more thinking
more often, but I think it makes it more applicable for more usecases if
you're willing to do that little bit of thinking.

------
nilknarf
I have previously written about how to predict the next Math.random in Java
(and hence Firefox also): [http://franklinta.com/2014/08/31/predicting-the-
next-math-ra...](http://franklinta.com/2014/08/31/predicting-the-next-math-
random-in-java/) The TL;DR; is that it is easy since you only have to brute
force around 2^22 possibilities (which runs in < 1 second).

Someone then asked whether it is also possible in Chrome and Node:
[https://github.com/fta2012/ReplicatedRandom/issues/2](https://github.com/fta2012/ReplicatedRandom/issues/2).
After examining the MWC1616 code I saw the same “two concatenated sub-
generators” problem explained in this post. The implication of it is that you
can brute force the top and bottom 16 bits independently. So it is also easy
since that’s just doing 2^16 possibilities twice! Code:
[https://gist.github.com/fta2012/57f2c48702ac1e6fe99b](https://gist.github.com/fta2012/57f2c48702ac1e6fe99b)

------
UnoriginalGuy
This is actually a solved problem already on most underlying operating
systems, it just hasn't filtered down into the Javascript-world. On Windows
you want a GUID and on UNIX-like OSs you want a UUID.

GUIDs and UUIDs are guaranteed to be unique not simply because of their
length, but also how they're constructed (MAC address, timestamp, and a random
element).

So even if the random number generator does loop back around, it still won't
ever generate the same GUID/UUID even on the same hardware, and on other
hardware it is more unlikely yet still (due to MAC address).

So the question is: Why isn't GUID/UUID generation not available to
Javascript? It works extremely well and is used for exactly this type of
scenario.

~~~
mmalone
Great point / good question. Here's why we didn't just use UUIDs:

* UUID generation still requires a good (CS)PRNG and should be vetted the same way you'd vet a (CS)PRNG.

* UUIDs are one-size-fits-all 36 character base-16 encoded strings. If you want a short random identifier that you can use in a URL (that people might need to type on a phone or something) they're not ideal. Re-encoding them and/or truncating them is as hard and error prone as just generating the randoms yourself.

* Researching a good UUID library and maintaining a dependency is harder than vetting the 10 lines of code required to do it yourself. Without a trusted standard library solution it's unclear which library to use (maybe not as true today as it was a couple years ago when this happened)

If UUIDs were available in the Javascript standard library we probably would
have used them in some of the places that we currently use our own
identifiers. We do use UUID4s in our Java services, for instance.

I'm guessing that UUIDs aren't standardized because the standards process is
browser-centric and it's low priority in that context... but that's just a
guess.

Edit: formatting.

~~~
hueving
>Re-encoding them and/or truncating them is as hard and error prone as just
generating the randoms yourself.

Feed the whole thing into sha256 and take the last X bytes where X is the
amount you need for your shortener. If you can show that the last X bytes of
sha256 have a chance for collision due to input, there is a lot of money in it
for you. One of the nice things about a cryptographic hash is that all of the
output has to be completely unpredictable based on the input.

~~~
mmalone
Exactly, it's as hard and error prone as generating the randoms yourself. The
solution you described requires an analysis of the hash function to determine
collision probability, diligence to find a sha-256 implementation (circa a few
years ago), review of said implementation, another analysis to figure out your
truncation, then a proper implementation resulting in ~the same amount of
code, except this time slower and with less entropy.

~~~
michaelmior
> with less entropy

What makes you believe this is the case?

~~~
mmalone
You're generating a random number (the UUID) then you're hashing it. It can't
possibly have more entropy, and there's a chance two inputs to the hash will
collide thus reducing entropy.

Hashing a counter is actually a variety of CSPRNG. Basically you're re-seeding
a hash-based PRNG with whatever PRNG the UUID code uses at each run. It's well
known that an PRNG cannot have more entropy than its seed. Hence, it will at
best have the same entropy, and probably have slightly less.

------
pygy_
The v8 Math.random() code was changed yesterday[0], maybe in response to this.
However, the update appears to be misguided[1]...

0\.
[https://github.com/v8/v8/commit/623cbdc5432713badc9fe1d605c5...](https://github.com/v8/v8/commit/623cbdc5432713badc9fe1d605c585aabb25876c)

1\.
[https://codereview.chromium.org/1462293002/#msg12](https://codereview.chromium.org/1462293002/#msg12)

~~~
je42
I don't understand why they don't go for mersenne twister ?!

~~~
bhickey
I haven't looked closely enough at their choice to evaluate it, but there's a
good reason not to use Mersenne Twister. MT is a bad generator.

* The state size is 2kb

* It fails some rudimentary statistical tests

* The output is trivially predicted

* It isn't particularly fast

* It lacks multi-stream support

If you need a CSPRNG use ChaCha20 or AES. For simulation use PCG.

~~~
SeanLuke
I wouldn't touch PCG yet. It's based on a single paper submitted to a journal
which hasn't even been reviewed yet. Has there been any independent testing of
the algorithm? Not that I know of. And the entire PCG website appears to have
been built by the paper author as a promotional tool, which is a bit spooky
given that the paper hasn't even been published.

MT is _far_ from a "bad generator". It doesn't pass a _few_ stringent TestU01
tests, which is the case for a number of very well regarded generators.

~~~
bhickey
What failure mode are you worried about? Yes, the author has a definite
incentive to self-promote. At the same time she's produced a testable
artifact. I think her claims about predictability are bogus, but the rest of
it we can trivially verify on the merits.

Simple PCG PRNGs are fast and pass TestU01 BigCrush. Someone could conceivably
cook up a new battery of tests that it fails on, but we'd need to reevaluate
the whole zoology of RNGs at this point.

Edit: Sorry, I was playing loose classifying MT's BigCrush failures as
rudimentary. That said, I think failing any part of BigCrush should disqualify
a PRNG from use.

~~~
SeanLuke
> That said, I think failing any part of BigCrush should disqualify a PRNG
> from use.

Would include the PCG family then?

~~~
bhickey
No, it isn't a reasonable comparison: PCG is a family while MT is particular
Generalized Feedback Shift Register. Despite my suspicions I can't tell you
off the top of my head if the entire GFSR class is dodgy, but we can point out
flaws in a single generator. I buy O'Neill's arguments (other than those about
security, which are a load of hooey) and I think added scrutiny will bear her
out. Usually what I see is people flocking to MT because of its comically
large period

When it comes down to it, PRNG quality gets assessed empirically. Why would
you use ever a generator with 2kb of state when faster & smaller generators
like PCG or Xorshift* pass the same statistical battery?

You can also write proofs about their behavior to ferret out the lousy ones.
For example, we can demonstrate that PCG is full period. Consider the inner
LCG:

    
    
        x1 = (ax0 + c) % m
    

The modulus term is equivalent to the size of `x`, so in practice you can skip
it and rely on overflow. By requiring that `c` is odd and `m` is a power of
two, we know that `c` and `m` are coprime. If you select `a` such that (a - 1)
% 4 = 0, the inner LCG is full period.

If you look at the permutation function, it's also provably unbiased by
counting. Since all possible bit sequences are fairly represented in the PCG
state we know that all combinations of permuting bits appear with all possible
output bits.

------
bytesandbots
Default PRNGs primary use case is generating session ids for websites with
upto 1000s of simultaneous users, not for assigning UUIDs.

One should use a combo of a sequential component and a random component. When
you have multiple servers, make them combine a sequential component, a unique
system identifier and a random number. In this case, you can now use the
faster PRNGs which will also give you speed while ensuring uniqueness.

Another takeaway from this problem is that the usual method of Multiply-and-
floor has a serious flaw of using only the highest bits which may not be good
enough for the default fast PRNGs. A better method is to carry forward the
remaining bits from the floor using a modulo function. The default fast PRNGs
rely on all of their bits and simply throwing away a big chunk is not going to
help. Multiply and floor might be more useful in a scenario where you are
using better PRNGs. Even then, they are slower and it is not wise to simply
throw the hard earned randomness.

~~~
mmalone
Using node IDs and timestamps is just an additional safety factor.
Statistically it's not necessary if you have a good generator. Even without
those things our target collision probability is less than the expected
uncorrectable error bit rate of a HDD.

Good PRNGs have equivalent entropy at each bit. With a good PRNG (even a non-
CS PRNG) you shouldn't need to mix entropy to do scaling. You should still do
rejection sampling[1] if you care about bias. It looks like a good scaling
method might be added to the ECMA spec as part of the standard library thanks
to some awesome people at Google.[2]

[1]
[https://gist.github.com/mmalone/d710793137ed0d6b8cb4](https://gist.github.com/mmalone/d710793137ed0d6b8cb4)

[2]
[https://twitter.com/mjmalone/status/667806963976134656](https://twitter.com/mjmalone/status/667806963976134656)

~~~
jorangreef
"Using node IDs and timestamps is just an additional safety factor.
Statistically it's not necessary if you have a good generator."

Sorry, that's just not sensible. You always want to add node IDs and
timestamps (provided you hash the final output so as not to leak details about
your system) in case your generator fails. Why would you not want another
layer of safety? It also helps protect against the case where an attacker
might gain something by being able to predict the next ID in the sequence.

~~~
mmalone
If the attacker might gain something by being able to predict the next ID in
the sequence then you should be using a CSPRNG. That's not a problem here.
There's nothing for them to gain.

It absolutely is sensible. Adding node IDs and timestamps leaks information.
If you add a hash function now you have two problems -- the hash of a random
value is actually a _new_ random value with entirely different
characteristics. You're falling into another trap. Which is why you might not
want another layer of safety -- you're introducing another layer of complexity
and another place to fuck up. You had good intentions, but in the scenario you
described you've just introduced an additional point of failure with limited
upside. Why wouldn't you do the math and implement the simpler solution using
a generator that won't fail?

As I've said elsewhere, the likelihood of collision with our identifiers is
lower than the uncorrectable error bit rate of a HDD. In other words, it's
more likely for a perfect deterministic method to generate a collision because
there was a hardware failure persisting it to disk. Or, more pragmatically,
the risk is far below the level that any sensible person should ever be
worried about.

~~~
jorangreef
"using a generator that won't fail"

It seems your original function made the same assumption of V8's Math.random.
All PRNGs fail at some point, even CSPRNGs. You may as well write your code
accordingly, with less optimistic assumptions.

If you're not going to be adding layers of safety, and if you're going to keep
insisting that your PRNG "won't fail" then I guess it's only a matter of time
before you will have to repeat the same mistake.

As other commenters have pointed out, you should have written your function in
such a way that it does not place a critical reliance on any single component.

~~~
mmalone
I generally trust peer reviewed formal mathematical proofs that show something
won't "fail" in a particular, relevant, way. If you don't then you probably
shouldn't be on a computer. The code I'm relying on makes the same sorts of
assumptions that keep your data secure. It is inconsistent to trust it in one
place but not in another.

I don't see the need for belt-and-suspenders here and there are legit reasons
not to add host/time to an identifier. That's why we have UUID1 _and_ UUID4.

~~~
jorangreef
You should read more Colin Percival then. :)

It's one thing to trust "peer reviewed formal mathematical proofs".

It's another to assume that these are perfectly implemented.

------
tlrobinson
JavaScript's Math.random (and most languages' default RNGs) is not intended to
be cryptographically secure, so I'm not sure "broken" is correct. It's
probably random enough for most uses that don't require a CSPRNG. Are there
any scenarios where this isn't the case?

On the other hand, enough people make this mistake that APIs should just use a
CSPRNG for "random()" and offer a "fastRandom()" for those who need speed but
not "secure" random.

~~~
panic
Statistical simulations require good randomness (or your simulation might give
a wrong result) and high speed but not cryptographic security. See, for
example, xorshift*
([http://xorshift.di.unimi.it](http://xorshift.di.unimi.it)), which is
efficient and high-quality but not cryptographically secure.

------
jorangreef
If you need good quality random IDs cheaply, you can do the following:

1\. Generate a pool of 2048 bytes or so of entropy at startup using
window.crypto.getRandomValues or crypto.pseudoRandomBytes. To this, append
Date.now() and Math.random().toString() just in case your crypto method fails
badly. Then call SHA256 on this to get your starting entropy distilled into 32
bytes. If some of the 2048 bytes of entropy are not high quality, it won't
matter as much since you compress it into 32 bytes (i.e. it's worse if you
just ask for 32 bytes from getRandomValues).

2\. Initialize a counter to 0.

3\. Each time you need an ID, increment the counter (handle wrap-around if
necessary), and take a SHA256 hash of (counter, 32 bytes entropy hash obtained
in 1. above, Date.now(), Math.random().toString()). Then truncate and encode
this using whatever character set as needed.

This way you don't drain out your cryptographic entropy pool every time you
generate an ID. You also don't leak any details as to your system time,
startup time, or your current position in Math.random() (which would allow
someone to predict the next Math.random() result) since the final ID is
hashed.

~~~
tptacek
Since what you're proposing is pretty close to just running SHA2 in counter
mode, you could simplify this by generating 128 bits of random data, using it
as an AES key, and just using your crypto library's AES-CTR function to
generate a keystream.

If you need a predictable stream of uncorrelated bits, this has the benefits
of being simple and trivially seedable.

~~~
jorangreef
Thanks, that's an elegant solution.

In case the entropy pool is drained, one would still want to get more than 128
bits from urandom and then hash this with SHA2 to get the 128 bit key, right?

I had in mind something portable to the browser without requiring AES there,
but will try it out on the server.

~~~
tptacek
In a practical sense, there is no such thing as draining an entropy pool. The
period for AES-CTR is 2^128.

The primary reason CSPRNGs rekey themselves periodically is for forward
security, so that if your machine gets hacked, an attacker can't snarf the RNG
state and predict all future numbers the machine generates.

~~~
Tomte
Forward security is obviously a desirable design feature in a CSPRNG (as a
building block that's evaluated and reviewed on its own merits), but I can't
help but feel that it's often distracting people from a whole system view.

If an attacker has access to your computer on a level where he can inspect the
CSPRNG's state, you've probably lost completely and no reseeding will help
you.

~~~
tptacek
I agree, of course.

------
emmelaich
I you want ids without collisions you should use a monotonically increasing
number instead (or as well).

I would concatenate utc time in ms to the random number.

~~~
dap
You can't use monotonically increasing numbers from multiple threads (or
servers) without either synchronizing (which is terrible for scalability) or
using separate prefixes for different threads or servers (in which case you
have problems related to sizing and allocating the prefixes). Right?

The UTC time + random number idea is interesting, but I'm not sure it's any
better than totally random. Say you want a 64-bit unique id. It would take
about 42 bits to store a millisecond Unix timestamp, leaving 22 bits for
randomness. If you consider the probability that a given id will collide with
a previously-generated one: with this scheme, you only need to consider ids
generated in the same millisecond, but you only have 22 bits of randomness
that can be used to avoid a collision. If you use 64 random bits, you can
theoretically collide with ids generated across all of time, but the odds of
collision are many orders of magnitude lower. By constraining the first 42
bits to store a millisecond timestamp, every millisecond that goes by removes
2^22 values from the possible id space (regardless of whether they're used).
(Besides that, this scheme removes from the space of possible ids all of the
millisecond values from before the system was created, which seems like about
2/3 of them, assuming about a 68-year span of values for Unix times, which
started 45 years ago.)

That's kind of handwavy, so let's do the math. With 22 bits of randomness, the
odds of any randomly selected pair of ids generated in the same millisecond
colliding is 1/4194304\. Assuming independently generated ids, the expected
number of collisions after 4194304 milliseconds is 1. So if you have just two
threads generating ids once per millisecond, you'd expect a collision in just
70 minutes. That's not great.

It's late, though. Maybe my math is wrong?

~~~
0x0
You can't use a PRNG from multiple threads without synchronizing either. And
interlocked increment is probably a thousand times cheaper than mutexing a
PRNG...

~~~
michaelmior
Sure you can. A different seed in each thread and you're done.

~~~
0x0
But you still have to generate a seed for each thread. Then you could just
generate a random starting point for each thread and run with incrementing
counters instead of hitting the PRNG for each ID.

~~~
michaelmior
True. But I don't think it's hard to come up with a seed which is guaranteed
to be unique. If you use counters you have a ton of management overhead. You
now need to keep track of where the counter for each thread should start and
what happens when a thread crashes. Also, what happens when you want to add a
new thread? Now you have to change the increment for your counters.

~~~
0x0
Why do you have to care about all that? Just pick a random number and start
incrementing++ from there. I don't see how this would yield any more
collisions than a PRNG? Just because you move across the ID space sequentially
instead of in an obfuscated pattern doesn't really mean there's a higher
chance of picking the same ID twice?

(My argument here, really, is that using a PRNG _at all_ sounds pretty crazy
for ID generation, if you don't take steps to prevent or detect collisions)

~~~
michaelmior
You have to care about that if you don't want any synchronization. How do you
generate IDs across multiple servers otherwise?

------
NicoJuicy
I understand why they want unique generated id's, i don't understand why they
just didn't use UUID/Guid, which are way more performant ( string vs int) and
less big in size ( 22 bytes vs 16 bytes).

Both are created for the same purpose and Guid's are even possible as keys in
MS SQL Server (next to the standard autoincremented integer). So it's safe to
say that it's pretty reliable.

~~~
cowsandmilk
UUID : 128 bits

Bitable's scheme : 132 bits

Essentially the same size. They ended up with 132 bits because they use a 6
bit alphabet; 128 is not divisible by 6, so you would need 22 characters of a
6 bit alphabet to represent a UUID, exactly the same as them.

(note, the 6 bit alphabet is so identifiers can go in urls; the commonly used
uuid hexadecimal representation is a 4-bit alphabet, so there are 32
characters and usually 4 hyphens; so 36 characters in the URL vs. only 22)

------
crabasa

        npm install node-uuid 
    

As a programmer, I quite enjoyed reading this post. But I can't believe the
amount of time the author wasted learning what many crypto library authors
already know: Math.random isn't any good. I mean this sincerely. You'd think
they have so many more pressing problems to solve for their users.

~~~
blahedo
Perhaps, but the mindset here is fundamentally academic: here is an
interesting problem, I wonder what I can learn about its properties? And
having looked into those properties, the author decides to write them up and
disseminate them. (Very well, I might add. I filed the link in a bookmark
folder I keep for "articles I'd like to assign my students to read, next time
I teach a relevant class. This one is well-written _and_ well-referenced.)

~~~
crabasa
Sure, but I still see two problems:

1\. Math.random, despite his insistence, isn't broken. There are different
degrees of randomness that a programmer might need for their program and
Math.random simply didn't suit his.

2\. Their initial implementation of a critical portion of their infrastructure
was so naïve, I almost couldn't couldn't believe it. I just seemed like such
an anti-pattern to try to build your own UUIDs, discover (in production!) that
they're not so good and then spend days figuring out why a built-in random
number generator that ships in every browser isn't so great.

~~~
mmalone
1\. In this case we're arguing semantics. Fine, it's not broken, but it's
still bad and there are better alternatives with no drawbacks. Arguing to keep
it is sort of like arguing to keep an O(n) algorithm that has an O(log n)
alternative that's also more intuitive and easier to code. If you still don't
believe me, here's what Brendan Eich thinks[1].

2\. I've addressed the standard solution / UUID question elsewhere[2]. The way
we were generating identifiers is not an anti-pattern. It's a pattern. It's
the standard way to produce a random string from an alphabet. If you look at
the source code of any website, from HN to Google, I guarantee you will find
an almost identical piece of code somewhere. The code is simple, but that
doesn't make it bad.

[1]
[https://twitter.com/BrendanEich/status/667735502691373056](https://twitter.com/BrendanEich/status/667735502691373056)

[2]
[https://news.ycombinator.com/item?id=10605977](https://news.ycombinator.com/item?id=10605977)

~~~
ss95060
Your code used the naive algorithm and the system PRNG to generate a set of
"random" and thus probably "unique" strings. The PRNGs supplied with many
languages are widely known to be of poor quality. The results were bad, for
reasons you later discovered. "Anti-pattern" sounds correct to me.

A safe approach is to run multiple uncorrelated sources of probable
uniqueness/[pseudo]randomness through a modern cryptographic hash function,
and generate your "unique" string from the hash output.

~~~
mmalone
In other words the code itself is, in principal, fine. The PRNG it used was
not fine. Had it used a CSPRNG there would not be a problem, and generating
identifiers in this manner is perfectly safe and not an anti-pattern. Hashing
the output of an already good (CS)PRNG is the real anti-pattern. It can only
reduce entropy and make things worse.

I've acknowledged the incorrect assumptions / taken blame for not doing proper
diligence on the PRNG. What I'm saying here is that the identifier generation
technique is not, in principal, flawed. It is a common technique and it has
many legitimate use cases. It's simple enough that pulling in a dependency to
solve it for you is not an obviously better alternative.

True randomness and a crypto-strength PRNG are good options here, but they're
not necessary. There are peer reviewed proofs that show something like MT19937
is suitable (perhaps even better suited) for this sort of task. See the deeply
nested comment thread ITT for more of that debate.

~~~
ss95060
Suppose, somehow, you seed your super-duper PRNG with the same value on
multiple servers. If you are not incorporating other forms of entropy into
your unique IDs, you are hosed.

The cryptographic hash acts sort of like a blender for all your random/unique-
ish information sources. Counters/MACs/urandom/times/IPaddrs/etc go in, a
nicely mixed value comes out.

~~~
mmalone
Meh. You should just use all of that random unique-ish stuff to properly seed
your PRNG. You're trying to improve the generator "randomly," which is
generally dangerous and can be counterproductive. Better to rely on sound
theory and good practice / keep it simple. Leave all that craziness to the
kernel entropy pool and just use it as a seed.

~~~
ss95060
You might want to read this:

[http://security.stackexchange.com/questions/89813/managing-k...](http://security.stackexchange.com/questions/89813/managing-
keys-generated-with-insufficient-entropy/89817)

Anyways, if you decide you want to learn more about secure hash functions and
cryptography in general, you can't go wrong with Bruce Schneier's "Applied
Cryptography".

Good luck!

~~~
mmalone
Applied Cryptography is a good book. I first read it more than 15 years ago
:). Schneier actually designed the CSPRNG that is used by OS X. It's called
Yarrow and is SHA based. It does more or less what you suggest, but with many
subtle improvements to mitigate attacks. It has features that you're unlikely
to properly reproduce in user-space. Trying to improve its output by re-
hashing in user-space is, even if done properly, unnecessary. If you screw
something up and do it improperly you could easily reduce the quality of the
generator.

I'm not sure how your link is relevant. If there's not enough entropy on the
system for the kernel, there's not enough for you either.

------
Drdrdrq
I think the whole idea of generating random ids is flawed. Why take a chance
when there is no need? Just take some machine id (seq. number, MAC), timestamp
and a big enough per-machine counter and there can be no collision,
guaranteed. No need to use random.voodoo().

~~~
sgk284
Hardware companies have definitely been lazy and reused MAC addresses. But
what's worse is that the total space is as low as 48 bits.

The probability of a collision is orders of magnitude higher than if you just
grab some random bytes out of /dev/urandom and generate a type-4 UUID.

For a discussion on the benefits of your approach (a type-1 UUID) vs a random
id (a type-4 UUID), check out the wikipedia article:
[https://en.wikipedia.org/wiki/Universally_unique_identifier](https://en.wikipedia.org/wiki/Universally_unique_identifier)

Randomness is great because it doesn't require any synchronization or access
to some special token.

~~~
michaelmior
If you get identical MAC addresses on machines operating the same service,
presumably on the same network, you have bigger problems than ID generation.

------
adrae5df
Can someone explain how the following calculation was done?

 _With 2¹³² possible values, if identifiers were randomly generated at the
rate of one million per second for the next 300 years the chance of a
collision would be roughly 1 in six billion._

I tries using the formula for the Birthday problem, but the values are too
large.

~~~
dmit
Which formula did you use? There are various approximations.

Although I got ~43 years using the one from
[https://en.wikipedia.org/wiki/Birthday_attack#Source_code_ex...](https://en.wikipedia.org/wiki/Birthday_attack#Source_code_example)

    
    
      julia> birthday(prob, vals) = sqrt(2 * vals * -log1p(-prob))
      birthday (generic function with 1 method)
      
      julia> iters = birthday(1/6e9, 2.0^132)
      1.3471597122821932e15
    
      julia> iters / 1e6 / 60 / 60 / 24 / 365
      42.718154245376496

~~~
adrae5df
I found an approximation (x^2/2m) and got the result that is on the same
magnitude as 1 in a billion.

------
erikpukinskis
Huh. I _always_ check for collisions if I'm using a random number as a unique
key. I guess it's faster to just not check, but it's just a single seek to
check for existence. It's pretty fast, and you almost never have to do it
twice.

As a side benefit you can get by with shorter (more human readable, typeable,
fits-in-a-tweet) keys because you correct the few collisions you get.

If the extra keyspace seek were ever a serious performance bottleneck I would
just skip the collision check in that one spot and use a proper RNG as OP is
suggesting. But how often does that really happen? Very few systems need to
generate millions of unique keys a second.

~~~
anyfoo
Checking for collisions becomes hardly possible in distributed systems pretty
quickly (take git as an extreme example), and provides virtually no benefit if
you can trust your RNG.

At those astronomical possibilities, there are many things that can go wrong
before collisions become viable. Think about it: Why bother checking, when you
have a somewhat similarly egregious probability that a random bit flip occurs
right after your check?

In the situations you encountered, it's probably simple and cheap, but in the
cases where it's not (the situation in the article being a probable example),
it can be expensive, error-prone complexity without gain.

------
admax88q
No platform in 2015 should even have a poor PRNG. It's pretty sad really.

rand() in C, rand() in PHP, Math.random() in JS.

Any use case where you want a poor random number generator is a niche use case
and you can write your own. We should be secure by default.

~~~
baudehlo
If you want a random entry from an array, Math.random is fine. Anyone who
doesn't know that prng's are broken for things like uuids shouldn't be in
charge.

~~~
mmalone
If you want to do something as trivial as an unbiased Fisher-Yates shuffle of
an array, however, Math.random() is broken. And Math.random() doesn't have to
be broken for things like UUIDs. Python and Ruby both have PRNGs that are
suitable for such things.

We fucked up by not vetting the algorithm, that's definitely the primary
lesson here. Mea culpa. I'm sure you would have done things differently, but
V8 is a modern system and I don't think assuming a modern PRNG was completely
unreasonable while quickly getting to MVP at a new startup.

~~~
PhantomGremlin
_We fucked up by not vetting the algorithm, that 's definitely the primary
lesson here. Mea culpa._

Bingo. Exactly.

I heavily criticized you in another post for 1) your initial choice of
substandard algorithm and 2) your rationalizations (e.g. Google's "good
reputation").

But that is ancient history. Here you're admitting that you fucked up, which
IMO you didn't admit previously.

Also, kudos for your article which kicked off this entire discussion. In that
article you showed that you carefully analyzed what was happening, how you
went wrong, and how you could improve.

Even more important, you did this all publicly, both your article and your
responses here on HN. Everyone learns from this. I commend you for your
openness.

~~~
Dylan16807
Well Google fucked up too, in a way that affects many more people. They used a
broken form of an obsolete random number generator with a 2^30 cycle. There
are faster and simpler generators that perform overwhelmingly better on
statistical tests.

------
nodesocket
The challenge of "randomness" is overlooked and leads to security issues more
than most people realize. For example in php, the `rand()` function
documentation (to their credit) has a big warning that the function is not
cryptographically secure.

When in doubt, use unix, and <updated>/dev/urandom</updated>

Also, here is a command (to my knowledge secure) which generates a nice random
20 length alphanumeric:

    
    
        LC_CTYPE=C < /dev/urandom tr -dc A-Za-z0-9 | head -c20

~~~
admax88q
Use /dev/urandom, not /dev/random.

There are no security concerns with /dev/urandom

~~~
rst
Here's a writeup of a serious security flaw traced directly to the use of
/dev/urandom instead of /dev/random:

[https://factorable.net/](https://factorable.net/)

In brief, wifi routers need to generate unique crypto keys. The programmers
started off using /dev/random, observed that blocking, and switched to
/dev/urandom. The problem was that they were generating the keys at first
boot, before any entropy had accumulated, with the result that many, many
devices wound up with similar or identical keys.

~~~
derefr
Pedantic version, then: use /dev/urandom unless your software is itself part
of the OS's boot sequence.

~~~
comex
What is the definition of "part of the OS's boot sequence"? In the
factorable.net paper[1], a test system didn't generate 192 bits of entropy
until 66 seconds(!) after boot. The problem was apparently exacerbated by
Linux not mixing the entropy into the /dev/urandom pool _at all_ until that
threshold was reached, but even if it started doing so immediately, standard
server processes like ssh would still start up long before there was enough
entropy to make urandom sufficiently unpredictable. Sure, this is something of
an extreme case... and as mentioned in the paper, there are various tools used
on Linux to save a random seed to disk, so maybe the OS should ensure that
happens before it starts any other services... but software that responds to
an easy-to-make configuration mistake by _silently behaving insecurely_ is the
polar opposite of good security design.

The right answer is Linux's new getrandom syscall, whose default mode "blocks
if the entropy pool has not yet been initialized", but thereafter acts like
/dev/urandom. But on systems where that syscall is not available, I'm
skeptical that using /dev/urandom is sane design.

[1]
[https://factorable.net/weakkeys12.conference.pdf](https://factorable.net/weakkeys12.conference.pdf)

~~~
derefr
Yes, getrandom(3) is good. It's just a patch for a broken architecture,
though; the _right_ answer, architecturally, is that userland services (like
sshd) should not be started at all until /dev/urandom has been seeded.

To put it another way: /dev/urandom is effectively a _service_ that can be in
a "starting" or "started" state. When you, as a service, depend on another
service, the idiomatic thing to do is to teach your init/supervisor daemon
about that dependency. Then, instead of your service coming up and sitting
around doing voodoo to your pipes/sockets to try to figure out whether your
dependencies are up, your supervisor can just delay starting your service
until _it_ knows the dependent service is up.

(Or, to get really clever about it, you could just make /dev/urandom itself a
socket attached to a "service" whose job is to, on startup, do a blocking-seed
of the "real" /dev/urandom device, and thereafter become a dumb pipe over to
the "real" /dev/urandom device. Then all your services just block on reading
/dev/urandom [the socket] the first time, with the first one of them to try
reading from it in fact _causing_ the reseeding to happen. Socket activation!)

~~~
abortz
No need to get clever about it, or rather it's already been done. ;-) Except
for the reseeding bit, getrandom() does that for you by default: blocks if
/dev/urandom hasn't been seeded yet. At least according to the man page.

------
dzhiurgis
Hey, does anyone tried to Google for CSPRNG? What Google return for
wikipedia's article title is totally off:

    
    
      https://www.google.com/search?q=CSPRNG
      Random are numbers - Wikipedia
      https://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator

~~~
cooper12
I got the same result. Usually google gives a different title than the
Wikipedia title when it's closer to the search term used. However, these
titles are taken from redirects to the article and "Random are numbers" wasn't
ever an article or a redirect. The phrase also doesn't appear anywhere in the
wikitext or the generated html. Searching for the exact string also yields
only two other results, one from a reddit thread, and another from a paper.
Very strange.

------
andreapaiola
If you need to identify, to name, something uniquely and dynamically:

1) Do NOT use a non-deterministic generator, you'll thank me when debugging

2) find what uniquely identifies the object and generated deterministically
string from those values

For example to identify a request to a service on the internet you can use
microtime, IP and a counter and so on.

------
antirez
Great post. I wrote this after reading it where a simple and reliable solution
to the ID generation problem is described:
[https://news.ycombinator.com/item?id=10606910](https://news.ycombinator.com/item?id=10606910)

~~~
mmalone
Good idea! But it looks like you re-invented a SHA/counter based CSPRNG :).
That's not necessarily a bad thing, but there are probably library CSPRNGs
that work as well and hide some details. Also, it's not generally any more or
less likely to produce collisions than the method we're using. The probability
of collision is dependent on the hash function / PRNG being used.

------
gmac
I have to say I'm surprised — I thought better of V8 than this.

Arguably it's usually going to be preferable to know you're getting the same
kind of random numbers across platforms anyway, so a BYO Mersenne Twister is a
decent remedy.

One is linked in the post, and I have a CoffeeScript one here:
[https://github.com/jawj/mtwist/blob/master/mtwist.coffee](https://github.com/jawj/mtwist/blob/master/mtwist.coffee)

------
bytesandbots
For those wondering why not use UUID, most uuid implementation are guaranteed
to be unique _only when running a single machine_. When working on loads of
servers without putting any load on a central resource, you can not rely on
either uuid or sequential.

~~~
tolmasky
UUID v4 seems to be based on random numbers, so should be "as good" as
whatever scheme they were trying to hand roll here. Additionally, the uuid npm
library uses crypto.randomBytes, which is cryptographically secure.

[https://github.com/broofa/node-uuid](https://github.com/broofa/node-uuid)

~~~
mmalone
The biggest problem is that they take up a lot of space. Plus you have to find
a good library and vet it (there's nothing in the standard library). Or just
write 10 lines of straightforward code.

If you go and look at node-uuid commit history it, too, used Math.random()
back in the day.

------
netheril96
Given that browser JS has no way to interface with /dev/urandom, I think it is
a mistake for V8 not to make the only random number generator available to
browser JS secure by default.

Of course, on Node.js that is a different issue.

~~~
asdfaoeu
What about [https://developer.mozilla.org/en-
US/docs/Web/API/RandomSourc...](https://developer.mozilla.org/en-
US/docs/Web/API/RandomSource/getRandomValues) ?

~~~
netheril96
Learned a new thing today. Thanks.

------
swang
Curious why they aren't using hardware RNGs?

~~~
pcwalton
Too slow.

------
altern8
Again..?

------
mmalone
tl;dr: the algorithm powering V8's Math.random() is very poor quality. For
many use cases you can't safely pretend its output is actually random. Don't
use it for anything non-trivial that you care about. It should probably be
fixed. In the meantime, use crypto.randomBytes() or crypto.getRandomValues()
instead.

~~~
kayamon
I think the real lesson here is that if your application is dependent on the
exact behavior of an undefined algorithm, you probably should fix your
application.

~~~
mmalone
It wasn't dependent on exact behavior. It was dependent on a sensible general
contract. We rely on sensible implementations of general contracts all over
the place in software development.

We did make an incorrect assumption that the PRNG (created by Google, who has
a good reputation, and in what is probably the most popular software in the
world) was high quality. That was a mistake, and we should have done more
homework. However, there's no reason why the Math.random() implementation in
V8 should _not_ be good enough to not have to worry about.

~~~
asdfaoeu
It was dependent on it being a CSPRNG which it wasn't. Math.random is designed
for things like games and animations where randomness isn't critical and
performance is important.

~~~
seba_dos1
No, it wasn't. Read the article again. Assumption made was on Math.random
being high quality PRNG. Being cryptographically secure is completely
unrelated.

~~~
Dylan16807
Medium quality, really. But it was neglected and bad for no reason, no
tradeoffs.

------
lolo_
META: Maybe the mods need to look into how certain popular sites treat #, as
medium seems to generate a new #<hash> on each refresh. I submitted this 1 day
ago, intentionally removing the #<hash> \-
[https://news.ycombinator.com/item?id=10598335](https://news.ycombinator.com/item?id=10598335)
\- r0muald clearly caught a better timing for this article but frustrating
that an exact duplicate submitted in very close proximity becomes a totally
separate submission!

~~~
dang
HN's dupe detector deliberately allows reposts if an article hasn't had
significant attention yet, so the added hash didn't make a difference in this
case. That said, there are definitely too many duplicates right now and we're
working on a new approach that will (as a side effect) privilege the original
submitter more often.

It's frustrating when you get there first and someone else hits the jackpot,
but it evens out in the long run if you submit enough good stories.

~~~
lolo_
Thanks for the quick and reasonable reply!

I am only a little frustrated that I missed out on so much karma love (I am in
my 30's now and find that these things matter less as I get older :), it was
more the fact that Medium naturally appends this, so literally all submissions
of a Medium article will result in reposts.

It's not a simple problem as some URLs will rely on the hash to work
correctly, e.g. hxxp://www.example.com/#!/foo/bar, so you can't just strip it.
I think it requires some special cases for sites known to do this kind of
thing (not sure why Medium do it.)

Glad to hear duplicates is an issue you're looking at, has certainly been a
problem, though I agree it makes sense to allow it in certain cases. For
example I love the XV6 OS and like to see it periodically submitted again from
time-to-time to see new discussion. It might even be nice to have an auto-
generated list of previous submissions in this case?

As an amusing aside, this whole issue has a relationship to the story -
perhaps the generated hash uses Math.random() and therefore might actually
result in duplicate URLs after not so many resubmissions? ;)

------
PhantomGremlin
I can't believe you're defending "an incorrect assumption" regarding your
previous generator, especially after you wrote a very good article showing how
badly wrong you were.

Wait, let me restate that. You're Engineering the Disruption of Real Money
Gaming and yet you say of some PRNG "no reason the ... implementation ...
should not be good enough to not have to worry about"? You're not just wrong,
in fact you're "not even wrong", that's how wrong you are. Plus that sentence
has three negatives!

Jules Winnfield said it better than I could. Riffing from him, what you were
doing was, compared to what you _should_ have been doing: _ain 't the same
fuckin' ballpark, it ain't the same league, it ain't even the same fuckin'
sport._

Here's something I came up with in 60 seconds. It's far better than what you
didn't want to worry about:

    
    
       ID Quantique hardware RNG [1]
       mixed with Intel RdRand [2]
    

Yes, you're right. That's not seed-able, so you can't create a reproducible
sequence of values (e.g., for testing). That's an advantage. You want to test,
you replace hardware with software. But if you want to disrupt real money
gaming you don't use some crappy software in some random library that wasn't
designed for the purpose.

Once again, I can't believe your defense of how badly you failed was the
"reputation" that Google has. Real money is involved and you thought it was OK
to rely on some crappy PRNG created by Google? For something that is at the
heart of, the quintessence of what you are attempting to accomplish?

EPIC FAIL!!!

[1] [http://www.idquantique.com/random-number-
generation/](http://www.idquantique.com/random-number-generation/) [2]
[https://en.wikipedia.org/wiki/RdRand](https://en.wikipedia.org/wiki/RdRand)

~~~
dang
Your comments here are breaking the HN guidelines. HN comments need to be
civil and substantive.

We detached this subthread from
[https://news.ycombinator.com/item?id=10601984](https://news.ycombinator.com/item?id=10601984)
and marked it off-topic.

~~~
PhantomGremlin
Sorry.

Not a badge of honor to have a subthread detached.

------
elcct
func rand() float64 { return 0; }

FTFY

------
mckoss
Why no mention of window.crypto.getRandomValues?

~~~
Too
Read more carefully...

~~~
mckoss
Skimming on phone and I missed it-thanks!

------
serge2k
> Good PRNGs are designed so that their cycle length is close to this upper
> bound. Otherwise you’re wasting memory.

Why?

Good article. I do have a question for the author, why after a huge amount of
time writing this post don't you submit a patch to V8?

~~~
tzs
> Why?

Good question. As they author noted, k bits of state can support a cycle
length of 2^k. You can look at that from the other direction and state it as a
cycle length of L requires at least log2(L) bits of state.

If your cycle length is L and you are using k bits of state, and k > log2(L),
then in theory with a cleverer encoding of state you could save k - log2(L)
bits of memory.

Whether or not I'd characterize k - log2(L) excess bits as "wasted" memory
depends on just how big it is. For instance, suppose someone implemented
Mersenne Twister and used twice as many bits as theoretically needed for the
cycle length. Typical MT has a cycle length of around 2^20000, so needs 20000
bits of state. Someone doubling that is using an excess of 2500 bytes.

On the other hand, the generator the article is about has a cycle length of
around 2^60, and so need in theory 60 bits of state. If an implementation
doubled that they have an excess of less of 8 bytes.

2500 bytes is enough that if I were implementing I'd probably take a good look
at seeing if I could eliminate it, even if it meant making the implementation
a little more obtuse. I'd be inclined to consider that 2500 bytes as wasted
memory.

8 bytes? That's small enough that even on most embedded systems I'd probably
consider clean and clear code to be more important than saving that memory.
I'd not be inclined to consider it wasted memory.

~~~
Gibbon1
I have another reason based on some mucking with simple PNG's a long time ago.
Might not apply to more complicated ones[1].

Maximal cycle length generators don't produce degenerate sequences. Meaning I
seem to remember ones where they'd have say a sequence of 2^k - 349 and a
sequence of 340 and a sequence of 7. Which means they fail badly if seeded
incorrectly.

[1] Noting an old article I read on FPGA design that said, 'helps to design
your state machines so that that illegal states transition to legal ones' So I
assume sub 2^k generators exist that don't have sub-sequences.

------
omarforgotpwd
I can't b3liwv3 I just read that entire thing drunk on a Friday night

~~~
mmalone
Hahahahaha. I can't believe that either.

