

Random in the wild - donmcc
http://www.tedunangst.com/flak/post/random-in-the-wild

======
colmmacc
My own favourite random() in the wild bug is one that I've come across many
many times:

    
    
        /* Generate a random number 0...5 inclusive */
        int r = random() % 6;  
    

The problem is that this results in a bias towards certain values, because the
random() return space is probably not a whole multiple of the number you are
modding. It's easier if you think what would happen if RAND_MAX were "10".
Then the results 0,1,2,3,4 each have two opportunities of being selected for
r, but "5" only has one. So 5 is half as likely as any other number. Terrible!

Using float/double random()'s don't always help either, because floating point
multiplication has non-uniform errors. Instead, what you really need is:

    
    
       int random_n(int top) {
           if (top <= 0) {
                return -1;
           }
        
           while(1) {
               int r = random(top);
               if (r < (RAND_MAX - (RAND_MAX % top))) {
                    return r;
               }
           }
         
           return -1;   
        }
    

Although it's better to replace random() itself with something that uses real
entropy.

~~~
conistonwater
On my machine RAND_MAX is 2^31-1. So getting the probability wrong in the way
you describe means a relative error of one in two billion. That's really quite
small for non-critical applications.

~~~
colmmacc
It's not so simple. The larger the modulus, the larger the error; up to a
limit. If the modulus is 2^30 + 2^29 for example, then values of r in the
range 0 ... 2^29 will be represented twice as often as values in the range
(2^29 + 1) ... (2^30 + 2^29). This is much more significant error than one in
two billion (it's close to 1 in 3, nearly ten orders of magnitude more
significant).

Subtleties like this are hard to detect on code review, and even in testing
... the real lesson may be that the very interface behind random(void) is just
broken. It really should be random(int) and take care of all of this for you,
as many other languages/libraries do.

~~~
conistonwater
Yes, you are right, I didn't think of really large moduluses of magnitude
comparable to RAND_MAX.

------
sjolsen
This sort of code is one of the motivations behind improving the random number
generation facilities available in C++. If you're using C++ or you're using C
and have the option to link with C++, I strongly recommend looking at the
random number library introduced in C++11 [1]. In addition to letting you
specify a statistical distribution (including a proper uniform distribution
for both integral and floating-point types), it lets you choose between
various PRNG engines with various trade-offs. It also provides a way to source
hardware entropy with which to seed an engine, and it's all pretty easy to
use.

*[1] en.cppreference.com/w/cpp/numeric

~~~
tptacek
It's nice to make it clear what distribution you're going to get from a PRNG,
or what the performance characteristics are going to be. But for cryptographic
randomness (the subtext of 'tedunangst's post), what you _don 't_ want are a
bunch of knobs and dials. What you generally need is a guarantee that your RNG
interface is feeding you data from the system secure random number generator
--- either urandom or CryptGenRandom.

I'm also not sure what "hardware" has to do with that guarantee.

One example I can think of where a standard library got this right is Golang:

[http://golang.org/pkg/crypto/rand/](http://golang.org/pkg/crypto/rand/)

~~~
pbsd
FWIW, the C++ random number library is very much _not_ meant for cryptography.
This is obvious when you see the weak guarantees given by std::random_device:
on Windows, libstdc++ will actually use the Mersenne Twister with a hardcoded
seed, _and this is standard-compliant_! Additionally, every engine included in
the library is not cryptographic at all.

The Boost random library actually does the right thing, and states that
boost::random_device must not be implemented where a proper source of random
bits is not available. The standard tries to be all things to all people, and
std::random_device ends up being useless for any serious application.

As a mathematical random number generation library, the C++ library is much
superior to its Go counterpart, math/rand.

------
clarry
I would've appreciated if you'd annotated each snippet with the source so as
to make it easier for us to find the program it came from. One interesting
question to ask next would be, what do these programs do with their random
numbers? And so, does the quality of the stream or the non-repeatability of it
matter at all?

~~~
waterhouse
Since he's mocking and snarking at each snippet, I think adding pointers to
the sources would make it come off as an incitement to make fun of the
authors, and rather more mean than he intended. An alternative would be to add
direct citations but tone down the snark. I think Ted wanted to focus on the
code examples themselves, and to make an overarching point about how "rand" is
actually used in practice.

That said, if you want to figure it out, he did say they were "selected
examples" from this list of projects: [http://marc.info/?l=openbsd-
tech&m=141776286105814&w=2](http://marc.info/?l=openbsd-
tech&m=141776286105814&w=2) Also, you may be able to google for exact text
matches with the code. (You might find _other_ projects that have the exact
same problem--but that is likely good enough.)

~~~
moron4hire
There was a particularly egregious example near the end that a simple Google
search definitively found only one copy.

------
mijoharas
My favourite bit is: "Take 16 bytes of random data. No, wait, make that 15
bytes. Then hash it to four bytes to really squeeze the entropy in. Then
seed."

Good article

------
viraptor
With all the comments author makes about nonstandard and crazy behaviour, he
actually misses some practical solutions and makes fun of them.

"The one operation that was not observed was substracting the pid from the
time. More research into this subject is warranted."

It's actually simple (even if still not effective on pid wraparound) - pid
numbers grow, at a rate of at least 1 per program execution. Time grows at
around 1 second per second. If you substituted pid from time, there's a good
chance you would get the same seed by running the app twice in a row.

So it's added instead, so that it always grows. And we pretend the wraparound
happens very rarely.

Broken behaviour? Sure. Practical solution that works for 99% cases where non-
critical randomness is required? Definitely.

------
sarciszewski
For cryptography, there's really no reason to use rand(), mt_rand(), or the
other insecure variants. No excuse, I should say.

Just use urandom. Or getentropy() if your OS supports it.

If you're not using it for cryptographic purposes, then I don't see why it
matters. :)

~~~
jsnell
There are all kinds of apps that might not need crypto quality randomness, but
do at least need the random number streams to be different between invocations
of the program! A Monte Carlo simulation that always gets the same results
isn't going to be too hot. A game that seeds the RNG with 16 bits of entropy
will be fine for a single player, but not for a community. (Think of FreeCell
or Minesweeper in Windows, though those might even have been just 15 bits).

~~~
conistonwater
> A Monte Carlo simulation that always gets the same results isn't going to be
> too hot.

A Monte Carlo simulation absolutely needs to be able to produce the same
numbers every time if you want to have a hope of debugging it. In a release
version, yes, sure. But even there, people will still have concerns about
replicating other people's results, so being careful with random seeds is
important.

~~~
smeyer
There's a difference between "always uses the same seed" and "can be seeded
for reproducibility". You probably want the latter and probably don't want the
former.

------
anders
> Perhaps there is a reason why software like Lua, Python, and Ruby all
> include their own implementation of a Mersenne Twister.

As far as I know, Lua does not include Mersenne Twister. math.random() is just
C rand().

~~~
mbq
Also MT still has to be seeded, it is not crypto-safe and its
quality/computational load ratio is not too good.

~~~
Scaevolus
It's far more complex than it needs to be. If you want a fast, high quality
deterministic RNG, use something from the xorshift family.

~~~
edwintorok
This is a good article describing fast non-cryptographic RNGs:
[http://www.drdobbs.com/tools/fast-high-quality-parallel-
rand...](http://www.drdobbs.com/tools/fast-high-quality-parallel-random-
number/229625477) [http://www.drdobbs.com/tools/fast-high-quality-parallel-
rand...](http://www.drdobbs.com/tools/fast-high-quality-parallel-random-
number/231000484)

CMRES is quite impressive, I get ~1500 MB/s out of it. Quite useful for
generating large unique files for various testcases.

