Violating randomization standards

Negitivefrags · on Dec 9, 2014

This seems like a very bad idea to me.

It's highly ingrained in my mind that rand() provides a deterministic sequence, and I use deterministic psudorandom sequences all the time.

If you are changing this just in OpenBSD isn't that going to be a portability nightmare? I would find this behaviour in OpenBSD extremely surprising.

I'm very sceptical of the concept that developers expect true randomness out of rand(). Unless the way people are taught this stuff has changed drastically, the very concept of randomness in computing was always introduced right alongside the idea that they don't produce true randomness.

Infact, the need to call srand() or get the same sequence every time can't help but teach you the fact that it's a deterministic sequence.

tedunangst · on Dec 9, 2014

If people wanted deterministic sequences, they wouldn't be calling srand(time(NULL)). Or srand((time(NULL) + buf_len) ^ rand()). Or, my personal favorite, srand(getpid() * time(NULL) % ((unsigned int) -1)). Though I will admit it is very difficult for me to properly imagine what that developer was expecting.

bmm6o · on Dec 9, 2014

I tend to agree. The determinism of random() is a feature, but it's a feature that almost nobody actually needs. It's just that the alternatives are either more trouble than they're worth or platform-dependent.

kevin_thibedeau · on Dec 9, 2014

I use the determinism with test cases generating random input data. Every run of a test case generates a random seed based off the system time and reports the seed used. When a failure happens, the seed is fixed using an environment variable. Then the failed case can be reproduced until the bug is tracked down.

This isn't an unusual practice.

The determinism in standard PRNG functions is an important feature not a weakness. Theo is being foolish. If he's unhappy with the way the *rand functions are being used he should just change the code calling into the library to use random seeds derived from a suitable entropy source, not change the library itself.

bmm6o · on Dec 9, 2014

> This isn't an unusual practice.

I would even say it's common within programs that generate random sets, but that those use cases make up a tiny minority of all the uses of rand(), and it shouldn't be difficult to determine them from context.

fryguy · on Dec 9, 2014

Why are things like Mersenne Twister or linear congruential generators platform dependent?

To me, there are three classes of reasons you would need random numbers:

1. Simulation. Things like Monte Carlo. In this case, only the statistical quality of the numbers matters.

2. Repeatable simulation. Things like fuzzing tests or benchmarks, where you want them to be repeatable. In this case, you care about determinism and the statistical quality of the numbers.

3. Resource allocation. In this case, you have a set of things that you want to be unique, but to either make them unpredictable (process numbers, invoice numbers) or unique without sharing state (guids). You care about the statistical quality and unpredictability here.

The problem is that 2 and 3 are at odds with each other. 3 requires a constant source of entropy, and 2 requires there to be no entropy.

I think the following slides are relevant: http://www.openbsd.org/papers/hackfest2014-arc4random/mgp000...

nitrogen · on Dec 9, 2014

I would add a #4: cryptography.

You don't need to add entropy for #3. If a CSPRNG with n bits of initial entropy as a seed can satisfy #4, then it's also good enough for #3.

fryguy · on Dec 9, 2014

My understanding is that the way cryptography is done is by having hardware entropy source in the CPU fed into some sort of hashing algorithm. Having a CSPRNG with only n bits of initial entropy either exposes the problem of if an attacker can somehow get the state of the CSPRNG, then they can predict future random numbers that are generated (and previous ones, depending on implementation). Constantly reseeding with entropy means that it's not possible to do that.

bmm6o · on Dec 9, 2014

I meant that "true" random numbers are platform dependent.

needs · on Dec 9, 2014

It's in fact half of a feature, because the generation method is not specified. So you can't use it for terrain generation in portable code, for example.

Negitivefrags · on Dec 9, 2014

I'm not going to argue that rand() isn't misused by programmers who don't know any better, but at the same time you have legions of coders who understand rand() very well and those expectations will be totally broken here.

I wouldn't have even bothered to check the the man pages for rand before assuming it was deterministic for a given seed to srand().

to3m · on Dec 9, 2014

I imagine they wanted a different set of numbers each time, and didn't really care what they were, provided the sequence was vaguely unpredictable. On the face of it this doesn't sound to me like a great reason for breaking standardized behaviour - but you know what they say about people who comment from the sidelines.

dezgeg · on Dec 9, 2014

It's already not a terribly good idea to use rand() for deterministic pseudorandom numbers, as the generated sequence isn't portable between systems.

duskwuff · on Dec 9, 2014

The sequence isn't necessarily even portable within a system, as the libraries used by your application may themselves call rand() in unpredictable ways. For instance, consider a DNS resolver that generates query IDs using rand(). If the resolver library generates a new ID for retransmitted queries (e.g, when a packet is dropped), any DNS resolution may desynchronize the state of the random number generator.

And, just as if that weren't bad enough, consider multithreading. The order in which threads execute is unpredictable, so, if more than one thread calling rand() is executing at once, the state of the RNG may become dependent on the order threads run in. Long story short, setting a global RNG to a known state isn't a sufficient way to make a program deterministic.

bmm6o · on Dec 9, 2014

The use cases are more about being able to repeat a run on a given system - a particular simulation or a unit test that triggered an error (e.g. a sequence of values in a hash table). I could probably come up with a story where I'd want rand to be portable across systems, but it wouldn't be plausible.

mikeash · on Dec 9, 2014

For a relatively common example of where you want portable pseudo-randomness, many games generate "random" worlds with a known PRNG algorithm and expose the seeds to the player. Players can then share interesting worlds with other players by simply sharing the seed. They can also generate "personalized" worlds by using their name or other meaningful text as the seed.

Of course, it's completely reasonable to expect such games to bring their own PRNG implementation for that.

clarry · on Dec 9, 2014

> Of course, it's completely reasonable to expect such games to bring their own PRNG implementation for that.

Indeed. I was going to point out that rand() and random() might be horrible options even if you want a deterministic, repeatable stream of random numbers. I am pretty sure celeron55 initially tried using these functions in world generation for minetest, and seeing the rng's output as cubes in a 3D image made it hilariously clear how predictable and patterned the output is.

leni536 · on Dec 9, 2014

You don't always need good random numbers but "random looking" numbers. Examples:

- Map generation for games

- AI for games

- Random looking texture generation (like Perlin noise)

- Shuffle a playlist in your music player

- Clicking "random" on a webcomic's page

I'm sure there are other examples. I think that rand() is much faster than any good pseudorandom algorithm too.

clarry · on Dec 9, 2014

Actually if you try using rand() for map or noise generation, you'll find that most implementations of it produce utter shit. It's not even random looking. It produces very obvious patterns. Really, it's shit.

So despite that it could still be useful in some situations, but I don't think that's good enough reason to require its existence in standards and standard libraries. If your application happens to need something that carries a remote resembleness to randomness, and it needs to be fast, your application could just provide its own implementation of that special functionality. rand is just a few lines of code...

orlp · on Dec 9, 2014

> I think that rand() is much faster than any good pseudorandom algorithm too.

This is simply not true. It's often _slower_, since it uses the integer div instruction.

For an example of a good fast PRNG, see Tyche: https://eden.dei.uc.pt/~sneves/pubs/2011-snfa2.pdf

It's faster and produces infinitely better distributions.

clarry · on Dec 9, 2014

> This is simply not true. It's often _slower_, since it uses the integer div instruction.

Actually I think most implementations reduce modulo a power of two (e.g. 2^31 or 2^32):

    0000000000048ee0 <rand_r>:
       48ee0:       8b 07                   mov    (%rdi),%eax
       48ee2:       69 c0 6d 4e c6 41       imul   $0x41c64e6d,%eax,%eax
       48ee8:       05 39 30 00 00          add    $0x3039,%eax
       48eed:       89 07                   mov    %eax,(%rdi)
       48eef:       25 ff ff ff 7f          and    $0x7fffffff,%eax
       48ef4:       c3                      retq

https://en.wikipedia.org/wiki/Linear_congruential_generator#...

orlp · on Dec 9, 2014

Woops, I totally forgot about that, I think I was confused with range reduction on the final result. Nevertheless, Tyche-i is also blazing fast and small:

    0000000000000000 <_Z6tycheiv>:
       0:   8b 0d 0c 00 00 00       mov    0xc(%rip),%ecx        
       6:   c4 e3 7b f0 05 07 00    rorx   $0x7,0x7(%rip),%eax   
       d:   00 00 07
      10:   c4 e3 7b f0 15 ff ff    rorx   $0x8,-0x1(%rip),%edx  
      17:   ff ff 08
      1a:   44 8b 05 04 00 00 00    mov    0x4(%rip),%r8d        
      21:   31 ca                   xor    %ecx,%edx
      23:   44 31 c0                xor    %r8d,%eax
      26:   41 29 d0                sub    %edx,%r8d
      29:   c4 e3 7b f0 d2 10       rorx   $0x10,%edx,%edx
      2f:   29 c1                   sub    %eax,%ecx
      31:   c4 e3 7b f0 c0 0c       rorx   $0xc,%eax,%eax
      37:   44 31 c0                xor    %r8d,%eax
      3a:   31 ca                   xor    %ecx,%edx
      3c:   29 c1                   sub    %eax,%ecx
      3e:   89 05 08 00 00 00       mov    %eax,0x8(%rip)        
      44:   41 29 d0                sub    %edx,%r8d
      47:   89 15 00 00 00 00       mov    %edx,0x0(%rip)        
      4d:   44 89 05 04 00 00 00    mov    %r8d,0x4(%rip)        
      54:   89 0d 0c 00 00 00       mov    %ecx,0xc(%rip)        
      5a:   c3                      retq

viraptor · on Dec 9, 2014

It's slightly disappointing that he didn't look through at least a sample of the apps using something like `srand(silly_value)` to see what the numbers are used for. (at least that's what I understand from his post) They're using silly pseudo-random numbers, but what if all of them expect nothing more? They're breaking a known interface that may just happen to affect some people when they don't expect it, but there's a chance the "good side" of the change is not even going to be noticed.

I mean, use cases like choosing a tip of the day to display, or choosing a starting colour for displaying a number of objects, or a million other ideas probably wouldn't care if 0 was chosen 10% of the time as a result.

sehugg · on Dec 9, 2014

All three subsystems produce deterministic results. The reasons for this are history not known to me, but it might even be linked to Dual_EC_DRBG.

Seriously? I mean rand/srand have been around since what, at least 1980?

nullc · on Dec 9, 2014

And then he goes and replaces it with something that calls itself RC4 (but isn't!)? April fools must have come early.

rjaco31 · on Dec 9, 2014

Maybe he just meant that it was designed to be deterministic for the same reasons.

emmelaich · on Dec 9, 2014

There's been a warning in 'man rand' for as long as I can remember. The man page on OSx has "bad random number generator" in the NAME.

The amazing thing is that they're still being used.

plg · on Dec 9, 2014

What's the current recommendation for a better random number generator? (C please)

bmm6o · on Dec 9, 2014

It depends in what way you want your numbers to be "better". You might mean deterministic and fast and with less predictable patterns (Mersenne Twister), deterministic and indistinguishable from true random (and slower) (CSPRNG), non-deterministic (depends on the platform).

kbaker · on Dec 9, 2014

Maybe arc4random(), from OpenBSD. There were some slides linked earlier about it [1], it might be worth looking into.

[1] http://www.openbsd.org/papers/hackfest2014-arc4random/mgp000...

carey · on Dec 9, 2014

Is it really wise to choose a random number generator based on RC4, which is known to have remarkably strong biases? Surely it would be better to use something based on ChaCha20, Keccak or Spritz if you have the choice?

grogers · on Dec 9, 2014

It does use ChaCha20, and the implementation will use something else in another 20 years most likely.

http://www.openbsd.org/papers/hackfest2014-arc4random/mgp000...

carey · on Dec 9, 2014

Serves me right for not reading through all the slides. I would argue that it's one of the more misleading function names out there, though.

justincormack · on Dec 9, 2014

Yes, it is misleading, it used to use rc4. Posix are proposing to standardise it under a new generic name, so it is unlikely to get changed until this happens.

emmelaich · on Dec 9, 2014

rand(3) recommends random(3) and arc4random(3)

rand48(3) recommends random(3)

random(3) recommends arc4random(3)

So no excuses.

_3u10 · on Dec 9, 2014

For most systems, open /dev/urandom and read from it.

For pure C, use a secure hash function such as SHA-256.

For a simpler approach, just fail to initialize the variable for which you want random data, especially if you frequently want the value to just be zero when debugging.

mikeash · on Dec 9, 2014

Those first two suggestions are good. The last one, not so good.

First, it's not very random. As you point out, it will often just be zero. If zero is acceptable to you, then just initialize it to zero. If you need something reliably different then you can't count on uninitialized memory to deliver it.

Second, it will trip diagnostics in tools like Valgrind, which you'll have to either silence or ignore.

Third, and this is a real fun one, any C program which reads from uninitialized memory is non-conforming, which means that the compiler is allowed to assume that it never happens. For an example, let's say you wrote some code to flip a coin "randomly":

    int uninitialized;
    if((uninitialized % 2) == 0) printf("heads");
    else printf("tails");

The compiler would be within its rights to eliminate both branches of the if statement. It would, in fact, be within its rights to start a game of nethack upon encountering such code (as early versions of gcc did when encountering a #pragma statement) or make demons fly out of your nose (a common example of what C compilers are allowed to do with such things, albeit one that's somewhat difficult to implement).

Don't read uninitialized variables. It's the law!

orlp · on Dec 9, 2014

Tyche: https://eden.dei.uc.pt/~sneves/pubs/2011-snfa2.pdf

eveningcoffee · on Dec 9, 2014

What is the usage of random numbers in your case?

platz · on Dec 9, 2014

considering an rtl-sdr radio is $15, why not just make it standard to include a radio which samples atmospheric noise to genereate random numbers?

dfox · on Dec 9, 2014

First, sampling atmospheric noise is not good general entropy source, because it needs careful setup to ensure that what you are sampling actually is atmospheric noise and not some signal, also such noise might well be unpredictable, but certainly is not secret (ie. not readily usable for cryptographical purposes).

Second, there are significantly cheaper ways to get quality entropy in modern digital system. Relative timing of various internal events in system with multiple clock sources is one of the meaningful sources. Also, purpose built hardware RNG integrated into some silicon is essentially free (and mostly cheaper than it's interface circuitry).

The problem is not that system does not have usable entropy, but that application tend to not use the good sources that are already there (ie. /dev/urandom, CryptGenRandom/rand_s and such)