Ask for opinion: changing rand(3) to random(3) in awk(1)

Thu Aug 28 15:43:37 UTC 2014

Chenguang Li <horus.li at gmail.com> writes:

> The problem I was trying to describe was its "one-shot" randomness, take these two as examples (where it matters):
>
> 1. You wrote a script[1] that simulate rolling a dice, it would
> produce the same result if executed within, say, 5 seconds.
> [1] BEGIN { srand(); print int(1+rand()*6); } or BEGIN { srand(); } { print int(1+rand()*6); }, won't matter.

One second, not 5. Calling srand() without a parameter seeds the random
number generator with the current time in seconds, so the value changes
once per second.

> 2. You have a CGI script which will show different content based on the number generated by rand().
>
> In the first situation, you can generate all the outcomes in a single
> run by using for-loop, but the first outcome will be the same. OSX's
> awk(1) will produce a reasonable number every time I run it. In the
> latter one, you could call rand() once and throw away the result, and
> call it again to get another number. Both are practical workarounds,
> but we do have a better choice: applying the modification I suggested
> before.

You are still misunderstanding the relationship between srand() and
rand(), in a way that will not be fixed by changing awk's implementation
from rand(3) to random(3). srand() "seeds" the random number generator
with a particular value, and the sequence of numbers is completely
determined afterwards. This isn't a bug; the ability to exactly
reproduce a sequence of "random" numbers is an essential feature in a
lot of simulation uses. This is also why we refer to these algorithms as
"pseudo-random" rather than just "random."

In your cases, you really do want a different sequence every time. The
way that is handled is by using a different seed each time. The normal
use of srand() uses the current time, so as long as it isn't called
twice within one second, it will always use a different sequence of
numbers. If it *is* called twice within the same second, it will produce
the same sequence of numbers (not just the same first number, but the
second, third, etc. number will be the same also). This is just as true
on OSX as on FreeBSD. Your use of srand() in your first script is buggy
because it calls srand() for *every* call to rand(); your second version
fixes this problem.

How do we deal with the one-second window? Well, most of the time we
ignore it. For a CGI script, it won't matter. If you really do need to
run separate copies of an awk script more often, you'll need a better
seed. Reading it from /dev/random would be one place for your awk
script to get that. An important point that you may have missed is that
when your script calls srand(), it can provide a parameter, which will
be used instead of the current time.

> If others are not affected by the problem I described above, then I am
> okay with that. The other reason why I suggest this is, I see no loss,
> only to make it better.

The problem you described is caused by your calling srand() multiple
times. This is a bug on your part, not a problem with awk that would
affect other people. Changing awk to use random(3) instead of rand(3)
will not fix your problem, because continually reseeding srandom(3) with
the same seed will give you the same values from random(3) just as much
as doing the same with srand(3) and rand(3) will. In your example:
> BEGIN { srand(); print int(1+rand()*6); } or BEGIN { srand(); } { print int(1+rand()*6); }
the first one is broken and the second one works (try them and compare
the output).

Although it may not fix the problem you thought it would, you're right
that there's no loss in making the change, so I think it's a good idea.

Be well.
        Lowell