Unevenly-Spread Results Using $RANDOM – Why and How to Fix

random

I read about RNGs on Wikipedia and $RANDOM function on TLDP but it doesn't really explain this result:

$ max=$((6*3600))
$ for f in {1..100000}; do echo $(($RANDOM%max/3600)); done | sort | uniq -c
  21787 0
  22114 1
  21933 2
  12157 3
  10938 4
  11071 5

Why are the values above about 2x more inclined to be 0, 1, 2 than 3, 4, 5 but when I change the max modulo they're almost equally spread over all 10 values?

$ max=$((9*3600))
$ for f in {1..100000}; do echo $(($RANDOM%max/3600)); done | sort | uniq -c
  11940 0
  11199 1
  10898 2
  10945 3
  11239 4
  10928 5
  10875 6
  10759 7
  11217 8

Best Answer

To expand on the topic of modulo bias, your formula is:

max=$((6*3600))
$(($RANDOM%max/3600))

And in this formula, $RANDOM is a random value in the range 0-32767.

   RANDOM Each time this parameter is referenced, a random integer between
          0 and 32767 is generated.

It helps to visualize how this maps to possible values:

0 = 0-3599
1 = 3600-7199
2 = 7200-10799
3 = 10800-14399
4 = 14400-17999
5 = 18000-21599
0 = 21600-25199
1 = 25200-28799
2 = 28800-32399
3 = 32400-32767

So in your formula, the probability for 0, 1, 2 is twice that of 4, 5. And probability of 3 is slightly higher than 4, 5 too. Hence your result with 0, 1, 2 as winners and 4, 5 as losers.

When changing to 9*3600, it turns out as:

0 = 0-3599
1 = 3600-7199
2 = 7200-10799
3 = 10800-14399
4 = 14400-17999
5 = 18000-21599
6 = 21600-25199
7 = 25200-28799
8 = 28800-32399
0 = 32400-32767

1-8 have the same probability, but there is still a slight bias for 0, and hence 0 was still the winner in your test with 100'000 iterations.

To fix the modulo bias, you should first simplify the formula (if you only want 0-5 then the modulo is 6, not 3600 or even crazier number, no sense in that). This simplification alone will reduce your bias by a lot (32766 maps to 0, 32767 to 1 giving a tiny bias to those two numbers).

To get rid of bias altogether, you need to re-roll, (for example) when $RANDOM is lower than 32768 % 6 (eliminate the states that do not map perfectly to available random range).

max=6
for f in {1..100000}
do
    r=$RANDOM
    while [ $r -lt $((32768 % $max)) ]; do r=$RANDOM; done
    echo $(($r%max))
done | sort | uniq -c | sort -n

Test result:

  16425 5
  16515 1
  16720 0
  16769 2
  16776 4
  16795 3

The alternative would be using a different random source that does not have noticable bias (orders of magnitude larger than just 32768 possible values). But implementing a re-roll logic anyway doesn't hurt (even if it likely never comes to pass).

Related Question