Postgresql – Using Postgres to return a selection based on frequency distribution

postgresqlstatistics

I am not even sure if this is a possibility so my apologies if I'm barking up the wrong tree; I am ignorant of Postgres's limitations as I am self-taught.

This is hypothetical, but if I had a table like so:

dice_rolls

roll_1 roll_2 sum
1 1 2
1 2 3
1 3 4
1 4 5
1 5 6
1 6 7
2 1 3
2 2 4
2 3 5
2 4 6
2 5 7
2 6 8
3 1 4
3 2 5
3 3 6
3 4 7
3 5 8
3 6 9
4 1 5
4 2 6
4 3 7
4 4 8
4 5 9
4 6 10
5 1 6
5 2 7
5 3 8
5 4 9
5 5 10
5 6 11
6 1 7
6 2 8
6 3 9
6 4 10
6 5 11
6 6 12

we can see that the value '7' for sum appears most frequently. If we plotted the frequency distribution of the sum, it would look something like this:

               x
            x  x  x
         x  x  x  x  x  
      x  x  x  x  x  x  x 
   x  x  x  x  x  x  x  x  x
x  x  x  x  x  x  x  x  x  x  x

2  3  4  5  6  7  8  9  10 11 12

Therefore, we can assume that a truly random selection of this dataset will follow this distribution pattern. What I'm interested in is psuedo-random DISTINCT selection of N rows that attempts to normalize the distribution based on the sum column.

E.g.,

SELECT * FROM dice_rolls ORDER BY RANDOM() LIMIT 5;

is what I have now. This query is most likely to contain rows where the sum is 5, 6, 7, 8, 9. How would I go about using a custom ORDER BY function that would be significantly more likely (or even strictly) to, return rows where the sum has an even chance of being 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12?

Best Answer

Get a CTE with the different sums, pick one and return a matching row:

WITH sums AS (
   SELECT DISTINCT dice_sum FROM dice_rolls
)
SELECT * FROM dice_rolls
WHERE dice_sum = (SELECT dice_sum FROM sums ORDER BY random() LIMIT 1)
ORDER BY random() LIMIT 1;