Postgresql – PSQL: create random data sets with the same statistic distrbution as another table

postgresqlrandom

For test purposes, I want to create data sets (tables) where the statistic distribution of the values in the columns is the same as my actual data, but randomized.

Example : consider the following table ('ARTICLES' for a clothes shop):

| id | type    | colour | size | price |
----------------------------------------
| 1 | shirt    | yellow | M    | 14.99 |
| 2 | trousers | brown  | L    | 20.00 |
| 3 | shirt    | red    | L    | 14.99 |
| 4 | trousers | red    | L    | 20.00 |
| 5 | cap      | yellow | M    | 5.00  |
| 6 | cap      | brown  | S    | 5.00  |
| 7 | shirt    | red    | M    | 14.99 |
| 8 | trousers | red    | L    | 20.00 |

So, out of these 8 rows, I have the following distributions for each column:

  • type : 3 shirts, 3 trousers and 2 cap
  • colour : 4 red, 2 yellow, 2 brown
  • size : 1 S, 3 M, 4 L
  • price : 2 x 5.00 , 3 x 14.99 , 3 x 20.00

What I want is to create another table, with the same columns and the same number of rows, in which I have for each separate column the same statistic distribution of values, but assigned randomly and independently from each other.

Doing the stats for each column is pretty easy :

SELECT column_name, COUNT(1) AS number_of_rows 
FROM ARTICLES GROUP BY column_name ;

I can also easily record these stats in a dedicated table :

CREATE TABLE articles_stats ( column_name varchar(255), value varchar(255), number_of_rows integer );

INSERT INTO articles_stats (column_name, value, number_of_rows)
SELECT 'type', type, COUNT(1) AS number_of_rows 
    FROM ARTICLES GROUP BY type 
UNION
SELECT 'colour', colour, COUNT(1) AS number_of_rows 
    FROM ARTICLES GROUP BY colour  
UNION
SELECT 'size', size, COUNT(1) AS number_of_rows 
    FROM ARTICLES GROUP BY size  
UNION
SELECT 'price', price::varchar(255), COUNT(1) AS number_of_rows 
    FROM ARTICLES GROUP BY price ;

In this example, it would create the following articles_stats table :

| column_name | value    | number_of_rows | 
-------------------------------------------
| type        | shirt    |        3       |
| type        | trousers |        3       |
| type        | cap      |        2       |
| colour      | red      |        4       |
| colour      | yellow   |        2       |
| colour      | brown    |        2       |
| size        |    S     |        1       |
| size        |    M     |        3       |
| size        |    L     |        4       |
| price       |    5.00  |        2       |
| price       |   14.99  |        3       |
| price       |   20.00  |        3       |

But how do I then create inserts into the target table (lets' call it 'RANDOM_ARTICLES') ?

PS: I have to do this many times so I hope to create a PSQL function for this.

Best Answer

If you only need the distribution of cardinalities to be the same, and not the ordering, or the length of the text strings, then you can use a hashing function.

create new_table as select 
    md5('Aiw7Pa6y'||col1) as col1, 
    md5('coh9ahMo'||col2) as col2,
    md5('ii5eFohj'||col3::text) as col3
from old_table;

For non-text columns, you would have to decide if they need to be randomized at all, and if so whether conversion to text (keeping the cardinality of each distinct value, but not the order or relative gaps) would be suitable.