PostgreSQL – Sampling Techniques for Random Data

postgresqlrandom

I am looking for possible ways of random sampling in PostgreSQL. I found a couple of methods to do that with different advantages and disadvantages. The naive way to do that is:

select * from Table_Name
order by random()
limit 10;

Another faster method is:

select * from Table_Name
WHERE random() <= 0.01
order by random()
limit 10;

(Although that 0.01 depends on the table size and the sample size; this is just an example.)

In both of these queries a random number is generated for each row and sorted based on those random generated numbers. Then in the sorted numbers the first 10 are selected as the final result, so I think these should be sampling without replacement.

Now what I want to do is to somehow turn this sampling methods into sampling with replacement. How is that possible? Or is there any other random sampling method with replacement in PostgreSQL?

I have to say that I do have an idea how this might be possible but I don't know how to implement it in Postgres. Here is my idea:

If instead of generating one random value we generate S random values where S is the sample size, then order all of the random generated values, it will be sampling with replacement. (I don't know if I am right.)
At this point I don't mind about the performance of the query.

Best Answer

In Postgres 9.3+, you can use the folowing:

select t.*
from 
  generate_series(1, 10) as x(i)
    cross join lateral
  ( select *, x.i
    from Table_Name
    -- where random() < 0.01
    order by random()
    limit 1 
  ) ;

which basically chooses 1 random row, 10 times.


In older versions, you can use a simple cross join (no lateral):

select t.*
from 
    generate_series(1, 1000) as x(i)
  cross join 
    Table_Name as t
    -- where random() < 0.01
    order by random()
    limit 10
  ) t ;

which creates a 1000-fold copy of the table (so each row is there 1000 times) and then chooses 10 rows with the same method as your query. If the number of copies (1000) is large enough compared to the wanted rows (10), the probabilities are almost equal to the probabilities you would have got with replacement.

Performance of this second query will of course be horrible, even with small tables.