PostgreSQL – Sampling Techniques for Random Data

postgresqlrandom

I am looking for possible ways of random sampling in PostgreSQL. I found a couple of methods to do that with different advantages and disadvantages. The naive way to do that is:

select * from Table_Name
order by random()
limit 10;

Another faster method is:

select * from Table_Name
WHERE random() <= 0.01
order by random()
limit 10;

(Although that 0.01 depends on the table size and the sample size; this is just an example.)

In both of these queries a random number is generated for each row and sorted based on those random generated numbers. Then in the sorted numbers the first 10 are selected as the final result, so I think these should be sampling without replacement.

Now what I want to do is to somehow turn this sampling methods into sampling with replacement. How is that possible? Or is there any other random sampling method with replacement in PostgreSQL?

I have to say that I do have an idea how this might be possible but I don't know how to implement it in Postgres. Here is my idea:

If instead of generating one random value we generate S random values where S is the sample size, then order all of the random generated values, it will be sampling with replacement. (I don't know if I am right.)
At this point I don't mind about the performance of the query.

Best Answer

In Postgres 9.3+, you can use the folowing:

select t.*
from 
  generate_series(1, 10) as x(i)
    cross join lateral
  ( select *, x.i
    from Table_Name
    -- where random() < 0.01
    order by random()
    limit 1 
  ) ;

which basically chooses 1 random row, 10 times.

In older versions, you can use a simple cross join (no lateral):

select t.*
from 
    generate_series(1, 1000) as x(i)
  cross join 
    Table_Name as t
    -- where random() < 0.01
    order by random()
    limit 10
  ) t ;

which creates a 1000-fold copy of the table (so each row is there 1000 times) and then chooses 10 rows with the same method as your query. If the number of copies (1000) is large enough compared to the wanted rows (10), the probabilities are almost equal to the probabilities you would have got with replacement.

Performance of this second query will of course be horrible, even with small tables.

Related Solutions

Postgresql – the best way to store X509 certificate in PostgreSQL database

Duplication isn't ideal, but in this case is probably the best choice. Set the table permissions so that the table owner is not the operational day to day database user your app runs as, and only GRANT your app the ability to write to the certificate data column, not the "cache" columns with expiry etc. Have a SECURITY DEFINER trigger function intercept writes to the certificate field and as a privileged user update the indexed cache columns by using an X.509 library to extract the fields from the certificate after verifying it.

Alternately, you could write a PL/Python, PL/Perl, or even a C SQL function that calls an X.509 certificate parser library to extract fields and return them. So you'd say extract_x509_field(cert, 'subject'). Or perhaps even a row-returning form like SELECT subject, issuerName FROM (SELECT extract_x509_fields(cert) FROM the_table) where extract_x509_fields returns a row of all relevant cert data. With this approach you could create functional indexes like CREATE INDEX cert_issuer ON certificate_table( extract_x509_field(cert,'issuer') ); that could be used to match WHERE expressions. You wouldn't need to have table columns for the extracted data at all. The downside is that this would likely be slower as the cert would get parsed multiple times during index creation, during index re-checks, etc.

Either way, it's vital that your application operate as a PostgreSQL user that is not the database owner, not a superuser, and not the owner of the tables and indexes of concern. It should be GRANTed the minimum rights necessary and no more. If you have quite separate tasks (say, read-only vs write-and-update) consider using different database users for them so that even if the "read" part of your app is tricked into trying to write a field/update a cert/etc, it doesn't have permission to.

Postgresql – Primary key with randomly varying increments (so it cannot be guessed easily)

I suggest a function taking a regclass parameter that runs ALTER SEQUENCE with a new randomly generated increment before it returns the next value from a given sequence.
Can be used as drop-in replacement for nextval().

Per documentation on ALTER SEQUENCE:

increment

The clause INCREMENT BY increment is optional. A positive value will make an ascending sequence, a negative one a descending sequence. If unspecified, the old increment value will be maintained.

However:

You must own the sequence to use ALTER SEQUENCE.

So we need to take care of privileges. You could make the function SECURITY DEFINER and owned by a superuser. If you don't REVOKE privileges from public it works for anyone on any sequence. There are two basic strategies to restrict usage:

To allow for selected sequences only, change the owner of those sequences to some dedicated role, say randseq and make randseq own the function (still with SECURITY DEFINER).
To allow for selected roles only, REVOKE all privileges on the function from public and GRANT EXECUTE to said roles. You might use a group role to simplify privilege management.

Or combine both:

CREATE OR REPLACE FUNCTION nextval_rand(regclass)
  RETURNS int AS
$func$
BEGIN
EXECUTE format('ALTER SEQUENCE %s INCREMENT %s'
               , $1                          -- regclass automatically sanitized
               , (random() * 100)::int + 1); -- values between 1 and 100
RETURN nextval($1)::int;
END
$func$ LANGUAGE plpgsql SECURITY DEFINER;

-- to restrict usage:
ALTER FUNCTION nextval_rand(regclass) OWNER TO randseq;
REVOKE ALL ON FUNCTION nextval_rand(regclass) FROM public;
GRANT EXECUTE ON FUNCTION nextval_rand(regclass) TO randseq;
GRANT randseq TO ???;

Call:

SELECT nextval_rand('tbl_tbl_id_seq'::regclass);

All you have to do now is replace nextval() with nextval_rand() in the column default of any serial column. And possibly change the owner of the sequence.

ALTER SEQUENCE tbl_tbl_id_seq OWNER TO randseq;
ALTER TABLE tbl ALTER COLUMN tbl_id SET DEFAULT nextval_rand('tbl_tbl_id_seq'::regclass);

Notes

ALTER SEQUENCE is designed not to block concurrent transactions. It takes effect immediately and cannot be rolled back. It should work reliably in a multi-user environment. Read the Notes section of the manual page for the fine print of ALTER SEQUENCE behavior.

There is a very slim chance for a race condition, where two concurrent operations each run ALTER SEQUENCE before calling nextval(). Since we are operating with random numbers anyway, this really doesn't matter.

Since we are running dynamic SQL I would normally SET search_path = public, pg_temp for the function. But since the parameter is regclass, only valid sequence names can be passed and are automatically schema-qualified and escaped unambiguously.

Best Answer

Related Solutions

Postgresql – the best way to store X509 certificate in PostgreSQL database

Postgresql – Primary key with randomly varying increments (so it cannot be guessed easily)

Notes

Related Question