Postgresql – Using Postgres to return a selection based on frequency distribution

postgresqlstatistics

I am not even sure if this is a possibility so my apologies if I'm barking up the wrong tree; I am ignorant of Postgres's limitations as I am self-taught.

This is hypothetical, but if I had a table like so:

dice_rolls

roll_1	roll_2	sum
1	1	2
1	2	3
1	3	4
1	4	5
1	5	6
1	6	7
2	1	3
2	2	4
2	3	5
2	4	6
2	5	7
2	6	8
3	1	4
3	2	5
3	3	6
3	4	7
3	5	8
3	6	9
4	1	5
4	2	6
4	3	7
4	4	8
4	5	9
4	6	10
5	1	6
5	2	7
5	3	8
5	4	9
5	5	10
5	6	11
6	1	7
6	2	8
6	3	9
6	4	10
6	5	11
6	6	12

we can see that the value '7' for sum appears most frequently. If we plotted the frequency distribution of the sum, it would look something like this:

               x
            x  x  x
         x  x  x  x  x  
      x  x  x  x  x  x  x 
   x  x  x  x  x  x  x  x  x
x  x  x  x  x  x  x  x  x  x  x

2  3  4  5  6  7  8  9  10 11 12

Therefore, we can assume that a truly random selection of this dataset will follow this distribution pattern. What I'm interested in is psuedo-random DISTINCT selection of N rows that attempts to normalize the distribution based on the sum column.

E.g.,

SELECT * FROM dice_rolls ORDER BY RANDOM() LIMIT 5;

is what I have now. This query is most likely to contain rows where the sum is 5, 6, 7, 8, 9. How would I go about using a custom ORDER BY function that would be significantly more likely (or even strictly) to, return rows where the sum has an even chance of being 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, or 12?

Best Answer

Get a CTE with the different sums, pick one and return a matching row:

WITH sums AS (
   SELECT DISTINCT dice_sum FROM dice_rolls
)
SELECT * FROM dice_rolls
WHERE dice_sum = (SELECT dice_sum FROM sums ORDER BY random() LIMIT 1)
ORDER BY random() LIMIT 1;

Related Solutions

PostgreSQL – Clustering Rows Without Exclusive Lock and Logging Overhead

You can do this without using the cluster command and having the table locked or generating WAL for the whole table. The cost is that you need to full-scan the table regularly.

The basic idea is:

turn off autovacuum for the table
check each block to determine the degree of clustering
delete and re-insert all the rows from blocks below a clustering threshold
manually vacuum to free those (complete) blocks
repeat steps 2-4 as regularly as necessary

test schema sample data initially 'part-clustered':

create schema stack;
set search_path=stack;
create type t_tid as (blkno bigint, rowno integer);
create table foo(host_id integer, bar text default repeat('a',400)) with (autovacuum_enabled=false);
insert into foo(host_id) select mod(g,10) from generate_series(1,500000) g order by mod(g,10);
insert into foo(host_id) select mod(g,10) from generate_series(1,500000) g;
create index nu_foo on foo(host_id);

initial clustering statistics:

select cn, count(*)
from ( select count(*) cn
       from (select distinct (ctid::text::t_tid).blkno, host_id from foo) z
       group by blkno ) z
group by cn
order by cn;
/*
 cn | count
----+-------
  1 | 27769  <---- half clustered
  2 |     8
  5 |     1
 10 | 27778  <---- half un-clustered
*/
select count(distinct (ctid::text::t_tid).blkno) from foo where host_id=1;
/*
 count
-------
 30558  <--------- lots of blocks to read for `host_id=1`
*/

initial analyze (2146.503 ms):

explain analyze select count(bar) from foo where host_id=1;
/*
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=15097.30..15097.31 rows=1 width=32) (actual time=2146.157..2146.158 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=95.17..15084.80 rows=5000 width=32) (actual time=21.586..2092.379 rows=100000 loops=1)
         Recheck Cond: (host_id = 1)
         Rows Removed by Index Recheck: 286610
         ->  Bitmap Index Scan on nu_foo  (cost=0.00..93.92 rows=5000 width=0) (actual time=19.232..19.232 rows=100000 loops=1)
               Index Cond: (host_id = 1)
 Total runtime: 2146.503 ms
*/

delete and re-insert the un-clustered rows:

with w as ( select blkno
            from (select distinct (ctid::text::t_tid).blkno, host_id from foo) z
            group by blkno
            having count(*)>2 )
   , d as ( delete from foo
            where (ctid::text::t_tid).blkno in (select blkno from w)
            returning * )
insert into foo(host_id,bar) select host_id,bar from d order by host_id;
--
vacuum foo;

new clustering statistics:

select cn, count(*)
from ( select count(*) cn
       from (select distinct (ctid::text::t_tid).blkno, host_id from foo) z
       group by blkno ) z
group by cn
order by cn;
/*
 cn | count
----+-------
  1 | 55541  <---- fully clustered
  2 |    16
*/
select count(distinct (ctid::text::t_tid).blkno) from foo where host_id=1;
/*
 count
-------
  5558  <--------- far fewer blocks to read for `host_id=1`
*/

new analyze (48.804 ms):

explain analyze select count(bar) from foo where host_id=1;
/*
                                                          QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=16110.64..16110.65 rows=1 width=32) (actual time=48.760..48.761 rows=1 loops=1)
   ->  Bitmap Heap Scan on foo  (cost=131.18..16098.14 rows=5000 width=32) (actual time=8.402..32.439 rows=100000 loops=1)
         Recheck Cond: (host_id = 1)
         ->  Bitmap Index Scan on nu_foo  (cost=0.00..129.93 rows=5000 width=0) (actual time=7.636..7.636 rows=100000 loops=1)
               Index Cond: (host_id = 1)
 Total runtime: 48.804 ms
*/

clean up:

drop schema stack cascade;

The above is workable now, but is a bit quirky (needing to turn off auto-vacuum for the table) and requires regular full-scanning the table. I think something similar without the disadvantages could built into postgres. You'd need:

A space efficient index to cluster on (this is coming in 9.4 with GIN compression, or better still in 9.5 with the new BRIN index type)
A 'vacuum-like' process that would scan that index to detect which blocks need to be deleted/reinserted (this would ideally be able to reinsert the rows into fresh blocks so auto-vacuum can be left at default)

PostgreSQL – Slow Queries and Autovacuum Frequency

vacuum: Clears out deleted records from disk

Not exactly. VACUUM marks rows that aren't visible to any active transaction any more as dead (and ready for re-use). It does not shrink the physical file size that represents the table, except for completely dead / empty pages at the physical end. The manual explains everything and probably better than I could recap here.

analyse: Updates the query planner

It's officially ANALYZE, with Z, but ANALYSE is accepted as alternative spelling, too. And it updates statistics used by the query planner. Again, the manual already provides the best explanation.

In postgres.conf ... my presumption is that even though it's commented out, it should still be on?

That's correct. Again, consider details in the manual.

With that many write operations (a few thousand records a minute) your system is in bad need of regular VACUUM / ANALYZE runs. You have read the manual by now, so you understand the consequences. If your table ...

hadn't been autovacuumed for more than a week.

... then that's bad for multiple reasons. Also consider @Daniel's answer how this may have transpired with a huge table like yours. Or maybe the high load constantly locks the table and never lets VACUUM do its work. Again, it's all documented in the manual. Here is a related case with good answers how to tune settings:

Aggressive Autovacuum on PostgreSQL

Remember that you can have per-table settings (STORAGE parameters) to fine tune for special needs of a special table and leave the rest of the system alone.

If you are mostly updating recently inserted rows, a FILLFACTOR below 100 may be very helpful. You can compact the table (once) with CLUSTER or pg_repack an then set the FILLFACTOR below 100. And for huge tables it may also help to set a higher STATISTICS target for key columns with irregular data distribution.

Also, if old rows are updated seldom, partitioning might be a good solution, to treat old sections differently. It really depends on the complete picture ...

Also, don't forget indices, those can get bloated, too. Only keep indices that you actually need.

To see a count of dead and live tuples:

What is the meaning of n_live_tup and n_dead_tup in pg_stat_user_tables

Best Answer

Related Solutions

PostgreSQL – Clustering Rows Without Exclusive Lock and Logging Overhead

PostgreSQL – Slow Queries and Autovacuum Frequency

Related Question