Postgresql – Optimizing “random” choice in PostgreSQL

cteexecution-planpostgresqlpostgresql-9.4random

I have the following table

CREATE TABLE article (
    id bigserial PRIMARY KEY,
    text TEXT NOT NULL,
    category_id bigint REFERENCES categories (id),
    status TEXT NOT NULL,
    locale TEXT NOT NULL,
    source TEXT NOT NULL
)

from where I want to select, say, 10 random records from as many distinct categories as possible (hence the ORDER BY rn ASC clause in the CTE). I'm using the following CTE to achieve this:

WITH "partitioned_articles" AS
(
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY "articles"."category_id" 
        ORDER BY RANDOM() ASC) AS "rn"
    FROM
        "articles"
    WHERE
        "articles"."status" = 'PUBLISHED' AND 
        LOWER("articles"."locale") = LOWER('en_US') AND 
        "articles"."source" = 'WIKIPEDIA'
    ORDER BY
        rn ASC
)
SELECT
    *
FROM
    "partitioned_articles"
LIMIT
    10

EXPLAIN ANALYZE originally gave me the following output:

Limit  (cost=45587.17..45587.37 rows=10 width=473) (actual time=795.552..795.557 rows=10 loops=1)
  CTE partitioned_articles
    ->  Sort  (cost=45586.04..45587.17 rows=452 width=306) (actual time=795.545..795.547 rows=10 loops=1)
          Sort Key: (row_number() OVER (?))
          Sort Method: external merge  Disk: 21984kB
          ->  WindowAgg  (cost=45555.93..45566.10 rows=452 width=306) (actual time=574.822..673.524 rows=89190 loops=1)
                ->  Sort  (cost=45555.93..45557.06 rows=452 width=306) (actual time=574.808..623.847 rows=89190 loops=1)
                      Sort Key: articles.category_id, (random())
                      Sort Method: external merge  Disk: 20368kB
                      ->  Seq Scan on articles  (cost=0.00..45536.00 rows=452 width=306) (actual time=0.190..364.287 rows=89190 loops=1)
                            Filter: ((status = 'PUBLISHED'::text) AND (source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
                            Rows Removed by Filter: 310810
  ->  CTE Scan on partitioned_articles  (cost=0.00..9.04 rows=452 width=473) (actual time=795.550..795.554 rows=10 loops=1)
Planning time: 0.462 ms
Execution time: 805.519 ms

I tried adding

CREATE INDEX articles__locale ON articles (locale);
CREATE INDEX articles__source ON articles (source);
CREATE INDEX articles__status ON articles (status);

and ended up with the following execution plan:

Limit  (cost=43393.70..43393.90 rows=10 width=473) (actual time=686.982..686.991 rows=10 loops=1)
  CTE partitioned_articles
    ->  Sort  (cost=43392.57..43393.70 rows=452 width=306) (actual time=686.976..686.978 rows=10 loops=1)
          Sort Key: (row_number() OVER (?))
          Sort Method: external merge  Disk: 21976kB
          ->  WindowAgg  (cost=43362.47..43372.64 rows=452 width=306) (actual time=462.080..562.038 rows=89190 loops=1)
                ->  Sort  (cost=43362.47..43363.60 rows=452 width=306) (actual time=462.064..511.409 rows=89190 loops=1)
                      Sort Key: articles.category_id, (random())
                      Sort Method: external merge  Disk: 20368kB
                      ->  Bitmap Heap Scan on articles  (cost=3102.54..43342.54 rows=452 width=306) (actual time=35.149..216.145 rows=89190 loops=1)
                            Recheck Cond: (status = 'PUBLISHED'::text)
                            Filter: ((source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
                            Rows Removed by Filter: 44388
                            Heap Blocks: exact=12897
                            ->  Bitmap Index Scan on articles__status  (cost=0.00..3102.42 rows=135200 width=0) (actual time=30.421..30.421 rows=133578 loops=1)
                                  Index Cond: (status = 'PUBLISHED'::text)
  ->  CTE Scan on partitioned_articles  (cost=0.00..9.04 rows=452 width=473) (actual time=686.981..686.988 rows=10 loops=1)
Planning time: 0.621 ms
Execution time: 696.945 ms

which, from what I understand, is only marginally better. Is there anything else I can do to further optimize this query? I'm using PostgreSQL 9.4 (planning on migrating to 9.5 as soon as possible, but unfortunately it is still not an option) and the articles table has currently about 400,000 records.

PS – These names have been made up for the example, I'm not really scrapping anything from Wikipedia or from anywhere else 😉

Best Answer

Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY clause or via a window function.

[1] though for small data sets it might not actually touch disk

Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).

If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).

A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.