Postgresql – Can PostgreSQL use multiple partial indexes per query

index-tuningperformancepostgresqlpostgresql-performance

I have read that PostgreSQL can use multiple indexes in general, but in the specific case of a query that ranges across two indexes, will it use both? If so, will they be loaded sequentially or together?

For example, if this query spans across two partial indexes by column_1, how will partial indexes be used if at all, and how is the index data loaded and discarded:

SELECT 1 FROM sample_table WHERE column_1 > 50 AND column_2 < 50000

Best Answer

Very short version: Yes, sometimes.

PostgreSQL can use bitmap index scans to combine multiple indexes.

A predicate like

WHERE a > 50 AND a < 50000

is a specialisation of the more general form:

wHERE a > 50 and b < 50000

for a = b.

PostgreSQL can use two indexes here, one for each part of the predicate, and then bitmap AND them. It doesn't matter if they happen to be on different ranges of the same column.

This is much less efficient than a single index, and may not be useful for some queries, but it's possible.

The bigger problem is that PostgreSQL's partial index support is not very bright. Irrespective of whether there's one or two indexes it might just not figure out that it can use the index at all.

Demonstration setup:

CREATE TABLE partial (x integer, y integer);
CREATE INDEX xs_above_50 ON partial(x) WHERE (x > 50);
CREATE INDEX xs_below_50000 ON partial(x) WHERE (x < 5000);
INSERT INTO partial(x,y) SELECT a, a FROM generate_series(1,100000) a;

OK, what will Pg prefer for given queries?

regress=> EXPLAIN SELECT y FROM partial WHERE x > 50 AND x < 50000;
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Index Scan using xs_above_50 on partial  (cost=0.29..1788.47 rows=50309 width=4)
   Index Cond: ((x > 50) AND (x < 50000))
(2 rows)

regress=> EXPLAIN SELECT y FROM partial WHERE x > 20 AND x < 50000;
                          QUERY PLAN                          
--------------------------------------------------------------
 Seq Scan on partial  (cost=0.00..1943.00 rows=50339 width=4)
   Filter: ((x > 20) AND (x < 50000))
(2 rows)

regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 AND x < 50000;
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Index Scan using xs_above_50 on partial  (cost=0.29..1787.45 rows=50258 width=4)
   Index Cond: ((x > 100) AND (x < 50000))
(2 rows)

regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 AND x < 20000;
                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Index Scan using xs_above_50 on partial  (cost=0.29..710.71 rows=19921 width=4)
   Index Cond: ((x > 100) AND (x < 20000))
(2 rows)

What if we try to force a bitmap index scan just to find out if Pg can use one, even if it's not worth doing for this particular simple case and small sample?

Try:

regress=> SET enable_seqscan  = off;
SET
regress=> SET enable_indexscan  = off;
SET
regress=> SET enable_indexonlyscan  = off;
SET
regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 AND x < 20000;
                                   QUERY PLAN                                   
--------------------------------------------------------------------------------
 Bitmap Heap Scan on partial  (cost=424.48..1166.30 rows=19921 width=4)
   Recheck Cond: ((x > 100) AND (x < 20000))
   ->  Bitmap Index Scan on xs_above_50  (cost=0.00..419.50 rows=19921 width=0)
         Index Cond: ((x > 100) AND (x < 20000))
(4 rows)

Hm. Nope. Not combining the indexes there. It might be able to but simply not think it's worth scanning a second index, though.

What about a query that ORs two predicates instead?

regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 OR x < 200;
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Bitmap Heap Scan on partial  (cost=1905.29..3848.29 rows=99908 width=4)
   Recheck Cond: ((x > 100) OR (x < 200))
   ->  BitmapOr  (cost=1905.29..1905.29 rows=100000 width=0)
         ->  Bitmap Index Scan on xs_above_50  (cost=0.00..1849.60 rows=99908 width=0)
               Index Cond: (x > 100)
         ->  Bitmap Index Scan on xs_below_50000  (cost=0.00..5.73 rows=193 width=0)
               Index Cond: (x < 200)
(7 rows)

Here PostgreSQL has ORed both indexes to find a match, then done a heap scan and recheck.

So yes, PostgreSQL can combine multiple partial indexes, at least for some queries, where it is useful to do so.

But if I RESET the planner overrides...

regress=> RESET enable_seqscan;
RESET
regress=> RESET enable_indexscan ;
RESET
regress=> RESET enable_indexonlyscan ;
RESET
regress=> EXPLAIN SELECT y FROM partial WHERE x > 100 OR x < 200;
                          QUERY PLAN                          
--------------------------------------------------------------
 Seq Scan on partial  (cost=0.00..1943.00 rows=99908 width=4)
   Filter: ((x > 100) OR (x < 200))
(2 rows)

... Pg will realise it's faster to just sequentially scan the table.

EDIT

I've found Jeremiah Peschka's terrific article about table and index storage. With information from there, I could answer (2) as well. I set up a small test, so you can check these yourself.

-- we will need two extensions
CREATE EXTENSION pg_buffercache;
CREATE EXTENSION pageinspect;


-- a very simple test table
CREATE TABLE index_cache_test (
      id serial
    , blah text
);


-- I am a bit megalomaniac here, but I will use this for other purposes as well
INSERT INTO index_cache_test
SELECT i, i::text || 'a'
FROM generate_series(1, 1000000) a(i);


-- let's create the index to be cached
CREATE INDEX idx_cache_test ON index_cache_test (id);


-- now we can have a look at what is cached
SELECT c.relname,count(*) AS buffers
FROM 
    pg_class c 
    INNER JOIN pg_buffercache b ON b.relfilenode = c.relfilenode 
    INNER JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
GROUP BY c.relname
ORDER BY 2 DESC LIMIT 10;

             relname              | buffers
----------------------------------+---------
 index_cache_test                 |    2747
 pg_statistic_relid_att_inh_index |       4
 pg_operator_oprname_l_r_n_index  |       4
... (others are all pg_something, which are not interesting now)

-- this shows that the whole table is cached and our index is not in use yet

-- now we can check which row is where in our index
-- in the ctid column, the first number shows the page, so 
-- all rows starting with the same number are stored in the same page
SELECT * FROM bt_page_items('idx_cache_test', 1);

 itemoffset |  ctid   | itemlen | nulls | vars |          data
------------+---------+---------+-------+------+-------------------------
          1 | (1,164) |      16 | f     | f    | 6f 01 00 00 00 00 00 00
          2 | (0,1)   |      16 | f     | f    | 01 00 00 00 00 00 00 00
          3 | (0,2)   |      16 | f     | f    | 02 00 00 00 00 00 00 00
          4 | (0,3)   |      16 | f     | f    | 03 00 00 00 00 00 00 00
          5 | (0,4)   |      16 | f     | f    | 04 00 00 00 00 00 00 00
          6 | (0,5)   |      16 | f     | f    | 05 00 00 00 00 00 00 00
...
         64 | (0,63)  |      16 | f     | f    | 3f 00 00 00 00 00 00 00
         65 | (0,64)  |      16 | f     | f    | 40 00 00 00 00 00 00 00

-- with the information obtained, we can write a query which is supposed to
-- touch only a single page of the index
EXPLAIN (ANALYZE, BUFFERS) 
    SELECT id 
    FROM index_cache_test 
    WHERE id BETWEEN 10 AND 20 ORDER BY id
;

 Index Scan using idx_test_cache on index_cache_test  (cost=0.00..8.54 rows=9 width=4) (actual time=0.031..0.042 rows=11 loops=1)
   Index Cond: ((id >= 10) AND (id <= 20))
   Buffers: shared hit=4
 Total runtime: 0.094 ms
(4 rows)

-- let's have a look at the cache again (the query remains the same as above)
             relname              | buffers
----------------------------------+---------
 index_cache_test                 |    2747
 idx_test_cache                   |       4
...

-- and compare it to a bigger index scan:
EXPLAIN (ANALYZE, BUFFERS) 
SELECT id 
    FROM index_cache_test 
    WHERE id <= 20000 ORDER BY id
;


 Index Scan using idx_test_cache on index_cache_test  (cost=0.00..666.43 rows=19490 width=4) (actual time=0.072..19.921 rows=20000 loops=1)
   Index Cond: (id <= 20000)
   Buffers: shared hit=4 read=162
 Total runtime: 24.967 ms
(4 rows)

-- this already shows that something was in the cache and further pages were read from disk
-- but to be sure, a final glance at cache contents:

             relname              | buffers
----------------------------------+---------
 index_cache_test                 |    2691
 idx_test_cache                   |      58

-- note that some of the table pages are disappeared
-- but, more importantly, a bigger part of our index is now cached

All in all, this shows that indexes and tables can be cached page by page, therefore the answer for (2) is NO.

And a final one to illustrate temporary tables being non-cached here:

CREATE TEMPORARY TABLE tmp_cache_test AS 
SELECT * FROM index_cache_test ORDER BY id FETCH FIRST 20000 ROWS ONLY;

EXPLAIN (ANALYZE, BUFFERS) SELECT id FROM tmp_cache_test ORDER BY id;

-- checking the buffer cache now shows no sign of the temp table

Postgresql – Query across multiple schemas, unknown until execution time

If you just need results from multiple schemas, you can re-use the same query string and set the search_path in between:

SET search_path = u111, public;
SELECT * FROM foo;
SET search_path = u222, public;
SELECT * FROM foo;
...

The schema search path search_path in Postgres works much like the search path a file system. Related:

If you need to combine results from multiple schemas (probably your use-case), you can either build the statement in your client or use a plpgsql function with dynamic SQL and EXECUTE. That's what I would do. Plain SQL does not allow parametrized identifiers (schema, table, column, ...).

CREATE OR REPLACE FUNCTION foo(_schemas text[])
  RETURNS TABLE (bar int, baz text) AS  -- matching return type
$func$
BEGIN
   RETURN QUERY EXECUTE (
   SELECT string_agg(format('SELECT bar, baz FROM %I.foo', sch)  -- your query here
                          , E'\nUNION ALL\n')
   FROM   unnest(_schemas) sch
   );
END
$func$  LANGUAGE plpgsql;

Builds and executes a query of the following form dynamically:

SELECT bar, baz FROM u111.foo
UNION ALL
SELECT bar, baz FROM u222.foo
UNION ALL
SELECT bar, baz FROM u333.foo;

Schema names are escaped as identifiers properly to defend against SQL injection.

db<>fiddle here (returning query string as error msg instead of executing it)
Old sqlfiddle

Best Answer

Related Solutions

PostgreSQL Index Caching

EDIT

Postgresql – Query across multiple schemas, unknown until execution time

Related Question