PostgreSQL – Query Plan with Two Indexes of Different Types on the Same Column

execution-planindexoperatorpostgresql

How does PostgreSQL behave if I create two (or more) indexes of different type on the same column?

In my case I want to compare B-Tree indexes against GIN on a tsvector column.
I know that GIN are specifically there for tsvector. What's curious is that PostgreSQL doesn't complaint if I create a B-Tree index, but I see that the query planner doesn't use it. I can also create a GIN index (without removing the B-Tree index) and now the planner uses the newly created index. The column now has two indexes but only one of them is used.

What's the criteria used for selecting an index type against another, even in presence of more than two indexes? Why doesn't PostgreSQL tell me that a B-Tree index on a tsvector is useless and won't be never used by the planner?

Update
The GIN index is used only on certain check conditions like my_tsvector IS NOT NULL but (obviously) not on conditions like my_tsvector @@ '...'::tsquery.

Best Answer

An index is going to be used if it's part of the fastest plan the query planner can come up with. Many factors are involved in the cost estimation. Conditio sine qua non: the index must be applicable. Indexes are bound to operators, not data types or functions or anything else. And operators are defined by the operator and data type of left and right operand. The manual about btree indexes:

B-trees can handle equality and range queries on data that can be sorted into some ordering. In particular, the PostgreSQL query planner will consider using a B-tree index whenever an indexed column is involved in a comparison using one of these operators:
<
<=
=
>=
>

There is more, read the manual.

Related case with more explanation:

Can PostgreSQL index array columns?

A btree index on tsvector is not useless. It can be used to sort rows (implicitly using >= or <= operators):

SELECT * FROM tbl ORDER BY my_tsvector;

Or it can be used for equality predicates:

DELETE FROM tbl WHERE my_tsvector = 'foo bar'::tsvector;

But it cannot be used for a predicate my_tsvector @@ '...'::tsquery. You need a GIN or GiST index for that. Both use the the operator class tsvector_ops by default (no need to declare it explicitly):

Then the index is applicable. But Postgres may still decide to use a different query plan, based on table statistics and cost settings ...

You can have many indexes involving the same index columns. There are various kinds, types and various operator classes, which can be combined in many ways. The query planner will only consider applicable indexes and estimate the cost based on many factors ...

And you can even create the same index any number of times under different names. Then Postgres will take an arbitrary pick. It's your responsibility not to create pointless, duplicated objects.

Related Solutions

PostgreSQL – Index Not Used with = ANY() but Used with IN

Internally, there are two separate forms of IN, as well as for the ANY construct.

One of each, taking a set, is equivalent to the other and expr IN (<set>) also leads to the same query plan as expr = ANY(<set>) that can use a plain index. Details:

IN vs ANY operator in PostgreSQL

Consequently, the following two queries are equivalent and both can use the plain index t_a_b_idx (which can also be the solution if you are trying to get your query to use the index):

EXPLAIN ANALYZE
SELECT *
FROM t
WHERE (a,b) = ANY(VALUES (1,1),(1,2));

Or:

...
WHERE (a,b) IN (VALUES (1,1),(1,2));

Identical for both:

                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.33..16.71 rows=1 width=8) (actual time=0.101..0.101 rows=0 loops=1)
   ->  Unique  (cost=0.04..0.05 rows=2 width=8) (actual time=0.068..0.070 rows=2 loops=1)
         ->  Sort  (cost=0.04..0.04 rows=2 width=8) (actual time=0.067..0.068 rows=2 loops=1)
               Sort Key: "*VALUES*".column1, "*VALUES*".column2
               Sort Method: quicksort  Memory: 25kB
               ->  Values Scan on "*VALUES*"  (cost=0.00..0.03 rows=2 width=8) (actual time=0.005..0.005 rows=2 loops=1)
   ->  Index Only Scan using t_plain_idx on t  (cost=0.29..8.32 rows=1 width=8) (actual time=0.009..0.009 rows=0 loops=2)
         Index Cond: ((a = "*VALUES*".column1) AND (b = "*VALUES*".column2))
         Heap Fetches: 0
 Planning time: 4.080 ms
 Execution time: 0.202 ms

However, this cannot easily be passed to a function, since there are no "table variables" in Postgres. Which leads to the problem that started this topic:

Pass multiple sets or arrays of values to a function

There are various workarounds for that problem. One being the alternative answer I added there. Some others:

The second form of each is different: ANY takes an actual array, while IN takes a comma separated list of values.

This has different consequences for typing the input. As we can see in the EXPLAIN output of the question, this form:

WHERE (a,b) = ANY(ARRAY[(1,1),(1,2)]);

is seen as shorthand for:

ROW(a, b) = ANY (ARRAY[ROW(1, 1), ROW(1, 2)])

And actual ROW values are compared. Postgres is not currently smart enough to see that the index on the composite type t_row_idx is applicable. Nor does it realize that the simple index t_a_b_idx should be applicable as well.

An explicit cast helps to overcome this lack of smarts:

WHERE (a,b)::int_pair = ANY(ARRAY[(1,1),(1,2)]::int_pair[]);

Casting the right operand (::int_pair[]) is optional (though preferable for performance and to avoid ambiguities). Once the left operand has a well-known type, the right operand is coerced from "anonymous record" to a matching type. Only then, the operator is defined unambiguously. And Postgres picks applicable indexes based on the operator and the left operand. For many operators that define a COMMUTATOR, the query planner can flip operands to bring the indexed expression to the left. But that's not possible with the ANY construct.

Is there a way to usefully index a text column containing regex patterns?
```
WHERE (a,b) IN ((1,1),(1,2));
```

.. values are taken as elements and Postgres is able to compare individual integer values as we can see in the EXPLAIN output once more:

Filter: ((b = 1) OR (b = 2))

Hence Postgres finds that the simple index t_a_b_idx can be used.

Consequently, there would be another solution for the particular case in the example: since the custom composite type int_pair in the example happens to be equivalent to the row type of the table t itself, we could simplify:

CREATE INDEX t_row_idx2 ON t ((t));

Then this query would use the index without any more explicit casting:

EXPLAIN ANALYZE
SELECT *
FROM   t
WHERE  t = ANY(ARRAY[(1,1),(1,2)]);

                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=40.59..496.08 rows=1000 width=8) (actual time=0.19
1..0.191 rows=0 loops=1)
   Recheck Cond: (t.* = ANY (ARRAY[ROW(1, 1), ROW(1, 2)]))
   ->  Bitmap Index Scan on t_row_idx2  (cost=0.00..40.34 rows=1000 width=0) (actual time=0.188..0.188 rows=0 loops=1)
         Index Cond: (t.* = ANY (ARRAY[ROW(1, 1), ROW(1, 2)]))
 Planning time: 2.575 ms
 Execution time: 0.267 ms

But typical use cases won't be able to utilize the implicitly existing type of the table row.

Postgresql – How to create some GIN index concurrently in Postgresql

To follow up on my comment, PostgreSQL absolutely can build gin indexes on different tables in parallel to each other. If you don't see a benefit, it is because you don't have enough of some resource. Probably random disk IO capacity, as building large gin indexes can be quite intensive on this.

If using a different table space didn't speed it up, then perhaps the IO bottleneck is in the kernel or FS code, such that it fails to take full advantage of the underlying hardware. One way to more-definitively determine the bottleneck would be to run two separate PostgreSQL servers on the same machine (listening to different ports) and then put one table in each database server, and try to create the indexes simultaneously on that set up.

Best Answer

Related Solutions

PostgreSQL – Index Not Used with = ANY() but Used with IN

Postgresql – How to create some GIN index concurrently in Postgresql

Related Question