Postgresql – If i index two columns (indepentently) do i get the same performance as putting and index on the two

indexindex-tuningpostgresql

I want to put an index on two columns to improve my query performance for each. I also want to perform queries where both columns are used.

From my understanding, if you had a BTREE index on each column, you could theoretically get the same performance by querying both indices and then combining the range you got – if that makes sense?

e.g.

SELECT * FROM table WHERE f_name = 'first' and l_name = 'last'

Instead checking an index for ('first', 'last') you'd check the f_name index for first and the l_name index for last. Now, you can check if the results of both index queries are the same (the result of the initial query). Obviously, this is slightly slower than a composite index, but still much faster than no index.

With this idea, you get performance benefits not only when querying both columns together, but either column independently.

My main question is: if I have these 2 independent indices, would I need another index on both, to improve the performance of queries using both columns?

Best Answer

Your question is a bit jumbled. No, you can't get theoretically the same performance, unless one of the two columns is unique or nearly unique. But yes, you can get performance that while slower than the composite is still much faster than no index at all. Will it be fast enough that you don't need the composite index? No one can answer that but you. We don't know either how fast you need it to be, nor how selective the condition against each column will be (both independently and jointly).

But if you are going to have indexes on (a) and on (b), you might as well just change the first one to be on (a,b) or change the second to be on (b,a). The extra cost will be nearly negligible (unless the added column contains wide data values) while the extra value at least has the potential to be large. The cost might be slightly higher in v13, because having the composite index might defeat the new deduplication logic compared to the one with just the prefix column.

Doing a detailed analysis of whether it is worthwhile is probably not worthwhile.

Answer

Since you refer to the website use-the-index-luke.com, consider the chapter:

Use The Index, Luke › The Where Clause › Searching For Ranges › Greater, Less and BETWEEN

It has an example that matches your situation perfectly (two-column index, one is tested for equality, the other for range), explains (with more of those nice index graphics) why @ypercube's advice is accurate and sums it up:

Rule of thumb: index for equality first — then for ranges.

Also good for just one column?

What to do for queries on just one column seems to be clear. More details and benchmarks concerning that under these related question:

Less selective column first?

Apart from that, what if you have only equality conditions for both columns?

It doesn't matter. Put the column first that is more likely to receive conditions of its own, which actually matters.

Consider this demo, or reproduce it yourself. I create a simple table of two columns with 100k rows. One with very few, the other one with lots of distinct values:

CREATE TEMP TABLE t AS
SELECT (random() * 10000)::int AS lots
     , (random() * 4)::int     AS few
FROM generate_series (1, 100000);

DELETE FROM t WHERE random() > 0.9;  -- create some dead tuples, more "real-life"

ANALYZE t;

SELECT count(distinct lots)   -- 9999
     , count(distinct few)    --    5
FROM   t;

Query:

SELECT *
FROM   t
WHERE  lots = 2345
AND    few = 2;

EXPLAIN ANALYZE output (Best of 10 to exclude caching effects):

Seq Scan on t  (cost=0.00..5840.84 rows=2 width=8)
               (actual time=5.646..15.535 rows=2 loops=1)
  Filter: ((lots = 2345) AND (few = 2))
  Buffers: local hit=443
Total runtime: 15.557 ms

Add index, retest:

CREATE INDEX t_lf_idx ON t(lots, few);

Index Scan using t_lf_idx on t  (cost=0.00..3.76 rows=2 width=8)
                                (actual time=0.008..0.011 rows=2 loops=1)
  Index Cond: ((lots = 2345) AND (few = 2))
  Buffers: local hit=4
Total runtime: 0.027 ms

Add other index, retest:

DROP INDEX t_lf_idx;
CREATE INDEX t_fl_idx  ON t(few, lots);

Index Scan using t_fl_idx on t  (cost=0.00..3.74 rows=2 width=8)
                                (actual time=0.007..0.011 rows=2 loops=1)
  Index Cond: ((few = 2) AND (lots = 2345))
  Buffers: local hit=4
Total runtime: 0.027 ms

Sql-server – the best index implementation for a really large databse

Can i replace the non-clustered indices with one covering index ?

No. Suppose you sometimes find people by LastName and sometimes by FirstName. An index on (LastName, FirstName) won't help you find people by FirstName.

If not, is there any way to get rid of RID Lookup rather than a covering index ?

Not in a way that's particularly useful.

Does each query require a different covering index depending on the columns in the select list and the search conditions ?

Pretty much. Your job is to come up with a compromise.

I have no clustered indexes in the table , does adding a unique column and setting it as a primary key helps in getting rid of RID Lookups ?

A PK doesn't necessarily mean a clustered index. And having a CIX just means your RID Lookups will become Key Lookups, which are potentially worse. But without CIXs you have heaps which can fragment when you change or delete data. So CIXs are fine, but won't improve performance of your Lookups.

Best Answer

Related Solutions

Postgresql – Multicolumn index and performance

Answer

Also good for just one column?

Less selective column first?

Sql-server – the best index implementation for a really large databse

Related Question