Postgresql – Multicolumn index and performance

indexindex-tuningpostgresql

I have a table with a multicolumn index, and I have doubts about the proper sorting of the indexes to get the maximum performance on the queries.

The scenario:

  • PostgreSQL 8.4, table with about one million rows

  • Values in column c1 can have about 100 different values. We can assume the values are evenly distributed, so we have about 10000 rows for every possible value.

  • Column c2 can have 1000 different values. We have 1000 rows for every possible value.

When searching data, the condition always includes values for these two columns, so the table has a multicolumn index combining c1 and c2. I have read about the importance of properly ordering the columns in a multicolumn index if you have queries using just one column for filtering. This is not the case in our scenario.

My question is this one:

Given the fact that one of the filters selects a much smaller set of data, could I improve performance if the first index is the most selective one (the one which allows a smaller set)? I had never considered this question until I saw the graphics from the referenced article:

enter image description here

Image taken from the referenced article about multicolumn indexes.

The queries use values from the two columns for filtering. I have no queries using just one column for filtering. All of them are: WHERE c1=@ParameterA AND c2=@ParameterB. There are also conditions like this: WHERE c1 = "abc" AND c2 LIKE "ab%"

Best Answer

Answer

Since you refer to the website use-the-index-luke.com, consider the chapter:

Use The Index, Luke › The Where Clause › Searching For Ranges › Greater, Less and BETWEEN

It has an example that matches your situation perfectly (two-column index, one is tested for equality, the other for range), explains (with more of those nice index graphics) why @ypercube's advice is accurate and sums it up:

Rule of thumb: index for equality first — then for ranges.

Also good for just one column?

What to do for queries on just one column seems to be clear. More details and benchmarks concerning that under these related question:

Less selective column first?

Apart from that, what if you have only equality conditions for both columns?

It doesn't matter. Put the column first that is more likely to receive conditions of its own, which actually matters.

Consider this demo, or reproduce it yourself. I create a simple table of two columns with 100k rows. One with very few, the other one with lots of distinct values:

CREATE TEMP TABLE t AS
SELECT (random() * 10000)::int AS lots
     , (random() * 4)::int     AS few
FROM generate_series (1, 100000);

DELETE FROM t WHERE random() > 0.9;  -- create some dead tuples, more "real-life"

ANALYZE t;

SELECT count(distinct lots)   -- 9999
     , count(distinct few)    --    5
FROM   t;

Query:

SELECT *
FROM   t
WHERE  lots = 2345
AND    few = 2;

EXPLAIN ANALYZE output (Best of 10 to exclude caching effects):

Seq Scan on t  (cost=0.00..5840.84 rows=2 width=8)
               (actual time=5.646..15.535 rows=2 loops=1)
  Filter: ((lots = 2345) AND (few = 2))
  Buffers: local hit=443
Total runtime: 15.557 ms

Add index, retest:

CREATE INDEX t_lf_idx ON t(lots, few);
Index Scan using t_lf_idx on t  (cost=0.00..3.76 rows=2 width=8)
                                (actual time=0.008..0.011 rows=2 loops=1)
  Index Cond: ((lots = 2345) AND (few = 2))
  Buffers: local hit=4
Total runtime: 0.027 ms

Add other index, retest:

DROP INDEX t_lf_idx;
CREATE INDEX t_fl_idx  ON t(few, lots);
Index Scan using t_fl_idx on t  (cost=0.00..3.74 rows=2 width=8)
                                (actual time=0.007..0.011 rows=2 loops=1)
  Index Cond: ((few = 2) AND (lots = 2345))
  Buffers: local hit=4
Total runtime: 0.027 ms