Postgresql – Multicolumn index and performance

indexindex-tuningpostgresql

I have a table with a multicolumn index, and I have doubts about the proper sorting of the indexes to get the maximum performance on the queries.

The scenario:

PostgreSQL 8.4, table with about one million rows
Values in column c1 can have about 100 different values. We can assume the values are evenly distributed, so we have about 10000 rows for every possible value.
Column c2 can have 1000 different values. We have 1000 rows for every possible value.

When searching data, the condition always includes values for these two columns, so the table has a multicolumn index combining c1 and c2. I have read about the importance of properly ordering the columns in a multicolumn index if you have queries using just one column for filtering. This is not the case in our scenario.

My question is this one:

Given the fact that one of the filters selects a much smaller set of data, could I improve performance if the first index is the most selective one (the one which allows a smaller set)? I had never considered this question until I saw the graphics from the referenced article:

enter image description here

Image taken from the referenced article about multicolumn indexes.

The queries use values from the two columns for filtering. I have no queries using just one column for filtering. All of them are: WHERE c1=@ParameterA AND c2=@ParameterB. There are also conditions like this: WHERE c1 = "abc" AND c2 LIKE "ab%"

Best Answer

Answer

Since you refer to the website use-the-index-luke.com, consider the chapter:

Use The Index, Luke › The Where Clause › Searching For Ranges › Greater, Less and BETWEEN

It has an example that matches your situation perfectly (two-column index, one is tested for equality, the other for range), explains (with more of those nice index graphics) why @ypercube's advice is accurate and sums it up:

Rule of thumb: index for equality first — then for ranges.

Also good for just one column?

What to do for queries on just one column seems to be clear. More details and benchmarks concerning that under these related question:

Less selective column first?

Apart from that, what if you have only equality conditions for both columns?

It doesn't matter. Put the column first that is more likely to receive conditions of its own, which actually matters.

Consider this demo, or reproduce it yourself. I create a simple table of two columns with 100k rows. One with very few, the other one with lots of distinct values:

CREATE TEMP TABLE t AS
SELECT (random() * 10000)::int AS lots
     , (random() * 4)::int     AS few
FROM generate_series (1, 100000);

DELETE FROM t WHERE random() > 0.9;  -- create some dead tuples, more "real-life"

ANALYZE t;

SELECT count(distinct lots)   -- 9999
     , count(distinct few)    --    5
FROM   t;

Query:

SELECT *
FROM   t
WHERE  lots = 2345
AND    few = 2;

EXPLAIN ANALYZE output (Best of 10 to exclude caching effects):

Seq Scan on t  (cost=0.00..5840.84 rows=2 width=8)
               (actual time=5.646..15.535 rows=2 loops=1)
  Filter: ((lots = 2345) AND (few = 2))
  Buffers: local hit=443
Total runtime: 15.557 ms

Add index, retest:

CREATE INDEX t_lf_idx ON t(lots, few);

Index Scan using t_lf_idx on t  (cost=0.00..3.76 rows=2 width=8)
                                (actual time=0.008..0.011 rows=2 loops=1)
  Index Cond: ((lots = 2345) AND (few = 2))
  Buffers: local hit=4
Total runtime: 0.027 ms

Add other index, retest:

DROP INDEX t_lf_idx;
CREATE INDEX t_fl_idx  ON t(few, lots);

Index Scan using t_fl_idx on t  (cost=0.00..3.74 rows=2 width=8)
                                (actual time=0.007..0.011 rows=2 loops=1)
  Index Cond: ((few = 2) AND (lots = 2345))
  Buffers: local hit=4
Total runtime: 0.027 ms

Related Solutions

Postgresql – Run query faster, what index to use

I think the main problem is in your table design (many-many redundant values which are prone to typos and other problems). If you were referencing these values via foreign keys, the table rows would be much shorter - therefore the aggregation could be done on a much smaller dataset and the actual values be added via joins. (The optimizer takes care of doing this well, you don't need additional magic. The design change is your task, though :) - see also ypercube's comment above.)

Aside this, in your configuration the work_mem value seems to be very low. If this is true or not depends on how many rows you get from this query - but as grouping and aggregating involves sorting and work_mem is the place where the data is being sorted, there is a chance that simply increasing this to a (much) bigger value would improve the performance a lot.

Furthermore, an index like

CREATE INDEX idx_emissions_totales_residentiel
ON emissions_totales (schema_source, numreg, annee_ref, type_source_n1, unites)
WHERE schema_source = 'residentiel';

will also help this particular query. As edze suggested, going to PostgreSQL 9.2 and including 'emi' in the index (as the last column) could also help.

Sql-server – Will a nonclustered index with a unique column always address all queries filtering that column first

Yes, given the constraints in the question, particularly that the primary key column is the leading column in the indexes. Also assuming the primary key never changes.
Not necessarily.

The optimizer can indeed infer uniqueness without marking the nonclustered index unique.

Marking the index unique may introduce a Split-Sort-Collapse combination in execution plans that change an index key. The extra Sort in particular has the potential to be performance-affecting.

On the other hand, not marking the index unique risks data integrity if the primary key is ever changed.

Example

CREATE TABLE dbo.Test
(
    pk integer PRIMARY KEY NONCLUSTERED,
    c1 integer NOT NULL,
    c2 integer NOT NULL
);

-- Not unique on pk, c1
CREATE NONCLUSTERED INDEX ic1 ON dbo.Test (pk, c1);

-- Unique on pk, c2
CREATE UNIQUE NONCLUSTERED INDEX ic2 ON dbo.Test (pk, c2);

Uniqueness

-- Neither plan has an aggregate
SELECT DISTINCT T.c1 FROM dbo.Test AS T WHERE T.pk = 1;
SELECT DISTINCT T.c2 FROM dbo.Test AS T WHERE T.pk = 1;

Split, Sort, Collapse

-- No split, sort, collapse
UPDATE dbo.Test SET c1 = CHECKSUM(NEWID());

-- Split, sort, collapse updating unique key
UPDATE dbo.Test SET c2 = CHECKSUM(NEWID());

Note the split-sort-collapse plan is also a wide (per-index) update.

Uniqueness is a huge topic though. I would normally mark something that is unique as unique, unless there is a good reason not to. Some further reading from my blog:

To anticipate comments about heap tables: Most tables benefit from being clustered. You need good reasons to choose a heap structure, especially from a space management point of view, if the table ever experiences deletes. Updates can also introduce performance impacts if columns expand beyond the space available on the original page (forwarded records).