Sql-server – Consolidating indexes

indexperformancesql serversql-server-2005

I have one big table that is used to generate business intelligence cube. Currently it has around 40M rows and 55 columns. A lot of the cube dimensions are generated by running 'select distinct' on a column of the big table.

Currently I have one index for each 'select distinct' and 16 indexes in total. Is it better to merge some of those indexes? My thinking is that the cube needs to process all rows anyway, and the indexes speed up the process because it is much smaller than the entire table, and if I put two or three columns in one index the index size will not grow so much. Maybe the good candidates are columns that have low specificity.

Am I going in the right direction? Can the query make use of the second column (or third, fourth, …) of the index?

I'm using SQL Server 2005 Standard Edition.

Best Answer

If the "cube" needs to process all rows, indexes won't speed up the process as the querying engine would just do a table scan.

I think the best advice I could give is to familiarize yourself with query plan output so that you can see how the database is satisfying the query requests. This will allow you to see what access methods are used to retrieve data. It would also allow you to answer you're own question: if you create the proposed index, does sql server use it to fulfill your query?

I think the answer to your question is going to be "No". If your "cube" is being materialized by executing "SELECT DISTINCT columnB FROM table", and you currently have an indexes on "columnA" and "columnB". "Consolidating" those separate indexes into a single index on "(columnA, columnB)" is unlikely to help your "cube" queries on "columnB". Your mileage may vary, not having access to the schema nor query source makes these kinds of questions difficult to answer definitively.

Having said that, I think the correct response is to point you towards the SQL Server data warehousing functionality instead of rolling your own solution.

Answer

Since you refer to the website use-the-index-luke.com, consider the chapter:

Use The Index, Luke › The Where Clause › Searching For Ranges › Greater, Less and BETWEEN

It has an example that matches your situation perfectly (two-column index, one is tested for equality, the other for range), explains (with more of those nice index graphics) why @ypercube's advice is accurate and sums it up:

Rule of thumb: index for equality first — then for ranges.

Also good for just one column?

What to do for queries on just one column seems to be clear. More details and benchmarks concerning that under these related question:

Less selective column first?

Apart from that, what if you have only equality conditions for both columns?

It doesn't matter. Put the column first that is more likely to receive conditions of its own, which actually matters.

Consider this demo, or reproduce it yourself. I create a simple table of two columns with 100k rows. One with very few, the other one with lots of distinct values:

CREATE TEMP TABLE t AS
SELECT (random() * 10000)::int AS lots
     , (random() * 4)::int     AS few
FROM generate_series (1, 100000);

DELETE FROM t WHERE random() > 0.9;  -- create some dead tuples, more "real-life"

ANALYZE t;

SELECT count(distinct lots)   -- 9999
     , count(distinct few)    --    5
FROM   t;

Query:

SELECT *
FROM   t
WHERE  lots = 2345
AND    few = 2;

EXPLAIN ANALYZE output (Best of 10 to exclude caching effects):

Seq Scan on t  (cost=0.00..5840.84 rows=2 width=8)
               (actual time=5.646..15.535 rows=2 loops=1)
  Filter: ((lots = 2345) AND (few = 2))
  Buffers: local hit=443
Total runtime: 15.557 ms

Add index, retest:

CREATE INDEX t_lf_idx ON t(lots, few);

Index Scan using t_lf_idx on t  (cost=0.00..3.76 rows=2 width=8)
                                (actual time=0.008..0.011 rows=2 loops=1)
  Index Cond: ((lots = 2345) AND (few = 2))
  Buffers: local hit=4
Total runtime: 0.027 ms

Add other index, retest:

DROP INDEX t_lf_idx;
CREATE INDEX t_fl_idx  ON t(few, lots);

Index Scan using t_fl_idx on t  (cost=0.00..3.74 rows=2 width=8)
                                (actual time=0.007..0.011 rows=2 loops=1)
  Index Cond: ((few = 2) AND (lots = 2345))
  Buffers: local hit=4
Total runtime: 0.027 ms

Best Answer

Related Solutions

SQL and Multiple Indexes

Postgresql – Multicolumn index and performance

Answer

Also good for just one column?

Less selective column first?

Related Question