How many B-tree indexes do I need to support lookups by any subset of n columns

btreeindexperformance

First, some background ideas

Generally, an SQL database which supports multi-column B-tree indexes also supports doing a lookup by a subset of the columns in the index if and only if they're the first columns in the index. For example, if I have an index on columns (a, b, c, d) and would like to execute:

SELECT * FROM my_table
WHERE b = 7 AND a = 'foo';

then this will use the index and be fast, because the pair (a, b) is located at the start of the index and so the database can navigate the tree looking for index records starting with ('foo', 7, ... ). However, if I run

SELECT * FROM my_table
WHERE b = 7 AND c = 'bar';

then the index will not be used*, because matching records will be distributed all over the index depending upon the value they have in column a.

* (except perhaps by doing a full or partial scan of the index, as described in Evan's answer below – but since a full index scan still has the same time complexity as a full table scan, and a partial index scan potentially does too, this doesn't help much.)

The problem

I have a table with n columns and a potentially huge number of rows. I also have a front-end GUI that allows the user to filter by the exact value of an arbitrary combination of those columns and view a table of results. It is not acceptable for any query generated by this front end to cause a full table scan to get its results; every possible filter has to be supported by an index.

With n columns, what is the minimum number of B-tree indexes I need to create to ensure that every possible combination of columns is covered by some index?

Example with n=4

Suppose my table has 4 columns: a, b, c, and d. There are then 15 different combinations of columns my user might filter by:

a b c d
a b c  
a b d  
a c d  
b c d  
a b    
a c    
a d    
b c    
b d    
c d    
a      
b      
c      
d      

However, I can support lookups from any arbitrary subset of these 4 columns using only the following 6 B-tree indexes:

(a, b, c, d)
(b, c, d)
(c, d, a)
(d, b, a)
(a, c)
(a, d)

To illustrate this, below are the 15 possible subsets alongside the index that can be used to do a lookup by that subset:

a b c d   (a, b, c, d)
a b c     (a, b, c, d)
a b d     (d, b, a)
a c d     (c, d, a)
b c d     (b, c, d)
a b       (a, b, c, d)
a c       (a, c)
a d       (a, d)
b c       (b, c, d)
b d       (d, b, a)
c d       (c, d, a)
a         (a, b, c, d)
b         (b, c, d)
c         (c, d, a)
d         (d, b, a)

I'm unsure how to generalise this idea to tables with large numbers of columns, though, or how the number of indexes required scales with n. Hence my question: how many indexes are needed to extend this approach to n columns, and how can I determine what they are?

(I'm primarily interested in the specific theoretical problem of how to support these lookups using B-tree indexes – but I'm open to other solutions, if for example there's an index type that neatly solves this precise problem better than a bunch of B-tree indexes would.)

Best Answer

How many B-tree indexes do I need to support lookups by any subset of n columns

At least for Oracle (12.x) and Postgres the answer is: n

That is: one index for each column. Both DBMS are able to combine multiple indexes for a single query. In older Oracle versions you would need to use a single bitmap index on all columns. Oracle 12.1 can do a bitmap "scan" on the fly (don't know if that was already possible in 11.x - I don't have an 11.x installation to test).

Test setup:

create table idx_test (a int, b int, c int, d int, e int);
create index idx_a on idx_test(a);
create index idx_b on idx_test(b);
create index idx_c on idx_test(c);
create index idx_d on idx_test(d);

Now populate the table with million rows and random values for each column. I did this in Postgres using:

insert into idx_test (a,b,c,d,e)
select (random() * 10000 + 1)::int, 
       (random() * 100000 + 1)::int, 
       (random() * 1000000 + 1)::int, 
       (random() * 100000 + 1)::int, 
       (random() * 10000 + 1)::int
from generate_series(1,1000000);

Then copied the rows over to the Oracle server.

The following query

select *
from idx_test
where b = 42 or a = 24 or c = 100;

shows the following execution plan in Postgres:

QUERY PLAN                                                                                                                    
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on idx_test  (cost=6.25..166.36 rows=111 width=20) (actual time=0.032..0.244 rows=113 loops=1)               
  Recheck Cond: ((b = 42) OR (a = 24) OR (c = 100))                                                                           
  Heap Blocks: exact=113                                                                                                      
  ->  BitmapOr  (cost=6.25..6.25 rows=111 width=0) (actual time=0.020..0.020 rows=0 loops=1)                                  
        ->  Bitmap Index Scan on idx_test_b_idx  (cost=0.00..1.96 rows=11 width=0) (actual time=0.009..0.009 rows=11 loops=1) 
              Index Cond: (b = 42)                                                                                            
        ->  Bitmap Index Scan on idx_test_a_idx  (cost=0.00..2.27 rows=98 width=0) (actual time=0.007..0.007 rows=100 loops=1)
              Index Cond: (a = 24)                                                                                            
        ->  Bitmap Index Scan on idx_test_c_idx  (cost=0.00..1.93 rows=2 width=0) (actual time=0.003..0.003 rows=2 loops=1)   
              Index Cond: (c = 100)                                                                                           
Planning time: 0.077 ms                                                                                                       
Execution time: 0.270 ms                                                                                                      

You can see that the corresponding index for each column was used. The execution plan for Oracle 12.1 is nearly identical:

PLAN_TABLE_OUTPUT                                                                     
--------------------------------------------------------------------------------------
Plan hash value: 2332290563                                                           

--------------------------------------------------------------------------------------
| Id  | Operation                           | Name     | E-Rows |E-Bytes| Cost (%CPU)|
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |          |     43 |  2795 |   740   (0)|
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| IDX_TEST |     43 |  2795 |   740   (0)|
|   2 |   BITMAP CONVERSION TO ROWIDS       |          |        |       |            |
|   3 |    BITMAP OR                        |          |        |       |            |
|   4 |     BITMAP CONVERSION FROM ROWIDS   |          |        |       |            |
|*  5 |      INDEX RANGE SCAN               | IDX_C    |        |       |     3   (0)|
|   6 |     BITMAP AND                      |          |        |       |            |
|   7 |      BITMAP CONVERSION FROM ROWIDS  |          |        |       |            |
|*  8 |       INDEX RANGE SCAN              | IDX_A    |        |       |     3   (0)|
|   9 |      BITMAP CONVERSION FROM ROWIDS  |          |        |       |            |
|* 10 |       INDEX RANGE SCAN              | IDX_B    |        |       |     3   (0)|
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):                                   
---------------------------------------------------                                   

   5 - access("C"=100)                                                                
   8 - access("A"=24)                                                                 
  10 - access("B"=42)                                                                 

A query with only two columns will only use two indexes:

select *
from idx_test
where a = 1001 and b = 45877;

Postgres:

QUERY PLAN                                                                                                                   
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on idx_test  (cost=4.48..5.99 rows=1 width=20) (actual time=0.118..0.118 rows=0 loops=1)                    
  Recheck Cond: ((b = 45877) AND (a = 1001))                                                                                 
  ->  BitmapAnd  (cost=4.48..4.48 rows=1 width=0) (actual time=0.117..0.117 rows=0 loops=1)                                  
        ->  Bitmap Index Scan on idx_test_b_idx  (cost=0.00..1.96 rows=11 width=0) (actual time=0.075..0.075 rows=11 loops=1)
              Index Cond: (b = 45877)                                                                                        
        ->  Bitmap Index Scan on idx_test_a_idx  (cost=0.00..2.27 rows=98 width=0) (actual time=0.038..0.038 rows=93 loops=1)
              Index Cond: (a = 1001)                                                                                         
Planning time: 0.150 ms                                                                                                      
Execution time: 0.157 ms                                                                                                     

And Oracle:

PLAN_TABLE_OUTPUT                                                                     
--------------------------------------------------------------------------------------
Plan hash value: 3618970768                                                           

--------------------------------------------------------------------------------------
| Id  | Operation                           | Name     | E-Rows |E-Bytes| Cost (%CPU)|
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |          |     78 |  5070 |    22   (0)|
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| IDX_TEST |     78 |  5070 |    22   (0)|
|   2 |   BITMAP CONVERSION TO ROWIDS       |          |        |       |            |
|   3 |    BITMAP AND                       |          |        |       |            |
|   4 |     BITMAP CONVERSION FROM ROWIDS   |          |        |       |            |
|*  5 |      INDEX RANGE SCAN               | IDX_A    |     43 |       |     3   (0)|
|   6 |     BITMAP CONVERSION FROM ROWIDS   |          |        |       |            |
|*  7 |      INDEX RANGE SCAN               | IDX_B    |     43 |       |     3   (0)|
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):                                   
---------------------------------------------------                                   

   5 - access("A"=1001)                                                               
   7 - access("B"=45877)                                                              

The indexes are also used for an OR condition:

select *
from idx_test
where a = 1001 or b = 45877;

Postgres:

QUERY PLAN                                                                                                                   
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on idx_test  (cost=4.29..161.31 rows=109 width=20) (actual time=0.053..0.240 rows=104 loops=1)              
  Recheck Cond: ((a = 1001) OR (b = 45877))                                                                                  
  Heap Blocks: exact=104                                                                                                     
  ->  BitmapOr  (cost=4.29..4.29 rows=109 width=0) (actual time=0.038..0.038 rows=0 loops=1)                                 
        ->  Bitmap Index Scan on idx_test_a_idx  (cost=0.00..2.27 rows=98 width=0) (actual time=0.030..0.030 rows=93 loops=1)
              Index Cond: (a = 1001)                                                                                         
        ->  Bitmap Index Scan on idx_test_b_idx  (cost=0.00..1.96 rows=11 width=0) (actual time=0.007..0.007 rows=11 loops=1)
              Index Cond: (b = 45877)                                                                                        
Planning time: 0.111 ms                                                                                                      
Execution time: 0.273 ms                                                                                                     

Oracle:

PLAN_TABLE_OUTPUT                                                                     
--------------------------------------------------------------------------------------
Plan hash value: 2853963857                                                           

--------------------------------------------------------------------------------------
| Id  | Operation                           | Name     | E-Rows |E-Bytes| Cost (%CPU)|
--------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |          |     62 |  4030 |   879   (0)|
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| IDX_TEST |     62 |  4030 |   879   (0)|
|   2 |   BITMAP CONVERSION TO ROWIDS       |          |        |       |            |
|   3 |    BITMAP OR                        |          |        |       |            |
|   4 |     BITMAP CONVERSION FROM ROWIDS   |          |        |       |            |
|*  5 |      INDEX RANGE SCAN               | IDX_A    |        |       |     3   (0)|
|   6 |     BITMAP CONVERSION FROM ROWIDS   |          |        |       |            |
|*  7 |      INDEX RANGE SCAN               | IDX_B    |        |       |     3   (0)|
--------------------------------------------------------------------------------------

Predicate Information (identified by operation id):                                   
---------------------------------------------------                                   

   5 - access("A"=1001)                                                               
   7 - access("B"=45877)                                                              

A quick test with SQL Server 2016 shows that it does essentially the same thing:

StmtText                                                                                                                             | StmtId | NodeId | Parent | PhysicalOp   | LogicalOp  | Argument                                                                                                  
-------------------------------------------------------------------------------------------------------------------------------------+--------+--------+--------+--------------+------------+------------------------------------------------------------------------------------------------------
select * from idx_test where b = 42 and a = 24                                                                                       |      1 |      1 |      0 |              |            | 1                                                                                                         
  |--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1000]))                                                                          |      1 |      2 |      1 | Nested Loops | Inner Join | OUTER REFERENCES:([Bmk1000])                                                                              
       |--Hash Match(Inner Join, HASH:([Bmk1000])=([Bmk1000]), RESIDUAL:([Bmk1000] = [Bmk1000]))                                     |      1 |      4 |      2 | Hash Match   | Inner Join | HASH:([Bmk1000])=([Bmk1000]), RESIDUAL:([Bmk1000] = [Bmk1000])                                            
       |    |--Index Seek(OBJECT:([TestDB].[dbo].[idx_test].[idx_b]), SEEK:([TestDB].[dbo].[idx_test].[b]=(42)) ORDERED FORWARD)     |      1 |      5 |      4 | Index Seek   | Index Seek | OBJECT:([TestDB].[dbo].[idx_test].[idx_b]), SEEK:([TestDB].[dbo].[idx_test].[b]=(42)) ORDERED FORWARD 
       |    |--Index Seek(OBJECT:([TestDB].[dbo].[idx_test].[idx_a]), SEEK:([TestDB].[dbo].[idx_test].[a]=(24)) ORDERED FORWARD)     |      1 |      6 |      4 | Index Seek   | Index Seek | OBJECT:([TestDB].[dbo].[idx_test].[idx_a]), SEEK:([TestDB].[dbo].[idx_test].[a]=(24)) ORDERED FORWARD 
       |--RID Lookup(OBJECT:([TestDB].[dbo].[idx_test]), SEEK:([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD)                          |      1 |     16 |      2 | RID Lookup   | RID Lookup | OBJECT:([TestDB].[dbo].[idx_test]), SEEK:([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD                  

Is this the most efficient index access possible? Probably not, but it is most probably the best compromise between indexing overhead and query flexibility.