First, some background ideas
Generally, an SQL database which supports multi-column B-tree indexes also supports doing a lookup by a subset of the columns in the index if and only if they're the first columns in the index. For example, if I have an index on columns (a, b, c, d)
and would like to execute:
SELECT * FROM my_table
WHERE b = 7 AND a = 'foo';
then this will use the index and be fast, because the pair (a, b)
is located at the start of the index and so the database can navigate the tree looking for index records starting with ('foo', 7, ... )
. However, if I run
SELECT * FROM my_table
WHERE b = 7 AND c = 'bar';
then the index will not be used*, because matching records will be distributed all over the index depending upon the value they have in column a
.
* (except perhaps by doing a full or partial scan of the index, as described in Evan's answer below – but since a full index scan still has the same time complexity as a full table scan, and a partial index scan potentially does too, this doesn't help much.)
The problem
I have a table with n columns and a potentially huge number of rows. I also have a front-end GUI that allows the user to filter by the exact value of an arbitrary combination of those columns and view a table of results. It is not acceptable for any query generated by this front end to cause a full table scan to get its results; every possible filter has to be supported by an index.
With n columns, what is the minimum number of B-tree indexes I need to create to ensure that every possible combination of columns is covered by some index?
Example with n=4
Suppose my table has 4 columns: a
, b
, c
, and d
. There are then 15 different combinations of columns my user might filter by:
a b c d
a b c
a b d
a c d
b c d
a b
a c
a d
b c
b d
c d
a
b
c
d
However, I can support lookups from any arbitrary subset of these 4 columns using only the following 6 B-tree indexes:
(a, b, c, d)
(b, c, d)
(c, d, a)
(d, b, a)
(a, c)
(a, d)
To illustrate this, below are the 15 possible subsets alongside the index that can be used to do a lookup by that subset:
a b c d (a, b, c, d)
a b c (a, b, c, d)
a b d (d, b, a)
a c d (c, d, a)
b c d (b, c, d)
a b (a, b, c, d)
a c (a, c)
a d (a, d)
b c (b, c, d)
b d (d, b, a)
c d (c, d, a)
a (a, b, c, d)
b (b, c, d)
c (c, d, a)
d (d, b, a)
I'm unsure how to generalise this idea to tables with large numbers of columns, though, or how the number of indexes required scales with n. Hence my question: how many indexes are needed to extend this approach to n columns, and how can I determine what they are?
(I'm primarily interested in the specific theoretical problem of how to support these lookups using B-tree indexes – but I'm open to other solutions, if for example there's an index type that neatly solves this precise problem better than a bunch of B-tree indexes would.)
Best Answer
At least for Oracle (12.x) and Postgres the answer is: n
That is: one index for each column. Both DBMS are able to combine multiple indexes for a single query. In older Oracle versions you would need to use a single bitmap index on all columns. Oracle 12.1 can do a bitmap "scan" on the fly (don't know if that was already possible in 11.x - I don't have an 11.x installation to test).
Test setup:
Now populate the table with million rows and random values for each column. I did this in Postgres using:
Then copied the rows over to the Oracle server.
The following query
shows the following execution plan in Postgres:
You can see that the corresponding index for each column was used. The execution plan for Oracle 12.1 is nearly identical:
A query with only two columns will only use two indexes:
Postgres:
And Oracle:
The indexes are also used for an
OR
condition:Postgres:
Oracle:
A quick test with SQL Server 2016 shows that it does essentially the same thing:
Is this the most efficient index access possible? Probably not, but it is most probably the best compromise between indexing overhead and query flexibility.