Cardinality rule for bitmap indexes

bitmap-indexdata-warehouseindex-tuningoraclestar-schema

The Oracle documentation includes the following advice:

A bitmap index should be built on each of the foreign key columns of
the fact table or tables

In that reference, there is even a bitmap index on the date column. Whatever happened to cardinality rule for using bitmap indexes? Date columns defy that rule the most, but other columns like customer_key are also a little too huge to be considered candidates for bitmap indexes. I can sort of understand putting one on item_key if you don't have thousands of items.

If not a bitmap index, then what – especially for a date column that has a foreign key to a time dimension – typical stuff – month, year, day, etc? Obviously, it's queried often.

I asked this question on Stack Overflow a couple days ago, but I'm going to delete it, since it received no replies.

Best Answer

There was never a rule that bitmap indexes were only useful on columns that had relatively few distinct values. That was a myth that derived from the fact that bitmap indexes aren't appropriate for columns that are unique or mostly unique and that a lot of the columns that you would want to put bitmap indexes on happen to have relatively few distinct values.

Richard Foote (who probably knows more about indexes in Oracle than any other person on the planet) has a nice article on bitmap indexes with many distinct values that walks through why this is perfectly reasonable and appropriate in much more detail. A followup article comparing bitmap and b-tree indexes on columns with many distinct values is also well worth reading.

Related Solutions

Sql-server – SQL server indexing foreign keys, covering indexes included columns

If a FK does not have a dedicated index on them but are part of wider indexes used for covering queries, Should they have a dedicated index created?

It depends on the table's access patterns. If the column is being searched a lot (and, ideally, is highly selective), then yes, you absolutely should have an index on that column, with the column as the first key column in the definition.

Should I be removing some of these indexes and combining them with included columns instead? then have dedicated indexes for my foreign keys?

What was given in the question is somewhat unclear, and the question you've asked is a bit... confused, so let's take a step back for a second.

In SQL Server 2005+, the three most important parts of an index definition are:

The key columns, which determines the index sort order. This means the order of the key columns is very important, because SQL Server uses an index by searching for a value in the first key column, then in the second key column, etc.
The included columns, which are copies of row data tagged onto the index structure. The order included columns are specified is irrelevant.
Is the index unique? This means that the index key can contain only unique combinations of column values.

(While this is not relevant to the discussion at hand, for completeness I will mention it here: SQL Server 2008+ introduces the concept of filtered indexes, which only includes rows in the index that satisfy a predicate.)

The first thing you should do is index consolidation. This involves using the points above to combine indexes that share commonalities.

For example, consider the following two indexes:

CREATE INDEX IX_1 ON [dbo].[t1](C1) INCLUDE(C3, C4);
CREATE INDEX IX_2 ON [dbo].[t1](C1, C2) INCLUDE(C5);

These indexes share the leading key column, C1. Included columns can be specified in any order, so these two indexes could be combined as follows:

CREATE INDEX IX_3 ON [dbo].[t1](C1, C2) INCLUDE(C3, C4, C5);

Where index keys differ in their composition or other properties, you have to be very careful. Consider these indexes:

CREATE INDEX IX_4 ON [dbo].[t1](C1, C3) INCLUDE(C4);
CREATE UNIQUE INDEX IX_5 ON [dbo].[t1](C1, C4) INCLUDE(C5);

Now the decision is not as easy. You have to determine what to do based on your workload, which queries hit the table, and the selectivity of the data itself.

So to answer the question more directly: if you currently have one or more indexes where the column of interest is the first key column in those indexes, you don't have to add more indexes, because the indexes you have are useful.

If the column is searched frequently and there isn't an index with that column as the first key column, you should create an index with that column as the first key column. (Depending on query requirements, you may want to specify other columns as well, for either the key or the included columns.)

If the column is not searched frequently, you can potentially get away with having it contained in another index (not the first key column): the query may be satisfied by scanning the index that contains the column. This is not as efficient as an index seek (for many reasons), but if this operation doesn't happen too often, and the performance in this case is acceptable, you may be okay.

Remember that creating indexes isn't free -- they take up data space, log space, cache memory, and can potentially slow down INSERT/UPDATE/DELETE activity (having said that, there can be other advantages to creating indexes). It's a balance you have to strike for your environment.

Sql-server – Non-Clustered Index key columns to cover a variable where clause

That's a great question.
And there are good answers.

The engine definitely will use the index even if you don't use every key column.
That's especially so if they are in order, as you are talking about.
(can anyone else speak to different orders of key columns?)

You will benefit just fine from selecting just on the first column alone as a key, or multiple columns.
What will make a difference - for any index - is staying inside the INCLUDEd columns.
No matter how many key columns you use in your Where, the performance hit for having to go back to the primary key for additional columns can be huge as it doubles the "operations".

When it comes to dealing with performance vs. size, you have the same problem as with any index.
Since you know you want the same columns returned in all cases, if you are READ focused, you will probably want to the index with all 6, if you INCLUDE everything.
It will certainly save you db size compared to making both indexes.

On WRITE, you obviously have a bigger burden with a larger index. That is a significant additional amount of sorting.
If you do just one row inserted at a time, maybe it won't hardly matter at all.
If you do bulk inserts, you'll definitely want to test the two indexes to see the write performance for your actual inserts.

Best Answer

Related Solutions

Sql-server – SQL server indexing foreign keys, covering indexes included columns

Sql-server – Non-Clustered Index key columns to cover a variable where clause

Related Question