Sql-server – Clustered Index and Primary Key on table with unique identifier and date

clustered-indexsql serversql-server-2005

I have table with Orders with unique incrementing ID and (order creation) Date. This table is quite large and wide (5mn rows, 50 columns, 10 columns of which are FK IDs). Example:

CREATE TABLE Orders As (
    ID int UNIQUE AUTOINCREMENT (1,1),
    DATE datetime,  --This is SQL Server 2005, so no [date] data type
    OrderStatus char (2),
    ClientID int
    ...
)

I get this table as copy of yesterday's production table. Table (as whole DB) is used for reporting purposes and therefore read-only.

Part 1

I have 2 very common use cases:

80% (or more) of queries have DATE column in WHERE clause, as users want data for specific business date.

Here I want to create clustered index on DATE.

40%-60% of queries use OrderID to JOIN Orders table to other tables with information about Order details (Product, Supplpier, Payments, Reservations, etc).

Here I want to create clustered index on OrderID.

Part 2

Can I have one CI for both cases?
Index (ID, DATE) will not work with WHERE clause.
Index (DATE,ID) will not work for joins on ID only or ID in WHERE

Catch. We know that DATE is incremental as well as ID. ID with higher value cannot appear yesterday under any circumstances as it is AUTOINCREMENT.

Question. Is there a way to tell SQL Server that CI (DATE, ID) will have ID sequentially ordered for all dates?

My only solution at the moment is crating non clustered covering index (ID, DATE), but it's suboptimal.

I've searched for some time, but could not find anything. If there is a solutions for later versions of SQL Server I would be interested in it too.

Update

I know clustered index basics. Please note, database is put in read-only state for users.

Logically you can just ignore Date column in (Date, ID) index without any harm at all. Possibly this is a very specific use case that is not yet covered by SQL Server functionality.

Best Answer

The main characteristic of the clustered index is that the data is all in that order. That allows SQL to "read ahead". So for example if you created your CI on the Date column and ran queries pulling summary info for a week or a month then SQL could pull the data more quickly.

If on the other hand you are having to do seeks (a single Id for example) then the CI is no different than a covering NCI. Note the covering part of that.

Here is something you could try. Place your CI on the Date, OrderId combination. Do this because you are more likely to pull range data on a date than an Id. Also because you said that 80% of your queries use the date. Some portion of these will presumably also use the OrderId.

Then see what columns the queries that just use OrderId use and add those columns to an OrderId NCI using the INCLUDE clause. And just to save time if the answer is "they are using all of the columns on these queries" then a) you need to look at those queries and see if they really need all that data and b) yes you could INCLUDE all of the columns in the table.

I realize that would be a HUGE INCLUDE. In fact it would double the size of your table and you'll want to carefully test that you aren't going to have a serious negative impact on write operations. However for reads it should work just fine.

Related Solutions

Sql-server – Using unique non clustered index with unique clustered index

Yes having a column in multiple unique keys is sometimes perfectly reasonable. In the case that you gave above I'm not sure I would bother since the ProductId key is unique regardless. But let's say that you have a product table like this:

ProductVendor  PK
ProductCode  PK
ProductDescription
.....

In this particular case the ProductVendor and ProductCode are together unique and are your primary key and clustered index. However there is an additional business rule that ProductDescription must also be unique by ProductVendor. In this case you could create a non-clustered index on ProductVendor, ProductDescription.

Sql-server – Is a clustered index on a child table in a parent/child relationship the most optimal index

You assumption about adjacency is correct.

If we use TPC-H as an example: Clustering the LINEITEMS table on on ORDERID will locate all order lines belonging to the same LINEITEM physically adjacent on disk. This speeds up queries that fetch all order lines for a given ORDERID. Clustering on the foreign key to the parent also allow fast merge joins between the child and parent.

There are a few downsides to the clustering approach:

The entire table must be kept sorted on disk. If you are expecting a great many inserts with ORDERID not being sequentially generated, page splits will be more expensive. This is something you can throw hardware at.
If ORDERID is generated sequentially, you will create a hotspot at the end of the table. In some database engines (For example SQL Server) this is a problem at high insert speed. In SQL Server, this typically kicks in around 5K-10K inserts/sec.
The cluster index keys either have to be unique (ex: ORDERID, LINENUMBER) or padded with some hidden column to make them unique. Since the composite cluster key must be present in all other indexes, this makes the secondary, non-clustered indexes larger.
Storing the table clustered will force a B-tree traversal when you want to locate data via a secondary index (unless the secondary index is covering the query). If you instead kept the table as a heap, all other indexes would only have an 8B overhead and your B-tree traversals are cut in half.

The vast majority of cases, you will want to cluster both the parent and the child on the same leading key. But if you expect the child table to be accessed via many different indexes - it may be worth considering the alternatives.

Best Answer

Related Solutions

Sql-server – Using unique non clustered index with unique clustered index

Sql-server – Is a clustered index on a child table in a parent/child relationship the most optimal index

Related Question