Sql-server – Is a clustered index on a child table in a parent/child relationship the most optimal index

clustered-indexindexindex-tuningperformancesql server

A clustered index will sort row data in a table according to the ordering of the fields in the index. If so, why would you (or would you not) want to create a clustered index on the child table in a parent/child relationship?

Lets assume the classical order-with-orderlines example where you never fetch the orderlines without first going through the associated order. I would think that having a clustered index on both the orderid and the orderlineid column (as clustered indexes must be unique) forces all orderlines for a single order to be physically located adjacent to each other, which would make reads very performant. Is this correct? And if so, are there downsides to this approach?

Best Answer

You assumption about adjacency is correct.

If we use TPC-H as an example: Clustering the LINEITEMS table on on ORDERID will locate all order lines belonging to the same LINEITEM physically adjacent on disk. This speeds up queries that fetch all order lines for a given ORDERID. Clustering on the foreign key to the parent also allow fast merge joins between the child and parent.

There are a few downsides to the clustering approach:

  • The entire table must be kept sorted on disk. If you are expecting a great many inserts with ORDERID not being sequentially generated, page splits will be more expensive. This is something you can throw hardware at.
  • If ORDERID is generated sequentially, you will create a hotspot at the end of the table. In some database engines (For example SQL Server) this is a problem at high insert speed. In SQL Server, this typically kicks in around 5K-10K inserts/sec.
  • The cluster index keys either have to be unique (ex: ORDERID, LINENUMBER) or padded with some hidden column to make them unique. Since the composite cluster key must be present in all other indexes, this makes the secondary, non-clustered indexes larger.
  • Storing the table clustered will force a B-tree traversal when you want to locate data via a secondary index (unless the secondary index is covering the query). If you instead kept the table as a heap, all other indexes would only have an 8B overhead and your B-tree traversals are cut in half.

The vast majority of cases, you will want to cluster both the parent and the child on the same leading key. But if you expect the child table to be accessed via many different indexes - it may be worth considering the alternatives.