Sql-server – SQL Server – is it ok to store link table as a heap

heapsql-server-2008

I'm trying to understand if it's a good practice to have non-clustered PK on link table:

CREATE TABLE table1_table2(table1_id INT NOT NULL, table2_id INT NOT NULL,
CONSTRAINT PK_T1_T2 PRIMARY KEY NONCLUSTERED(table1_id, table2_id),
CONSTRAINT FK_T1_T2_T1ID FOREIGN KEY(table1_id) REFERENCES table1(id),
CONSTRAINT FK_T1_T2_T2ID FOREIGN KEY(table2_id) REFERENCES table2(id)
);
CREATE INDEX IDX_T1_T2 ON table1_table2 (table2_id);

My point here that heap is ok because

No particular order during INSERT, so in case of clustered PK I'll have many page splits.
All SELECTs will use just index anyway

In my understanding, adding identity clustered PK will add more overhead for such tables (absolutely no data except foreign keys).

Could you please clarify if that makes sense ?

Thanks.

Best Answer

I agree that there is no point in creating an IDENTITY column clustered index here but I would just create the composite PK as CLUSTERED rather than NONCLUSTERED.

There is no point having 2 copies of the data, One in the heap that never gets used and one in the NCI. In fact it is downright counter productive as shown below.

Your NCI will still get page splits anyway if you are not inserting into the table in order of primary key so you have not avoided the problem. You might also consider a UNIQUE constraint or index on the reversed key order as well depending upon what queries you run against that table.

In general NCIs can be slightly be more compact than the equivalent NCI as they do not have the status bits B or NULL_BITMAP (if no nullable columns) but in this case this is more than outweighed by the necessity to store the 8 byte RID pointer to the row in the heap as can be seen from the below.

Script

CREATE TABLE T_NCI
  (
     table1_id INT NOT NULL,
     table2_id INT NOT NULL,
     CONSTRAINT NCI_PK PRIMARY KEY NONCLUSTERED(table1_id, table2_id)
  );

CREATE TABLE T_CI
  (
     table1_id INT NOT NULL,
     table2_id INT NOT NULL,
     CONSTRAINT CI_PK PRIMARY KEY CLUSTERED(table1_id, table2_id)
  );  

INSERT INTO T_CI
OUTPUT inserted.* INTO T_NCI
SELECT DISTINCT number, number
FROM master..spt_values


SELECT LEFT(OBJECT_NAME(object_id),5) AS [Name],
       page_count,
       record_count,
       min_record_size_in_bytes,
       max_record_size_in_bytes
FROM   (SELECT *
        FROM   sys.dm_db_index_physical_stats(db_id(), OBJECT_ID('T_CI'), NULL, NULL, 'DETAILED')
        UNION ALL
        SELECT *
        FROM   sys.dm_db_index_physical_stats(db_id(), OBJECT_ID('T_NCI'), NULL, NULL, 'DETAILED')
) T  

DROP TABLE T_NCI
DROP TABLE T_CI

Results

Name  page_count           record_count         min_record_size_in_bytes max_record_size_in_bytes
----- -------------------- -------------------- ------------------------ ------------------------
T_CI  5                    2163                 15                       15
T_CI  1                    5                    15                       15
T_NCI 5                    2163                 15                       15
T_NCI 6                    2163                 17                       17
T_NCI 1                    6                    15                       15

The record length for the non leaf pages in both the CI and the NCI is 15 bytes (1 byte status bits, 8 bytes for the composite key and 6 for the down page pointer) but for the leaf page NCI the rows take up 17 bytes (1 byte status bits, 8 bytes for the composite key and 8 for the row pointer) compared to 15 for the CI (2 bytes status bits, 1 byte column count offset, 8 bytes data, 2 bytes column count, 1 byte null bitmap). And as well as this less compact index structure you also have all the additional pages for the heap itself on top.

Related Solutions

Sql-server – SQL Server equivalent of Oracle USING INDEX clause

Is there a SQL Server 2008 equivalent of the USING INDEX clause in Oracle?

No. When you create a primary key or unique constraint in SQL Server, a unique index to support that constraint is created automatically, with the same keys.

Which seems to imply that there is a way of specifying what index should be used for a Primary Key.

No. The documentation is only attempting to explain whether the automatic supporting index will be created as clustered or nonclustered, if you do not specify. It is confusingly worded, I agree.

To clarify, when you add a primary key constraint to an existing table without expressing a preference, the supporting index will be clustered if there is no pre-existing clustered index on the table. The supporting index will be created as nonclustered if there is already a clustered index

You can specifically request a clustered or nonclustered primary key using: PRIMARY KEY CLUSTERED or PRIMARY KEY NONCLUSTERED.

In fairness, the documentation is much clearer on the subject at:

table_constraint (Transact-SQL)

Sql-server – How to lower HEAP Fragmentation in SQL Server

What does Fragmentation Means in a Heap

The fragmentation value in Heap which you get from column avg_fragmentation_in_percent by querying sys.dm_db_index_physical_stats DMV states that

Logical fragmentation for indexes, or extent fragmentation for heaps in the IN_ROW_DATA allocation unit.

Further the same BOL says that

This is the percentage of out-of-order extents in the leaf pages of a heap. An out-of-order extent is one for which the extent that contains the current page for a heap is not physically the next extent after the extent that contains the previous page.

So you can see it is not the free space present in pages allocated to Heap but the varying sequence of pages that creates the fragmentation.

This can be demonstrated by small test. Let us create a Heap Table and insert some records in it and then check the fragmentation.

create table dbo.HeapTest
(
Id INT not NULL Default (1),
Col1   char(5000) Not null Default ('Heaps Are Cool')
)

SET NOCOUNT ON

Insert into dbo.Heaptest default values
go 50

select index_type_desc,avg_fragmentation_in_percent,fragment_count,
avg_page_space_used_in_percent,record_count
from sys.dm_db_index_physical_stats(db_id(),object_id('dbo.HeapTest','U'),0,default,'detailed')

So Heap table is created with 50 records in it. Below is what fragmentation looks like after query DMV sys.dm_db_index_physical stats

You can see avg_fragmentation_in_percent column value is 33 %. Now let us see how are pages arranged. This can be done by using undocumented query %%lockres%%. The query would be

SELECT  %%lockres%%, * FROM dbo.HeapTest;

And below is what output looks like. Attaching only relevant portion of it. The query produced 50 rows since we inserted 50 rows in our dbo.HeapTest table.

What it says is the first page has ID 197 the next page has ID 242 subsequent pages has continuous ID till we reach page ID 264 because after that we get page ID 280. So this jump in page ID numbers is what actually causing fragmentation.

Now lest rebuild the heap and run The command again to see the fragmentation and how pages are arranged. We get fragmentation like

You can see fragmentation is now 14%.

Let us see page numbers allocated

We only have one jump rest all pages are allocated page ID serially. Since just one jump fragmentation decreased considerably.

I rebuild the Heap again and now when I checked fragmentation it was completely gone. And page ID allocation is like

Why Fragmentation Increased

Now regarding what could have caused fragmentation to rise we can corroborate it to fact that when pages were getting allocated to the heap they would not be continuous, as you saw above what caused fragmentation value to increase was jump in the PAGE ID's allocated to pages.

At the back of head you should also keep in mind that the word fragmentation for HEAP does not have any meaning, how would you define fragmentation for bunch of un-ordered pages.

Really Worried about Fragmentation

If you really face a scenario where heap table is fragmented and slowing queries it would be better creating a clustered index on table than rebuilding it. The reason is when you rebuild heap all underlying Non Clustered indexes are also rebuilt causing the rebuild process to take much longer time, utilizing lot of resources and bloating transaction log. On a production system one would always try to avoid this. Paul covered this in his Myth Section about heap.

PS: Please don't use undocumented command on production system. This was just for demonstration.