Sql-server – SQL Server one-to-many and index fragmentation

entity-frameworkfragmentationsql serversql-server-2016

I'm currently writing code to update an entities child collection (one-to-many relationship), and while thinking about how to write code to determine which entities have been added/removed/modified, I realized that not only is it easier to just recreate the whole list, but probably better for SQL server performance?

Let's say I have a table Student with a one-to-many relationship to Course, and I have there 2 courses Math and Physics, with primary keys 1 and 2 respectively, and obviously a foreign key to Student which is 1.

If I one day decide to update the first course, remove the other, and add a new one, I would end up with the following entities

Math (upated), primary key is: 1
~~Physics (deleted), primary key was: 2~~
History (added), primary key is: 5000

Now the 2 courses are no longer next to each other, and this probably causes big performance problems in the long run since my keys are now fragmented, is this correct?

A related question is whether I should have a primary key on the Course table at all? I usually add it by default to all of my (except association) tables, but in this case I never query courses individually, but always in the context of a student. Does it make sense to keep the primary key, even if I know I'll never reference it from anywhere? Would the lack of a primary key hurt performance?

EDIT: I've decided to keep the primary key, in-case in the future I do need to query this child table without the parent.

My primary keys are integers and have a clustered index (default on SQL Server) and I'm using SQL Server 2016, and Entity Framework 6 as my ORM.

UPDATE:

Here's my CREATE TABLE from SSMS. I'm using EF Code-First so I didn't write any of this by hand.

CREATE TABLE [dbo].[Courses](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Description] [nvarchar](max) NOT NULL,
    [StudentId] [uniqueidentifier] NOT NULL,
    [Fk1Id] [int] NOT NULL,
    [Fk2Id] [int] NOT NULL,
    [Fk3Id] [int] NULL,
 CONSTRAINT [PK_dbo.Courses] PRIMARY KEY CLUSTERED 
 (
     [Id] ASC
 )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
 ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[Courses]  WITH CHECK ADD  CONSTRAINT [FK_dbo.Courses_dbo.Students_StudentId] FOREIGN KEY([StudentId])
REFERENCES [dbo].[Students] ([Id])
GO

ALTER TABLE [dbo].[Courses] CHECK CONSTRAINT [FK_dbo.Courses_dbo.Students_StudentId]
GO
--Omitted other alter tables for Fk1Id,Fk2Id, and Fk3Id

StudentId is the foreign key that I will always use to fetch these rows, they are not relevant in any other context (at least for now).

I have renamed the table and column names for privacy reasons, and removed 3 foreign key columns for simplicity! I hope they're not too relevant.

Best Answer

Let's suppose student 1 takes Math. We get a Course.Id value of 1. What if student 2 also wants to become more numerate? Should she study "Math", "Maths" or "Mathematics"? Is there a list somewhere of the agreed available subjects? Conversely if the last student stops studying History does your history department disappear? I believe this is a hole in your analysis. The tables should be

Student
  StudentId
  StudentName
  .. other values

Subject
  SubjectId,
  SubjectName,
  .. other values

Enrolment
  EnrolmentId,
  StudentId,
  SubjectId,
  .. others

Enrolment is the intersection of Student and Subject. It is what you have called Course(s). As an intersection it could use the two foreign keys (StudentId and SubjectId) as the primary key. There is no problem with having multi-column primary keys. There are a lot of good reasons to design the key this way. It would be my preferred implementation.

As with any table, it could stick with it's natural key or also be given a surrogate key, which is what EnrolmentId is. While useful for many reasons it is not a requirement to have a surrogate primary key on any table. It is certainly not a requirement for the surrogate key to be the clustered key.

For performance reasons you do, indeed, want to keep together on disk those rows which are read together. Since you're using SQL Server you can achieve this by a clustered index. So which rows do you read together? From your question it would seem that the most important query is to list courses for a student. Therefore clustering Enrolment by StudentId will be the best idea. When a student drops or adds a course the new row will be beside that student's existing rows, because that's how a clusterd index works.

Maybe in future it will become important to list the students which are in a given course. Although the clustered index has been assigned, a second, covering index could be defined.

create unique nonclustered index IX_by_subject on Enrolment
(
  SubjectId,
  StudentId
);

Although not defined as clustered it holds all the values required to satisfy a large class of important queries. As students take up and drop subjects SQL Server will keep together on disk all the rows associated with one subject. This will be a different bit of disk than where the base table data (i.e. the clustered index) is held.

For more complex tables SQL Server supports the INCLUDE clause. Using this one can create what are effectively further clustered indexes that are updated by SQL Server to remain in sync with the base data.

With students studying a few subjects, and the tables indexed as above, fragmentation will not be a concern. Each student's rows will most likely be on one page, and never more than on two pages. Performing sensible, measured index maintenance is good practice.

Related Solutions

Sql-server – Index fragmentation increase as more row being deleted

This isn't fragmentation.

Fragmentation is generated of course, but deletes will simply create "islands" of remaining pages, which is less evil then GUID/clustered key INSERT fragmentation.

If you're PK is an IDENTITY, then CreationDate should roughly track this so you're actually deleting chunks of contiguous rows anyway.

Do you have an index on CreationDate
Do you have delete cascades?
Is the TOP 1000 in a single transaction?

For point 3, doing a loop inside a transaction is pointless: is this it?

At some point, a statistics update may be needed if you delete enough rows but I don't think it's that.

Other options:

why not use TRUNCATE TABLE, wrapped in a stored procedure with EXECUTE AS OWNER
use SYNONYMs for poor man's partitioning

Sql-server – Index fragmentation while continuously processing

What you're doing is you're using a table as a queue. Your update is the dequeue method. But the clustered index on the table is a poor choice for a queue. Using tables as Queues actually impose quite stringent requirements on the table design. Your clustered index must be the dequeue order, in this case likely ([DataType], [DataStatus], [ProcessDate]). You can implement the primary key as a nonclustered constraint. Drop the non-clustered index Idx, as the clustered key takes its role.

Another important piece of the puzzle is to keep the row size constant during processing. You have declared the ProcessThreadId as a VARCHAR(100) which implies the row grows and shrinks as is being 'processed' because the field value changes from NULL to non-null. This grow-and-shrink pattern on the row causes page splits and fragmentation. I can't possibly imagine a thread ID that is 'VARCHAR(100)'. Use a fixed length type, perhaps an INT.

As a side note, you do not need to dequeue in two steps (UPDATE followed by SELECT). You can use OUTPUT clause, as explained in the article linked above:

/**************************************/
CREATE TABLE [Table] 
(
    [PrimaryKeyId] [INT] IDENTITY(1,1) NOT NULL PRIMARY KEY NONCLUSTERED,
    [ForeignKeyId] [INT] NOT NULL,
    /* more columns ... */
    [DataType] [CHAR](1) NOT NULL,
    [DataStatus] [DATETIME] NULL,
    [ProcessDate] [DATETIME] NOT NULL,
    [ProcessThreadId] INT NULL
);

CREATE CLUSTERED INDEX [Cdx] ON [Table] 
(
    [DataType],
    [DataStatus],
    [ProcessDate]
);
/**************************************/

declare @BatchSize int, @ProcessThreadId int;

/**************************************/
WITH cte AS (
    SELECT TOP (@BatchSize) [PrimaryKeyId], [ProcessThreadId] , ... more columns 
    FROM [Table] WITH ( ROWLOCK, UPDLOCK, READPAST )
    WHERE [DataType] = 'X'
    AND [DataStatus] IS NULL
    AND [ProcessDate] < DATEADD(m, -2, GETDATE()) -- older than 2 months
    AND [ProcessThreadId] IS NULL
)
UPDATE cte
SET [ProcessThreadId] = @ProcessThreadId
OUTPUT DELETED.[PrimaryKeyId] , ... more columns ;
/**************************************/

In addition I would consider moving successfully processed items into a different, archive, table. You want your queue tables to hover near zero size, you do no want them to grow as they retain 'history' from unneeded old entries. You may also consider partitioning by [ProcessDate] as an alternative (ie. one current active partition that acts as the queue and stores entries with NULL ProcessDate, and another partition for everything non-null. Or multiple partitions for non-null if you want to implement efficient deletes (switch out) for data that has passed the mandated retention period. If things get hot you can partition in addition by [DataType] if it has enough selectivity, but that design would be really complicated as it requires partitioning by persisted computed column (a composite column that glues together [DataType] and [ProcessingDate]).

Best Answer

Related Solutions

Sql-server – Index fragmentation increase as more row being deleted

Sql-server – Index fragmentation while continuously processing

Related Question