Sql-server – SQL Server one-to-many and index fragmentation

entity-frameworkfragmentationsql serversql-server-2016

I'm currently writing code to update an entities child collection (one-to-many relationship), and while thinking about how to write code to determine which entities have been added/removed/modified, I realized that not only is it easier to just recreate the whole list, but probably better for SQL server performance?

Let's say I have a table Student with a one-to-many relationship to Course, and I have there 2 courses Math and Physics, with primary keys 1 and 2 respectively, and obviously a foreign key to Student which is 1.

If I one day decide to update the first course, remove the other, and add a new one, I would end up with the following entities

  • Math (upated), primary key is: 1
  • Physics (deleted), primary key was: 2
  • History (added), primary key is: 5000

Now the 2 courses are no longer next to each other, and this probably causes big performance problems in the long run since my keys are now fragmented, is this correct?

A related question is whether I should have a primary key on the Course table at all? I usually add it by default to all of my (except association) tables, but in this case I never query courses individually, but always in the context of a student. Does it make sense to keep the primary key, even if I know I'll never reference it from anywhere? Would the lack of a primary key hurt performance?

EDIT: I've decided to keep the primary key, in-case in the future I do need to query this child table without the parent.

My primary keys are integers and have a clustered index (default on SQL Server) and I'm using SQL Server 2016, and Entity Framework 6 as my ORM.

UPDATE:

Here's my CREATE TABLE from SSMS. I'm using EF Code-First so I didn't write any of this by hand.

CREATE TABLE [dbo].[Courses](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [Description] [nvarchar](max) NOT NULL,
    [StudentId] [uniqueidentifier] NOT NULL,
    [Fk1Id] [int] NOT NULL,
    [Fk2Id] [int] NOT NULL,
    [Fk3Id] [int] NULL,
 CONSTRAINT [PK_dbo.Courses] PRIMARY KEY CLUSTERED 
 (
     [Id] ASC
 )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
 ) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

ALTER TABLE [dbo].[Courses]  WITH CHECK ADD  CONSTRAINT [FK_dbo.Courses_dbo.Students_StudentId] FOREIGN KEY([StudentId])
REFERENCES [dbo].[Students] ([Id])
GO

ALTER TABLE [dbo].[Courses] CHECK CONSTRAINT [FK_dbo.Courses_dbo.Students_StudentId]
GO
--Omitted other alter tables for Fk1Id,Fk2Id, and Fk3Id

StudentId is the foreign key that I will always use to fetch these rows, they are not relevant in any other context (at least for now).

I have renamed the table and column names for privacy reasons, and removed 3 foreign key columns for simplicity! I hope they're not too relevant.

Best Answer

Let's suppose student 1 takes Math. We get a Course.Id value of 1. What if student 2 also wants to become more numerate? Should she study "Math", "Maths" or "Mathematics"? Is there a list somewhere of the agreed available subjects? Conversely if the last student stops studying History does your history department disappear? I believe this is a hole in your analysis. The tables should be

Student
  StudentId
  StudentName
  .. other values

Subject
  SubjectId,
  SubjectName,
  .. other values

Enrolment
  EnrolmentId,
  StudentId,
  SubjectId,
  .. others

Enrolment is the intersection of Student and Subject. It is what you have called Course(s). As an intersection it could use the two foreign keys (StudentId and SubjectId) as the primary key. There is no problem with having multi-column primary keys. There are a lot of good reasons to design the key this way. It would be my preferred implementation.

As with any table, it could stick with it's natural key or also be given a surrogate key, which is what EnrolmentId is. While useful for many reasons it is not a requirement to have a surrogate primary key on any table. It is certainly not a requirement for the surrogate key to be the clustered key.

For performance reasons you do, indeed, want to keep together on disk those rows which are read together. Since you're using SQL Server you can achieve this by a clustered index. So which rows do you read together? From your question it would seem that the most important query is to list courses for a student. Therefore clustering Enrolment by StudentId will be the best idea. When a student drops or adds a course the new row will be beside that student's existing rows, because that's how a clusterd index works.

Maybe in future it will become important to list the students which are in a given course. Although the clustered index has been assigned, a second, covering index could be defined.

create unique nonclustered index IX_by_subject on Enrolment
(
  SubjectId,
  StudentId
);

Although not defined as clustered it holds all the values required to satisfy a large class of important queries. As students take up and drop subjects SQL Server will keep together on disk all the rows associated with one subject. This will be a different bit of disk than where the base table data (i.e. the clustered index) is held.

For more complex tables SQL Server supports the INCLUDE clause. Using this one can create what are effectively further clustered indexes that are updated by SQL Server to remain in sync with the base data.

With students studying a few subjects, and the tables indexed as above, fragmentation will not be a concern. Each student's rows will most likely be on one page, and never more than on two pages. Performing sensible, measured index maintenance is good practice.