Sql-server – Calendar Event table – best practice setup for range queries and individual retrieval

azure-sql-databasedateindex-tuningquery-performancesql server

This seems like a generic problem that should have been solved already, but I can't find anything about this. In general this question is – given a table where data is read by a date range, what is the best, most efficient setup?

We have a calendar event table that will quickly grow to millions of records.

The schema is something like:

CREATE TABLE [dbo].[CalendarEvent](
[Id] [uniqueidentifier] NOT NULL,
[DtStart] [datetime] NULL,
[DtEnd] [datetime] NULL,
[Created] [datetime] NULL,
[LastModified] [datetime] NULL,
[CalendarEventType] [nvarchar](255) NULL,
[CalendarId] [uniqueidentifier] NULL
PRIMARY KEY CLUSTERED 
(
    [Id] ASC
    )WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]

Forget about recurring events, etc. as that doesn't bear on our problem.

Most queries will be of the type:

select * from CalendarEvent where CalendarId = 'b5d6338f-805f-4717-9c0a-4600f95ac515' AND dtStart > '01/01/2020' AND dtStart < '10/22/2020'

Notice no joins, etc.

But we will also have some that select for individual events, and include joins:

select * from CalendarEvent ce join tags t on ce.Id = t.CalendarEventId where Id = '17606330-5486-496a-a91c-f5d0e123bfff'

Questions and ideas:

  1. Should we keep the Id as the PK, but make the start date the clustered index?
  2. Should we just make an index on dtStart?
  3. Should we partition by month?
  4. Should we denormalize a little and break duplicate the dtStart data by include year and month columns that we can index and use in our range queries?

In general, when you do your querying on a table by date range, what is the best setup for this type of table?

Note: If you think this question could be improved to help more people, make it more generic and widely applicable, such as removing references to a Calendar Event table specifically, and making this just about date range querying in any type of table, please help me do that.

Best Answer

So there's probably a few different things you can try but what stands out to me from your schema and examples are the following:

  1. Avoid using UniqueIdentifiers if possible, and try to use Int or BigInt instead. (Int has over 8 billion values, BigInt over 18 quintillion.) It'll definitely make a difference in indexing performance.

  2. Create a unique clustered index on your PK field (which it looks like you're already doing, so you can keep it as is).

  3. Create a nonclustered index on CalendarId and dtStart.

  4. Avoid SELECT * in your queries, unless you're using every column every time, it's faster to only select the exact columns you need (and either way it's best practice to list your columns out instead of using * for reliability reasons).

*5. If you're able to do #4 then when you create the nonclustered index in #3, you can use the INCLUDE keyword to specify names of the columns you'll be normally selecting. This eliminates key lookups and will be a little more performant.

  1. You can look into creating an indexed view on top of:

    select * from CalendarEvent ce join tags t on ce.Id = t.CalendarEventId

You just have to replace * with the actual columns you'll be using, and you need a unique value (likely the primary key of one of the tables) that you can create a unique clustered index on. This will basically store the data of this view as if it was a table and can save you some time. You also get the added benefit of being able to create additional nonclustered indexes on it if necessary as well.

This Brent Ozar article is also useful to be aware of for the limitations of Indexed Views: What You Can (and Can’t) Do With Indexed Views

  1. You can also look into Table / Index Compression but this will be dependent on how much data is in your tables / indexes and if the bigger hardware bottleneck is I/O vs CPU.