Sql-server – SQL Azure “hot” table query slowdown over time

azurelockingperformancesql server

I have an Azure SQL Database table that is acting as a "prioritized queue". As such, it gets heavily written to and deleted from over time. What's happening is that after a week or two, the DB size increases. CPU increases and performance reading from this slows down almost to a halt.

If I check the count, it's zero. If I check the table size, there are a lot of pages assigned to this table.

If I TRUNCATE the table, the count is still zero but the table size is back to minimal levels and performance is acceptable again.

Obviously, to me this indicates that the table has a lot of pages due to the high insert/delete activity and this is why the DB size increases. This is fine as when Azure cleanup occurs (presumably transaction log backups), the DB size drops back to normal levels.

What I don't understand is why the performance drops off so drastically when there are still zero rows in the table.

Can someone provide some suggestions as to what may be going on?

NB: The logic dictates that the overall write is atomic; i.e. update business data and write to the queue. Additional logic dictates that if an item (identified via business key) exists then overwrite queue item.

The Table is defined thus:

CREATE TABLE [dbo].[PrioritizedQueue](
    [Id] [bigint] IDENTITY(1,1) NOT NULL,
    [Region] varchar(10) NOT NULL,
    [ItemId] varchar(10) NOT NULL,
    [Priority] tinyint NOT NULL,
    [ReceivedTimestamp] datetime2(7) NOT NULL,
    [Operation] varchar(7168) NOT NULL,
    [Content] varchar(max) NOT NULL,
 CONSTRAINT [PK_IndexerQueue] PRIMARY KEY NONCLUSTERED 
(
    [Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
)
GO
CREATE UNIQUE INDEX [IX_PrioritizedQueue_ItemId] ON [dbo].[PrioritizedQueue]
(
    [ItemId] ASC,
    [Region] ASC
)
INCLUDE ([Priority])
WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF)
GO
CREATE CLUSTERED INDEX [IX_IndexerQueue_Priority] ON [dbo].[PrioritizedQueue]
(
    [Region] ASC,
    [Priority] ASC,
    [ReceivedTimestamp] ASC
)
WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF)
GO

FYI: The "content" field is a varchar(max) due to history, it now only contains a file pointer so AFAIK the row should never need to use data pages off the current page

The Stored proc that reads from the table, and appears to have all the problems is:

CREATE PROCEDURE [dbo].[usp_indexer_dequeue]
    @batchSize AS int,
    @region varchar(10)
AS
BEGIN
    SET NOCOUNT ON;
    SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
    DECLARE @deletedbatch table
    (
        [Operation] varchar(7168) NOT NULL,
        [Content] varchar(max) NOT NULL,
        [ItemId] varchar(10) NOT NULL
    );
    WITH cte AS 
        (
            SELECT TOP(@batchSize) [Operation], [Content], [ItemId]
            FROM [PrioritizedQueue] WITH (rowlock, updlock, readpast) 
            WHERE [Region] = @region
            ORDER BY [Priority], [ReceivedTimestamp] ASC
        )
    DELETE FROM cte
    OUTPUT deleted.*
    INTO @deletedbatch;
    SELECT * FROM @deletedbatch
END
GO

This proc is designed to fetch the top available N rows (typically this 25 rows) return them to the app and then delete from the table. As it runs as one of many processes calling this at once, it uses (rowlock, updlock, readpast) to bypass anything currently locked.

Best Answer

The problem might be cardinality estimation, index fragmentation and stale statistics causing performance issues that increase over time. If you're not rebuilding your indexes and updating your statistics, this can be affecting the optimiser's plan choices and causing performance degradation over time. It might also explain why TRUNCATE seems to correct the issue (see Kendra Little's article, basically the optimiser tracks TRUNCATE to ensure it makes sensible decisions on empty tables even without statistics updates).

You need to do a couple of things to track down the root cause:

  1. Enable Query Store to identify regressions in query performance over time. If a new plan is being selected due to cardinality issues or stale statistics, Query Store will highlight them.
  2. Implement regular index and statistics maintenance tasks if you haven't already. If you have stats or index frag issues, this should help alleviate them.
  3. Track your wait statistics over the one-two week period to see if a particular wait is increasing over time causing the degradation.
  4. Next time it occurs, don't truncate to fix - you need to collect as much data as you can to try and determine the root cause. Get the statistics details for the table, get the index frag details for the table, get the current execution plan details, collect BlitzCache details, these will all help you identify the root cause.