Query runs fast but occasionally never completes

azure-sql-database

I have messed things up at work (oh no!). Everything was running fine until about two days ago; queries executed instantly, and the database tended to sit around 20% load. Then I got told that some of our historical data in the database was entered incorrectly, and needed to be removed and re-entered from the following tables:

CREATE TABLE TrackerStates(
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [TrackerId] [int] NOT NULL,
    [DateRecorded] [datetime] NOT NULL,
    [Latitude] [float] NOT NULL,
    [Longitude] [float] NOT NULL,
    [Altitude] [float] NOT NULL,
    -- Some other fields
    CONSTRAINT [PK_dbo.TrackerStates] PRIMARY KEY NONCLUSTERED  (Id)
)

CREATE CLUSTERED INDEX IX_TrackerId_DateRecorded_Processed ON TrackerStates (TrackerId, DateRecorded, Processed)
CREATE NONCLUSTERED INDEX IX_TrackerId_Processed ON TrackerStates (TrackerId) WHERE (Processed = 0)

CREATE TABLE TrackingPoints(
    [DatabaseId] [int] NOT NULL,
    [TrackerStateId] [int] NOT NULL,
    [RoadId] [int] NOT NULL,
    [Distance] [float] NOT NULL,
    PRIMARY KEY CLUSTERED (DatabaseId, TrackerStateId),
    FOREIGN KEY (DatabaseId) REFERENCES RammDatabases (Id),
    FOREIGN KEY (TrackerStateId) REFERENCES TrackerStates (Id)
)
CREATE NONCLUSTERED INDEX IX_DatabaseId ON TrackingPoints (DatabaseId)
CREATE NONCLUSTERED INDEX IX_TrackerStateId ON TrackingPoints (TrackerStateId)

CREATE TABLE TrackingSegments(
    [DatabaseId] [int] NOT NULL,
    [StartingStateId] [int] NOT NULL,
    [EndingStateId] [int] NOT NULL,
    [EntityId] [nvarchar](max) NULL,
    PRIMARY KEY CLUSTERED (DatabaseId, StartingStateId, EndingStateId),
    FOREIGN KEY (DatabaseId) REFERENCES RammDatabases (Id),
    FOREIGN KEY (StartingStateId) REFERENCES TrackerStates (Id),
    FOREIGN KEY (EndingStateId) REFERENCES TrackerStates (Id),
    FOREIGN KEY (DatabaseId, StartingStateId) REFERENCES TrackingPoints (DatabaseId, TrackerStateId),
    FOREIGN KEY (DatabaseId, EndingStateId) REFERENCES TrackingPoints (DatabaseId, TrackerStateId)
)
CREATE NONCLUSTERED INDEX IX_DatabaseId ON TrackingSegments (DatabaseId)
CREATE NONCLUSTERED INDEX IX_DatabaseId_StartingStateId ON TrackingSegments (DatabaseId, StartingStateId)
CREATE NONCLUSTERED INDEX IX_DatabaseId_EndingStateId ON TrackingSegments (DatabaseId, EndingStateId)

My first attempt to delete the data simply took far too long – it's historical data so there are a few million rows. After that failed, I asked around and someone suggested disabling check constraints, deleting, and then re-enabling the constraints for those specific tables (which I will NEVER do again, terrible idea). The disable and delete ran fairly quickly, I had to leave the enable running all night but it did succeed.

From then on, the database has been at 100% CPU, and one of the queries occasionally never completes. It's the same query every time:

SELECT TOP 500 * FROM TrackerStates WHERE TrackerId = @TrackerId AND Processed = 0 ORDER BY DateRecorded

This query uses the index IX_TrackerId_DateRecorded_Processed, and usually takes no time at all to execute. Occasionally though, a query will just sit there executing until it times out (after 30 seconds).

So far, I have tried:

  • Scaling the database up to 5 times the size (less timeouts, but still 100% CPU usage)
  • Rebuilding the indexes, as they were fragmented to about 80% (no difference)
  • Changing the query to isolation level READ UNCOMMITTED (no difference)

What could I possibly do to fix this?

Best Answer

It looks like the statistics are off. From the plan you posted it's estimating that it's going to read 64,000 rows, but it's actually reading zero. That's a very wide disparity. I'd suggest a few things. First, update the statistics with a full scan. Any index rebuilds ought to have taken care of that, but with this disparity I'm wondering if something is up there. Next, make sure that the constraints are all in place (although, this plan isn't referencing constraints since it's a straight index seek with a TOP operation). Finally, capture the wait statistics for the system to see what's actually causing things to run slowly. You can use extended events to capture the wait metrics for just this query, so that's an even better approach. Also, a 64,000 row range scan from a seek is a little excessive unless you have millions of rows. However, this could still be a part of the statistics being off.