SQL Server – How to Retrieve Next Queue Item

performancequery-performancequeuesql serversql-server-2012

I have a simple table in SQL Server 2012 which implements a processing queue. As data has been inserted the query to retrieve the next item has gone from <100ms to a constant 5-6secs. I would be hugely grateful if someone could point me at the cause of this sudden dip in performance. (It seems to have been an almost overnight drop).

Here's the table definition:

CREATE TABLE [dbo].[messagequeue] (
    [id]             INT             IDENTITY (1, 1) NOT NULL,
    [testrunident]   VARCHAR (255)   NOT NULL,
    [filesequence]   INT             NOT NULL,
    [processed]      BIT             NOT NULL,
    [dateentered]    DATETIME        NULL,
    [filedata]       VARBINARY (MAX) NULL,
    [retries]        INT             NOT NULL,
    [failed]         BIT             NOT NULL,
    [msgobject]      VARBINARY (MAX) NULL,
    [errortext]      VARCHAR (MAX)   NULL,
    [sourcefilename] VARCHAR (MAX)   NULL,
    [xmlsource]      VARCHAR (MAX)   NULL,
    [messagetype]    VARCHAR (255)   NULL
);

CREATE NONCLUSTERED INDEX [messagequeue_sequenc_failed_idx]
    ON [dbo].[messagequeue]([processed] ASC, [failed] ASC)
    INCLUDE([id], [testrunident], [filesequence]);

CREATE NONCLUSTERED INDEX [messagequeue_sequence_idx]
    ON [dbo].[messagequeue]([testrunident] ASC, [processed] ASC)
    INCLUDE([filesequence]);

CREATE UNIQUE NONCLUSTERED INDEX [IXd_testrun_sequence]
    ON [dbo].[messagequeue]([testrunident] ASC, [filesequence] ASC);

And here's the query used to retrieve the next row to be processed:

select messagequeue.id, messagequeue.testrunident, messagequeue.filesequence,
    messagequeue.processed, messagequeue.filedata, messagequeue.retries, messagequeue.failed, 
    messagequeue.msgobject, messagequeue.xmlsource 
    from messagequeue where id = (
        select top 1 id from messagequeue mqouter
        where processed = 0
        AND failed = 0
        AND (filesequence = 0 OR
        filesequence = (
                select max (filesequence) + 1
                from messagequeue mqinner 
                where mqinner.testrunident = mqouter.testrunident
                and mqinner.processed = 1
            )
        )
        order by testrunident, filesequence
        )

There are multiple rows with the same testrunident, each has a filesequence which should be sequential, however some may be missing so the query should only return the NEXT row where the previous row has processed = 1 or filesequence = 0 which denotes this is the first row within a testrunident group.

Here's an SQLFiddle to give an idea: SQL Fiddle

Query plan: Query Plan XML

Is there a better way to write the query?

EDIT 1 – Example of ensuring previous row was processed before selecting a row:

Where `id` = testrunident and `fs` = filesequence

id | fs | processed 
1  | 0  |  1
1  | 1  |  1
1  | 2  |  1
1  | 4  |  0 -- this shouldn't be next as no row with seqeuence = 3 and processed = 1
2  | 0  |  0 --this should be the next row
2  | 1  |  0

SQLFiddle highlighting this

Best Answer

Taking just the core part that identifies the id of the row to return, the following query encapsulates the logic needed:

SELECT TOP (1)
    MQ.id
FROM dbo.messagequeue AS MQ
WHERE
    -- Current row
    MQ.processed = 0
    AND MQ.failed = 0
    AND 
    (
        EXISTS
        (
            -- Previous row in strict sequence
            SELECT * 
            FROM dbo.messagequeue AS MQ2
            WHERE
                MQ2.testrunident = MQ.testrunident
                AND MQ2.processed = 1
                AND MQ2.failed = 0
                AND MQ2.filesequence = MQ.filesequence - 1
        )
        OR MQ.filesequence = 0
    )
ORDER BY 
    MQ.testrunident ASC,
    MQ.filesequence ASC;

Executing this query efficiently needs a small change to an existing index, whose definition is currently:

CREATE NONCLUSTERED INDEX [messagequeue_sequenc_failed_idx]
    ON [dbo].[messagequeue]([processed] ASC, [failed] ASC)
    INCLUDE([id], [testrunident], [filesequence]);

The change involves moving testrunident and filesequence from the INCLUDE list to the index keys. The new index still supports all the queries the old one did, and as a side-effect of the change, the redefined index can now be marked as UNIQUE. The following script will perform this change (you can perform this operation ONLINE if you are running Enterprise Edition):

CREATE UNIQUE NONCLUSTERED INDEX [messagequeue_sequenc_failed_idx]
ON [dbo].[messagequeue]
(
    [processed] ASC, 
    [failed] ASC, 
    testrunident ASC, 
    filesequence ASC
)
INCLUDE 
(
    [id]
)
WITH 
(
    DROP_EXISTING = ON
    --, ONLINE = ON
);

With this index in place, the execution plan for the revised query is:

Execution plan

To return data from the identified row, the final query is a simple extension:

SELECT
    MQ3.id,
    MQ3.testrunident,
    MQ3.filesequence,
    MQ3.processed,
    MQ3.filedata,
    MQ3.retries,
    MQ3.failed,
    MQ3.msgobject,
    MQ3.xmlsource
FROM dbo.messagequeue AS MQ3
WHERE
    MQ3.id =
    (
        SELECT TOP (1)
            MQ.id
        FROM dbo.messagequeue AS MQ
        WHERE
            MQ.processed = 0
            AND MQ.failed = 0
            AND 
            (
                EXISTS
                (
                    SELECT * 
                    FROM dbo.messagequeue AS MQ2
                    WHERE
                        MQ2.testrunident = MQ.testrunident
                        AND MQ2.filesequence = MQ.filesequence - 1
                        AND MQ2.processed = 1
                        AND MQ2.failed = 0
                )
                OR MQ.filesequence = 0
            )
        ORDER BY 
            MQ.testrunident ASC,
            MQ.filesequence ASC
    );

Final execution plan

Second option

There is another option because you are using SQL Server 2012, which introduced the LAG and LEAD window functions:

SELECT TOP (1)
    ML.id
FROM 
(
    SELECT 
        M.id, 
        M.testrunident,
        M.filesequence,
        M.processed,
        M.failed,
        PreviousProcessed = LAG(M.processed) OVER (
            ORDER BY M.testrunident, M.filesequence),
        PreviousFailed = LAG(M.failed) OVER (
            ORDER BY M.testrunident, M.filesequence),
        PreviousFileSequence = LAG(M.filesequence) OVER (
            ORDER BY M.testrunident, M.filesequence)
    FROM dbo.messagequeue AS M
) AS ML
WHERE
    -- Current row
    ML.processed = 0
    AND ML.failed = 0
    -- Previous row in strict order
    AND ML.PreviousProcessed = 1
    AND ML.PreviousFailed = 0
    AND ML.PreviousFileSequence = ML.filesequence - 1
ORDER BY 
    ML.testrunident, 
    ML.filesequence;

This query also needs an adjustment to an existing index, this time by adding processed and failed as included columns:

CREATE UNIQUE NONCLUSTERED INDEX [IXd_testrun_sequence]
ON [dbo].[messagequeue]
(
    [testrunident] ASC, 
    [filesequence] ASC
)
INCLUDE 
(
    [processed],
    [failed]
)
WITH 
(
    DROP_EXISTING = ON
    --, ONLINE = ON
);

With that index in place, the execution plan is:

LAG execution plan

I should also mention that your general approach would be unsafe if more than one process were ever working on the same queue at once. For more information about this and general queue table design see Remus Rusanu's excellent article Using Tables As Queues.

Further analysis

The original execution plan showed a big difference between the number of rows SQL Server expected to process, versus the number actually encountered during execution. Opening the plan with SQL Sentry Plan Explorer shows these differences clearly:

Plan Explorer Plan Tree View

SQL Server chose an execution strategy that would have worked well if the number of rows really had been as small as it had estimated:

Estimated row counts

Unfortunately, the chosen strategy did not scale well when the estimates turned out to be far too low. Parts of the execution plan were actually executed 458,260 times - not a recipe for instant results. Had the SQL Server optimizer known the true numbers, it likely would have chosen a different strategy.

Actual row counts

It is tempting to think that a discrepancy between estimated and actual row counts must be due to inaccurate statistical information, but chances are you have reasonably up-to-date single-column automatic statistics on these tables. Improvements might be made by providing additional statistical information, but the root cause of the inaccurate estimates is something quite different in this case.

By default, when SQL Server chooses an execution plan for a query it assumes all potential result rows will be returned to the client. However, when SQL Server sees a TOP operator, it rightly takes the number of rows specified into account when estimating cardinality. This behaviour is known as setting a row goal.

Essentially, a row goal means the optimizer scales back estimates under the Top operator to reflect the fact that fewer rows will be needed than the full potential size of the result set. This scaling implicitly assumes that values of interest are evenly distributed within the set.

For example, say you have a table with 1000 rows, of which 10 satisfy some query predicate. Using statistics, SQL Server knows that 1% of the rows meet the criteria. If you write a query to return the first row that matches, SQL Server assumes that it will need to read 1% of the table (= 10 rows) in order to find that first match. In the worst case, the ten rows that meet the criteria might appear last (in whatever order they are searched) so 990 rows that do not match will be read before SQL Server encounters the one you want.

The improved query (using the improved index) still suffers from this problem to a certain extent, but the effects are much less marked:

Improved Query Plan Tree

The fundamental question we are asking of the optimizer's row goal logic here is: how many rows do we need to check before we find the first one (according to the order by clause specification) that is in sequence, unprocessed, and not marked as having failed. This is a tough question to answer based on the limited statistics SQL Server keeps.

In fact there is very little we can do to communicate all nuances of this complex problem to the optimizer effectively. The best we can do is to provide it with an efficient index that we expect to locate the qualifying row as early as possible.

In this case, that means providing an index that returns unprocessed, un-failed entries in order by clause order. We expect to have to check relatively few of these before finding the first one that is in sequence (i.e. the preceding row exists and is marked as processed and not marked as failed).

Both solutions shown above eliminate the Key Lookup operations seen in the original query because the new index now includes (covers) all needed columns. Also, the new index is likely to find the target row more quickly. The original execution plan scanned index IXd_testrun_sequence in (testrunident, filesequence) order, meaning it would encounter old test runs first, most of which will be marked processed. We are looking for unprocessed rows in the outer query, so this strategy was inefficient (we end up performing the sequence check on 458,260 rows).

Finally, checking for a particular sequence value is much more efficient than finding the maximum of a potentially large set. This is the difference I have highlighted in the previous code as looking for the previous row in strict sequence. There is a semantic difference between these two queries and the solution using MAX shown in the question. My understanding is that you are interested in the first matching row; that row need not be the highest filesequence for the testrunident.