I have a simple table in SQL Server 2012 which implements a processing queue. As data has been inserted the query to retrieve the next item has gone from <100ms to a constant 5-6secs. I would be hugely grateful if someone could point me at the cause of this sudden dip in performance. (It seems to have been an almost overnight drop).
Here's the table definition:
CREATE TABLE [dbo].[messagequeue] (
[id] INT IDENTITY (1, 1) NOT NULL,
[testrunident] VARCHAR (255) NOT NULL,
[filesequence] INT NOT NULL,
[processed] BIT NOT NULL,
[dateentered] DATETIME NULL,
[filedata] VARBINARY (MAX) NULL,
[retries] INT NOT NULL,
[failed] BIT NOT NULL,
[msgobject] VARBINARY (MAX) NULL,
[errortext] VARCHAR (MAX) NULL,
[sourcefilename] VARCHAR (MAX) NULL,
[xmlsource] VARCHAR (MAX) NULL,
[messagetype] VARCHAR (255) NULL
);
CREATE NONCLUSTERED INDEX [messagequeue_sequenc_failed_idx]
ON [dbo].[messagequeue]([processed] ASC, [failed] ASC)
INCLUDE([id], [testrunident], [filesequence]);
CREATE NONCLUSTERED INDEX [messagequeue_sequence_idx]
ON [dbo].[messagequeue]([testrunident] ASC, [processed] ASC)
INCLUDE([filesequence]);
CREATE UNIQUE NONCLUSTERED INDEX [IXd_testrun_sequence]
ON [dbo].[messagequeue]([testrunident] ASC, [filesequence] ASC);
And here's the query used to retrieve the next row to be processed:
select messagequeue.id, messagequeue.testrunident, messagequeue.filesequence,
messagequeue.processed, messagequeue.filedata, messagequeue.retries, messagequeue.failed,
messagequeue.msgobject, messagequeue.xmlsource
from messagequeue where id = (
select top 1 id from messagequeue mqouter
where processed = 0
AND failed = 0
AND (filesequence = 0 OR
filesequence = (
select max (filesequence) + 1
from messagequeue mqinner
where mqinner.testrunident = mqouter.testrunident
and mqinner.processed = 1
)
)
order by testrunident, filesequence
)
There are multiple rows with the same testrunident
, each has a filesequence
which should be sequential, however some may be missing so the query should only return the NEXT row where the previous row has processed = 1
or filesequence = 0
which denotes this is the first row within a testrunident
group.
Here's an SQLFiddle to give an idea: SQL Fiddle
Query plan: Query Plan XML
Is there a better way to write the query?
EDIT 1 – Example of ensuring previous row was processed before selecting a row:
Where `id` = testrunident and `fs` = filesequence
id | fs | processed
1 | 0 | 1
1 | 1 | 1
1 | 2 | 1
1 | 4 | 0 -- this shouldn't be next as no row with seqeuence = 3 and processed = 1
2 | 0 | 0 --this should be the next row
2 | 1 | 0
Best Answer
Taking just the core part that identifies the
id
of the row to return, the following query encapsulates the logic needed:Executing this query efficiently needs a small change to an existing index, whose definition is currently:
The change involves moving
testrunident
andfilesequence
from theINCLUDE
list to the index keys. The new index still supports all the queries the old one did, and as a side-effect of the change, the redefined index can now be marked asUNIQUE
. The following script will perform this change (you can perform this operationONLINE
if you are running Enterprise Edition):With this index in place, the execution plan for the revised query is:
To return data from the identified row, the final query is a simple extension:
Second option
There is another option because you are using SQL Server 2012, which introduced the
LAG
andLEAD
window functions:This query also needs an adjustment to an existing index, this time by adding
processed
andfailed
as included columns:With that index in place, the execution plan is:
I should also mention that your general approach would be unsafe if more than one process were ever working on the same queue at once. For more information about this and general queue table design see Remus Rusanu's excellent article Using Tables As Queues.
Further analysis
The original execution plan showed a big difference between the number of rows SQL Server expected to process, versus the number actually encountered during execution. Opening the plan with SQL Sentry Plan Explorer shows these differences clearly:
SQL Server chose an execution strategy that would have worked well if the number of rows really had been as small as it had estimated:
Unfortunately, the chosen strategy did not scale well when the estimates turned out to be far too low. Parts of the execution plan were actually executed 458,260 times - not a recipe for instant results. Had the SQL Server optimizer known the true numbers, it likely would have chosen a different strategy.
It is tempting to think that a discrepancy between estimated and actual row counts must be due to inaccurate statistical information, but chances are you have reasonably up-to-date single-column automatic statistics on these tables. Improvements might be made by providing additional statistical information, but the root cause of the inaccurate estimates is something quite different in this case.
By default, when SQL Server chooses an execution plan for a query it assumes all potential result rows will be returned to the client. However, when SQL Server sees a
TOP
operator, it rightly takes the number of rows specified into account when estimating cardinality. This behaviour is known as setting a row goal.Essentially, a row goal means the optimizer scales back estimates under the Top operator to reflect the fact that fewer rows will be needed than the full potential size of the result set. This scaling implicitly assumes that values of interest are evenly distributed within the set.
For example, say you have a table with 1000 rows, of which 10 satisfy some query predicate. Using statistics, SQL Server knows that 1% of the rows meet the criteria. If you write a query to return the first row that matches, SQL Server assumes that it will need to read 1% of the table (= 10 rows) in order to find that first match. In the worst case, the ten rows that meet the criteria might appear last (in whatever order they are searched) so 990 rows that do not match will be read before SQL Server encounters the one you want.
The improved query (using the improved index) still suffers from this problem to a certain extent, but the effects are much less marked:
The fundamental question we are asking of the optimizer's row goal logic here is: how many rows do we need to check before we find the first one (according to the order by clause specification) that is in sequence, unprocessed, and not marked as having failed. This is a tough question to answer based on the limited statistics SQL Server keeps.
In fact there is very little we can do to communicate all nuances of this complex problem to the optimizer effectively. The best we can do is to provide it with an efficient index that we expect to locate the qualifying row as early as possible.
In this case, that means providing an index that returns unprocessed, un-failed entries in
order by
clause order. We expect to have to check relatively few of these before finding the first one that is in sequence (i.e. the preceding row exists and is marked as processed and not marked as failed).Both solutions shown above eliminate the Key Lookup operations seen in the original query because the new index now includes (covers) all needed columns. Also, the new index is likely to find the target row more quickly. The original execution plan scanned index
IXd_testrun_sequence
in (testrunident, filesequence
) order, meaning it would encounter old test runs first, most of which will be marked processed. We are looking for unprocessed rows in the outer query, so this strategy was inefficient (we end up performing the sequence check on 458,260 rows).Finally, checking for a particular sequence value is much more efficient than finding the maximum of a potentially large set. This is the difference I have highlighted in the previous code as looking for the previous row in strict sequence. There is a semantic difference between these two queries and the solution using
MAX
shown in the question. My understanding is that you are interested in the first matching row; that row need not be the highestfilesequence
for thetestrunident
.