I have a SQL Server table with about 2 million rows in, this table stores short textual documents. Here's my schema:
CREATE TABLE Documents (
documentId bigint IDENTITY NOT NULL,
content nvarchar(MAX) NOT NULL,
isIndexed bit NOT NULL
)
I have a separate Lucene-based indexer process that does this:
SELECT TOP 1 documentId, content FROM Documents WHERE isIndexed = 0
It then performs the indexing operation before executing this:
UPDATE Documents SET isIndexed = 1 WHERE documentId = @documentId
Initially this worked fine. When every document in the table was not indexed (i.e. all isIndexed
values were 0) then each retrieval took about 5ms.
However as more and more documents were indexed then the retrieval time slowly increased. Currently it's about 150ms – a 30 time speed reduction. I note that the UPDATE statement always seems to run in under 2ms, so I know that's not a problem.
From the start the table has always had a non-clustered index on the isIndexed column, but the Actual Execution Plan shows the SQL Server uses an Index Scan (rather than Seek).
What can I do to speed up the system?
I am aware that the "isIndexed" column existing is bad in itself, but due to how the indexer works it can't request documents by documentId directly. For this and other reasons I can't accept any answers that don't solve the immediate problem at hand.
Best Answer
An index on the bit column isn't going to help at all because of selectivity. You should consider a filtered index:
You should also include an
ORDER BY documentId
in yourSELECT
.Though you should test this in a staging environment because it may help your
SELECT
but may offset that with a more expensiveUPDATE
.