Sql-server – Stuck at optimizing “not in clause” for a medium-sized query across two tables

performanceperformance-tuningquery-performancesql serversql server 2014

I have a system to teach English. For each word someone learns, I record:

LearnedWords 
   -- Id
   -- UserId
   -- WordId

And for each new text that I give to a learner, I want to indicate what words are new for him.

So, imagine a learner has learned 15000 words. So there are 15000 records for him only in the LearnedWords. Now he chooses to read a new text that has over 2000 words. All I have is textual data here, that is, words.

From these 2000 words, let's say only 300 words are new. How can I find them? The only way that I can think of is to do set operations based on Venn Diagram. That is, using not in clause. So, I create a procedure that takes the 2000 words as input, creates a temporary table and populates it with the given text words, selects learned words from database, and then executes the not in query:

declare @textWords table
(
    Word nvarchar(100)
)

-- populate @textWords with 2000 given words here

declare @learnedWords  table
(
    Word nvarchar(100)
)

declare @newWords table 
(
    Word nvarchar(100)
)

insert into @learnedWords (Word)
select Words.[Text]
from LearnedWords
inner join Words
on LearnedWords.WordId = Words.Id
where LearnedWords.UserId = 1 --> user with 15000 learned words

insert into @newWords (Word) -- filters 300 new words for the given user
select Word 
from @textWords -- has 2000 records
where Word not in 
(
    select Word 
    from @learnedWords -- has 15000 records for the given user
)

However, this procedure takes 14 seconds to complete. It's a nightmare. I ran Tuning Advisor and applied recommendations. But it only improved to 13 seconds. I'm stuck at how to meet this business requirement in data layer with a good performance. Any help?

Best Answer

Consider the database that you're querying against. It contains a data model that someone designed. It likely contains indexes that someone added to improve query performance. It contains statistics maintained by the engine to improve query performance. You effectively throw all of that away when you join table variables to each other. SQL Server doesn't allow for statistics on table variables and those queries won't benefit from all of the work that went into improving your data model. The best use cases for table variables are to retain data after a transaction is rolled back or to avoid unnecessary recompiles in stored procedures that are executed hundreds of times per second or more.

Fundamentally, comparing 15000 rows to 2000 rows can be a very fast operation in SQL Server as long as you avoid fighting with the database. I mocked up some data similar to the quantities mentioned in the question:

CREATE TABLE dbo.Words (
    Id INT IDENTITY (1, 1),
    [Text] NVARCHAR(100),
    PRIMARY KEY (Id)
);

INSERT INTO dbo.Words WITH (TABLOCK)
SELECT DISTINCT LEFT(Text, 25)
FROM sys.messages;
-- (203013 rows affected)

CREATE TABLE dbo.LearnedWords (
    UserId INT,
    WordId INT
);

CREATE CLUSTERED INDEX CI ON dbo.LearnedWords (UserId);

INSERT INTO dbo.LearnedWords WITH (TABLOCK)
SELECT 1, Id
FROM dbo.Words
WHERE ID % 100 <= 7;
-- (16247 rows affected)

Suppose that you were able to create a table that contains the distinct words used in each "text". I used the following:

CREATE TABLE dbo.TextWords (
    TextId INT,
    WordId INT
);

CREATE CLUSTERED INDEX CI ON dbo.TextWords (TextId);

INSERT INTO dbo.TextWords WITH (TABLOCK)
SELECT 1, Id
FROM dbo.Words
WHERE ID % 1000 <= 9;
-- (2039 rows affected)

Finding new words can now be expressed with a single SELECT query without any table variables:

select w.[Text]
from TextWords t
INNER JOIN dbo.Words w ON t.WordId = w.Id
Where t.TextId = 1
AND NOT EXISTS (
    SELECT 1
    FROM LearnedWords l
    WHERE t.WordId = l.WordId
    AND l.UserId = 1
);

That query finishes in less than 20 ms on my machine. If I run your original code it takes 3800 ms in total. If you need to use the table variables then at the very least try adding a RECOMPILE hint to the final query. That drops the execution time to 1600 ms on my machine.

With the right data model you can meet your business requirements in a simple and efficient way. There's a bit of ambiguity in your statement about texts. Perhaps you aren't able to store unique words for each text in the database ahead of time. In that case the time taken for your procedure should entirely depend on how much time it takes to populate a temporary structure with the unique words used in that text. You don't have that code posted here so it's impossible for us to help you with that.