I have a system to teach English. For each word someone learns, I record:
LearnedWords
-- Id
-- UserId
-- WordId
And for each new text that I give to a learner, I want to indicate what words are new for him.
So, imagine a learner has learned 15000 words. So there are 15000 records for him only in the LearnedWords
. Now he chooses to read a new text that has over 2000 words. All I have is textual data here, that is, words.
From these 2000 words, let's say only 300 words are new. How can I find them? The only way that I can think of is to do set operations based on Venn Diagram. That is, using not in
clause. So, I create a procedure that takes the 2000 words as input, creates a temporary table and populates it with the given text words, selects learned words from database, and then executes the not in
query:
declare @textWords table
(
Word nvarchar(100)
)
-- populate @textWords with 2000 given words here
declare @learnedWords table
(
Word nvarchar(100)
)
declare @newWords table
(
Word nvarchar(100)
)
insert into @learnedWords (Word)
select Words.[Text]
from LearnedWords
inner join Words
on LearnedWords.WordId = Words.Id
where LearnedWords.UserId = 1 --> user with 15000 learned words
insert into @newWords (Word) -- filters 300 new words for the given user
select Word
from @textWords -- has 2000 records
where Word not in
(
select Word
from @learnedWords -- has 15000 records for the given user
)
However, this procedure takes 14 seconds to complete. It's a nightmare. I ran Tuning Advisor and applied recommendations. But it only improved to 13 seconds. I'm stuck at how to meet this business requirement in data layer with a good performance. Any help?
Best Answer
Consider the database that you're querying against. It contains a data model that someone designed. It likely contains indexes that someone added to improve query performance. It contains statistics maintained by the engine to improve query performance. You effectively throw all of that away when you join table variables to each other. SQL Server doesn't allow for statistics on table variables and those queries won't benefit from all of the work that went into improving your data model. The best use cases for table variables are to retain data after a transaction is rolled back or to avoid unnecessary recompiles in stored procedures that are executed hundreds of times per second or more.
Fundamentally, comparing 15000 rows to 2000 rows can be a very fast operation in SQL Server as long as you avoid fighting with the database. I mocked up some data similar to the quantities mentioned in the question:
Suppose that you were able to create a table that contains the distinct words used in each "text". I used the following:
Finding new words can now be expressed with a single
SELECT
query without any table variables:That query finishes in less than 20 ms on my machine. If I run your original code it takes 3800 ms in total. If you need to use the table variables then at the very least try adding a
RECOMPILE
hint to the final query. That drops the execution time to 1600 ms on my machine.With the right data model you can meet your business requirements in a simple and efficient way. There's a bit of ambiguity in your statement about texts. Perhaps you aren't able to store unique words for each text in the database ahead of time. In that case the time taken for your procedure should entirely depend on how much time it takes to populate a temporary structure with the unique words used in that text. You don't have that code posted here so it's impossible for us to help you with that.