I have a stored procedure in Microsoft SQL Server 2008 R2 (SP3) which searches a table for anything matching a @Keywords
parameter that I pass in. It was too slow using LIKE
as the table has 155,000 rows, so I created a fulltext catalog and switched to using CONTAINS
instead. I am not very familiar with this feature, but after researching it seemed all I needed to do was process the user's choice of words with:
SET @Keywords = '"' + @Keywords + '*"'
Adding the double quotes and asterisk worked well until someone discovered they had no results when searching for a string they knew definitely existed. It seems that the *
is stopping CONTAINS
from finding rows where the @Keywords
are right at the very end of the string.
Analysis with very simple t-SQL:
--first I prove that I can find the one row which the user was searching for
--this column is a NVARCHAR(200)
SELECT * FROM News
WHERE Headline = '1120 days at sea and still sailing' --1 row returned (OK)
SELECT * FROM News
WHERE Headline LIKE '1120 days at sea and still sailing%' --1 row returned (OK)
--using the full headline
SELECT * FROM News
WHERE CONTAINS(Headline, '"1120 days at sea and still sailing*"') --no rows, just like my proc (WRONG)
SELECT * FROM News
WHERE CONTAINS(Headline, '"1120 days at sea and still sailing"') --1 row returned if asterisk omitted (OK)
--using part of the headline so we don't touch the end of the
SELECT * FROM News
WHERE CONTAINS(Headline, '"days at sea and still"') --lots of rows, since it is anything with 'days' and then 'sea' present (NOT USEFUL)
SELECT * FROM News
WHERE CONTAINS(Headline, '"days at sea and still*"') --no rows (WRONG)
I note that CONTAINS
considers at/and/still to be 'noise' words and ignores them. Fair enough. The word sailing is the troublesome one at the end of the string.
My question:
How should I modify the incoming keywords to make sure than CONTAINS
can find results regardless of the position of the string of keywords within the column value?
Best Answer
This has to do with how stoplist (or noise words) is/are treated when adding the asterix (*) in combination with
CONTAINS()
.A longer explanation on that here.
In short: Default <> Prefix searching
By default, when using the
SYSTEM
stoplist some words are not indexed and classified as stoplist words.When adding the * +
CONTAINS()
they are not seen as stoplist words, and the exact phrase has to match. But when using the defaultSYSTEM
stoplist these 'stoplist words' are not added to the index, resulting in no matches found.Question
Prefix matching in combination with contains from microsoft docs
Source
Keywords here are:
Meaning that we have to have precise matches when using prefix matching and
CONTAINS()
.Workaround #1 Disabling the stoplist
One workaround for this would be disabling the 'stoplist'
You do have to know that this will increase the words stored in the fulltext index, as the words 'and', 'or', 'still', ... are added to the index.
Performance might suffer from disabling the stoplist.
To reenable the system stoplist
Alternatively, you could create your own stoplist
Testing this on your query:
Result
For other workarounds you could look into using
CONTAINS()
orFREETEXT()
without prefix matching, or prefix matching +CONTAINS()
with one (non noise) word to get more accurate resultsIn your case, enbling
transform noise words
does not seem to work with prefix-matchingAgain due to how prefix matching + contains works.
-- No results
Inside the fulltext index (with stoplist = system).
result
'Still','at','and' are filtered out, when disabling the stoplist they will be added to the index
Testing without stopwords
More on the asterix
Source