Sql-server – Why does CONTAINS not find matches at the end of the string

full-text-searchsql server

I have a stored procedure in Microsoft SQL Server 2008 R2 (SP3) which searches a table for anything matching a @Keywords parameter that I pass in. It was too slow using LIKE as the table has 155,000 rows, so I created a fulltext catalog and switched to using CONTAINS instead. I am not very familiar with this feature, but after researching it seemed all I needed to do was process the user's choice of words with:

SET @Keywords = '"' + @Keywords + '*"'

Adding the double quotes and asterisk worked well until someone discovered they had no results when searching for a string they knew definitely existed. It seems that the * is stopping CONTAINS from finding rows where the @Keywords are right at the very end of the string.

Analysis with very simple t-SQL:

--first I prove that I can find the one row which the user was searching for
--this column is a NVARCHAR(200)
SELECT * FROM News 
WHERE Headline = '1120 days at sea and still sailing' --1 row returned (OK)
SELECT * FROM News 
WHERE Headline LIKE '1120 days at sea and still sailing%' --1 row returned (OK)

--using the full headline
SELECT * FROM News 
WHERE CONTAINS(Headline, '"1120 days at sea and still sailing*"') --no rows, just like my proc (WRONG)
SELECT * FROM News 
WHERE CONTAINS(Headline, '"1120 days at sea and still sailing"') --1 row returned if asterisk omitted (OK)

--using part of the headline so we don't touch the end of the
SELECT * FROM News 
WHERE CONTAINS(Headline, '"days at sea and still"') --lots of rows, since it is anything with 'days' and then 'sea' present (NOT USEFUL)
SELECT * FROM News 
WHERE CONTAINS(Headline, '"days at sea and still*"') --no rows (WRONG)

I note that CONTAINS considers at/and/still to be 'noise' words and ignores them. Fair enough. The word sailing is the troublesome one at the end of the string.

My question:

How should I modify the incoming keywords to make sure than CONTAINS can find results regardless of the position of the string of keywords within the column value?

Best Answer

This has to do with how stoplist (or noise words) is/are treated when adding the asterix (*) in combination with CONTAINS().

A longer explanation on that here.

In short: Default <> Prefix searching

By default, when using the SYSTEM stoplist some words are not indexed and classified as stoplist words.

When adding the * + CONTAINS()they are not seen as stoplist words, and the exact phrase has to match. But when using the default SYSTEM stoplist these 'stoplist words' are not added to the index, resulting in no matches found.


Question

Are you saying that the intended behaviour, when an asterisk is present, is to return zero matches for a user who includes a noise word? This seems surprising, when you consider what most humans are likely to type!

Prefix matching in combination with contains from microsoft docs

Specifies a match of words or phrases beginning with the specified text. Enclose a prefix term in double quotation marks ("") and add an asterisk (*) before the ending quotation mark, so that all text starting with the simple term specified before the asterisk is matched

Source

Keywords here are:

so that all text starting with the simple term specified before the asterisk is matched

Meaning that we have to have precise matches when using prefix matching and CONTAINS().


Workaround #1 Disabling the stoplist

One workaround for this would be disabling the 'stoplist'

ALTER FULLTEXT INDEX ON News
SET STOPLIST OFF;

You do have to know that this will increase the words stored in the fulltext index, as the words 'and', 'or', 'still', ... are added to the index.

Performance might suffer from disabling the stoplist.

To reenable the system stoplist

ALTER FULLTEXT INDEX ON News
SET STOPLIST SYSTEM;

Alternatively, you could create your own stoplist

Testing this on your query:

ALTER FULLTEXT INDEX ON News
SET STOPLIST OFF;
INSERT INTO News(Headline)
VALUES('1120 days at sea and still sailing')

INSERT INTO News(Headline)
VALUES('zzz 1120 days at sea and still sailing')

SELECT * FROM News 
WHERE CONTAINS(Headline, '"days at sea and still*"') 

Result

id  Headline
1   1120 days at sea and still sailing
2   zzz 1120 days at sea and still sailing

For other workarounds you could look into using CONTAINS() or FREETEXT() without prefix matching, or prefix matching + CONTAINS() with one (non noise) word to get more accurate results


In your case, enbling transform noise words does not seem to work with prefix-matching

sp_configure 'show advanced options', 1;
RECONFIGURE;
GO
sp_configure 'transform noise words', 1;
RECONFIGURE;
GO

Again due to how prefix matching + contains works.

SELECT * FROM News 
WHERE CONTAINS(Headline, '"1120 days at sea and still sailing*"')

-- No results


Inside the fulltext index (with stoplist = system).

SELECT display_term,column_id,document_count
FROM sys.dm_fts_index_keywords(DB_ID('MNGDB'), OBJECT_ID('dbo.News'));

result

display_term    column_id   document_count
1120                 2           2
days                 2           2
nn1120               2           2
sailing              2           2
sea                  2           2
zzz                  2           1
END OF FILE          2           2 

'Still','at','and' are filtered out, when disabling the stoplist they will be added to the index

Testing without stopwords

ALTER FULLTEXT INDEX ON News
SET STOPLIST SYSTEM;

INSERT INTO dbo.News
VALUES('days sea sailing')
SELECT * FROM dbo.News 
WHERE CONTAINS(Headline, '"days sea sailing*"') 

id  Headline
4   days sea sailing

More on the asterix

The presence of the asterisk enables the prefix-matching mode... If a phrase is provided, matches are detected if the column contains all the specified words with zero or more other characters following the final word

Source