Sql-server – SQL Server Full-Text Indexer with stoplists/stopwords

full-text-searchindexsql serversql-server-2012t-sql

As I'm working on a graduate project (Textmining with SQL Server 2012 Semantic Search) I run into a situation where I need to post a question on this website, hoping someone can help me.

This question is about stoplists and stopwords in SQL Server 2012. I have set up a Proof of Concept where I'm trying to index documents and list statistically relevant key phrases with the new Semantic Search functionality. Because I don't want some words to be indexed, and therefore statistically relevant key phrases, I'm creating a stoplist to exclude those words.

Stoplist/stopwords for the English language (lcid 1033):

/* Create stoplist and add words */ 
CREATE FULLTEXT STOPLIST [naam van de stoplist];
   ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'beeten' LANGUAGE 'English';
   ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'centimeter' LANGUAGE 'English';
   ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'info' LANGUAGE 'English';
   ALTER FULLTEXT STOPLIST [naam van de stoplist] ADD 'ruud' LANGUAGE 'English';
GO

Creating a Full-Text Catalog, Full-Text Index with the custom stoplist and semantics:

/* Full-Text catalog */
CREATE FULLTEXT CATALOG [ft] WITH ACCENT_SENSITIVITY = ON AS DEFAULT;
GO

/* Full-Text Index */
CREATE FULLTEXT INDEX ON [dbo].[Documents]
    (   file_stream Language 1033 STATISTICAL_SEMANTICS )
    KEY INDEX DocumentsFt
    WITH STOPLIST = [naam van de stoplist];
GO

I tried everything I can think of to check if i missed something:

    /*Select all words in the stoplist, with some debug information*/
    SELECT sys.fulltext_stoplists.stoplist_id AS [Stoplist id]
        ,  sys.fulltext_stoplists.name AS [Stoplist]
        ,  sys.database_principals.name AS [Owner]
        ,  sys.fulltext_languages.lcid AS [LCID]
        ,  sys.fulltext_languages.name AS [Taal]
        ,  sys.fulltext_stopwords.stopword AS [Stopwoord] 
    FROM sys.fulltext_languages
    INNER JOIN sys.fulltext_stopwords 
        ON sys.fulltext_stopwords.language_id = sys.fulltext_languages.lcid
    INNER JOIN sys.fulltext_stoplists 
        ON sys.fulltext_stopwords.stoplist_id = sys.fulltext_stoplists.stoplist_id
    INNER JOIN sys.database_principals ON sys.database_principals.principal_id = sys.fulltext_stoplists.principal_id
    WHERE sys.fulltext_stoplists.name = 'naam van de stoplist';

/* List with all Full-Text Indexes (with statistical_semantics) */
SELECT sys.fulltext_catalogs.name [Full-Text catalog]   
    , sys.indexes.name AS [Index] 
    , sys.indexes.type_desc AS [Index type]
    , sys.fulltext_indexes.is_enabled AS [Index in use]
    , sys.fulltext_stoplists.name AS [Stoplist]
    , sys.tables.name AS [Table]
    , sys.columns.name AS [Column]
    , sys.fulltext_index_columns.language_id AS [LCID]
    , sys.fulltext_languages.name AS [Language]
    , sys.fulltext_index_columns.statistical_semantics [Semantic]
FROM sys.fulltext_catalogs
INNER JOIN sys.fulltext_indexes 
    ON sys.fulltext_catalogs.fulltext_catalog_id = sys.fulltext_indexes.fulltext_catalog_id
INNER JOIN sys.fulltext_index_columns 
    ON sys.fulltext_indexes.object_id = sys.fulltext_index_columns.object_id
INNER JOIN sys.indexes 
    ON sys.fulltext_indexes.object_id = sys.indexes.object_id 
    AND sys.fulltext_indexes.unique_index_id = sys.indexes.index_id
INNER JOIN sys.index_columns 
    ON sys.indexes.object_id = sys.index_columns.object_id 
    AND sys.indexes.index_id = sys.index_columns.index_id
INNER JOIN sys.columns 
    ON sys.index_columns.object_id = sys.columns.object_id 
    AND sys.index_columns.column_id = sys.columns.column_id
INNER JOIN sys.tables 
    ON sys.fulltext_indexes.object_id = sys.tables.object_id
INNER JOIN sys.fulltext_languages 
    ON sys.fulltext_index_columns.language_id = sys.fulltext_languages.lcid
LEFT JOIN sys.fulltext_stoplists 
    ON sys.fulltext_indexes.stoplist_id = sys.fulltext_stoplists.stoplist_id    
WHERE sys.fulltext_index_columns.statistical_semantics = 1
ORDER BY sys.fulltext_catalogs.name
        ,sys.indexes.name
        ,sys.index_columns.key_ordinal;

/* Rebuild catalog */
ALTER FULLTEXT CATALOG [ft] REBUILD;
GO

/* Check status of the catalog rebuild */
/*  0 = Idle.
1 = Full population is in progress.
2 = Incremental population is in progress.
3 = Propagation of tracked changes is in progress.
4 = Background update index is in progress, such as automatic change tracking.
5 = Full-text indexing is throttled or pause    
*/
SELECT FULLTEXTCATALOGPROPERTY('ft', 'PopulateStatus') AS Status;
GO

/* Repopulate Full-Text Index */
ALTER FULLTEXT INDEX ON dbo.Documents START UPDATE POPULATION;
GO

All the commands above indicate things are set up correct.

When I look in the indexed words I still see the words in the stoplist, for example 'beeten'.

SELECT * 
FROM sys.dm_fts_index_keywords(DB_ID('SQLServerArticles'), OBJECT_ID('Documents'))
WHERE display_term = 'beeten';

I've even tried if the Full-Text parser is not working properly with the following statement.

SELECT special_term, display_term
FROM sys.dm_fts_parser
(' "testing for fruit and nuts centimeter, any type of Beeten" ', 1033, 8, 0)

This statement returns the following result:

Exact Match testing
Exact Match for
Exact Match fruit
Exact Match and
Exact Match nuts
Noise Word  centimeter
Exact Match any
Exact Match type
Exact Match of
Noise Word  beeten

This result indicates that the word 'beeten' is an Noise Word. Should this word be skipped while indexing? What did I miss?

Again: Because I don't want some words to be indexed, and therefore statistically relevant key phrases, I'm creating a stoplist to exclude those words.

Best Answer

If your System Locale setting is different from English, then there is a known bug ( Microsoft Connect Item 753596 ) where the system locale stopwords are used instead of the fulltext index stopwords for documents stored in filetables.