Sql-server – Combining full-text and scalar index

full-text-searchindexsql serversql-server-2012

Let's say we have a database of 12 million names and addresses that need to be searchable using full-text, but each row contains an integer value too, let's say COMPANYID. The table contains around 250 distinct COMPANYIDs over those 12 million rows.

Is it possible, when defining the full-text indices, to give each COMPANY its own "branch" in the tree?

Best Answer

No is the short answer, and you don't really need this. Full-text indexes are inverted indexes so they store the split words by the unique doc_id that you have to specify when you create the full-text index. This must be a "unique, single-key, non-nullable column" ideally an integer. What is essentially a foreign key does not figure and there is no easy way to partition them on that basis.

You could spoof something like this it with a table per company and full-text index per table. You would need some kind of code logic sat in front to determine which table to insert to / fetch from. This would be a considerable headache to manage at is almost certainly not worth it.

If you had some serious volume ( eg more like 23 billion records ) then you could look at a sharding solution, eg something like an Azure VM per company with an app sat in front of them to determine which machine to connect to. But clearly you don't need that either.

There were also a number of improvements in SQL 2008 to full-text which is now more integrated into the database engine. One scenario, where you specify a WHERE clause against a normal column and use the full-text functions, is referred to as a 'Mixed Query' and discussed here. This is still a great article even though the information is for SQL 2008.

If you're generally concerned about performance and plans, why not spin up some test data, introduce some skew and try it out. I knocked up this script with ~2 million rows in a few minutes:

!!TODO introduce some skew
USE master
GO

SET NOCOUNT ON
GO

DBCC TRACEON(610)   -- Minimal logging
GO

GO

IF EXISTS ( SELECT * FROM sys.databases WHERE name = 'fullTextDemo' )
BEGIN
    ALTER DATABASE fullTextDemo SET SINGLE_USER WITH ROLLBACK IMMEDIATE
    DROP DATABASE fullTextDemo
END
GO

IF NOT EXISTS ( SELECT * FROM sys.databases WHERE name = 'fullTextDemo' )
CREATE DATABASE fullTextDemo
GO

ALTER DATABASE fullTextDemo SET RECOVERY SIMPLE
GO

USE fullTextDemo
GO

IF OBJECT_ID('dbo.yourAddresses') IS NOT NULL DROP TABLE dbo.yourAddresses
IF OBJECT_ID('dbo.companies') IS NOT NULL DROP TABLE dbo.companies
GO

CREATE TABLE dbo.companies (
    companyId       INT IDENTITY NOT NULL,
    companyName     NVARCHAR(50) NOT NULL,

    CONSTRAINT PK_companies PRIMARY KEY ( companyId )
)
GO

CREATE TABLE dbo.yourAddresses (
    rowId           INT IDENTITY,
    companyId       INT NOT NULL FOREIGN KEY REFERENCES dbo.companies ( companyId ),
    searchTerms     NVARCHAR(2048) NOT NULL

    CONSTRAINT PK_yourAddresses PRIMARY KEY ( rowId )
)
GO

-- Populate the companies
;WITH cte AS (
SELECT TOP 250 ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) rn
FROM master.sys.columns c1
    CROSS JOIN master.sys.columns c2
    CROSS JOIN master.sys.columns c3
)
INSERT INTO dbo.companies ( companyName )
SELECT NEWID()
FROM cte
GO

-- Generate 2,636,000 records
INSERT dbo.yourAddresses ( companyId, searchTerms )
SELECT c.companyId, m.[text]
FROM dbo.companies c
    CROSS JOIN ( SELECT * FROM sys.messages ) m
WHERE m.language_id = 1033
AND m.[text] Like '[a-z]%'
GO

CREATE INDEX _idx ON dbo.yourAddresses ( companyId ) INCLUDE ( searchTerms )
GO

-- !!TODO look at compression
--ALTER INDEX PK_yourAddresses ON dbo.yourAddresses REBUILD WITH ( DATA_COMPRESSION = PAGE )
--GO

-- Create the catalog
IF NOT EXISTS ( SELECT * FROM sys.fulltext_catalogs WHERE name = N'ftc_yourAddresses' )
CREATE FULLTEXT CATALOG ftc_yourAddresses
GO

-- Create the full-text index
CREATE FULLTEXT INDEX ON dbo.yourAddresses ( searchTerms ) KEY INDEX PK_yourAddresses ON ftc_yourAddresses WITH CHANGE_TRACKING MANUAL  -- CHANGE_TRACKING OFF, NO POPULATION
GO

SELECT 'before' ft, * FROM sys.fulltext_indexes
GO

ALTER FULLTEXT INDEX ON dbo.yourAddresses START FULL POPULATION;
GO


DECLARE @i INT 
SET @i = 0

WHILE EXISTS ( SELECT * FROM sys.fulltext_indexes WHERE has_crawl_completed = 0 )
BEGIN

        SELECT outstanding_batch_count, *
        FROM sys.dm_fts_index_population
        WHERE database_id = DB_ID()

        --SELECT *
        --FROM sys.dm_fts_outstanding_batches
        --WHERE database_id = DB_ID()

    WAITFOR DELAY '00:00:05'

    SET @i = @i + 1
    IF @i > 60 BEGIN RAISERROR( 'Too many loops!', 16, 1 ) BREAK END

END

SELECT 'after' ft, * FROM sys.fulltext_indexes
GO



SELECT TOP 1000 *
FROM dbo.yourAddresses ft
WHERE companyId = 42
 AND CONTAINS ( searchTerms, 'data' )
GO

SELECT TOP 1000 *
FROM dbo.yourAddresses a
    INNER JOIN CONTAINSTABLE ( dbo.yourAddresses, searchTerms, 'data' ) ct ON a.rowId = ct.[key]
WHERE a.companyId = 42
GO

SELECT TOP 1000 *
FROM dbo.yourAddresses a
    INNER JOIN CONTAINSTABLE ( dbo.yourAddresses, searchTerms, 'data' ) ct ON a.rowId = ct.[key]
WHERE a.companyId = 42
OPTION ( MERGE JOIN )
GO

SELECT TOP 100 *
FROM sys.dm_fts_index_keywords (DB_ID(), OBJECT_ID('dbo.yourAddresses') )

SELECT TOP 100 *
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('dbo.yourAddresses') )
ORDER BY document_id
GO