Sql-server – Combining full-text and scalar index

full-text-searchindexsql serversql-server-2012

Let's say we have a database of 12 million names and addresses that need to be searchable using full-text, but each row contains an integer value too, let's say COMPANYID. The table contains around 250 distinct COMPANYIDs over those 12 million rows.

Is it possible, when defining the full-text indices, to give each COMPANY its own "branch" in the tree?

Best Answer

No is the short answer, and you don't really need this. Full-text indexes are inverted indexes so they store the split words by the unique doc_id that you have to specify when you create the full-text index. This must be a "unique, single-key, non-nullable column" ideally an integer. What is essentially a foreign key does not figure and there is no easy way to partition them on that basis.

You could spoof something like this it with a table per company and full-text index per table. You would need some kind of code logic sat in front to determine which table to insert to / fetch from. This would be a considerable headache to manage at is almost certainly not worth it.

If you had some serious volume ( eg more like 23 billion records ) then you could look at a sharding solution, eg something like an Azure VM per company with an app sat in front of them to determine which machine to connect to. But clearly you don't need that either.

There were also a number of improvements in SQL 2008 to full-text which is now more integrated into the database engine. One scenario, where you specify a WHERE clause against a normal column and use the full-text functions, is referred to as a 'Mixed Query' and discussed here. This is still a great article even though the information is for SQL 2008.

If you're generally concerned about performance and plans, why not spin up some test data, introduce some skew and try it out. I knocked up this script with ~2 million rows in a few minutes:

!!TODO introduce some skew
USE master
GO

SET NOCOUNT ON
GO

DBCC TRACEON(610)   -- Minimal logging
GO

GO

IF EXISTS ( SELECT * FROM sys.databases WHERE name = 'fullTextDemo' )
BEGIN
    ALTER DATABASE fullTextDemo SET SINGLE_USER WITH ROLLBACK IMMEDIATE
    DROP DATABASE fullTextDemo
END
GO

IF NOT EXISTS ( SELECT * FROM sys.databases WHERE name = 'fullTextDemo' )
CREATE DATABASE fullTextDemo
GO

ALTER DATABASE fullTextDemo SET RECOVERY SIMPLE
GO

USE fullTextDemo
GO

IF OBJECT_ID('dbo.yourAddresses') IS NOT NULL DROP TABLE dbo.yourAddresses
IF OBJECT_ID('dbo.companies') IS NOT NULL DROP TABLE dbo.companies
GO

CREATE TABLE dbo.companies (
    companyId       INT IDENTITY NOT NULL,
    companyName     NVARCHAR(50) NOT NULL,

    CONSTRAINT PK_companies PRIMARY KEY ( companyId )
)
GO

CREATE TABLE dbo.yourAddresses (
    rowId           INT IDENTITY,
    companyId       INT NOT NULL FOREIGN KEY REFERENCES dbo.companies ( companyId ),
    searchTerms     NVARCHAR(2048) NOT NULL

    CONSTRAINT PK_yourAddresses PRIMARY KEY ( rowId )
)
GO

-- Populate the companies
;WITH cte AS (
SELECT TOP 250 ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) rn
FROM master.sys.columns c1
    CROSS JOIN master.sys.columns c2
    CROSS JOIN master.sys.columns c3
)
INSERT INTO dbo.companies ( companyName )
SELECT NEWID()
FROM cte
GO

-- Generate 2,636,000 records
INSERT dbo.yourAddresses ( companyId, searchTerms )
SELECT c.companyId, m.[text]
FROM dbo.companies c
    CROSS JOIN ( SELECT * FROM sys.messages ) m
WHERE m.language_id = 1033
AND m.[text] Like '[a-z]%'
GO

CREATE INDEX _idx ON dbo.yourAddresses ( companyId ) INCLUDE ( searchTerms )
GO

-- !!TODO look at compression
--ALTER INDEX PK_yourAddresses ON dbo.yourAddresses REBUILD WITH ( DATA_COMPRESSION = PAGE )
--GO

-- Create the catalog
IF NOT EXISTS ( SELECT * FROM sys.fulltext_catalogs WHERE name = N'ftc_yourAddresses' )
CREATE FULLTEXT CATALOG ftc_yourAddresses
GO

-- Create the full-text index
CREATE FULLTEXT INDEX ON dbo.yourAddresses ( searchTerms ) KEY INDEX PK_yourAddresses ON ftc_yourAddresses WITH CHANGE_TRACKING MANUAL  -- CHANGE_TRACKING OFF, NO POPULATION
GO

SELECT 'before' ft, * FROM sys.fulltext_indexes
GO

ALTER FULLTEXT INDEX ON dbo.yourAddresses START FULL POPULATION;
GO


DECLARE @i INT 
SET @i = 0

WHILE EXISTS ( SELECT * FROM sys.fulltext_indexes WHERE has_crawl_completed = 0 )
BEGIN

        SELECT outstanding_batch_count, *
        FROM sys.dm_fts_index_population
        WHERE database_id = DB_ID()

        --SELECT *
        --FROM sys.dm_fts_outstanding_batches
        --WHERE database_id = DB_ID()

    WAITFOR DELAY '00:00:05'

    SET @i = @i + 1
    IF @i > 60 BEGIN RAISERROR( 'Too many loops!', 16, 1 ) BREAK END

END

SELECT 'after' ft, * FROM sys.fulltext_indexes
GO



SELECT TOP 1000 *
FROM dbo.yourAddresses ft
WHERE companyId = 42
 AND CONTAINS ( searchTerms, 'data' )
GO

SELECT TOP 1000 *
FROM dbo.yourAddresses a
    INNER JOIN CONTAINSTABLE ( dbo.yourAddresses, searchTerms, 'data' ) ct ON a.rowId = ct.[key]
WHERE a.companyId = 42
GO

SELECT TOP 1000 *
FROM dbo.yourAddresses a
    INNER JOIN CONTAINSTABLE ( dbo.yourAddresses, searchTerms, 'data' ) ct ON a.rowId = ct.[key]
WHERE a.companyId = 42
OPTION ( MERGE JOIN )
GO

SELECT TOP 100 *
FROM sys.dm_fts_index_keywords (DB_ID(), OBJECT_ID('dbo.yourAddresses') )

SELECT TOP 100 *
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('dbo.yourAddresses') )
ORDER BY document_id
GO

Related Solutions

Sql-server – SQL Server Full-Text Search: combining proximity term and thesaurus

I do not believe that is a supported feature of SQL Server full text searching. That would require a wildcard resolution to words and then a thesaurus lookup of each matching word to gather the thesaurus terms.

This basically maps to a pretty complex query: some one of a group of prefixed words is very near to some one of another group of prefixed words which all then go through a thesaurus lookup to provide even more words. Based on previous experience, that is just not supported.

(I see online that you have asked this elsewhere in the last few months, but without any answers, so I hope that this helps.)

I believe that you can create something useful for your query, but it probably requires externalizing the thesaurus entries by doing something like the following:

DECLARE @FirstGroup TABLE (AGroup NVARCHAR(50));
INSERT INTO @FirstGroup VALUES ('hdd');
INSERT INTO @FirstGroup VALUES ('hard disk');
INSERT INTO @FirstGroup VALUES ('hard disc');
INSERT INTO @FirstGroup VALUES ('harddisk');

DECLARE @SecondGroup TABLE (BGroup NVARCHAR(50));
INSERT INTO @SecondGroup VALUES ('sata');
INSERT INTO @SecondGroup VALUES ('msata');
INSERT INTO @SecondGroup VALUES ('esatap');

DECLARE @SearchString NVARCHAR(2000);
SET @SearchString = ''

SELECT @SearchString = @SearchString + '("'+AGroup+'"~"'+BGroup+'") OR '
FROM @FirstGroup CROSS JOIN @SecondGroup 

SELECT @SearchString = LEFT(@SearchString, LEN(@SearchString) - 3);

SELECT * FROM dbo.TBL_NAME
 WHERE contains(field_name,@SearchString);

So if you load the tables with the candidate thesaurus entries and make the OR connections in the code, it should work just fine.

You did not mention which version of SQL Server you are running, but if you are using SQL Server 2012, you might want to use the newer and more powerful syntax. However, this article also has hints about the legacy NEAR context.
http://technet.microsoft.com/en-us/library/ms142568(v=sql.110).aspx

Sql-server – Full Text Index and database structure

Did you try to replace the insert..union statements with separate insert statements?

For example:

insert into @containsTable select [key],[rank] from ContainsTable(File2010,Content,@keywords)
insert into @containsTable select [key],[rank] from ContainsTable(File2000,Content,@keywords)
insert into @containsTable select [key],[rank] from ContainsTable(File1990,Content,@keywords)
insert into @containsTable select [key],[rank] from ContainsTable(File1980,Content,@keywords)
insert into @containsTable select [key],[rank] from ContainsTable(File1970,Content,@keywords)

It could be as simple as the increased processing cost of union plus the increased memory usage from having 5 processes open at the same time to use them on that insert, slowing down your UDF.

(Also, since you're querying from - supposedly - unrelated data tables, the 'distinct' part of union should be unnecessary.)

Best Answer

Related Solutions

Sql-server – SQL Server Full-Text Search: combining proximity term and thesaurus

Sql-server – Full Text Index and database structure

Related Question