Sql-server – Full-text Search XML Properties

full-text-searchsql-server-2012

I'm trying to determine whether I can retrieve a property (attribute) value from an xml blob stored in a FILESTREAM-enabled column using Full-text search. Here is a sample where message is a varbinary(max) field that stores xml messages:

SELECT message FROM AuditTable
WHERE CONTAINS ( PROPERTY ( message, 'AccountNo' ), '123456');

On the MSDN page Search Document Properties with Search Property Lists, the summary at the top of the page includes this statement: "On the other hand, the XML IFilter does not emit properties."

I'm confused because it appears that this should work for an XML document. I'm starting to prep an environment to test, but I wanted to ask first just in case someone already knew.

Best Answer

If you have say a Word or Excel document, then these have documents well-known properties such as Author, Title, Keywords etc which you can associate with a search property list and then a full-text index. Properties available depending on their IFilter. You can use the PROPERTY keyword to query these properties.

However with an .xml file, it's unlikely the document has these custom properties, and what you really want to do is search the xml content. AccountNo sounds like just an attribute in the xml rather than a document property. You can search your xml using ordinary full-text search methods, however xml markup and attribute names are ignored in full-text indexes. You can query attribute and element content. You could also cast the file content to xml and use the built-in xml methods eg .query, .nodes, .value and .exist. Work through this demo and see if it makes sense:

USE master
GO

IF EXISTS ( SELECT * FROM sys.databases WHERE name = 'filestreamDemo' )
BEGIN
    ALTER DATABASE filestreamDemo SET SINGLE_USER WITH ROLLBACK IMMEDIATE
    DROP DATABASE filestreamDemo
END
GO

CREATE DATABASE filestreamDemo
ON PRIMARY (name = filestreamDemo_File, filename = N'c:\temp\filestreamDemo.mdf' ),
FILEGROUP FG_filestreamDemo CONTAINS FILESTREAM ( NAME = filestreamDemo_FS_File, FILENAME = N'c:\temp\filestreamDemo_FS' )
WITH FILESTREAM ( 
    non_transacted_access = FULL,
    directory_name = N'filestreamDemo' )
GO


USE filestreamDemo
GO

-- DROP TABLE dbo.AuditTable
IF OBJECT_ID('dbo.AuditTable') IS NULL 
CREATE TABLE dbo.AuditTable
(
    rowId           INT IDENTITY CONSTRAINT PK_AuditTable PRIMARY KEY,
    Id              UNIQUEIDENTIFIER ROWGUIDCOL NOT NULL UNIQUE,
    [message]       VARBINARY(MAX) FILESTREAM NULL,
    fileExt         VARCHAR(5) NOT NULL
)
GO

IF NOT EXISTS ( SELECT * FROM sys.fulltext_catalogs WHERE name = N'ft' ) 
CREATE FULLTEXT CATALOG ft AS DEFAULT
GO

CREATE FULLTEXT INDEX ON dbo.AuditTable
    ( [message] TYPE COLUMN fileExt )
    KEY INDEX PK_AuditTable ON ft
    WITH ( CHANGE_TRACKING = MANUAL, STOPLIST = SYSTEM )
GO

INSERT INTO dbo.AuditTable ( Id, [message], fileExt )
SELECT
    NEWID(),
    CAST( '<root AccountNo="123456"/>' AS VARBINARY(MAX) ),
    'xml'

UNION ALL

SELECT
    NEWID(),
    CAST( '<root><child AccountNo="123456"/>elementValue</root>' AS VARBINARY(MAX) ),
    'xml'

UNION ALL

SELECT
    NEWID(),
    CAST( '<dummy/>' AS VARBINARY(MAX) ),
    'xml'
GO


ALTER FULLTEXT INDEX ON dbo.AuditTable START FULL POPULATION;
GO

SELECT 'before' ft, * FROM sys.fulltext_indexes
GO

DECLARE @i INT = 0

WHILE EXISTS ( SELECT * FROM sys.fulltext_indexes WHERE has_crawl_completed = 0 )
BEGIN
    WAITFOR DELAY '00:00:05'

    SET @i += 1
    IF @i > 60 BEGIN THROW 50001, 'Too many loops!', 1 BREAK END

END
GO

SELECT 'after' ft, * FROM sys.fulltext_indexes
GO


-- Have a look in the full-text index
SELECT *
FROM sys.dm_fts_index_keywords_by_document(DB_ID(), OBJECT_ID('dbo.AuditTable'))
ORDER BY document_id


-- Search using full-text; WON'T WORK
SELECT [message], CAST( [message] AS VARCHAR(MAX) ) 
FROM AuditTable
WHERE CONTAINS ( [message], 'AccountNo');


-- will work
SELECT [message], CAST( [message] AS VARCHAR(MAX) ) 
FROM AuditTable
WHERE CONTAINS ( [message], '123456');


-- will work: numbers are prefixed with nn in full-text indexes
SELECT [message], CAST( [message] AS VARCHAR(MAX) ) 
FROM AuditTable
WHERE CONTAINS ( [message], 'nn123456');


-- Cast to xml and search using normal xml methods
SELECT *
FROM ( SELECT rowId, ID, CAST( [message] AS XML ) [message] FROM dbo.AuditTable ) a
WHERE a.message.exist('//@AccountNo[.="123456"]') = 1
GO

I also did a test and could not get an xml doc to emit any properties, as suggested by the documentation. I could however get this to work with a standard Word document.