Sql-server – SQL Server Full Text Search – .rtf files incorrectly indexed by rtf tags

full-text-searchsql serversql-server-2012

I've set up a full-text index on my SQL Server DB on a varbinary(max) column. I have a type column specified, which contains the extension of the file, such a ".doc", ".pdf" etc.

However, I have noticed that when any .rtf files are indexed, SQL is including all of the meta information in the file (such as the RTF tag "listoverridecount0").

This is bloating the indexes quite a bit, and also means that searches will match on these tags (i.e. I can search on "listoverridecount0" and get EVERY .rtf returned).

Is there any reason that the iFilter for .rtf would not be removing the RTF Tags?

When I run this:

SELECT * FROM sys.fulltext_document_types WHERE document_type = '.rtf';

I get this:

document_type  .rtf                                     
class_id       C7310720-AC80-11D1-8DF3-00C04FB6EF4F 
path       c:\Program Files\Microsoft SQL Server\MSSQL11.MSSQLSERVER\MSSQL\Binn\msfte.dll
version        12.0.6828.0
manufacturer   Microsoft Corporation

I have filed a Microsoft Connect bug as I don't seem to be able to find any workaround. Presumably this is either a) Error with the RTF iFilter not removing tags, or b) Full Text index issue.

My SQL Server version is:

Microsoft SQL Server 2012 (SP1) - 11.0.3393.0 (X64) 
Oct 25 2013 19:04:40 
Copyright (c) Microsoft Corporation
Developer Edition (64-bit) on Windows NT 6.2  (Build 9200: ) (Hypervisor)

Best Answer

I can reproduce this but I think you have a few options:

  • not worry too much about full-text index bloat. Not many people are going to be searching for 'rtf1', 'pard' or 'wmetafile0' and I don't think it will affect your performance that much. You might get the odd clash if someone is searching for 'red' or 'blue' but I'd say this is fairly low risk.
  • Clean the string up eg with RegEx I copied from here. Seemed to make a good go of it. Store it in a separate table with the stream_id ( which is the unique id for the document ) and full-text index that table instead:
-- Convert rtf to plain text
SELECT 
    CAST( file_stream AS VARCHAR(MAX) ) original
    , MDS.mdq.RegExReplace( 
        CAST( file_stream AS VARCHAR(MAX) ), 
        '\{\*?\\[^{}]+}|[{}]|\\\n?[A-Za-z]+\n?(?:-?\d+)?[ ]?', '', 1 )
FROM dbo.Documents
  • Similar approach with a CLR function I adapted from here:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Text;
using System.Linq;  // for TakeWhile

public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction]
    public static SqlChars rtfToText( SqlChars inputRtf )
    {
        // RTF function lifted from here and adapted for SQL CLR:
        // https://stackoverflow.com/questions/23277178/richtextbox-throws-outofmemory-on-azure

        bool slash = false; //indicates if backslash followed by the space
        bool figure_opened = false; //indicates if opening figure brace followed by the space
        bool figure_closed = false; //indicates if closing brace followed by the space
        bool first_space = false; //the else spaces are in plain text and must be included to the result

        if (inputRtf.Length < 4) return new SqlChars ( string.Empty );

        int i = 0;
        i = inputRtf.ToString().IndexOf("\\pard");
        if (i < 1) return new SqlChars(string.Empty);

        var builder = new StringBuilder();
        for (int j = i; j < inputRtf.Length - 1; j++)
        {
            char ch = inputRtf[j];
            char nextCh = inputRtf[j + 1];

            if (ch == '\\' && nextCh == 'p') // appends \n if \pard, except for first
            {
                if (j > i && j < inputRtf.Length - 4)
                {
                    string fiveChars = inputRtf.ToString().Substring(j, 5);
                    if (fiveChars.Equals("\\pard"))
                    {
                        builder.Append("\n");
                    }
                }
            }

            if (ch == '\\' && nextCh == 'u') // to deal correctly with special characters
            {
                string fourChars = inputRtf.ToString().Substring(j + 2, 4);
                string digits = new string(fourChars.TakeWhile(char.IsDigit).ToArray());
                char specialChar = (char)int.Parse(digits);
                builder.Append(specialChar);
                j += digits.Length + 5;
                continue;
            }

            if (ch == '\\' && nextCh == '{') // if the text contains symbol '{'
            {
                slash = false;
                figure_opened = false;
                figure_closed = false;
                first_space = false;
                builder.Append('{');
                j++;
                continue;
            }
            else if (ch == '\\' && nextCh == '}') // if the text contains symbol '}'
            {
                slash = false;
                figure_opened = false;
                figure_closed = false;
                first_space = false;
                builder.Append('}');
                j++;
                continue;
            }
            else if (ch == '\\' && nextCh == '\\') // if the text contains symbol '\'
            {
                slash = false;
                figure_opened = false;
                figure_closed = false;
                first_space = false;
                builder.Append('\\');
                j++;
                continue;
            }
            else if (ch == '\\') // we are looking at the backslash
            {
                first_space = true;
                slash = true;
            }
            else if (ch == '{')
            {
                first_space = true;
                figure_opened = true;
            }
            else if (ch == '}')
            {
                first_space = true;
                figure_closed = true;
            }
            else if (ch == ' ')
            {
                slash = false;
                figure_opened = false;
                figure_closed = false;
            }

            if (!slash && !figure_opened && !figure_closed)
            {
                if (!first_space)
                {
                    builder.Append(ch);
                }
                else
                {
                    first_space = false;
                }
            }
        }

        // Return
        return new SqlChars(builder.ToString());
    }
};

I'm sure this isn't the world's most efficient CLR function but it did an ok job on some 100 documents I tested, with a few fails.

Let us know if you hear anything back on the connect item.