Sql-server – How to return the full text of a document indexed in SQL Server Full-Text

filetablefull-text-searchsql serversql server 2014

I'm using SQL Server 2014 with FileTables to store a large number of documents in different formats. The iFilters are working great, and everything is getting indexed with FTS + Semantic Search. Now I'd like to run some additional processing on the text of those documents, but don't see a reason to have the pipeline redo the work of decoding, extracting, etc. the text from the files.

It seems there should be an obvious solution … but I've been running in circles without any luck.

So the question is:

How can I query to return the full plaintext of a file in T-SQL?
If that's not possible, can it be done in SSIS or SSAS after the
normal FTS parser has run?
If that's not possible, is there a way to hook into the FTS pipeline
(via a trigger perhaps) so I can split the plain text into another
table?

Alternate solutions are appreciated as well if you've got good examples for me to reference. The only immediate idea I had was to use a different network share for dropoff, have SSIS pick up the file and extract the text (no idea how to do that), and then to move the file + text to SQL server … but that seems wonky for a lot of reasons.

[Edited to clarify "why"]

If SQL Server has already pulled out the text in order to chunk it & do the base NLP for the semantic index … I'd rather use that than reinvent the wheel. Specific uses I'm looking into are post-processing with other NLP utilities (e.g. NLTK, GenSim, Stanford NLP NER, etc.) so that I can generate extractive document summarizations, store n-gram statistics for my corpus, and include NER for more effective faceted search.

If I have to extract the text out of files before storing them in SQL Server (either using SSIS/.NET so I can keep the iFilters OR by using a different tool altogether) there's limited usefulness in SQL Server's ability to perform that work on FileTables for anything but the most basic tasks.

Consider the number of document formats already supported – and it's a major task to recreate the feature. Similarly, having to go back to the actual file afterwards and redo that indexing work is inefficient and it'd seem sensible to disable FTS on FileTables, skip using them altogether, or scrap SQL Server for document-based FTS altogether.

Best Answer

I am curious about what you will accomplish by getting the plain text. You already have the documents in the FileTable and you can open the files when needed using the appropriate tools.

For example: If you are looking at a PDF, a Word document, an Excel spreadsheet, and so forth you probably have the tools to look at the data. Most tools will even allow you to save the 'plain text' manually. But I suppose it depends on how you define "plain text". Save of Word document to a non-Unicode .TXT file? Or something more complex?

Of course, I realize that you would likely want a more automated approach, which I comment on further down.

One thing to know is that not all files will necessarily copy to "plain text". (At least I had trouble with plain text from Chinese.)

Regarding the Full Text Indexes:

Although the SQL Server Full Text Index is aware of the relative positions of 'words' in the index so that it can search for phrases, it has no interface to reveal the serial sequence of words in the text. Thus, there is no way (currently) to build plain text from the Full Text Index.

Even if those details were available, the Full Text Index still does not have everything needed to fully represent a document.

If you have any Stop Words configured, then they will not appear in the Full Text Index.
Punctuation is generally not included except when tokenized within a word, such as "pre-apocalypse", "F.B.I." and so forth.
Formatting is lost.
The language being used for Full Text Indexing may affect the results.

The Full Text Indexes are solving a different problem from creating plain text. However, my experience is that large text bases usually benefit from Full Text Indexing anyway, since someone is always trying to find something

I understand that once you get the text you will redundantly store it in a SQL Server table so as to expedite your use of the plain text data.

Automating the Extract and Load

This will require some work. But, since your documents are currently in a FileTable you should be able to access them from the file system using file tools. The "off-topic" answer to plain text extraction below includes several tools that are being used by others. Perhaps some of these tools would be useful to you.

http://stackoverflow.com/questions/5671988/how-to-extract-just-plain-text-from-doc-docx-files-unix

Since you are using SQL Server 2014, once you have plain text files you should be able to import them in a number of ways: BCP, SSIS, and other tools are available to load data.

Depending on your approach, you might choose to use a staging table to further prepare the results before moving the data into the destination table. And if you are also doing document versioning you will likely need to create some metadata that allows you to track different versions.

Related Solutions

Sql-server – SSIS SQL Server 2008 errors

You've come across a fairly well-known problem. When SSIS is trying to import the data, it provides locale-aware conversions and the format YYYYMMDD doesn't convert nicely. You can use the "import as string and convert to actual datetime" approach you've outlined in your solution but that's slower and consumes more resources than using the native approach of telling SSIS to quit being so damn smart.

Right click on your flat file source and select the Show Advanced Editor. In the Input and Output Properties, expand Output Columns, find your column (BirthDate) and change the FastParse property from False to True.

Fast parse settings

With only that change, the package will execute successfully

Success

Also, an excellent answer from SO on the same issue

Sql-server – SQL Server Filetable document properties

It looks like (at this time) the best you are going to be able to do is use the keywords on the property, join them up to the doc and cross your fingers it is enough.

SELECT
    keyword,
    display_term,
    column_id, document_id, property_id
FROM sys.dm_fts_index_keywords_by_property
(
    DB_ID('FileTableDB'), 
    OBJECT_ID('FileTableTb')
);

MSDN on that DMV - http://msdn.microsoft.com/en-us/library/ee677646.aspx

Here is a discussion with Bob Beauchemin and I on the topic: https://twitter.com/bobbeauch/status/275101491084292096

Bob told me about the DMV mentioned above and pretty much confirmed that it is the most likely way.

Here is the Connect ticket Bob created for this topic: https://connect.microsoft.com/SQLServer/feedback/details/773212/provide-complete-fts-property-values-in-a-dmv-side-table

We also go on to talk about how it might be better to have a preprocessor of sorts to pull the meta info out, store it, (not sure if you normalize format of your docs or not), and then reprocess if modified. Bob suggests perhaps being able to pull this off with a SQLCLR trigger/proc which would be a rather clever solution and get your hands all kinds of dirty.

Best Answer

Related Solutions

Sql-server – SSIS SQL Server 2008 errors

Sql-server – SQL Server Filetable document properties

Related Question