I am curious about what you will accomplish by getting the plain text. You already have the documents in the FileTable and you can open the files when needed using the appropriate tools.
For example: If you are looking at a PDF, a Word document, an Excel spreadsheet, and so forth you probably have the tools to look at the data. Most tools will even allow you to save the 'plain text' manually. But I suppose it depends on how you define "plain text". Save of Word document to a non-Unicode .TXT file? Or something more complex?
Of course, I realize that you would likely want a more automated approach, which I comment on further down.
One thing to know is that not all files will necessarily copy to "plain text". (At least I had trouble with plain text from Chinese.)
Regarding the Full Text Indexes:
Although the SQL Server Full Text Index is aware of the relative positions of 'words' in the index so that it can search for phrases, it has no interface to reveal the serial sequence of words in the text. Thus, there is no way (currently) to build plain text from the Full Text Index.
Even if those details were available, the Full Text Index still does not have everything needed to fully represent a document.
- If you have any
Stop Words
configured, then they will not appear in the Full Text Index.
- Punctuation is generally not included except when tokenized within a word, such as "pre-apocalypse", "F.B.I." and so forth.
- Formatting is lost.
- The language being used for Full Text Indexing may affect the results.
The Full Text Indexes are solving a different problem from creating plain text. However, my experience is that large text bases usually benefit from Full Text Indexing anyway, since someone is always trying to find something
I understand that once you get the text you will redundantly store it in a SQL Server table so as to expedite your use of the plain text data.
Automating the Extract and Load
This will require some work. But, since your documents are currently in a FileTable you should be able to access them from the file system using file tools. The "off-topic" answer to plain text extraction below includes several tools that are being used by others. Perhaps some of these tools would be useful to you.
http://stackoverflow.com/questions/5671988/how-to-extract-just-plain-text-from-doc-docx-files-unix
Since you are using SQL Server 2014, once you have plain text files you should be able to import them in a number of ways: BCP
, SSIS
, and other tools are available to load data.
Depending on your approach, you might choose to use a staging table to further prepare the results before moving the data into the destination table. And if you are also doing document versioning you will likely need to create some metadata that allows you to track different versions.
When you want to add words involving special characters that you want treated as a single word, you will want to create a custom dictionary.
Creating Custom Dictionaries for special terms to be indexed ‘as-is’ in SQL Server 2008 Full-Text Indexes
Be aware that custom dictionaries are limited in size, plus the larger the dictionary the slower fulltext can become.
Custom dictionaries are great when you have a limited set of words to add like AT&T or M&Ms which are words with special characters that might be used over and over that you wanted treated as a single word instead of the default word breaking behavior.
Custom dictionaries are not great for for words like part numbers (ex. JY-14562, PTW-14432, etc.) where every row in a table would result in a new entry in the custom dictionary. You could potentially have several hundred thousand or more unique part numbers. That would not be a good use for custom dictionary. If you need this type of behavior, you are better off writing your own word breaker. Creating word breakers is not well documented nor easy to create, but it does allow you to decide how you want words broken when you encounter special characters.
Best Answer
You can use
unaccent
extension, either by pre-processing the text withunaccent
function, or creating your own text search configuration. For instance (based on the example onunaccent
's doc):So you can match it:
See the last query, I'm matching combinations of
cañon
andcanon
, when usingenglish
configuration (first 4 columns) will only match if both are the same (first 2 ones), butmy_conf
configuration (last 4 columns) will match in all of them.