I am curious about what you will accomplish by getting the plain text. You already have the documents in the FileTable and you can open the files when needed using the appropriate tools.
For example: If you are looking at a PDF, a Word document, an Excel spreadsheet, and so forth you probably have the tools to look at the data. Most tools will even allow you to save the 'plain text' manually. But I suppose it depends on how you define "plain text". Save of Word document to a non-Unicode .TXT file? Or something more complex?
Of course, I realize that you would likely want a more automated approach, which I comment on further down.
One thing to know is that not all files will necessarily copy to "plain text". (At least I had trouble with plain text from Chinese.)
Regarding the Full Text Indexes:
Although the SQL Server Full Text Index is aware of the relative positions of 'words' in the index so that it can search for phrases, it has no interface to reveal the serial sequence of words in the text. Thus, there is no way (currently) to build plain text from the Full Text Index.
Even if those details were available, the Full Text Index still does not have everything needed to fully represent a document.
- If you have any
Stop Words
configured, then they will not appear in the Full Text Index.
- Punctuation is generally not included except when tokenized within a word, such as "pre-apocalypse", "F.B.I." and so forth.
- Formatting is lost.
- The language being used for Full Text Indexing may affect the results.
The Full Text Indexes are solving a different problem from creating plain text. However, my experience is that large text bases usually benefit from Full Text Indexing anyway, since someone is always trying to find something
I understand that once you get the text you will redundantly store it in a SQL Server table so as to expedite your use of the plain text data.
Automating the Extract and Load
This will require some work. But, since your documents are currently in a FileTable you should be able to access them from the file system using file tools. The "off-topic" answer to plain text extraction below includes several tools that are being used by others. Perhaps some of these tools would be useful to you.
http://stackoverflow.com/questions/5671988/how-to-extract-just-plain-text-from-doc-docx-files-unix
Since you are using SQL Server 2014, once you have plain text files you should be able to import them in a number of ways: BCP
, SSIS
, and other tools are available to load data.
Depending on your approach, you might choose to use a staging table to further prepare the results before moving the data into the destination table. And if you are also doing document versioning you will likely need to create some metadata that allows you to track different versions.
If you can access the DAC (Dedicated Administrator Console), you can inspect the value of the identity column, for INT
columns, by looking at the idtval
column in sys.syscolpars
.
Thanks to Martin Smith for directing me to that table via this very useful answer by Roi Gavish on a related question here.
Take, for instance, the following temporary table:
USE tempdb;
CREATE TABLE #d
(
ID INT NOT NULL IDENTITY(1,1)
);
TRUNCATE TABLE #d;
DBCC CHECKIDENT ('#d',RESEED, 2147483635);
INSERT INTO #d DEFAULT VALUES;
Let's see what the table contains:
SELECT *
FROM #d;
+------------+
| ID |
+------------+
| 2147483635 |
+------------+
The identity value can be inspected by this code:
DECLARE @idtval VARBINARY(64);
SELECT @idtval = scp.idtval
FROM sys.syscolpars scp
INNER JOIN sys.objects o ON scp.id = o.object_id
WHERE o.name LIKE '#d____%'
DECLARE @LittleEndian NVARCHAR(10);
SET @LittleEndian = LEFT(sys.fn_varbintohexstr(@idtval), 10);
SELECT @LittleEndian;
DECLARE @BigEndian NVARCHAR(10) = '0x';
DECLARE @Loop INT = 0;
WHILE @Loop < 4
BEGIN
SET @BigEndian = @BigEndian + SUBSTRING(@LittleEndian, ((4 - @Loop) * 2) + 1, 2);
SET @Loop += 1;
END
SELECT CurrentIdentityValue = CONVERT(INT,
CONVERT(VARBINARY(32), @BigEndian, 1), 2);
+----------------------+
| CurrentIdentityValue |
+----------------------+
| |
| 2147483635 |
+----------------------+
For BIGINT
identity columns, we need to expand the size of some variables used in the code, such as:
CREATE TABLE #dBig
(
ID BIGINT NOT NULL IDENTITY(1,1)
);
TRUNCATE TABLE #dBig;
DBCC CHECKIDENT ('#dBig',RESEED, 9223372036854775704);
INSERT INTO #dBig DEFAULT VALUES;
SELECT *
FROM #dBig;
DECLARE @idtval VARBINARY(64);
SELECT @idtval = scp.idtval
FROM sys.syscolpars scp
INNER JOIN sys.objects o ON scp.id = o.object_id
WHERE o.name LIKE '#dBig____%'
DECLARE @LittleEndian NVARCHAR(18);
SET @LittleEndian = LEFT(sys.fn_varbintohexstr(@idtval), 18);
DECLARE @BigEndian NVARCHAR(18) = '0x';
DECLARE @Loop INT = 0;
WHILE @Loop < 8
BEGIN
SET @BigEndian = @BigEndian + SUBSTRING(@LittleEndian, ((8 - @Loop) * 2) + 1, 2);
SET @Loop += 1;
END
SELECT CurrentIdentityValue = CONVERT(BIGINT,
CONVERT(VARBINARY(32), @BigEndian, 1), 2);
Results for the BIGINT
:
+----------------------+
| CurrentIdentityValue |
+----------------------+
| |
| 9223372036854775704 |
+----------------------+
Best Answer
It seems like there are several possible choices, but without going back to the original developers of your system, there's no way to be sure. Some of the possibilities include: