Why Store Text in an SQL Image Field?

sql servert-sql

I've been asked to get some info out of the database of a system we use. I've found what I'm looking for, but its left me… confused.

For some reason, the developers have the system storing text in an SQL Image field. I suspected this, but wasn't entirely sure, until I took the Hex output from my select statement, put it into a Hex to ASCII converter, and hey presto, it's exactly the kind of message I'm looking for.

What I'd like to know is: why? Is there some legitimate reason why someone would store text in an Image field?

Given the quite serious and private nature of these messages, the only thing I can think of is that someone was seriously mistaken in thinking that storing the text in an Image field would somehow make it more secure? Or is there really just something obvious I'm missing?

EDIT: Covered in the comments, but: At minimum, this text will occasionally likely include accented characters and such. Plus, my manager just let me know that the software was written using SQL Server 2000, or maybe even a version or two earlier. Possibly the simple answer is that there wasn't an analogue to nvarchar or similar back when it was written?

Best Answer

It seems like there are several possible choices, but without going back to the original developers of your system, there's no way to be sure. Some of the possibilities include:

Conversion from another system. If the developers wrote their software against database engine X (Oracle, MySQL, Access, SQL Anywhere, etc.) and then converted their system to SQL Server 2000, it is possible their conversion tools / methodology was not "smart" enough to use the right data type for their data.
Improper UTF-8 encoding. As mentioned by others, it could be that the programmers didn't know how to encode UTF-8 or other non-ASCII data, and chose Image as a fallback storage method.
Data obfuscation. While not an encryption method, storing text in an image field, as you note, makes it non-trivial to view. Perhaps the programmers didn't want to take the time/effort to figure out a true encryption method, and thought the Image data type was "good enough" to hide the text from DBAs like yourself.
Lack of Database expertise. I've seen more than one case where the programmer chose some "default" data type for a column without really understanding what they were doing. In at least one case, it was because they used their programming language's IDE to define the database, rather than developing the schema through the engine's preferred database design tools. This can often lead to interesting data types that aren't appropriate for the data or choosing deprecated data types, etc.
Data column "sharing." It is possible the programmers, for whatever reason, are using this image column to store text in some cases, pictures in other cases? It is not best practice by any means, but I've seen systems where similar shortcuts were taken.
Source data conversion. I don't know where your source text originates. It is possible that the application isn't inserting raw text into an image field, but is uploading a file that contains text in human-readable form. Some document formats have human-readable text inside them, along with control codes. I recall that back in the day, WordPerfect documents stored data in a binary file type, but the actual text was mostly human readable if you opened the file in a text editor. The formatting control codes made lots of random "garbage" in the data, but you could extract the text if you wanted to. If your application uploads files like these, then the raw text might still be viewable in the column. In which case, an "image" data type isn't wrong.
Feature change over time. It is possible the Image data type was the appropriate type at one time in the design of the application, but the purpose changed over time with no change to the underlying database? Maybe they uploaded files originally, but later switched to inserting text instead. Or maybe they inserted text initially, but planned to eventually upload files instead of text. This is lazy programming; the schema should be changed over time to match the data, but again, this happens often enough...

Related Solutions

Sql-server – How to return the full text of a document indexed in SQL Server Full-Text

I am curious about what you will accomplish by getting the plain text. You already have the documents in the FileTable and you can open the files when needed using the appropriate tools.

For example: If you are looking at a PDF, a Word document, an Excel spreadsheet, and so forth you probably have the tools to look at the data. Most tools will even allow you to save the 'plain text' manually. But I suppose it depends on how you define "plain text". Save of Word document to a non-Unicode .TXT file? Or something more complex?

Of course, I realize that you would likely want a more automated approach, which I comment on further down.

One thing to know is that not all files will necessarily copy to "plain text". (At least I had trouble with plain text from Chinese.)

Regarding the Full Text Indexes:

Although the SQL Server Full Text Index is aware of the relative positions of 'words' in the index so that it can search for phrases, it has no interface to reveal the serial sequence of words in the text. Thus, there is no way (currently) to build plain text from the Full Text Index.

Even if those details were available, the Full Text Index still does not have everything needed to fully represent a document.

If you have any Stop Words configured, then they will not appear in the Full Text Index.
Punctuation is generally not included except when tokenized within a word, such as "pre-apocalypse", "F.B.I." and so forth.
Formatting is lost.
The language being used for Full Text Indexing may affect the results.

The Full Text Indexes are solving a different problem from creating plain text. However, my experience is that large text bases usually benefit from Full Text Indexing anyway, since someone is always trying to find something

I understand that once you get the text you will redundantly store it in a SQL Server table so as to expedite your use of the plain text data.

Automating the Extract and Load

This will require some work. But, since your documents are currently in a FileTable you should be able to access them from the file system using file tools. The "off-topic" answer to plain text extraction below includes several tools that are being used by others. Perhaps some of these tools would be useful to you.

http://stackoverflow.com/questions/5671988/how-to-extract-just-plain-text-from-doc-docx-files-unix

Since you are using SQL Server 2014, once you have plain text files you should be able to import them in a number of ways: BCP, SSIS, and other tools are available to load data.

Depending on your approach, you might choose to use a staging table to further prepare the results before moving the data into the destination table. And if you are also doing document versioning you will likely need to create some metadata that allows you to track different versions.

SQL Server – Physical Storage of IDENTITY VALUE

If you can access the DAC (Dedicated Administrator Console), you can inspect the value of the identity column, for INT columns, by looking at the idtval column in sys.syscolpars.

Thanks to Martin Smith for directing me to that table via this very useful answer by Roi Gavish on a related question here.

Take, for instance, the following temporary table:

USE tempdb;

CREATE TABLE #d
(
    ID INT NOT NULL IDENTITY(1,1)
);

TRUNCATE TABLE #d;

DBCC CHECKIDENT ('#d',RESEED, 2147483635);

INSERT INTO #d DEFAULT VALUES;

Let's see what the table contains:

SELECT *
FROM #d;

+------------+
| ID         |
+------------+
| 2147483635 |
+------------+

The identity value can be inspected by this code:

DECLARE @idtval VARBINARY(64);

SELECT @idtval = scp.idtval
FROM sys.syscolpars scp
    INNER JOIN sys.objects o ON scp.id = o.object_id
WHERE o.name LIKE '#d____%'

DECLARE @LittleEndian NVARCHAR(10);
SET @LittleEndian = LEFT(sys.fn_varbintohexstr(@idtval), 10);
SELECT @LittleEndian;
DECLARE @BigEndian NVARCHAR(10) = '0x';
DECLARE @Loop INT = 0;
WHILE @Loop < 4
BEGIN
  SET @BigEndian = @BigEndian + SUBSTRING(@LittleEndian, ((4 - @Loop) * 2) + 1, 2);
  SET @Loop += 1;
END
SELECT CurrentIdentityValue = CONVERT(INT, 
    CONVERT(VARBINARY(32), @BigEndian, 1), 2);

+----------------------+
| CurrentIdentityValue |
+----------------------+
|                      |
| 2147483635           |
+----------------------+

For BIGINT identity columns, we need to expand the size of some variables used in the code, such as:

CREATE TABLE #dBig
(
    ID BIGINT NOT NULL IDENTITY(1,1)
);

TRUNCATE TABLE #dBig;

DBCC CHECKIDENT ('#dBig',RESEED, 9223372036854775704);

INSERT INTO #dBig DEFAULT VALUES;

SELECT *
FROM #dBig;


DECLARE @idtval VARBINARY(64);

SELECT @idtval = scp.idtval
FROM sys.syscolpars scp
    INNER JOIN sys.objects o ON scp.id = o.object_id
WHERE o.name LIKE '#dBig____%'

DECLARE @LittleEndian NVARCHAR(18);
SET @LittleEndian = LEFT(sys.fn_varbintohexstr(@idtval), 18);
DECLARE @BigEndian NVARCHAR(18) = '0x';
DECLARE @Loop INT = 0;
WHILE @Loop < 8
BEGIN
  SET @BigEndian = @BigEndian + SUBSTRING(@LittleEndian, ((8 - @Loop) * 2) + 1, 2);
  SET @Loop += 1;
END
SELECT CurrentIdentityValue = CONVERT(BIGINT, 
    CONVERT(VARBINARY(32), @BigEndian, 1), 2);

Results for the BIGINT:

+----------------------+
| CurrentIdentityValue |
+----------------------+
|                      |
| 9223372036854775704  |
+----------------------+

Best Answer

Related Solutions

Sql-server – How to return the full text of a document indexed in SQL Server Full-Text

SQL Server – Physical Storage of IDENTITY VALUE

Related Question