SQL Server – Dynamic Length Obfuscation Method in T-SQL

data-maskingSecuritysql servert-sql

I would like to obfuscate (scramble) sensitive data from a SQL Server database, but in the way which will provide:

irreversibility (the plaintext can't be derived from the obfuscated data),
obfuscated data length needs to be the same as a length of data before obfuscation.
obfuscated value does not need to be unique for repeated obfuscations of the same input value. To be honest, I rather like getting the same value for the same input which can used (e.g. some matching data in different tables, probably useful in test cases).

Example:

Abc -> zyx (lenght: 3)
StackOverflow -> a65vr4doqjd (lenght: 11)

Usually I avoid "home made" algorithms, so are you aware of some MS builtin solution which could provide this kind of obfuscation?

I hope I expressed my problem clearly, otherwise let me know and I'll try to add as much info as needed.

Best Answer

No, I am not aware of any built-in function that does exactly this. But, you can still accomplish this without doing anything too complicated.

You could use the built-in CRYPT_GEN_RANDOM function (introduced in SQL Server 2008 R2) which generates random values based on a supplied length. The output is in hex/binary values so each byte returned is represented as two alphanumeric characters (hence the / 2 + 1 part below).

DECLARE @InputString NVARCHAR(4000) = 'hello';
SELECT SUBSTRING(CONVERT(VARCHAR(8000),
                         CRYPT_GEN_RANDOM((LEN(@InputString) / 2) + 1),
                         2),
                 1,
                 LEN(@InputString)) AS [Obfuscated];

SET @InputString = 'test';
SELECT SUBSTRING(CONVERT(VARCHAR(8000),
                         CRYPT_GEN_RANDOM((LEN(@InputString) / 2) + 1),
                         2),
                 1,
                 LEN(@InputString)) AS [Obfuscated];

Returns something along the lines of:

8C108

9A7A

The only real downside here is that this needs to be done inline as CRYPT_GEN_RANDOM cannot be used in a User-Defined Function (UDF: Scalar or Table-Valued). However, it can still be applied in a set-based approach using a CTE as shown here (just set @MaxLength to the max length of the column being obfuscated):

DECLARE @MaxLength INT = 10;

;WITH cte AS
(
    SELECT CONVERT(VARCHAR(8000),
                   CRYPT_GEN_RANDOM((@MaxLength / 2) + 1),
                2) AS [Random]

)
SELECT tmp.[String],
       cte.[Random],
       SUBSTRING(cte.[Random], 1, LEN(tmp.[String])) AS [Obfuscated]
FROM   (VALUES (N'test'), (N'Hello')) tmp(String)
CROSS JOIN  cte;

Returns something along the lines of:

String    Random          Obfuscated
------    ------------    ----------
test      F99B3888F993    F99B
Hello     D3250E74F0A3    D3250

As you can see, CRYPT_GEN_RANDOM returns a different value for each row.

Also, not sure if this is acceptable or not, but the only alpha characters returned are A - F.

OR, if you want the obfuscation to be repeatable for the same input value, or at least don't mind it being repeatable and prefer that this code be in a function so that it is easier to apply to multiple columns, you can use the HASHBYTES function which, like CRYPT_GEN_RANDOM, returns hex/binary bytes. Unlike CRYPT_GEN_RANDOM, the output length is fixed (in this case at 64 characters since I am using SHA2_256), so I used REPLICATE to repeat the hashed valued if the length of the input string is more than 64 characters. Also unlike CRYPT_GEN_RANDOM, HASHBYTES can be used in a User-Defined Function (UDF) :-).

CREATE FUNCTION dbo.Obfuscate(@InputString NVARCHAR(4000))
RETURNS TABLE
WITH SCHEMABINDING
AS RETURN
  SELECT SUBSTRING(REPLICATE(CONVERT(VARCHAR(8000),
                                     HASHBYTES('SHA2_256', @InputString),
                                     2),
                             (LEN(@InputString) / 64) + 1),
                   1,
                   LEN(@InputString)) AS [Obfuscated];
GO

And that can be used as follows:

SELECT tmp.[String],
       LEN(tmp.[String]) AS [InputLength],
       ob.[Obfuscated],
      LEN(ob.[Obfuscated]) AS [OutputLength]
FROM   (VALUES (N'test'), (N'Hello'), (REPLICATE(N'A', 63)),
               (REPLICATE(N'B', 64)), (REPLICATE(N'C', 65)),
               (REPLICATE(N'D', 4000))) tmp(String)
CROSS APPLY dbo.Obfuscate(tmp.[String]) ob;

Returns something along the lines of:

String                       InputLength    Obfuscated                    OutputLength
------                       -----------    ----------                    ------------
test                               4        FE52                                 4
Hello                              5        A07E4                                5
AAAAAAAAAAAAAAAAAAAAAA...         63        4B589C85DE74E76487730F3...          63
BBBBBBBBBBBBBBBBBBBBBB...         64        79813FB6480F354F1C6017A...          64
CCCCCCCCCCCCCCCCCCCCCC...         65        FB4B38FBA41ECC24B5B0F68...          65
DDDDDDDDDDDDDDDDDDDDDD...       4000        5D01CC6508C164E652B5C77...        4000

PLEASE NOTE: If you need alpha characters beyond A - F and/or need to have distinct obfuscated values for distinct input values (i.e. reduce chances of collisions), then either method above can be adapted easily enough to do that.

Related Solutions

Sql-server – Data obfuscation in SQL Server

I wish I could upvote you 100 points just for thinking about this! I have seen this subject overlooked so many times it's untrue - so well done. From what I understand you actually want to scramble the data within the fields themselves, and although I understand what you are trying to achieve it might not be quite necessary to do so - although it should be considered on a case-by-case basis.

Most data protection laws revolve around the ability to correctly associate a piece of data with an individual - for example a date of birth or a phone number. You can meet the requirements of the law by ensuring that when you move your data out of production into UAT it is jumbled up so it is not easily re-mapped to the original person - especially when you jumble forename and surnames.

However, this does not address the issue for instance of let's say contact details. You can meet the requirements of the law by jumbling the data but the phone numbers are still real, the emails still real etc... they are just not assigned to the correct person. For this I recommend if at all possible clearing that data before passing it into UAT, Red Gate do a piece of software called Data Generator that can create random test data for you so that you can repopulate the fields with data that can be tested against.

As for data scrambling: there exists many applications that do this for you and honestly you are correct in not wanting to reinvent the wheel. The one that we use at our company is a product called Data Masker by a company called Net2000. The license is pretty cheap, it works extremely fast and you don't have to worry about having to disable all your constraints before scrambling the database.

You can of course roll your own solution should you not find anything that meets your requirements - if you do decide to do this I would strongly recommend using CLR procedures to do it as it is much more flexible than pure TSQL (not to say that you can't use TSQL see here).

Once you have chosen an application to perform this for you the next thing you need to decide is what is it you actually want/need to scramble? Honestly your best resource for this is your company legal team and or the company auditors. I know that sometimes we may not like working with them but they will be much nicer to you for approaching them and asking them the question rather than trying to do it on your own and get it wrong, there is absolutely nothing wrong with asking for help - especially when it is as important as this.

I hope this helps you and I wish you good luck in your quest... ;-)

Sql-server – ny way to tell which derived column failed

You have to manually identify the failures.

Redirect the rows (as you currently are) into a table that matches your columns, with the single addition of an IDENTITY column. This table should be all VARCHAR data types, so that you retain all original values.

Now you can identify your failures using TRY_CAST when querying the table.

For example:

SELECT
Identity,
TRY_CAST(Winter As INT) As WinterConverted,
Winter
--------------------------------------
Identity  |  WinterConverted   | Winter
   1      |    NULL            | NaN

TRY_CAST failed to cast the value in Winter column to an int leaving you with a null. You can do this for all columns.

If it's worth your time, you could make a more complex query to return each column name that contains a NULL. If not, you should be able to relatively easily visually inspect.

Of course, if most of your errors are 'NaN' or 'N/A', there's a good chance you're wasting time identifying errors you already knew about. Cleanup the issues before they become one:

Winter == "NaN" || Winter == "N/A" ? -1 : (DT_I4)Winter

As a side note, you can add a TRY_CAST to 100 columns very easily by using vertical editing: hold Alt+Shift in SSDT or SSMS (some other apps too) and then ↑ or ↓, then edit hundreds of lines at the same time.

Best Answer

Related Solutions

Sql-server – Data obfuscation in SQL Server

Sql-server – ny way to tell which derived column failed

Related Question