I wish I could upvote you 100 points just for thinking about this! I have seen this subject overlooked so many times it's untrue - so well done. From what I understand you actually want to scramble the data within the fields themselves, and although I understand what you are trying to achieve it might not be quite necessary to do so - although it should be considered on a case-by-case basis.
Most data protection laws revolve around the ability to correctly associate a piece of data with an individual - for example a date of birth or a phone number. You can meet the requirements of the law by ensuring that when you move your data out of production into UAT it is jumbled up so it is not easily re-mapped to the original person - especially when you jumble forename and surnames.
However, this does not address the issue for instance of let's say contact details. You can meet the requirements of the law by jumbling the data but the phone numbers are still real, the emails still real etc... they are just not assigned to the correct person. For this I recommend if at all possible clearing that data before passing it into UAT, Red Gate do a piece of software called Data Generator that can create random test data for you so that you can repopulate the fields with data that can be tested against.
As for data scrambling: there exists many applications that do this for you and honestly you are correct in not wanting to reinvent the wheel. The one that we use at our company is a product called Data Masker by a company called Net2000. The license is pretty cheap, it works extremely fast and you don't have to worry about having to disable all your constraints before scrambling the database.
You can of course roll your own solution should you not find anything that meets your requirements - if you do decide to do this I would strongly recommend using CLR procedures to do it as it is much more flexible than pure TSQL (not to say that you can't use TSQL see here).
Once you have chosen an application to perform this for you the next thing you need to decide is what is it you actually want/need to scramble? Honestly your best resource for this is your company legal team and or the company auditors. I know that sometimes we may not like working with them but they will be much nicer to you for approaching them and asking them the question rather than trying to do it on your own and get it wrong, there is absolutely nothing wrong with asking for help - especially when it is as important as this.
I hope this helps you and I wish you good luck in your quest... ;-)
You have to manually identify the failures.
Redirect the rows (as you currently are) into a table that matches your columns, with the single addition of an IDENTITY column. This table should be all VARCHAR
data types, so that you retain all original values.
Now you can identify your failures using TRY_CAST when querying the table.
For example:
SELECT
Identity,
TRY_CAST(Winter As INT) As WinterConverted,
Winter
--------------------------------------
Identity | WinterConverted | Winter
1 | NULL | NaN
TRY_CAST failed to cast the value in Winter
column to an int
leaving you with a null
. You can do this for all columns.
If it's worth your time, you could make a more complex query to return each column name that contains a NULL. If not, you should be able to relatively easily visually inspect.
Of course, if most of your errors are 'NaN' or 'N/A', there's a good chance you're wasting time identifying errors you already knew about. Cleanup the issues before they become one:
Winter == "NaN" || Winter == "N/A" ? -1 : (DT_I4)Winter
As a side note, you can add a TRY_CAST
to 100 columns very easily by using vertical editing: hold Alt+Shift in SSDT or SSMS (some other apps too) and then ↑ or ↓, then edit hundreds of lines at the same time.
Best Answer
No, I am not aware of any built-in function that does exactly this. But, you can still accomplish this without doing anything too complicated.
You could use the built-in CRYPT_GEN_RANDOM function (introduced in SQL Server 2008 R2) which generates random values based on a supplied length. The output is in hex/binary values so each byte returned is represented as two alphanumeric characters (hence the
/ 2 + 1
part below).Returns something along the lines of:
The only real downside here is that this needs to be done inline as
CRYPT_GEN_RANDOM
cannot be used in a User-Defined Function (UDF: Scalar or Table-Valued). However, it can still be applied in a set-based approach using a CTE as shown here (just set@MaxLength
to the max length of the column being obfuscated):Returns something along the lines of:
As you can see,
CRYPT_GEN_RANDOM
returns a different value for each row.Also, not sure if this is acceptable or not, but the only alpha characters returned are
A
-F
.OR, if you want the obfuscation to be repeatable for the same input value, or at least don't mind it being repeatable and prefer that this code be in a function so that it is easier to apply to multiple columns, you can use the HASHBYTES function which, like
CRYPT_GEN_RANDOM
, returns hex/binary bytes. UnlikeCRYPT_GEN_RANDOM
, the output length is fixed (in this case at 64 characters since I am usingSHA2_256
), so I usedREPLICATE
to repeat the hashed valued if the length of the input string is more than 64 characters. Also unlikeCRYPT_GEN_RANDOM
,HASHBYTES
can be used in a User-Defined Function (UDF) :-).And that can be used as follows:
Returns something along the lines of:
PLEASE NOTE: If you need alpha characters beyond
A
-F
and/or need to have distinct obfuscated values for distinct input values (i.e. reduce chances of collisions), then either method above can be adapted easily enough to do that.