Sql-server – SSIS Data Flow Task Violates Unique Constraint But Linked Server Insert Succeeds

collationsql serverssisunique-constraint

Ultimately, what I am trying to do is create an SSIS package in the QA environment for another team which truncates the destination tables and inserts from the source tables. However, on one table, ITEMMAST, I am receiving an error due to a unique constraint violation when trying to copy the data using a data flow task.

What is odd is that a SELECT...INSERT using a linked server succeeds! However, I can't use linked servers because this other team wants to be able to specify the environment for the source so I have to parameterize the connection strings.

Here is the definition of the unique constraint:

ALTER TABLE [dbo].[ITEMMAST]
  ADD CONSTRAINT [ITESET1]
  PRIMARY KEY CLUSTERED 
  (
      [ITEM_GROUP] ASC,
      [ITEM] ASC 
  ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
        IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
        ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90)
  ON [PRIMARY];

I've tried dropping the primary key constraint, copying the data, and then recreating the primary key constraint but that also produces an error.

Msg 1505, Level 16, State 1, Line 46
The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.ITEMMAST' and the index name 'ITESET1'. The duplicate key value is (SALO , ).
Msg 1750, Level 16, State 0, Line 46
Could not create constraint. See previous errors.
The statement has been terminated.

A developer gave me the following query to run which helped me identify the culprits, one row with 32 spaces in the ITEM column and another row with the ASCII character of 0, a null character, in the ITEM column.

SELECT TOP 10 
patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,ITEM) as [Position],
 substring(ITEM,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,ITEM),1) as [InvalidCharacter],
 ascii(substring(ITEM,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,ITEM),1)) as [ASCIICode]
,*
 FROM Lawson.dbo.ITEMMAST
WHERE ITEM_GROUP = 'SALO'

enter image description here

Here is more information about the QA environment. These settings match our PROD environment so I'm unable to modify QA to make this work. I've thought about altering the unique constraint definition and changing the destination database collation.

Source Database Collation: Latin1_General_BIN
Collation for the ITEM Column in the Source Table: Latin1_General_BIN

Destination Database Collation: SQL_Latin1_General_CP1_CI_AS
Collation for the ITEM Column in the Destination Table: Latin1_General_BIN

I checked that the codepage for the Latin1_General_BIN collation is 1252 which matches the codepage for the SQL_Latin1_General_CP1_CI_AS collation.

Does anyone have any suggestions for other things that I can try? I'm new to SSIS so there might be a setting that I did not set correctly.

[UPDATE #1]

Screen shot showing the result of the following query from the source table.

SELECT TOP 2
 CONVERT(VARBINARY(500), [ITEM])
,ITEM COLLATE SQL_Latin1_General_CP1_CI_AS AS [ITEM_Collated]
,*
FROM PROD.dbo.ITEMMAST
WHERE ITEM_GROUP = 'SALO'

enter image description here

Best Answer

I had the O.P. check for duplicates in the destination table. The following query was used:

SELECT ITEM_GROUP, ITEM, COUNT(*)
FROM Lawson.dbo.ITEMMAST
GROUP BY ITEM_GROUP, ITEM
HAVING COUNT(*) > 1;

The result was:

0x2020202020202020202020202020202020202020202020202020202020202020

The row starting with 7 "nul" characters (i.e. CHAR(0) or 0x00, see image #2 in the question) was replaced with a row of all spaces.

The problem is that the first character is 0x00, which is the null terminator for strings. Pulling that value into SSIS (i.e. .NET) probably viewed it as a simple empty string. And inserting an empty string into a CHAR() column (i.e. a blank-padded datatype) will naturally leave you with all spaces. Hence your duplicate row.

At this point, I would say that it's simply bad data, unless you feel a need to preserve an invalid string (no string should ever have anything after a 0x00, nor should it really even have the 0x00, but if it does, it would be the last character).

So, you can correct that row at the source, but it can't be an empty string since that would get the same PK violation that you are running into now. Is there a default dummy value?

Or, if that row is truly invalid / not used, then I would remove it, OR maybe find out how that row got there in the first place. Did someone manually enter it, did the app create it? Was it a botched import? You might take this opportunity to add a CHECK constraint to ensure that no row has the 0x00 character in it in the future (this would help identify the culprit in the future). Then delete the row from QA and move on... :)

I suppose you can always try to exclude that row if need be, but it would be best to prevent it from ever existing. It will never appear correctly when selected. And there might be other negative impacts downstream in the future. How much time have you (and I ;) spent on finding / fixing this? Do you really want to do it again, or have someone else hunt this down? If that row could never show up in prod, then it should be removed from QA.