Sql-server – SSIS Data Flow Task Violates Unique Constraint But Linked Server Insert Succeeds

collationsql serverssisunique-constraint

Ultimately, what I am trying to do is create an SSIS package in the QA environment for another team which truncates the destination tables and inserts from the source tables. However, on one table, ITEMMAST, I am receiving an error due to a unique constraint violation when trying to copy the data using a data flow task.

What is odd is that a SELECT...INSERT using a linked server succeeds! However, I can't use linked servers because this other team wants to be able to specify the environment for the source so I have to parameterize the connection strings.

Here is the definition of the unique constraint:

ALTER TABLE [dbo].[ITEMMAST]
  ADD CONSTRAINT [ITESET1]
  PRIMARY KEY CLUSTERED 
  (
      [ITEM_GROUP] ASC,
      [ITEM] ASC 
  ) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
        IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
        ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90)
  ON [PRIMARY];

I've tried dropping the primary key constraint, copying the data, and then recreating the primary key constraint but that also produces an error.

Msg 1505, Level 16, State 1, Line 46
The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.ITEMMAST' and the index name 'ITESET1'. The duplicate key value is (SALO , ).
Msg 1750, Level 16, State 0, Line 46
Could not create constraint. See previous errors.
The statement has been terminated.

A developer gave me the following query to run which helped me identify the culprits, one row with 32 spaces in the ITEM column and another row with the ASCII character of 0, a null character, in the ITEM column.

SELECT TOP 10 
patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,ITEM) as [Position],
 substring(ITEM,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,ITEM),1) as [InvalidCharacter],
 ascii(substring(ITEM,patindex('%[^ !-~]%' COLLATE Latin1_General_BIN,ITEM),1)) as [ASCIICode]
,*
 FROM Lawson.dbo.ITEMMAST
WHERE ITEM_GROUP = 'SALO'

Here is more information about the QA environment. These settings match our PROD environment so I'm unable to modify QA to make this work. I've thought about altering the unique constraint definition and changing the destination database collation.

Source Database Collation: Latin1_General_BIN
Collation for the ITEM Column in the Source Table: Latin1_General_BIN

Destination Database Collation: SQL_Latin1_General_CP1_CI_AS
Collation for the ITEM Column in the Destination Table: Latin1_General_BIN

I checked that the codepage for the Latin1_General_BIN collation is 1252 which matches the codepage for the SQL_Latin1_General_CP1_CI_AS collation.

Does anyone have any suggestions for other things that I can try? I'm new to SSIS so there might be a setting that I did not set correctly.

[UPDATE #1]

Screen shot showing the result of the following query from the source table.

SELECT TOP 2
 CONVERT(VARBINARY(500), [ITEM])
,ITEM COLLATE SQL_Latin1_General_CP1_CI_AS AS [ITEM_Collated]
,*
FROM PROD.dbo.ITEMMAST
WHERE ITEM_GROUP = 'SALO'

Best Answer

I had the O.P. check for duplicates in the destination table. The following query was used:

SELECT ITEM_GROUP, ITEM, COUNT(*)
FROM Lawson.dbo.ITEMMAST
GROUP BY ITEM_GROUP, ITEM
HAVING COUNT(*) > 1;

The result was:

0x2020202020202020202020202020202020202020202020202020202020202020

The row starting with 7 "nul" characters (i.e. CHAR(0) or 0x00, see image #2 in the question) was replaced with a row of all spaces.

The problem is that the first character is 0x00, which is the null terminator for strings. Pulling that value into SSIS (i.e. .NET) probably viewed it as a simple empty string. And inserting an empty string into a CHAR() column (i.e. a blank-padded datatype) will naturally leave you with all spaces. Hence your duplicate row.

At this point, I would say that it's simply bad data, unless you feel a need to preserve an invalid string (no string should ever have anything after a 0x00, nor should it really even have the 0x00, but if it does, it would be the last character).

So, you can correct that row at the source, but it can't be an empty string since that would get the same PK violation that you are running into now. Is there a default dummy value?

Or, if that row is truly invalid / not used, then I would remove it, OR maybe find out how that row got there in the first place. Did someone manually enter it, did the app create it? Was it a botched import? You might take this opportunity to add a CHECK constraint to ensure that no row has the 0x00 character in it in the future (this would help identify the culprit in the future). Then delete the row from QA and move on... :)

I suppose you can always try to exclude that row if need be, but it would be best to prevent it from ever existing. It will never appear correctly when selected. And there might be other negative impacts downstream in the future. How much time have you (and I ;) spent on finding / fixing this? Do you really want to do it again, or have someone else hunt this down? If that row could never show up in prod, then it should be removed from QA.

Related Solutions

Sql-server – Unique index corrupted SQL. Select query returns single row but create unique index fails

if you consider error, Msg 1505, Level 16, State 1, Line 2 The CREATE UNIQUE INDEX statement terminated because a duplicate key was found for the object name 'dbo.MSmerge_contents' and the index name 'uc1SycContents'. The duplicate key value is (7696031, 08703987-557d-e111-9888-e61f13c44f03)...... I am running below Query
select * from msmerge_contents where rowguid='08703987-557d-e111-9888-e61f13c44f03'
and it is returning only 1 row

When you have index corruption problems (ie. keys present in NC index but not in base table or vice-versa) you must be very careful about the SQL you use to validate data. At this moment your data is inconsistent but the query optimizer does not know that and completely trusts your schema, including these incorrect indexes. As such it may optimize your query to use one of the NC indexes that is missing a key and the result will also miss a a key falsely returning no duplicates. To solve this catch-22 situation you need to force the optimizer hand by explicitly requesting an index or another and make sure the projected list of columns can be satisfied by the index you enforced (ie. no *). Assuming uc1SycContents is not the clustered index, try out the following:

select rowguid
from msmerge_contents with INDEX (1)
where rowguid='08703987-557d-e111-9888-e61f13c44f03';

select rowguid
from msmerge_contents with INDEX ([uc1SycContents])
where rowguid='08703987-557d-e111-9888-e61f13c44f03';

This will forcefully check if the rowguid has a duplicate for that guid in the base table clustered index (index id 1) vs. the index uc1SycContents. I expect that the first query returns 2 (or more) rows while the second returns 1.

Sql-server – DB collation is used for comparison instead of column collation

would this script use a temporary table? Then the table will have the server default collation.

If it just a quick one-time solution, then I would modify the source query to provide the data in the correct collation, here Hebrew_CI_AS

Best Answer

Related Solutions

Sql-server – Unique index corrupted SQL. Select query returns single row but create unique index fails

Sql-server – DB collation is used for comparison instead of column collation

Related Question