Sql-server – Why does the database structure and SELECT operations generate NULLs

csvimportsql serversql-server-2008trigger

I'm not too experienced with database problems, so please excuse the wall of text, I believe the context is important to understand the problem and goal.


I receive a CSV file from another system every day that I want to retain in a database running on Microsoft SQL Server. To accomplish this, I have a PowerShell-script scheduled daily that uses the bcp utility to "load" the CSV (accounts.csv) into a table (accounts_stage) that has the "exact" layout of the CSV file. In addition, I "write" the timestamp into another table (import_stage). The accounts_stage table has a trigger set, and bcp is configured to fire that trigger.

After the import is finished, I want to "normalize" the data. For a basic example, assume this concerns only the contract number (in the CSV file a field called "Vnr"), which is a string. I have another table (contracts), which gets filled by the trigger:

INSERT INTO dzp.contractid(vnr)
SELECT DISTINCT
    a.Vnr
FROM dzp.accounts_stage a
WHERE NOT EXISTS(
    SELECT id.vnr
    FROM dzp.contractid id
)

This should, and seems to accomplish, add all contract numbers that are ever imported to the table contractids. contractids.contractId is a BIGINT PRIMARY KEY, IDENTITY, and auto increment.

I add the import-metadata to my table like this:

   CREATE TABLE #IMPORTMETADATA
            ([importid] bigint
            ,[generated_timestamp] datetime
            ,[imported_timestamp] datetime
            )
    INSERT INTO #IMPORTMETADATA
        SELECT TOP 1 [importid], [timestamp_import], [generated_timestamp]         
    FROM
            [QGTAA].[dzp].[import]
        ORDER BY
            [timestamp_import] DESC

Next, the trigger should INSERT the data FROM accounts_stage INTO accounts, such that the contractId is replaced (from the original string representation to the PRIMARY KEY of the contracts table).

INSERT INTO [dzp].[accounts]
      ([Udnr]
      ,[ContractId]
      -- snip
      ,[generated_timestamp]
      ,[importid]
      )
SELECT 
    a.[Udnr], 
    cid.contractId, 
    -- snip
    meta.[generated_timestamp], 
    meta.[importid]
FROM [QGTAA].[dzp].[accounts_stage] a
LEFT JOIN dzp.contractid cid ON cid.vnr = a.Vnr
CROSS JOIN #IMPORTMETADATA meta

However, this last step fails, because the SELECT operation in the last block yields rows where contractId IS NULL, which I did not expect. Now I believe the issue is somewhere with the LEFT JOIN, but I'm not sure why.

Questions

  1. Can someone please point me in the right direction on this specific problem?

  2. I am not too experienced with databases, is this even a sane way of accomplishing what I intend, or is this a completely insane approach? Is there a better way?

Thank you!

Best Answer

There's a problem with your first NOT EXISTS. Let's take a look at your current query:

INSERT INTO dzp.contractid(vnr)
SELECT DISTINCT
    a.Vnr
FROM dzp.accounts_stage a
WHERE NOT EXISTS(
    SELECT id.vnr
    FROM dzp.contractid id
)

So, what's happening here is that the NOT EXISTS that you are using will return true only when the dzp.contractid table is empty. Once it has one row (or more), the NOT EXISTS will return false for every row of your main table, since SELECT id.vnr FROM dzp.contractid has a result set, and you end up inserting duplicated values in your table.

What you need to do is correlate this subquery with the main table:

INSERT INTO dzp.contractid(vnr)
SELECT DISTINCT
    a.Vnr
FROM dzp.accounts_stage a
WHERE NOT EXISTS(
    SELECT id.vnr
    FROM dzp.contractid id
    WHERE id.vnr = a.vnr
);

Also, the CROSS JOIN with the data from the latest import doesn't make much sense to me, and neither the fact that you aren't checking if the data exists already in the accounts table.