Sql-server – Doing a comparison on two csv files with no Primary Key

csvmigrationsql serverssis

I'm wondering if in SSIS there is a way to compare two csv files(with the same structure) with no reference key to make a record unique.

To give you a better picture from what I mean; In initial load we load data from a csv file into a table, in the subsequent loads, we will get another csv file (same format) that can have different records. We should compare what we have in the table vs. what exist in the new version of the csv file and only load the changed sub set. If there is a new records, it should be inserted, if update, it should be updated, if record deleted in the source, it should be marked as inactive in the table.

Now my questions are:

Is there any way to compare two csv files in SSIS, while there is no key to make the
records unique?
How can we compare two tables in SQL Server, while there is no key to make
records unique?

The volume of data in the csv file is quite high, more than 20 millions records!

Any idea is appreciated.

Thank you,

Nazila

Best Answer

If you have your first CSV loaded into a table, you can just as easily load the other one into a staging table (presumably with the same structure as the 'real' one). Then you can get the new rows by

SELECT * FROM staging_table
EXCEPT
SELECT * FROM real_table
;

Rows missing from the new CSV can be get reversing the two sides around EXCEPT. However, given the lack of a key on the staging table (and hopefully not on the real one - it's not clear from your question), deleting rows based on this comparison can be painful, especially with so many rows.

You can drop the staging table once you have finished.

(As far as I see, this approach will work in any RDBMS.)

Related Solutions

Sql-server – Duplicate primary key in bulk insert after truncate in SQL Server 2008

If you've truncated the table, then a primary key violation must be coming from duplicate data in the file. Try bulk inserting into a new table, without the PK constraint, and then check the table for duplicates (probably easier than writing some tool or script to parse the file directly). You can create a mimic table that won't have constraints this way:

SELECT * INTO dbo.new_bulk_source
  FROM dbo.old_source
  WHERE 1 = 0;

Then change your package to reference this table, do the insert, then run:

SELECT key FROM dbo.new_bulk_source
  GROUP BY key
  HAVING COUNT(*) > 1;

I bet a donut the call is coming from inside the house (or the truncate is not succeeding).

Mysql – Dealing with empty strings while loading a table from a CSV

According to the documentation, you can use SET statements to transform the data on the way in.

 [SET col_name = expr,...]

The expr expression can include the column name, which will be interpreted as the data being read from the file and destined for that column... so, for example, at the end of your LOAD DATA INFILE statement you might use:

SET latitude = IF(latitude + 0 = 0,NULL,latitude),
    area_code = IF(area_code = '',NULL,area_code)

This example transforms 2 columns. If latitude + 0 is 0, latitude gets set to NULL, and otherwise it gets set to the value from the file as the data is inserted; if area_code contains an empty string, it gets set to NULL, otherwise to the data from the file. The more appropriate choice will depend on how MySQL handles casting the data, but I suspect either of these constructs would work in your situation.

You do not have to reference columns you don't intend to transform. They'll be inserted as-is.

Best Answer

Related Solutions

Sql-server – Duplicate primary key in bulk insert after truncate in SQL Server 2008

Mysql – Dealing with empty strings while loading a table from a CSV

Related Question