Mysql – identify duplicates in a large MySQL insert

insertMySQL

I currently have a PHP script that inserts thousands to millions of rows into an InnoDB table. To prevent duplicates, the table has a UNIQUE index set up on the combination of four columns, and I use INSERT IGNORE to allow the insert to continue regardless of duplicates.

For performance reasons, rather than inserting one row at a time, I batch them up into 2000 rows per query. I want to know which individual rows are ignored due to a key violation, and the only way I can think to do it is to insert one row at a time and then check the value of mysqli_affected_rows after each insert, but that feels inefficient and I will lose the advantage of batching my inserts. Alternatively I could remove the UNIQUE index and retrospectively check for duplicates using some SQL at the end.

Any other suggestions?

Many thanks.

Best Answer

This can probably be solved with a more complex set of steps. You still get the benefit of the bulk INSERT.

Build a temp table (probably permanent, but TRUNCATE between uses).

INSERT the 2K rows into it.

Analyze the rows via a JOIN between the temp table and the real table.

INSERT INTO real ... SELECT ... JOIN ... to put the "new" rows.

More details in http://mysql.rjweb.org/doc.php/staging_table , especially the section on 'Normalization', which uses two queries in a slightly different way -- one to insert 'new' rows, another to pull back the id for every row.

Your app may also benefit from the "flip-flop" technique described there.