Mysql – Identify duplicates and apply logic

MySQL

I have a largish 27mil row table. The table contains a large number of duplicate string values due to case variance.

All rows have the string column and also one additional numerical column, and I am trying to identify duplicates, and then ignore the row with the numerically highest secondary value within each duplicate set and export the remaining rows to a new table.

I have tried a number of approaches, but performance has been so poor as to make them nonviable even as a one time event.

What is the most performant approach for this kind of duplicate search with logic?

Platform is MySQL.

Best Answer

I would try locating all duplicates and their respective highest secondary values like this:

SELECT
  StringColumn,
  MAX(SecondaryColumn) AS HighestValue
FROM
  atable
GROUP BY
  StringColumn
HAVING
  COUNT(*) > 1

and then using the results to produce the final set like this:

SELECT
  t.StringColumn,
  t.SecondaryColumn
FROM
  atable AS t
  INNER JOIN
  (
    SELECT
      StringColumn,
      MAX(SecondaryColumn) AS HighestValue
    FROM
      atable
    GROUP BY
      StringColumn
    HAVING
      COUNT(*) > 1
  ) AS s ON t.StringColumn = s.StringColumn
        AND t.SecondaryColumn <> s.HighestValue
;

I believe an index on (StringColumn, SecondaryColumn) should help the performance.

Related Solutions

MySQL Trigger vs Stored Procedure Performance Comparison

For the sake of simplicity, triggers are the way to go for implementing any kind of tracking of database changes. However, you need to be aware of what happens under the hood when you use triggers.

According to MySQL Stored Procedure Programming, page 256 under the head "Trigger Overhead" says the following:

It is important to remember that, by necessity, triggers add overhead to the DML statement to which they apply. the actual amount of overhead will depend upon the nature of the trigger, but --- as all MySQL triggers execute FOR EACH ROW --- the overhead can rapidly accumulate for statements that process large numbers of rows. You should therefore avoid placing any expensive SQL statements or procedural code in triggers.

An expanded explanation of trigger overhead is given on pages 529-531. The conclulding point from that section states the following:

The lesson here is this: since the trigger code will execute once for every row affected by a DML statement, the trigger can easily become the most significant factor in DML performance. Code inside the trigger body needs to be as lightweight as possible and -- in particular -- any SQL statements in the trigger should be supported by indexes whenever possible.

Not mentioned in the book is another factor when using triggers: When it comes to audit logging, please be aware of what you log data into. I say this because should you choose to log to a MyISAM table, each INSERT into a MyISAM table produces a full table lock during the INSERT. This can become a serious bottleneck in a high-traffic, high-transaction environment. Additionally, if the trigger is against an InnoDB table and you log changes in MyISAM from within the trigger, this will secretly disabled ACID compliance (i.e., reduce block transactions to autocommit behavior), which is cannot be rolled back.

When using triggers on InnoDB tables and logging changes

The table you log to is also InnoDB
You have autocommit turned off
You setup START TRANSACTION...COMMIT/ROLLBACK blocks thoroughly

In this way, audit logs can benefit from COMMIT/ROLLBACK as would main tables.

Concerning using stored procedures, you would have to painstakingly call the stored procedure at every point of DML against the table being tracked. One could easily miss logging changes in the face of tens of thousands of lines of application code. Placing such code in a trigger eliminates finding all those DML statements.

CAVEAT

Depending on how complex the trigger is, it can still be a bottleneck. If you want to reduce bottlenecks in audit logging, there is something you can do. However, it will require a little infrastructure change.

Using commodity hardware, create two more DB Servers

Distribution master (DM) to relay audit info only
MySQL Server for Auditing (AU)

This will server to reduce write I/O on the main database (MD) due to audit logging. Here is how you can accomplish it:

Step 01) Turn on binary logging in the main database.

Step 02) Using an inexpensive server, setup MySQL (same version as MD) with binary logging enabled. This will be DM. Setup replication from MD to DM.

Step 03) Using a second inexpensive server, setup MySQL (same version as MD) with binary logging disabled. Setup each audit table to use --replicate-do-table. This will be AU. Setup replication from DM to AU.

Step 04) mysqldump the table structures from MD and load it into DM and AU.

Step 05) Convert all audit tables in MD to use the BLACKHOLE storage engine

Step 06) Convert all tables in DM and AU to use the BLACKHOLE storage engine

Step 07) Convert all audit tables in AU to use the MyISAM storage engine

When done

DM will replicate from MD and record stuff in its binary log only
With --replicate-do-table filter on all audit tables, AU will replicate from DM

What this does is store audit info on a separate DB server and also reduce any write I/O degradation that MD would normally have.

Merge and Remove Duplicates in MySQL – How to Guide

The query for populating the users_club_meta table looks good to me except for two issues:

I am not entirely convinced about HAVING COUNT(*) > 1. If you want to create entries for all combinations of (user_id, name, function, description), remove that condition.

If you want to omit the groups consisting only of empty school_grade values, you can use this condition instead:
```
HAVING COUNT(school_grade) > 0
```
That is assuming that by "empty" you mean NULL. If you mean to exclude empty strings ('') as well, you could modify the above condition like this:
```
HAVING COUNT(NULLIF(school_grade, '')) > 0
```
You are selecting id in that query but the column is not in GROUP BY. It is true that MySQL allows you to do that but the manual also discourages that you do that in cases where the non-aggregated non-GROUP BY column has multiple values per group.

If you want predictable results, use an aggregate function, for instance MIN():
```
INSERT INTO
  user_clubs_meta (user_clubs_id, meta_key, meta_value) 
SELECT
  MIN(id),
  'school_grades',
  CONCAT('[', GROUP_CONCAT(CONCAT('"', school_grade, '"')), ']')
FROM
  user_clubs 
GROUP BY
  user_id, name, function, description 
HAVING
  COUNT(school_grade) > 0
;
```

For the DELETE query, take the final draft of the INSERT statement's SELECT query keeping only MIN(id) in the SELECT clause and removing the HAVING clause:

SELECT
  MIN(id)
FROM
  user_clubs 
GROUP BY
  user_id, name, function, description
;

That gives you the first ID of every unique combination of (user_id, name, function, description). That will be the rows you want to keep. All the others you want to remove. This is an anti-join, and there are various ways to implement it. In this case you can go with:

NOT IN:

DELETE FROM
  user_clubs
WHERE
  id NOT IN
  (
    SELECT MIN(id)
    FROM user_clubs 
    GROUP BY user_id, name, function, description
  )
;

LEFT JOIN + WHERE IS NULL:

DELETE
  uc
FROM
  user_clubs AS uc
  LEFT JOIN
  (
    SELECT MIN(id) AS min_id
    FROM user_clubs 
    GROUP BY user_id, name, function, description
  ) AS filter
  ON uc.id = filter.min_id
WHERE
  filter.min_id IS NULL
;

The left join method is known to work fast in MySQL but you should test for yourself in your environment to choose what works better for you.

Best Answer

Related Solutions

MySQL Trigger vs Stored Procedure Performance Comparison

Merge and Remove Duplicates in MySQL – How to Guide

Related Question