Mysql – Identify duplicates and apply logic

MySQL

I have a largish 27mil row table. The table contains a large number of duplicate string values due to case variance.

All rows have the string column and also one additional numerical column, and I am trying to identify duplicates, and then ignore the row with the numerically highest secondary value within each duplicate set and export the remaining rows to a new table.

I have tried a number of approaches, but performance has been so poor as to make them nonviable even as a one time event.

What is the most performant approach for this kind of duplicate search with logic?

Platform is MySQL.

Best Answer

I would try locating all duplicates and their respective highest secondary values like this:

SELECT
  StringColumn,
  MAX(SecondaryColumn) AS HighestValue
FROM
  atable
GROUP BY
  StringColumn
HAVING
  COUNT(*) > 1

and then using the results to produce the final set like this:

SELECT
  t.StringColumn,
  t.SecondaryColumn
FROM
  atable AS t
  INNER JOIN
  (
    SELECT
      StringColumn,
      MAX(SecondaryColumn) AS HighestValue
    FROM
      atable
    GROUP BY
      StringColumn
    HAVING
      COUNT(*) > 1
  ) AS s ON t.StringColumn = s.StringColumn
        AND t.SecondaryColumn <> s.HighestValue
;

I believe an index on (StringColumn, SecondaryColumn) should help the performance.