Mariadb – Optimize MariaDB query with table that self-joins

greatest-n-per-groupmariadboptimization

I have a query that I think is taking way too long (as in I've not yet seen it finish!) but I can't figure out why.

The query (which is generated by an application that would be difficult but not impossible to change) is:

SELECT t1.id id1, t2.id id2, 5 weight
FROM ar1 t1 
INNER JOIN ar1 t2 ON (t2.contact_type = 'Individual' and t1.first_name = t2.first_name) 
WHERE t1.first_name IS NOT NULL
 AND t1.first_name <> ''
 AND t1.contact_type = 'Individual'
 AND t1.id < t2.id GROUP BY id1, id2, weight

And a table with 100k rows.

[EDIT: added:] The puropse of the query is to provide a list of unique pairs of records that could be duplicates. Could be because the table has other fields, too, but this query is only looking at the first name field.

CREATE TABLE `ar1` (                                                                            
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,                       
  `contact_type` varchar(64) DEFAULT NULL,
  `first_name` varchar(64) DEFAULT NULL,       
  PRIMARY KEY (`id`),                                                                           
  KEY `k1` (`first_name`,`contact_type`),                                                       
  KEY `k2` (`contact_type`,`first_name`)                                                        
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci                              

Table status:

Name            | ar1
Engine          | InnoDB
Version         | 10
Row_format      | Dynamic
Rows            | 102463
Avg_row_length  | 46
Data_length     | 4734976
Max_data_length | 0
Index_length    | 5275648
Data_free       | 0
Auto_increment  | <null>
Create_time     | 2020-04-21 09:22:1
Update_time     | <null>
Check_time      | <null>
Collation       | utf8mb4_unicode_ci
Checksum        | <null>
Create_options  |

EXPLAIN says this:

***************************[ 1. row ]***************************
id            | 1
select_type   | SIMPLE
table         | t1
type          | ref
possible_keys | PRIMARY,k1,k2
key           | k2
key_len       | 259
ref           | const
rows          | 51232
Extra         | Using where; Using index; Using temporary; Using filesort
***************************[ 2. row ]***************************
id            | 1
select_type   | SIMPLE
table         | t2
type          | ref
possible_keys | PRIMARY,k1,k2
key           | k1
key_len       | 518
ref           | od_civicrm.t1.first_name,const
rows          | 5
Extra         | Using where; Using index
2 rows in set

Notes:

  • There are only 8647 distinct first names in the table.
  • contact_type is Individual for 99.9% of the rows.
  • There's no need to group on a constant (weight), either. But ommitting that does not magically fix it either.
  • I've tried various index combinations.
  • I'm on MariaDB 10.1
  • The servers I've tried it on are not massive – but to my mind this is not a huge recordset?

The ANALYZE result can be downloaded

Best Answer

What the heck is the intent? It seems to deliver the ids of all pairs of individuals with the same first_name.

It looks like a "group-wise" max, but not quite.

weight seems to be irrelevant since it is the constant "5".

INDEX(contact_type, first_name, id)

may speed it up, even with 99.9% of contact_types being the desired value. The speedup is due to the index being "covering". The order of the columns is important.

It is likely to return a million rows. That, in itself, will take time. And possibly choke the client.

Just looking for dups?

Perhaps you want

SELECT GROUP_CONCAT(id), first_name
    FROM t
    WHERE first_name != ''
      AND contact_type = 'Individual'
    GROUP BY first_name
    HAVING COUNT(*) > 1;

Alternatively, SELECT MIN(id), MAX(id), first_name ..., except that it does not give you all the ids when there are more than 2.

Optimal:

INDEX(contact_type, first_name, id)