I have a query that I think is taking way too long (as in I've not yet seen it finish!) but I can't figure out why.
The query (which is generated by an application that would be difficult but not impossible to change) is:
SELECT t1.id id1, t2.id id2, 5 weight
FROM ar1 t1
INNER JOIN ar1 t2 ON (t2.contact_type = 'Individual' and t1.first_name = t2.first_name)
WHERE t1.first_name IS NOT NULL
AND t1.first_name <> ''
AND t1.contact_type = 'Individual'
AND t1.id < t2.id GROUP BY id1, id2, weight
And a table with 100k rows.
[EDIT: added:] The puropse of the query is to provide a list of unique pairs of records that could be duplicates. Could be because the table has other fields, too, but this query is only looking at the first name field.
CREATE TABLE `ar1` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`contact_type` varchar(64) DEFAULT NULL,
`first_name` varchar(64) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `k1` (`first_name`,`contact_type`),
KEY `k2` (`contact_type`,`first_name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci
Table status:
Name | ar1
Engine | InnoDB
Version | 10
Row_format | Dynamic
Rows | 102463
Avg_row_length | 46
Data_length | 4734976
Max_data_length | 0
Index_length | 5275648
Data_free | 0
Auto_increment | <null>
Create_time | 2020-04-21 09:22:1
Update_time | <null>
Check_time | <null>
Collation | utf8mb4_unicode_ci
Checksum | <null>
Create_options |
EXPLAIN says this:
***************************[ 1. row ]***************************
id | 1
select_type | SIMPLE
table | t1
type | ref
possible_keys | PRIMARY,k1,k2
key | k2
key_len | 259
ref | const
rows | 51232
Extra | Using where; Using index; Using temporary; Using filesort
***************************[ 2. row ]***************************
id | 1
select_type | SIMPLE
table | t2
type | ref
possible_keys | PRIMARY,k1,k2
key | k1
key_len | 518
ref | od_civicrm.t1.first_name,const
rows | 5
Extra | Using where; Using index
2 rows in set
Notes:
- There are only 8647 distinct first names in the table.
contact_type
isIndividual
for 99.9% of the rows.- There's no need to group on a constant (
weight
), either. But ommitting that does not magically fix it either. - I've tried various index combinations.
- I'm on MariaDB 10.1
- The servers I've tried it on are not massive – but to my mind this is not a huge recordset?
The ANALYZE
result can be downloaded
Best Answer
What the heck is the intent? It seems to deliver the
ids
of all pairs of individuals with the samefirst_name
.It looks like a "group-wise" max, but not quite.
weight
seems to be irrelevant since it is the constant "5".may speed it up, even with 99.9% of contact_types being the desired value. The speedup is due to the index being "covering". The order of the columns is important.
It is likely to return a million rows. That, in itself, will take time. And possibly choke the client.
Just looking for dups?
Perhaps you want
Alternatively,
SELECT MIN(id), MAX(id), first_name ...
, except that it does not give you all the ids when there are more than 2.Optimal: