Mariadb – Optimize MariaDB query with table that self-joins

greatest-n-per-groupmariadboptimization

I have a query that I think is taking way too long (as in I've not yet seen it finish!) but I can't figure out why.

The query (which is generated by an application that would be difficult but not impossible to change) is:

SELECT t1.id id1, t2.id id2, 5 weight
FROM ar1 t1 
INNER JOIN ar1 t2 ON (t2.contact_type = 'Individual' and t1.first_name = t2.first_name) 
WHERE t1.first_name IS NOT NULL
 AND t1.first_name <> ''
 AND t1.contact_type = 'Individual'
 AND t1.id < t2.id GROUP BY id1, id2, weight

And a table with 100k rows.

[EDIT: added:] The puropse of the query is to provide a list of unique pairs of records that could be duplicates. Could be because the table has other fields, too, but this query is only looking at the first name field.

CREATE TABLE `ar1` (                                                                            
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,                       
  `contact_type` varchar(64) DEFAULT NULL,
  `first_name` varchar(64) DEFAULT NULL,       
  PRIMARY KEY (`id`),                                                                           
  KEY `k1` (`first_name`,`contact_type`),                                                       
  KEY `k2` (`contact_type`,`first_name`)                                                        
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Table status:

Name            | ar1
Engine          | InnoDB
Version         | 10
Row_format      | Dynamic
Rows            | 102463
Avg_row_length  | 46
Data_length     | 4734976
Max_data_length | 0
Index_length    | 5275648
Data_free       | 0
Auto_increment  | <null>
Create_time     | 2020-04-21 09:22:1
Update_time     | <null>
Check_time      | <null>
Collation       | utf8mb4_unicode_ci
Checksum        | <null>
Create_options  |

EXPLAIN says this:

***************************[ 1. row ]***************************
id            | 1
select_type   | SIMPLE
table         | t1
type          | ref
possible_keys | PRIMARY,k1,k2
key           | k2
key_len       | 259
ref           | const
rows          | 51232
Extra         | Using where; Using index; Using temporary; Using filesort
***************************[ 2. row ]***************************
id            | 1
select_type   | SIMPLE
table         | t2
type          | ref
possible_keys | PRIMARY,k1,k2
key           | k1
key_len       | 518
ref           | od_civicrm.t1.first_name,const
rows          | 5
Extra         | Using where; Using index
2 rows in set

Notes:

There are only 8647 distinct first names in the table.
contact_type is Individual for 99.9% of the rows.
There's no need to group on a constant (weight), either. But ommitting that does not magically fix it either.
I've tried various index combinations.
I'm on MariaDB 10.1
The servers I've tried it on are not massive – but to my mind this is not a huge recordset?

The ANALYZE result can be downloaded

Best Answer

What the heck is the intent? It seems to deliver the ids of all pairs of individuals with the same first_name.

It looks like a "group-wise" max, but not quite.

weight seems to be irrelevant since it is the constant "5".

INDEX(contact_type, first_name, id)

may speed it up, even with 99.9% of contact_types being the desired value. The speedup is due to the index being "covering". The order of the columns is important.

It is likely to return a million rows. That, in itself, will take time. And possibly choke the client.

Just looking for dups?

Perhaps you want

SELECT GROUP_CONCAT(id), first_name
    FROM t
    WHERE first_name != ''
      AND contact_type = 'Individual'
    GROUP BY first_name
    HAVING COUNT(*) > 1;

Alternatively, SELECT MIN(id), MAX(id), first_name ..., except that it does not give you all the ids when there are more than 2.

Optimal:

INDEX(contact_type, first_name, id)

Related Solutions

Mysql – Help optimizing MySQL slow query

I would like to get rid of "Using temporary; Using filesort"

One of the problems I see is that you're using different GROUP BY and ORDER BY clauses. From the manual on how MySQL uses temporary tables:

If there is an ORDER BY clause and a different GROUP BY clause, or if the ORDER BY or GROUP BY contains columns from tables other than the first table in the join queue, a temporary table is created.

As soon as you create a temporary table, it will need to be sorted according to your ORDER BY clause, indicated by 'using filesort'.

This execution plan at leasts uses the indexes to appropriately limit the number of rows found.

I would also look through the docs on ORDER BY optimization.

Mysql – Filesort while using primary key for orderby

KEY `index_dedupe_variables`
(`deleted_at`,`state`,`reference_type`,`reference`,`id`,`client_identifier`)

Try removing that index and replacing it with this one:

KEY `new_index`
(`deleted_at`,`state`,`reference_type`,`id`)

Even when an index is being used, only the useful portion of the index is used; columns to the right of the first instance where the index cannot be used are not considered in that index.

In your query, 'reference' uses a wildcard on the left side, making that column of the index not useful, and eliminating the use of `id` in the index.

Your explain plan tells you this is happening:

ref: const,const,const

From this, you can tell that only the first three columns of the index are actually being used by the query.

By redefining the index to eliminate the unusable "reference" column, we can use all 4 of the columns of the index, and since the index is already sorted in the order you want, we can then do this:

WHERE `deleted_at` IS NULL
  AND state = 'Maharashtra'
  AND reference_type = 4 
  AND reference LIKE '%8858' 
ORDER BY deleted_at, state, reference_type, id

We don't really want them ordered by those first three columns, but since all they are all equality comparisons, ordering by them won't change the results and it should be "obvious" to the optimizer that it can simply return the results in index order... no filesort needed.

Your EXPLAIN should change to "Using Where" and possibly "Using index" (since this can mean "using index for lookups" even when it's not a covering index.)

It's possible that the optimizer might catch on and not use a filesort even if your ORDER BY were only `id`\ so you might try it both ways after you change the index.

Also, remove the use index from your query. It shouldn't be needed.

Best Answer

Related Solutions

Mysql – Help optimizing MySQL slow query

Mysql – Filesort while using primary key for orderby

Related Question