MySQL query comparing table against itself is not utilizing index

indexMySQLselect

I have a table (civicrm_contact) with the following (relevant) columns:

+--------------------------------+------------------+------+-----+-------------------+-----------------------------+
| Field                          | Type             | Null | Key | Default           | Extra                       |
+--------------------------------+------------------+------+-----+-------------------+-----------------------------+
| id                             | int(10) unsigned | NO   | PRI | NULL              | auto_increment              |
| contact_type                   | varchar(64)      | YES  | MUL | NULL              |                             |
| first_name                     | varchar(64)      | YES  |     | NULL              |                             |
| last_name                      | varchar(64)      | YES  | MUL | NULL              |                             |

I'm looking to optimize a query that does a partial match comparison of the first/last name fields.

SELECT t1.id id1, t2.id id2, 2 weight
FROM civicrm_contact t1
JOIN civicrm_contact t2
  ON (SUBSTR(t1.first_name, 1, 4) = SUBSTR(t2.first_name, 1, 4))
  AND (SUBSTR(t1.last_name, 1, 6) = SUBSTR(t2.last_name, 1, 6))
WHERE t1.contact_type = 'Individual'
  AND t2.contact_type = 'Individual'
  AND t1.first_name IS NOT NULL
  AND t1.first_name <> ''
  AND t1.last_name IS NOT NULL
  AND t1.last_name <> ''
  AND t1.id < t2.id

I created an index to match the partial matches. There are other existing indexes as well.

CREATE INDEX idx_ct_last6_first4_name ON civicrm_contact (contact_type, last_name(6), first_name(4));

However, the query remains very slow, and when I EXPLAIN it, I can see that the first table does not use that index and the second table does not use any index.

+----+-------------+-------+------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------+-------+-------+----------+----------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys                                                                                                                                                 | key                | key_len | ref   | rows  | filtered | Extra                                              |
+----+-------------+-------+------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------+-------+-------+----------+----------------------------------------------------+
|  1 | SIMPLE      | t1    | NULL       | ref  | PRIMARY,index_contact_type,index_first_name,index_last_name,dedupe_index_first_name_4,dedupe_index_last_name_6,idx_last6_first4_name,idx_ct_last6_first4_name | index_contact_type | 195     | const | 46968 |    25.00 | Using where                                        |
|  1 | SIMPLE      | t2    | NULL       | ALL  | PRIMARY,index_contact_type,idx_ct_last6_first4_name                                                                                                           | NULL               | NULL    | NULL  | 93936 |    16.66 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+------------+------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------+---------+-------+-------+----------+----------------------------------------------------+

I'm trying to understand why it's not making better use of the indexes and what I can do to improve/optimize the query. It is intended to do a simple comparison against itself to identify duplicate values. Is the lack of index use because I'm joining the table against itself?

If so, I'm guessing my best option is to create a temp table for the second table reference so that I am joining two separate tables, which should make better use of the indexes.

Best Answer

I think it cannot pick up the index because of the substr expression. It isn't smart enough to get, that this would exactly match the indexed keys. You'll have the same problems with a temporary table. The problem is not, that you're self joining civicrm_contact.

If you're on version 8.0 you could use stored generated columns, that generate the substrings.

ALTER TABLE civicrm_contact
            ADD (last_name_6 varchar(6)
                             AS (substr(last_name, 1, 6))
                                STORED,
                 first_name_4 varchar(4)
                              AS (substr(first_name, 1, 4))
                                 STORED);

Stored generated columns can be indexed and you can then use these columns instead of the original ones in the substr().

For 5.7 to support generated columns you'd need to use NDB Cluster (but I don't think, that's the case). Versions below 5.7 don't support it at all. However, as a general workaround for lower versions than 8.0, you could also add regular columns and fill them with the substrings initially with an UPDATE and subsequently with a trigger.

Then, for further improvement I'd try to move contact_type to the end of the index as I guess it's selectivity is comparatively low.

CREATE INDEX idx_last6_first4_name_ct
             ON civicrm_contact (last_name_6,
                                 first_name_4,
                                 contact_type);

(Assuming, that last_name_6 and first_name_4 are the generated columns from above.)

Possibly also try to include id at the very end as this is a ranged condition rather than a pointed one.

CREATE INDEX idx_last6_first4_name_ct_id
             ON civicrm_contact (last_name_6,
                                 first_name_4,
                                 contact_type,
                                 id);

Or maybe play a little more with the order of the columns and see if there's anything better.

However one question is: Is this a one-off thing or do you plan to run this query frequently? If it's just a one-off, the effort might outweigh the gain. Maybe just running the query and letting it take it's time, is the overall more "efficient" way (in terms of human work) in such a case.