How to Structure an Index for Counting Distinct Values and Grouping by Columns in MySQL

distinctindexMySQL

I have a single reporting table of sales data with about 4 million rows of data:

CREATE TABLE reporting_sales (
  customer_id bigint(20) DEFAULT NULL,
  effective_date date DEFAULT NULL,
  expiration_date date DEFAULT NULL,
  license_type_id int(11) DEFAULT NULL,
  residency varchar(10) DEFAULT NULL,
  gender varchar(10) DEFAULT NULL,
  age_range varchar(10) DEFAULT NULL,
  KEY ndx_reporting_sales (license_type_id,
    effective_date,
    expiration_date,
    customer_id,
    residency,
    gender,
    age_range) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
;

And this is the statement I want to run to summarize the data as of a particular day:

SELECT COUNT(DISTINCT customer_id),
  license_type_id,
  residency,
  gender,
  age_range
FROM tmp_reporting_sales_fl
WHERE license_type_id in (1, 2, 3, 4, 5)
  AND effective_date <= '2021-01-01'
  AND expiration_date >= '2021-01-01'
GROUP BY license_type_id, residency, gender, age_range

I'm not sure how the index should be structured, specifically with respect to the customer_id field and the grouping.

Here's the explain for the index I have created, as shown above:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE reporting_sales range ndx_reporting_sales ndx_reporting_sales 4 1829784 16.66 Using where; Using index; Using filesort

How can I improve the performance of this statement and/or what would be a more suitable index?

Best Answer

A general rule for multi-column indexes is that you can have N leading columns in the index that are used in equality conditions.

Then you can have one more column in the index after those equality columns, to use for either inequality/range conditions, or grouping, or sorting. But not more than one.

Any further columns are not used for searching, sorting, or grouping. At best, they're used for a covering index.

In the query you show, you have three range conditions. Only one of these conditions can make use of the index.

You can tell from the EXPLAIN report's len column that it is only using 4 bytes of your index. That's for the first column license_type_id, which is a 4-byte integer. The other columns of the index are ignored for this query. They don't help narrow down the examined rows, nor do they help the group by.

In the query you show, that's the best you can do.


Possible exception to the above rule: MySQL 8.0.13 implemented the skip scan range access method, which might help in some cases, but there are a lot of limitations. Read the section https://dev.mysql.com/doc/refman/8.0/en/range-optimization.html#range-access-skip-scan for details.