MySQL: Optimize query with a range and distinct count

index-tuningMySQLoptimizationperformancequery-performance

I have a query on a large table with millions of rows that looks like this:

SELECT
    COUNT(
        DISTINCT clicks.tx_id
    ) AS unique_clicks_count ,
    clicks.offer_id
FROM
    clicks
WHERE
    clicks.offer_id = 5
    AND created_at > '2014-11-27 18:00:02'
;

Created_at is a timestamp. I have a compound index on (offer_id, created_at) which gets used. The following is the explain:

|  1 | SIMPLE      | clicks | range | clicks_created_at_index,clicks_offer_id_created_at_index | clicks_offer_id_created_at_index | 8       | NULL | 215380 | Using index condition; Using MRR |
  1. Keeping in mind the range, what kind of index would I need to be able to count the distinct tx_id's efficiently, most likely which covers tx_id as well?

  2. What would the index look like without specifying clicks.offer_id = 5, and instead doing GROUP BY offer_id?

Best Answer

You have the best index there is. It is in the right order, and the EXPLAIN says "Using index", which means that it read the index to get the answer, and did not have to reach into the data.

(To further address all the comments...)

Note that it needed to read about 200K rows (of the index) to do the count. That many rows takes time.

INDEX(offer_id, created_at) versus INDEX(offer_id, created_at, tx_id) -- Apparently you are using InnoDB and tx_id is the PRIMARY KEY. The PK is included in every secondary key, so these two index specifications are virtually identical.

Order of the columns in an INDEX usually matters. And it does matter here. The fields must be in this order: (1) all the the "=" conditions (offer_id), (2) one range (created_id), and (3) all the other fields to make it "Using index", in any order (tx_id).

If you did not have offer_id = 5, follow the above pattern and get (1) (empty set), (2) (created_id), and (3) (tx_id) -- That is, INDEX(created_at, tx_id). Note that neither index works well for the other query.

No kind of PARTITIONing would help performance at all. You don't need a 2-dimensional index (as in two ranges); you have "=" and "range", so a 'compound index' works best.

I suspect that "Using MRR" (Multi-Range Read Optimization) effectively replaces "Using temporary" and "Using filesort" would might normally be used for DISTINCT.