Mysql – Indexes/optimization for IN, JOIN, GROUP BY, ORDER BY query

MySQLperformancequery-performance

I'm working on a query where I need to use IN, BETWEEN, GROUP BY, JOIN, ORDER BY all in one query. I am struggling with performance for that query, so I need help to choose indexes or to make changes to table structures if indexes won't help.

Some considerations

Number of rows for both below tables are in millions.
There is functionality where user can filter list by name, age, gender etc.
There is functionality where user can sort list by some metrics like age, visits_count etc.
Need Pagination for list.

Table Structures

Table 1

CREATE TABLE `table_1` (
  `visitor_id` varchar(32) CHARACTER SET ascii NOT NULL,
  `name` varchar(200) NOT NULL,
  `gender` varchar(1) NOT NULL DEFAULT 'M',
  `mobile_number` int(10) unsigned DEFAULT NULL,
  `age` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `visits_count` mediumint(5) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`visitor_id`),
  KEY `indx_t1_test` (`visitor_id`,`visits_count`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Table 2

CREATE TABLE `table_2` (
  `company_id` bigint(20) unsigned NOT NULL,
  `visitor_id` varchar(32) CHARACTER SET ascii NOT NULL,
  `time_duration` mediumint(5) unsigned NOT NULL DEFAULT '0',
  `visited_on` date NOT NULL,
  PRIMARY KEY (`company_id`,`visitor_id`,`visited_on`),
  KEY `indx_t2_test` (`visited_on`,`company_id`,`visitor_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Most Basic Data I want to retreive

Want to get 20 (Pagination) unique visitors (GROUP BY / DISTINCT) that visited particular group of companies (IN part) between selected time (BETWEEN part) period order by their age (ORDER BY part).

Query 1

First query if I write down for this then it would be:

SELECT
    t1.visitor_id
FROM table_1 AS t1
INNER JOIN table_2 AS t2 ON t2.visitor_id = t1.visitor_id
WHERE
    t2.company_id IN (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455) AND
    t2.visited_on BETWEEN '2015-01-01' AND '2017-01-31'
GROUP BY t1.visitor_id
ORDER BY t1.`visits_count` DESC
LIMIT 20;

When I run this query for any single company, it returns data fast enough (when the amount of matching rows are in small number, query performance is good).

The issue is when number of companies increases in IN part of query (I need to support 100 companies for this part of query), it takes time about 36 seconds to return result.

Explain output of this query is:

Query 2

Second query I can think of for same case, then it would be something like this:

SELECT
(
    SELECT  
        t2.visitor_id
    FROM table_2 AS t2
    WHERE 
        t2.company_id IN (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455)
        AND t2.visitor_id = t1.`visitor_id`
        AND t2.visited_on BETWEEN '2015-01-01' AND '2017-01-31'
    LIMIT 1
) AS visitor_id
FROM `table_1` AS t1
HAVING visitor_id IS NOT NULL
ORDER BY t1.`visits_count` DESC
LIMIT 0, 20

Behaviour of this query is opposite to the first one. If I run query for company which have few visitors, performance of this query is very low (it took about 38 seconds) (only one company in IN part, and that company have only 3-4 visitors). When number of companies in IN part is high, it returns results faster compared to one company (it took approx 13 seconds), but still not usable performance.

Explain output of this query is:

Query 3

To eliminate use of IN part of query, I created temporary table and added company ids in that table and then used JOIN:

SELECT
    DISTINCT
    t1.visitor_id
FROM `table_1` AS t1
INNER JOIN `table_2` AS t2 ON t1.`visitor_id` = t2.visitor_id
INNER JOIN temp_table AS t3 ON t3.company_id = t2.company_id
ORDER BY t1.`visits_count` DESC
LIMIT 0, 20;

This query is also taking time up to 22 seconds. I need performance up to 2-3 seconds for this listing.

Additional information

innodb_buffer_pool_size is 12GB
RAM is 30 GB
I'm using AWS RDS db.r3.xlarge instance

SHOW TABLE STATUS output is as follows:

The query SELECT COUNT(*) FROM table_2 WHERE company_id IN (...) AND visited_on BETWEEN '2015-01-01' AND '2017-01-31' returns 2660123
For the first time only it is taking time. If I run the same query again it is very much faster (0.2 seconds). But, if I change the limit to LIMIT 20, 20 then again it repeats 24 seconds for first time, and same query second time is faster. It may be because of innodb_buffer_pool_size.

Output of EXPLAIN FORMAT=JSON SELECT ...; is as follows.

{
"query_block": {
"select_id": 1,
"ordering_operation": {
  "using_filesort": true,
  "grouping_operation": {
    "using_temporary_table": true,
    "using_filesort": false,
    "nested_loop": [
      {
        "table": {
          "table_name": "t2",
          "access_type": "range",
          "possible_keys": [
            "PRIMARY",
            "indx_t2_test"
          ],
          "key": "PRIMARY",
          "used_key_parts": [
            "company_id"
          ],
          "key_length": "8",
          "rows": 17301,
          "filtered": 100,
          "using_index": true,
          "attached_condition": "((`db`.`t2`.`company_id` in (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455)) and (`db`.`t2`.`visited_on` between '2015-01-01' and '2017-01-31'))"
        }
      },
      {
        "table": {
          "table_name": "t1",
          "access_type": "eq_ref",
          "possible_keys": [
            "PRIMARY",
            "indx_t1_test"
          ],
          "key": "PRIMARY",
          "used_key_parts": [
            "visitor_id"
          ],
          "key_length": "34",
          "ref": [
            "db.t2.visitor_id"
          ],
          "rows": 1,
          "filtered": 100
        }
      }
    ]
  }
}
}
}

The output of the query suggested by Rick James:

SELECT
    t2.visitor_id
FROM (
    SELECT
        DISTINCT visitor_id
    FROM table_2
    WHERE 
        company_id IN (528,211,1275,521,1299,493,492,852, 868,
                        869,1235,486,485,1238,855,1237,651,538,1241,1240, 548,
                        543,1247,1253,490,468,582,583,569,477,488,802,1294, 518,
                        1274,476,545,1267,556,479,1266,1265,541,1189,1263, 1152,
                        1260,478,1257,885,1139,1256,804,708,547,561,1239, 1142,
                        1226,1148,1230,529,1223,1192,1191,874,830,822,818, 817,
                        794,718,487,709,706,705,669,513,455)
        AND visited_on BETWEEN '2015-01-01' AND '2017-01-31'
) AS t2
INNER JOIN table_1 AS t1 ON t2.visitor_id = t1.visitor_id
ORDER BY t1.`visits_count` DESC
LIMIT 20;

Explain output of the query is as follows:

This query is taking 58 seconds

Explain output of the inner subquery is as follows

The query:

SELECT
    COUNT(DISTINCT company_id, visited_on, visitor_id), 
    COUNT(DISTINCT company_id, LEFT(visited_on, 7), visitor_id), 
    COUNT(*) 
FROM table_2;

returns:

COUNT(DISTINCT company_id, visited_on, visitor_id) = 7607938.
COUNT(DISTINCT company_id, LEFT(visited_on, 7), visitor_id) = 5194480
COUNT(*) = 7607938

Note this output is with latest data, so number of rows in count(*) might have increased.

Best Answer

age int(3) unsigned - That allows you to store ages up to 4 billion and wastes 4 bytes. Change to TINYINT UNSIGNED (1 bytes).

Ascii for names? Limited to the US? Even so, disallowing some odd names.

I'm puzzled by t2's PRIMARY KEY. Since the PK is Unique, this disallows recording more than one visit to a company for a person. If the restriction is OK, add this (in case the Optimizer decides that the data range is the best filter):

INDEX(visited_on, conpany_id, visitor_id)

If my hunch is correct then change the PK and add an index:

PRIMARY KEY(`company_id`, `visitor_id`, visited_on),
INDEX(visited_on, conpany_id, visitor_id)

Then check out your various Queries.

UPDATE 2012-02-21 17:06 EDT

SELECT e.*, MAX(m.datetime) AS unread_last, COUNT(m.id) AS unread 
FROM
TAB_EVENT e LEFT JOIN
(
    SELECT * FROM
    (SELECT id,event_id,datetime FROM TAB_MESSAGE
    WHERE read IS NULL OR read = 0
    ORDER BY datetime DESC) mm
    LIMIT 10
) m
ON e.id=m.event_id
ORDER BY m.datetime DESC, e.id ASC;

@Sebastian, I put the query back in the original join order. Please try this as well !!!

UPDATE 2012-02-21 17:11 EDT

Make sure the datetime field is indexed

ALTER TABLE TAB_MESSAGE ADD INDEX read_datetime_ndx (read,datetime);

MySQL looking up more rows than needed (indexing issue)

Your indexes are fine for the two types of queries you mentioned.

This query will be satisfied by traversing the clustered index on the primary key...

[...] WHERE participant_id = x AND question_id = y AND given_answer_id = z;

...and this one is satisfied by the index on 'question_id':

[...] WHERE question_id = x;

The output of EXPLAIN SELECT is not telling you what you think it is telling you, because the value shown in rows is an estimate of the number of rows the server will need to consider, not the actual rows it will examine. For InnoDB these are based on index statistics.

rows

The rows column indicates the number of rows MySQL believes it must examine to execute the query.

For InnoDB tables, this number is an estimate, and may not always be exact.

^{— http://dev.mysql.com/doc/refman/5.5/en/explain-output.html#explain_rows}

The optimizer gathers information about different possible query plans, and chooses the one with the lowest cost. The information shown in EXPLAIN is the information the optimizer gathered about the plan it selected.

When type is ref and key is not NULL, this means that the name listed in the key column is the name of the index that the optimizer has chosen to use to find the desired rows, so your query plan looks exactly as it should.

Note, sometimes you will see Using index in the Extra column and a lot of people assume that this means an index is being used, or that no index is being used when that doesn't appear, but that's not correct, either. Using index describes a special case called a "covering index" -- it does not indicate whether an index is being used to locate the rows of interest.

It's possible that running ANALYZE [LOCAL] TABLE would cause the numbers in rows shown by EXPLAIN to differ, but this is a simple query and selecting this index is an obvious choice for the optimizer to make, so ANALYZE TABLE is unlikely to make any actual difference in performance.

It is possible, however, that your overall performance might see some marginal improvement with an occasional OPTIMIZE [LOCAL] TABLE, because you are not inserting rows in primary key order (as would be the case with an auto_increment primary key)... but on large tables this can be time-consuming because it rebuilds a new copy of the table... but, again, I wouldn't expect any significant change.