Mysql – Indexes/optimization for IN, JOIN, GROUP BY, ORDER BY query

MySQLperformancequery-performance

I'm working on a query where I need to use IN, BETWEEN, GROUP BY, JOIN, ORDER BY all in one query. I am struggling with performance for that query, so I need help to choose indexes or to make changes to table structures if indexes won't help.

Some considerations

  1. Number of rows for both below tables are in millions.
  2. There is functionality where user can filter list by name, age, gender etc.
  3. There is functionality where user can sort list by some metrics like age, visits_count etc.
  4. Need Pagination for list.

Table Structures

Table 1

CREATE TABLE `table_1` (
  `visitor_id` varchar(32) CHARACTER SET ascii NOT NULL,
  `name` varchar(200) NOT NULL,
  `gender` varchar(1) NOT NULL DEFAULT 'M',
  `mobile_number` int(10) unsigned DEFAULT NULL,
  `age` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `visits_count` mediumint(5) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`visitor_id`),
  KEY `indx_t1_test` (`visitor_id`,`visits_count`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Table 2

CREATE TABLE `table_2` (
  `company_id` bigint(20) unsigned NOT NULL,
  `visitor_id` varchar(32) CHARACTER SET ascii NOT NULL,
  `time_duration` mediumint(5) unsigned NOT NULL DEFAULT '0',
  `visited_on` date NOT NULL,
  PRIMARY KEY (`company_id`,`visitor_id`,`visited_on`),
  KEY `indx_t2_test` (`visited_on`,`company_id`,`visitor_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Most Basic Data I want to retreive

Want to get 20 (Pagination) unique visitors (GROUP BY / DISTINCT) that visited particular group of companies (IN part) between selected time (BETWEEN part) period order by their age (ORDER BY part).

Query 1

First query if I write down for this then it would be:

SELECT
    t1.visitor_id
FROM table_1 AS t1
INNER JOIN table_2 AS t2 ON t2.visitor_id = t1.visitor_id
WHERE
    t2.company_id IN (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455) AND
    t2.visited_on BETWEEN '2015-01-01' AND '2017-01-31'
GROUP BY t1.visitor_id
ORDER BY t1.`visits_count` DESC
LIMIT 20;

When I run this query for any single company, it returns data fast enough (when the amount of matching rows are in small number, query performance is good).

The issue is when number of companies increases in IN part of query (I need to support 100 companies for this part of query), it takes time about 36 seconds to return result.

Explain output of this query is:

enter image description here

enter image description here

Query 2

Second query I can think of for same case, then it would be something like this:

SELECT
(
    SELECT  
        t2.visitor_id
    FROM table_2 AS t2
    WHERE 
        t2.company_id IN (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455)
        AND t2.visitor_id = t1.`visitor_id`
        AND t2.visited_on BETWEEN '2015-01-01' AND '2017-01-31'
    LIMIT 1
) AS visitor_id
FROM `table_1` AS t1
HAVING visitor_id IS NOT NULL
ORDER BY t1.`visits_count` DESC
LIMIT 0, 20

Behaviour of this query is opposite to the first one. If I run query for company which have few visitors, performance of this query is very low (it took about 38 seconds) (only one company in IN part, and that company have only 3-4 visitors). When number of companies in IN part is high, it returns results faster compared to one company (it took approx 13 seconds), but still not usable performance.

Explain output of this query is:

enter image description here

Query 3

To eliminate use of IN part of query, I created temporary table and added company ids in that table and then used JOIN:

SELECT
    DISTINCT
    t1.visitor_id
FROM `table_1` AS t1
INNER JOIN `table_2` AS t2 ON t1.`visitor_id` = t2.visitor_id
INNER JOIN temp_table AS t3 ON t3.company_id = t2.company_id
ORDER BY t1.`visits_count` DESC
LIMIT 0, 20;

This query is also taking time up to 22 seconds. I need performance up to 2-3 seconds for this listing.

Additional information

  • innodb_buffer_pool_size is 12GB
  • RAM is 30 GB
  • I'm using AWS RDS db.r3.xlarge instance
  • SHOW TABLE STATUS output is as follows:
    enter image description here

    1. The query SELECT COUNT(*) FROM table_2 WHERE company_id IN (...) AND visited_on BETWEEN '2015-01-01' AND '2017-01-31' returns 2660123

    2. For the first time only it is taking time. If I run the same query again it is very much faster (0.2 seconds). But, if I change the limit to LIMIT 20, 20 then again it repeats 24 seconds for first time, and same query second time is faster. It may be because of innodb_buffer_pool_size.

    3. Output of EXPLAIN FORMAT=JSON SELECT ...; is as follows.

      {
      "query_block": {
      "select_id": 1,
      "ordering_operation": {
        "using_filesort": true,
        "grouping_operation": {
          "using_temporary_table": true,
          "using_filesort": false,
          "nested_loop": [
            {
              "table": {
                "table_name": "t2",
                "access_type": "range",
                "possible_keys": [
                  "PRIMARY",
                  "indx_t2_test"
                ],
                "key": "PRIMARY",
                "used_key_parts": [
                  "company_id"
                ],
                "key_length": "8",
                "rows": 17301,
                "filtered": 100,
                "using_index": true,
                "attached_condition": "((`db`.`t2`.`company_id` in (528,211,1275,521,1299,493,492,852,868,869,1235,486,485,1238,855,1237,651,538,1241,1240,548,543,1247,1253,490,468,582,583,569,477,488,802,1294,518,1274,476,545,1267,556,479,1266,1265,541,1189,1263,1152,1260,478,1257,885,1139,1256,804,708,547,561,1239,1142,1226,1148,1230,529,1223,1192,1191,874,830,822,818,817,794,718,487,709,706,705,669,513,455)) and (`db`.`t2`.`visited_on` between '2015-01-01' and '2017-01-31'))"
              }
            },
            {
              "table": {
                "table_name": "t1",
                "access_type": "eq_ref",
                "possible_keys": [
                  "PRIMARY",
                  "indx_t1_test"
                ],
                "key": "PRIMARY",
                "used_key_parts": [
                  "visitor_id"
                ],
                "key_length": "34",
                "ref": [
                  "db.t2.visitor_id"
                ],
                "rows": 1,
                "filtered": 100
              }
            }
          ]
        }
      }
      }
      }
      

The output of the query suggested by Rick James:

SELECT
    t2.visitor_id
FROM (
    SELECT
        DISTINCT visitor_id
    FROM table_2
    WHERE 
        company_id IN (528,211,1275,521,1299,493,492,852, 868,
                        869,1235,486,485,1238,855,1237,651,538,1241,1240, 548,
                        543,1247,1253,490,468,582,583,569,477,488,802,1294, 518,
                        1274,476,545,1267,556,479,1266,1265,541,1189,1263, 1152,
                        1260,478,1257,885,1139,1256,804,708,547,561,1239, 1142,
                        1226,1148,1230,529,1223,1192,1191,874,830,822,818, 817,
                        794,718,487,709,706,705,669,513,455)
        AND visited_on BETWEEN '2015-01-01' AND '2017-01-31'
) AS t2
INNER JOIN table_1 AS t1 ON t2.visitor_id = t1.visitor_id
ORDER BY t1.`visits_count` DESC
LIMIT 20;

Explain output of the query is as follows:

enter image description here

This query is taking 58 seconds

enter image description here

Explain output of the inner subquery is as follows

enter image description here

enter image description here


The query:

SELECT
    COUNT(DISTINCT company_id, visited_on, visitor_id), 
    COUNT(DISTINCT company_id, LEFT(visited_on, 7), visitor_id), 
    COUNT(*) 
FROM table_2;

returns:

  • COUNT(DISTINCT company_id, visited_on, visitor_id) = 7607938.
  • COUNT(DISTINCT company_id, LEFT(visited_on, 7), visitor_id) = 5194480
  • COUNT(*) = 7607938

Note this output is with latest data, so number of rows in count(*) might have increased.

Best Answer

age int(3) unsigned - That allows you to store ages up to 4 billion and wastes 4 bytes. Change to TINYINT UNSIGNED (1 bytes).

Ascii for names? Limited to the US? Even so, disallowing some odd names.

I'm puzzled by t2's PRIMARY KEY. Since the PK is Unique, this disallows recording more than one visit to a company for a person. If the restriction is OK, add this (in case the Optimizer decides that the data range is the best filter):

INDEX(visited_on, conpany_id, visitor_id)

If my hunch is correct then change the PK and add an index:

PRIMARY KEY(`company_id`, `visitor_id`, visited_on),
INDEX(visited_on, conpany_id, visitor_id)

Then check out your various Queries.