Mysql – large table – index performance and query issue

indexMySQLquery-performance

I have simplified a much larger table below:

CREATE TABLE `core` (
  `id` int NOT NULL,
  `loc_country` enum('United States','Colombia','United Kingdom',       
       'Australia','India','Germany','Canada','Korea','Netherlands',
       '200 more')  CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,
  `loc_city` varchar(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci DEFAULT NULL,
  `job` enum('a','b','c','d') CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `loc_country_2` (`loc_country`,`job`,`loc_city`(6))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
          ROW_FORMAT=COMPRESSED




explain format=json
    SELECT id FROM core
        WHERE id!=518601449
          AND loc_country='Mongolia'
          AND id < 518601449
          AND job='a'
        LIMIT 151\G

*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "14002.99"
    },
    "table": {
      "table_name": "core",
      "access_type": "range",
      "possible_keys": [
        "PRIMARY",
        "loc_country_2"
      ],
      "key": "loc_country_2",
      "used_key_parts": [
        "loc_country",
        "job",
        "loc_city",
        "id"
      ],
      "key_length": "34",
      "rows_examined_per_scan": 45657,
      "rows_produced_per_join": 45657,
      "filtered": "100.00",
      "using_index_for_skip_scan": true,
      "cost_info": {
        "read_cost": "9437.29",
        "eval_cost": "4565.70",
        "prefix_cost": "14002.99",
        "data_read_per_join": "1G"
      },
      "used_columns": [
        "id",
        "loc_country",
        "job"
      ],
      "attached_condition": "((`api`.`core`.`job` = 'a') and (`api`.`core`.`loc_country` = 'Mongolia') and (`api`.`core`.`id` <> 518601449) and (`api`.`core`.`id` < 518601449))"
    }
  }
}

The query took 14 seconds to run, I need it to be done in 0.01 seconds

The biggest problem seems to be to use id < XXX and order by id, I thought that this should be "free" to use given that the id is primary key.
I need the id < and sort because I need to get a different portion out of the database with each query, if I'd not use it I would receive the same data with each query for each country+job

I can not parition the table as I have dozens of such queries using different columns, it's just one example.

I believe the compression makes a big impact, it might be responsible for most issues I have though I don't have the storage on my NVME disks to run without compression.

Would it help to add the primary key to the indexes I have? At the end ? I fear it would waste a lot of storage space.
Any ideas ?

Best Answer

ASPECT #1: Indexing

There is no need to add the PRIMARY KEY value id to the index. Why ?

According to the MySQL Documentation of Clustered and Secondary Indexes, Paragraphs 1,2 under the Subheading How Secondary Indexes Relate to the Clustered Index say the following:

All indexes other than the clustered index are known as secondary indexes. In InnoDB, each record in a secondary index contains the primary key columns for the row, as well as the columns specified for the secondary index. InnoDB uses this primary key value to search for the row in the clustered index.

If the primary key is long, the secondary indexes use more space, so it is advantageous to have a short primary key.

Thus, the primary key is automatically added to a secondary index.

The proof of this is in the XML output

  "used_key_parts": [
    "loc_country",
    "job",
    "loc_city",
    "id"
  ],

and

  "used_columns": [
    "id",
    "loc_country",
    "job"
  ],

ASPECT #2: Table Compression

Compression might be greatly contributing to the slowness. After all, InnoDB must decompress data coming from that table for it to reside in the InnoDB Buffer Pool. In addition, the InnoDB Buffer Pool holds both compressed and uncompressed data and index pages. How so ???

According to the MySQL Documentation on How Compression Works for InnoDB Tables , Paragraphs 1,2 under the Subheading Compression and the InnoDB Buffer Pool say the following:

In a compressed InnoDB table, every compressed page (whether 1K, 2K, 4K or 8K) corresponds to an uncompressed page of 16K bytes (or a smaller size if innodb_page_size is set). To access the data in a page, MySQL reads the compressed page from disk if it is not already in the buffer pool, then uncompresses the page to its original form. This section describes how InnoDB manages the buffer pool with respect to pages of compressed tables.

To minimize I/O and to reduce the need to uncompress a page, at times the buffer pool contains both the compressed and uncompressed form of a database page. To make room for other required database pages, MySQL can evict from the buffer pool an uncompressed page, while leaving the compressed page in memory. Or, if a page has not been accessed in a while, the compressed form of the page might be written to disk, to free space for other data. Thus, at any given time, the buffer pool might contain both the compressed and uncompressed forms of the page, or only the compressed form of the page, or neither.

SUGGESTION #1

Since the Buffer Pool holds both compressed and uncompressed pages, maybe increasing the InnoDB Buffer Pool size might slightly improve things. How ?

InnoDB has to evict unused pages from the Buffer Pool. A larger buffer pool reduces the number of times the Buffer Pool hasd to be cleared to make waves for new pages.

SUGGESTION #2

Perhaps changing the size of compressed pages might slightly improve things. This would require reloading the table with the new compression size. Please see my old post innodb_file_format Barracuda on changing the KEY_BLOCK_SIZE would be needed.

UPDATE 2020-12-12 19:52

Here is a crazy trick to try

Instead of your query

SELECT id FROM core
WHERE id!=518601449
AND loc_country='Mongolia'
AND id < 518601449 AND job='a'  LIMIT 151\G

Try refactoring as follows

SELECT id FROM (SELECT id FROM core WHERE loc_country='Mongolia' AND job='a') A 
WHERE id < 518601449 LIMIT 151\G

This will force the loc_country_2 index to gather ids in the subqueries first. Then, all IDs < 518601449 are dismissed. Finally, the LIMIT 151 is imposed.

I do not guarantee the result will be better. It could be worse.

You won't know until you try it out and see or at least look at the EXPLAIN plan !!!

Related Solutions

Mysql – Are two indexes needed

An index can seek by a subset of characters, as long as you're searching from the left. E.g., "Inter%" can seek, "%net" will not.

However, the first character is not necessarily the character under which the article would be sorted. "The Internet" should go under "I", not "T". You probably need two fields, DisplayTitle and SortTitle; a single-character index on the latter may be worthwhile, but most likely a full-length index will be just fine.

Indexes are typically B-trees, and a seek will jump to the right location about equally quickly whether you have 10 or 100 entries per page. Scans are another matter, but I'd start with the simplest solution and add an extra index only if performance proves inadequate in practice.

Mysql – Finding rows for a specified date range

Another way to get the result is this. It finds first all groups that the teacher has surely taught (or is going to) by checking that she has started within the month and then in another subquery it finds - for every group - the last teacher that started at the first day of the month or earlier.

With the unique index you have on the table, the second subquery should be quite efficient. The first subquery would benefit from an index on (teacherid, startdate, groupid):

SELECT groupid
FROM pupilgroupteacher
WHERE teacherid = @teacher 
  AND startdate >= @month + INTERVAL 1 DAY
  AND startdate < @month + INTERVAL 1 MONTH

UNION DISTINCT

SELECT gg.groupid
FROM 
    ( SELECT DISTINCT groupid
      FROM pupilgroupteacher
    ) AS gd
  JOIN pupilgroupteacher AS gg
    ON  gg.groupid = gd.groupid
    AND gg.startdate =
        ( SELECT MAX(gi.startdate)
          FROM pupilgroupteacher AS gi 
          WHERE gi.groupid = gd.groupid
            AND gi.startdate < @month + INTERVAL 1 DAY
        )
WHERE gg.teacherid = @teacher ;