Mysql – BETWEEN query on two indexed columns exponentially takes longer on minimal changes to LIMIT

innodbMySQLoptimizationperformancequery-performance

This is a minimal example of my problem, but I reproduced the problem with this structure. I have the following table:

| column name    | type        |
|----------------+-------------|
| id             | PRIMARY KEY |
| id_start       | INT         |
| id_end         | INT         |
|----------------+-------------|

This table contains roughly one billion (1,000,000,000) rows. The _start and _end columns are indexed. This is InnoDB using Btree indices.

Some background information: This is sensor data with start and end timestamps. The timestamps itself are kept in another table, which is referenced from this one. Thus I only work with the id columns here.

Following query returns around 15,000 rows:

SELECT * FROM table WHERE 310000 BETWEEN id_start AND id_end;

The problem is that query takes very long to return anything (MySQL Workbench duration 1 sec / fetch 220 sec).

If I append LIMIT 14000 to the query it returns almost immediately. The closer I go with the LIMIT parameter to the expected 15000 resulting rows, the longer the query takes.

What could be the cause, that fetching 14,000 rows vs. 15,000 rows takes exponentially longer?

Some more bits of information which I am unsure of if they are of any relevance to this question:

  • The DB Server is well configured (InnoDB buffers etc.).
  • The DB Server has way more CPU power and memory than required. The whole dataset fits into memory.
  • All columns in that table are not nullable.

As requested here is additional information regarding the table structure and the query.

CREATE TABLE `table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `id_start` int(11) NOT NULL,
  `id_end` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_end` (`id_end`),
  KEY `idx_start` (`id_start`),
  KEY `idx_start_end` (`id_start`,`id_end`),
  KEY `idx_end_start` (`id_end`,`id_start`)
) ENGINE=InnoDB;

And here the EXPLAIN statement for the query:

+----+-------------+-------+------------+-------+-------------------------------------------------+---------------+---------+------+----------+----------+---------------+
| id | select_type | table | partitions | type  |                  possible_keys                  |      key      | key_len | ref  |   rows   | filtered |     Extra     |
+----+-------------+-------+------------+-------+-------------------------------------------------+---------------+---------+------+----------+----------+---------------+
|  1 | SIMPLE      | table | NULL       | range | "idx_start,idx_end,idx_start_end,idx_end_start" | idx_end_start |       9 | NULL | 20509120 |    50.00 | "Using where" |
+----+-------------+-------+------------+-------+-------------------------------------------------+---------------+---------+------+----------+----------+---------------+

Here is the result of a low LIMIT EXPLAIN:

'{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "27795648.61"
    },
    "table": {
      "table_name": "table",
      "access_type": "range",
      "possible_keys": [
        "idx_start",
        "idx_end",
        "idx_start_end",
        "idx_end_start"
      ],
      "key": "idx_end_start",
      "used_key_parts": [
        "id_end"
      ],
      "key_length": "9",
      "rows_examined_per_scan": 19854034,
      "rows_produced_per_join": 9927018,
      "filtered": "50.00",
      "index_condition": "(310300 between `table`.`id_start` and `table`.`id_end`)",
      "cost_info": {
        "read_cost": "25810244.97",
        "eval_cost": "1985403.64",
        "prefix_cost": "27795648.61",
        "data_read_per_join": "4G"
      },
      "used_columns": [
        "id",
        "id_start",
        "id_end"
      ]
    }
  }
}'

and here is the result of a high LIMIT EXPLAIN:

'{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "27203639.01"
    },
    "table": {
      "table_name": "table",
      "access_type": "range",
      "possible_keys": [
        "idx_start",
        "idx_end",
        "idx_start_end",
        "idx_end_start"
      ],
      "key": "idx_end_start",
      "used_key_parts": [
        "id_end"
      ],
      "key_length": "9",
      "rows_examined_per_scan": 19431170,
      "rows_produced_per_join": 9715586,
      "filtered": "50.00",
      "index_condition": "(310300 between `table`.`id_start` and `table`.`id_end`)",
      "cost_info": {
        "read_cost": "25260521.78",
        "eval_cost": "1943117.23",
        "prefix_cost": "27203639.01",
        "data_read_per_join": "4G"
      },
      "used_columns": [
        "id",
        "id_start",
        "id_end"
      ]
    }
  }
}'

Best Answer

I see two questions:

  • What's with LIMIT affecting performance?
  • (The real question.) How to speed up this kind of query on this kind of table.

This is a classic "tough task" in MySQL -- finding elements between a 'start' and 'end' value.

I address it with some extra complication in http://mysql.rjweb.org/doc.php/ipranges . However, it requires that the ranges be non-overlapping. (It is unclear whether your ranges will overlap.)

You say that id_start is really a pointer to another table where the actual timestamp is held? And yet you use WHERE 310000 BETWEEN id_start AND id_end. This implies that id_start and id_end are chronologically ordered, though not actually representations of "time"? And ditto for 310000? Perhaps this is to use 4-byte INTs instead of larger TIMESTAMP(6) or something?

With EXPLAIN FORMAT=JSON SELECT ..., we might be able to figure out why changing the LIMIT value affects performance.

How evenly are the times distributed? Here's a guess at the difference you are seeing: With a low value for LIMIT, the Optimizer's statistics suggest that a certain query plan will find the 14K quickly, but with the larger value, it will deduce that it may need to search far to find 15K rows. The Optimizer has no concept of "time ranges". Hence it can only use INDEX(start) or INDEX(end); it does not know that start<end and other things that you know and could use to optimize the query.

Do you need id for anything else? Are there other columns in the table? Both of these are important questions, since the following optimization may fail, depending on your answers.

For this 3-column table, I would consider one of the following alternatives:

CREATE TABLE x (
    start_id INT UNSIGNED NOT NULL,
    end_id   INT UNSIGNED NOT NULL,
    PRIMARY KEY(start_id, end_id),  -- assumed unique
    INDEX      (end_id, start_id)
ENGINE=InnoDB;
-- assumes start-end pairs are unique
-- with a billion rows; fewer indexes is beneficial

CREATE TABLE x (
    start_id INT UNSIGNED NOT NULL,
    end_id   INT UNSIGNED NOT NULL,
    id       INT UNSIGNED NOT NULL AUTO_INCREMENT,
    PRIMARY KEY(start_id, end_id, id),  -- id assures uniqueness
    INDEX      (end_id, start_id),
    INDEX(id) -- sufficient to keep AUTO_INCREMENT happy
ENGINE=InnoDB;
-- does not assume start-end pairs are unique
-- allows for FK to `id`

Notes:

  • By having the PK start with start_id may help with performance.
  • The performance of start..end queries is not solved.
  • I have not assessed the benefit of these schemas if there are more columns.