Mysql – BETWEEN query on two indexed columns exponentially takes longer on minimal changes to LIMIT

innodbMySQLoptimizationperformancequery-performance

This is a minimal example of my problem, but I reproduced the problem with this structure. I have the following table:

| column name    | type        |
|----------------+-------------|
| id             | PRIMARY KEY |
| id_start       | INT         |
| id_end         | INT         |
|----------------+-------------|

This table contains roughly one billion (1,000,000,000) rows. The _start and _end columns are indexed. This is InnoDB using Btree indices.

Some background information: This is sensor data with start and end timestamps. The timestamps itself are kept in another table, which is referenced from this one. Thus I only work with the id columns here.

Following query returns around 15,000 rows:

SELECT * FROM table WHERE 310000 BETWEEN id_start AND id_end;

The problem is that query takes very long to return anything (MySQL Workbench duration 1 sec / fetch 220 sec).

If I append LIMIT 14000 to the query it returns almost immediately. The closer I go with the LIMIT parameter to the expected 15000 resulting rows, the longer the query takes.

What could be the cause, that fetching 14,000 rows vs. 15,000 rows takes exponentially longer?

Some more bits of information which I am unsure of if they are of any relevance to this question:

The DB Server is well configured (InnoDB buffers etc.).
The DB Server has way more CPU power and memory than required. The whole dataset fits into memory.
All columns in that table are not nullable.

As requested here is additional information regarding the table structure and the query.

CREATE TABLE `table` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `id_start` int(11) NOT NULL,
  `id_end` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_end` (`id_end`),
  KEY `idx_start` (`id_start`),
  KEY `idx_start_end` (`id_start`,`id_end`),
  KEY `idx_end_start` (`id_end`,`id_start`)
) ENGINE=InnoDB;

And here the EXPLAIN statement for the query:

+----+-------------+-------+------------+-------+-------------------------------------------------+---------------+---------+------+----------+----------+---------------+
| id | select_type | table | partitions | type  |                  possible_keys                  |      key      | key_len | ref  |   rows   | filtered |     Extra     |
+----+-------------+-------+------------+-------+-------------------------------------------------+---------------+---------+------+----------+----------+---------------+
|  1 | SIMPLE      | table | NULL       | range | "idx_start,idx_end,idx_start_end,idx_end_start" | idx_end_start |       9 | NULL | 20509120 |    50.00 | "Using where" |
+----+-------------+-------+------------+-------+-------------------------------------------------+---------------+---------+------+----------+----------+---------------+

Here is the result of a low LIMIT EXPLAIN:

'{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "27795648.61"
    },
    "table": {
      "table_name": "table",
      "access_type": "range",
      "possible_keys": [
        "idx_start",
        "idx_end",
        "idx_start_end",
        "idx_end_start"
      ],
      "key": "idx_end_start",
      "used_key_parts": [
        "id_end"
      ],
      "key_length": "9",
      "rows_examined_per_scan": 19854034,
      "rows_produced_per_join": 9927018,
      "filtered": "50.00",
      "index_condition": "(310300 between `table`.`id_start` and `table`.`id_end`)",
      "cost_info": {
        "read_cost": "25810244.97",
        "eval_cost": "1985403.64",
        "prefix_cost": "27795648.61",
        "data_read_per_join": "4G"
      },
      "used_columns": [
        "id",
        "id_start",
        "id_end"
      ]
    }
  }
}'

and here is the result of a high LIMIT EXPLAIN:

'{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "27203639.01"
    },
    "table": {
      "table_name": "table",
      "access_type": "range",
      "possible_keys": [
        "idx_start",
        "idx_end",
        "idx_start_end",
        "idx_end_start"
      ],
      "key": "idx_end_start",
      "used_key_parts": [
        "id_end"
      ],
      "key_length": "9",
      "rows_examined_per_scan": 19431170,
      "rows_produced_per_join": 9715586,
      "filtered": "50.00",
      "index_condition": "(310300 between `table`.`id_start` and `table`.`id_end`)",
      "cost_info": {
        "read_cost": "25260521.78",
        "eval_cost": "1943117.23",
        "prefix_cost": "27203639.01",
        "data_read_per_join": "4G"
      },
      "used_columns": [
        "id",
        "id_start",
        "id_end"
      ]
    }
  }
}'

Best Answer

I see two questions:

What's with LIMIT affecting performance?
(The real question.) How to speed up this kind of query on this kind of table.

This is a classic "tough task" in MySQL -- finding elements between a 'start' and 'end' value.

I address it with some extra complication in http://mysql.rjweb.org/doc.php/ipranges . However, it requires that the ranges be non-overlapping. (It is unclear whether your ranges will overlap.)

You say that id_start is really a pointer to another table where the actual timestamp is held? And yet you use WHERE 310000 BETWEEN id_start AND id_end. This implies that id_start and id_end are chronologically ordered, though not actually representations of "time"? And ditto for 310000? Perhaps this is to use 4-byte INTs instead of larger TIMESTAMP(6) or something?

With EXPLAIN FORMAT=JSON SELECT ..., we might be able to figure out why changing the LIMIT value affects performance.

How evenly are the times distributed? Here's a guess at the difference you are seeing: With a low value for LIMIT, the Optimizer's statistics suggest that a certain query plan will find the 14K quickly, but with the larger value, it will deduce that it may need to search far to find 15K rows. The Optimizer has no concept of "time ranges". Hence it can only use INDEX(start) or INDEX(end); it does not know that start<end and other things that you know and could use to optimize the query.

Do you need id for anything else? Are there other columns in the table? Both of these are important questions, since the following optimization may fail, depending on your answers.

For this 3-column table, I would consider one of the following alternatives:

CREATE TABLE x (
    start_id INT UNSIGNED NOT NULL,
    end_id   INT UNSIGNED NOT NULL,
    PRIMARY KEY(start_id, end_id),  -- assumed unique
    INDEX      (end_id, start_id)
ENGINE=InnoDB;
-- assumes start-end pairs are unique
-- with a billion rows; fewer indexes is beneficial

CREATE TABLE x (
    start_id INT UNSIGNED NOT NULL,
    end_id   INT UNSIGNED NOT NULL,
    id       INT UNSIGNED NOT NULL AUTO_INCREMENT,
    PRIMARY KEY(start_id, end_id, id),  -- id assures uniqueness
    INDEX      (end_id, start_id),
    INDEX(id) -- sufficient to keep AUTO_INCREMENT happy
ENGINE=InnoDB;
-- does not assume start-end pairs are unique
-- allows for FK to `id`

Notes:

By having the PK start with start_id may help with performance.
The performance of start..end queries is not solved.
I have not assessed the benefit of these schemas if there are more columns.

Related Solutions

Mysql – How to optimize indexes on MySQL query with various sorts

WHERE playable_character = 0 AND
    date_published BETWEEN date_sub(now(), INTERVAL 3 YEAR) AND now()

Start with the "=" item, then do the range:

INDEX(playable_character, date_published);

"Pagination", a la ORDER BY rating DESC LIMIT 4000, 1000; is best done by remember where you "left off". That way, you don't have scan over the 4000 records that you don't need.

Mysql – How to improve performance of frequently updated table in thesql

You have two "ranges" filtering the data. MySQL has no good way to do that.

First, let's check some numbers.

How many rows in the result set?
How many rows without the IN clause?
How many rows without the time range?
Is this IN list and this time range typical?

With those, we may be able to decide to focus on filtering on the time range in preference over the IN range.

Meanwhile, let's get rid of the "midnight" bug... Your time range seems to be from the morning of May 1 through midnight of the morning of Oct 31. Perhaps you meant to include all of Oct 31, not just one millisecond of that day? Did you mean to ask for a full 6 months? Suggest this coding style; it's less error-prone:

AND data_received_time >= 1000 *  UNIX_TIMESTAMP('2015-05-01')
AND data_received_time <  1000 * (UNIX_TIMESTAMP('2015-05-01') + INTERVAL 6 MONTH)

Back to the performance question...

If you are looking at a smallish fraction of the total time, then PARTITIONing may be useful for simulating a 2D index. (I mention this as Case #2 in http://mysql.rjweb.org/doc.php/partitionmaint )

If you break the table into, say, 20-40 partitions PARTITION BY RANGE(data_received_time), then "partition pruning" will first filter on the time range. Then your PRIMARY KEY, with imei_no first, filters on the other dimension.

I cannot say how much (if any) this partitioning will speed things up. I would expect small time ranges to benefit (percentage-wise) the most. A time range that hits all partitions will actually be a little slower than without partitioning. (Hence, my "check some numbers" questions.)

Schema notes... (Smaller is faster)

BIGINT is 8 bytes. Can odom go beyond 16M miles, the limit for `MEDIUMINT UNSIGNED (3 bytes)?
INT(4) -- the (n) means nothing; suggest TINYINT (1 byte) or SMALLINT (2 bytes).
Some of the VARCHARs could be 'normalized' to say a 2-byte SMALLINT UNSIGNED.

Best Answer

Related Solutions

Mysql – How to optimize indexes on MySQL query with various sorts

Mysql – How to improve performance of frequently updated table in thesql

Related Question