MySQL not using index on large WHERE IN clause

indexMySQLperformance

I am trying to retrieve chunks of records from my table via a generated SELECT WHERE IN statement. If the statement is small, it will use the index, if the statement is large (1000+ IN items) it performs a full table scan. The table is 17 million rows long, so this gets expensive when I only want to look for 10k records.

I have a unique index covering (in this order) date, world, player, city_id. I also have an index for the date and player. The table is also partitioned by world.

Example query:

SELECT * FROM mytable 
WHERE 
    (date, world, player, city_id) 
IN ( 
    ('2017-05-30',4,'player',15466761),
    ('2017-05-30',4,'player',567599046),
    ('2017-05-30',4,'player2',34567866),
    ('2017-05-30',4,'player',2342467),
    ('2017-05-30',4,'player3',92341268),
    ... etc
)

If the query is small, it will use the index, if the query is large it performs a full table scan. Even adding FORCE INDEX (unique_index) doesn't do anything for the large queries.

Execution plan output:

 "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "20928166.60"
    },
    "table": {
      "table_name": "city_data",
      "partitions": [
        "w3",
        "w4",
        "w5",
        "w6",
        "w7"
      ],
      "access_type": "ALL",
      "possible_keys": [
        "unique_index"
      ],
      "rows_examined_per_scan": 17440138,
      "rows_produced_per_join": 8720069,
      "filtered": "50.00",
      "cost_info": {
        "read_cost": "19184152.80",
        "eval_cost": "1744013.80",
        "prefix_cost": "20928166.60",
        "data_read_per_join": "3G"
      }
    }

SHOW CREATE TABLE for city_data:

CREATE TABLE `city_data` (
   `date` date NOT NULL,
   `world` int(3) NOT NULL,
   `player` varchar(45) NOT NULL,
   `alliance` varchar(45) NOT NULL,
   `score` int(11) NOT NULL,
   `continent` int(11) NOT NULL,
   `castle` tinyint(1) NOT NULL,
   `water` tinyint(1) NOT NULL,
   `temple` tinyint(1) NOT NULL,
   `x_coord` int(11) NOT NULL,
   `y_coord` int(11) NOT NULL,
   `city_name` varchar(45) NOT NULL,
   `city_id` int(20) NOT NULL,
   `entry_id` int(55) NOT NULL AUTO_INCREMENT,
   PRIMARY KEY (`entry_id`,`world`),
   UNIQUE KEY `unique_index` (`date`,`world`,`player`,`city_id`) USING BTREE,
   KEY `world_index` (`world`) USING BTREE,
   KEY `date_index` (`date`) USING BTREE,
   KEY `player_index` (`player`(8))
 ) ENGINE=MyISAM AUTO_INCREMENT=17999069 DEFAULT CHARSET=utf8
 /*!50100 PARTITION BY LIST (`world`)
 (PARTITION w3 VALUES IN (3) ENGINE = MyISAM,
  PARTITION w4 VALUES IN (4) ENGINE = MyISAM,
  PARTITION w5 VALUES IN (5) ENGINE = MyISAM,
  PARTITION w6 VALUES IN (6) ENGINE = MyISAM,
  PARTITION w7 VALUES IN (7) ENGINE = MyISAM) */

How can I get it to use the index instead of relying on a full table scan?

I have read this post: https://stackoverflow.com/questions/586381/mysql-not-using-indexes-with-where-in-clause

Best Answer

I think there is an issue with large IN clauses and PARITIONing -- Beyond some not-so-large-number, the Optimizer say "punt, I'll just send this query to every partition". (This may be part of what you are seeing.)

But more than that. Picking partitions is costly, not cheap. If you end up with all the partitions, the query will be slower than if you did not have partitioning. (I recommend removing partitioning. I have found no use case where BY LIST helps performance.)

Most importantly... "Tuples" are not optimized. That is, the Optimizer does a terrible job with WHERE (a,b) IN ((1,2), (3,4)), regardless of the indexes. Don't use it if you want performance. (Or you could plead for the optimization feature at bugs.mysql.com).) (This is the real answer to the question.)

So, what to do? One approach is to create a temp table with all those tuples. Then JOIN from that tmp table to the real table and have a composite index (as you have with unique_index) to make the lookups fast.

Some other notes.

  • The number in INT(55) means nothing. INT is always 4 bytes, with a range up to a few billion. world int(3) should probably be TINYINT UNSIGNED (0..255) or SMALLINT UNSIGNED (0..65535).
  • "Prefix" indexing is rarely beneficial: KEY player_index (player(8)); remove the (8).
  • KEY date_index (date) is redundant since there is another index (unique_index) starting with the same column.
  • Recommend moving to InnoDB before MyISAM is deprecated.