Mysql – Why index on low selectivity column hurts query performance

indexMySQLperformanceselect

The table is simple:

CREATE TABLE `t1` (
  `ID` int NOT NULL AUTO_INCREMENT,
  `name` int DEFAULT '1',
  `LastName` int DEFAULT '0',
  UNIQUE KEY `idx_ID` (`ID`) USING BTREE,
  KEY `idx_name` (`name`)
) ENGINE=InnoDB;

DELIMITER $$
DROP PROCEDURE IF EXISTS populate_t1;
CREATE PROCEDURE populate_t1(IN num INT)
BEGIN
    DECLARE COUNT INT DEFAULT 0;
    WHILE COUNT < num DO
        INSERT INTO t1(`name`,`LastName`) VALUES(round(rand()+5, 0), round(rand()*10000, 0));
        SET COUNT = COUNT + 1;
    END WHILE;
END $$
DELIMITER ;

CALL populate_t1(1000000);
CREATE INDEX idx_name ON t1 (`name`);
OPTIMIZE TABLE t1;

The column name has very low selectivity (i.e., only 2 distinct values). Query

> SELECT * FROM t1 WHERE Name=5;
> OK, 499845 records, Time: 0.840s

would be slower than the query which use a full table scan

> DROP INDEX idx_name ON t1;
> OK 
> SELECT * FROM t1 WHERE Name=5;
> OK, 499845 records, Time: 0.258s

This performance drop is expected, but I want to figure out why.

I know that the index is like an English dictionary index in the first several pages of an English dictionary. And the index is stored in a B+tree in my example I suppose. So with an index on column name, MySQL would know which rows satisfy name=5. Then MySQL fetches these rows into the result set. I think the process above would be quicker than scanning the whole table (but I am wrong). So what I want to ask is which process makes (e.g., loading index from disk to memory, searching in the B+tree, etc.) the index scan slower than the full table scan.

Further, I know that if the table is larger, the performance difference would be larger (I still want to know why).

I would be very grateful for your help!
Thanks!

Best Answer

Access to a row using a non-clustered index, which idx_name is, requires extra random I/O: b-tree lookup finds the clustered (primary key) index value, then you need to go and fetch the actual row from the clustered index.

The alternative is a sequential scan of the clustered index itself, which does not incur that extra I/O cost and is also simply more efficient by being sequential, not random.