MySQL Index Creation Internals

indexinnodbMySQLmysql-5.6

Right now there are two ways to build an index of a table in MySQL:

  1. Create the table structure first, then import the data and add indexes later.
  2. Create the table structure with indexes and then import the data.

In the first process we will have contiguous data (all fields) pages and then we have index pages. So when we query using index, MySQL has to first load the index pages and find out the matching keys and have to look up those primary key on the data pages. For this it has to then again load the data pages to get the data. This is useful when we have larger index scans as we have all index loaded contiguously.

In the second way of index creation the filtered index page will most probably contain the data page near to it as they were created at the same time. So I guess look up will be faster for a small range scans.

Is my understanding correct?

UPDATE:

I should have mentioned that the "PRIMARY KEY is enabled" (an auto increment id column) in the first way of importing data. So internal rowid is not generated and lot of IO is saved as we are not going to add a PRIMARY KEY.

As you noted there is fragmentation when we import data using the Second method.

Considering my requirements is for larger range scans (scanning ~100M rows) I guess I will go with the first way of importing data.

update on Jun 8 11:30

CREATE TABLE `table_dummy` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `column1` bigint(20) DEFAULT NULL,
  `column2` bigint(20) DEFAULT NULL,
  `column3` bigint(20) DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,
  `column4` tinyint(1) DEFAULT NULL,
  `column5` tinyint(4) DEFAULT NULL,
  `column6` bigint(20) DEFAULT NULL,
  `column6_created_at` datetime DEFAULT NULL,
  `column7` int(11) DEFAULT NULL,
  `column8` tinyint(1) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `twtaccount_id_2` (`column1`,`column2`),
  KEY `twtaccount_id` (`column1`,`created_at`,`column5`),
  KEY `twt_user_id` (`column3`,`created_at`,`column5`),
  KEY `original_status_id` (`column6`,`column3`,`followers_count`)
)

ENGINE is INNODB

this table has about 800M records and I took a dump and imported it in two ways.

Method 1: table was created with just the primary key and entire data was imported. Then I gave an alter statement to add the remaining indexes.

Method 2: The table was created with indexes and the dump was done. This took much larger time.

P.S The size of the resulting table through 'Method 1' was lesser by ~30GB when compared to size of table dumped through 'Method 2'. Method 1 took very less time than 'Method 2', almost twice the speed.

My primary concern is the performance of the table when I make a select scanning a wide range of index ('index only scan').

Best Answer

There are a few things I can clarify for you here:

  • Yes, it is a good practice to delay secondary index creation until after you import the data (starting from MySQL 5.5 - not before). Mysqlpump does this by default.

  • When you delay secondary index creation, internally MySQL will read, sort and then create the index (reducing fragmentation). For MySQL 5.7 there are additional optimizations here.

  • When you trickle load indexes, they may have more page splits and a lower page fill efficiency (fragmentation). This depends a little on the data, and if it is in order. You are not giving me many clues with column1, column2, but say for example column1 was a timestamp it would probably be in order.

  • Pages in an index are logically in order, not necessarily physically. This distinction may matter if you can not fit your working set for ranges in memory and are using spinning disks for backing storage. It's also a little hard to answer because with spinning disks you are probably using RAID with some stripe size, and the blocks at the filesystem level may also be non-contiguous.

  • Be aware also that the optimizer may not consider the range scan for large ranges over a tablescan (since prior to 5.7 the cost model assumed pages were not in memory). If you can rely on this being the case, you may want to FORCE INDEX and compare actual run times. (More info on new cost model.)