MySQL InnoDB – Improving Primary Key Efficiency

innodbMySQL

I've read a number of articles regarding efficiency of primary keys depending on the storage engine, and I am left confused.

Given a simple many-to-many table with two fields, storeId and zoneId, which one of the following designs is the most efficient with InnoDB and why?

  • using the two fields as a composite primary key:
CREATE TABLE store_zone(
    storeId INT(10) UNSIGNED NOT NULL,
    zoneId  INT(10) UNSIGNED NOT NULL,
    PRIMARY KEY(storeId, zoneId)
);
  • Using a specific auto increment primary key:
CREATE TABLE store_zone(
    id      INT(10) UNSIGNED NOT NULL AUTO_INCREMENT,
    storeId INT(10) UNSIGNED NOT NULL,
    zoneId  INT(10) UNSIGNED NOT NULL,
    PRIMARY KEY(id),
    UNIQUE KEY(storeId, zoneId)
);

Notes:

  • I need a unique key on the (storeId, zoneId) pair anyway
  • I have foreign keys to the store and zone tables, not shown here for readability, so in both cases there's an extra required index on zoneId as well

Best Answer

Analysis

Since we are talking about InnoDB, let's zero in on the gen_clust_index. It's a special index that keeps the PRIMARY KEY and associated row data accessible from the same InnoDB pages.

According to MySQL Documentation on the gen_clust_index

  • If you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index.

  • If you do not define a PRIMARY KEY for your table, MySQL picks the first UNIQUE index that has only NOT NULL columns as the primary key and InnoDB uses it as the clustered index.

  • If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.

This being the case, you would really feel efficiency or deficiency in the following areas:

DiskSpace

Having the single auto_increment column as the PRIMARY KEY keeps the overall size of the PRIMARY KEY smaller that that of having two columns. Why? BTREE pages would be twice as big if the PRIMARY KEY was two INTs instead of one. This becomes even more painful if you use foreign key constraints and secondary indexes as they must subsequently blow up in size as well.

In this instance, you would choose the second schema for better index usage.

Insert Performance

Inserting row data into an InnoDB table that has two UNIQUE indexes calls for twice as much BTREE management and unique checks.

  • One for the gen_clust_index (dictated by auto_increment)
  • One for the other UNIQUE index (dictated by (storeId, zoneId))

Additional microseconds for each additional UNIQUE index would add up in CPU time when inserting millions of rows.

In this instance, you would choose the first schema for faster INSERTs.

Conclusion

The diskspace and insert performance issues pretty much force you to choose. Obviously, you choose which one to live with. If you use foreign key constraints that represent (storeId, zoneId), then you need the second schema using the auto_increment as the reference in outside tables.