MySQL InnoDB – Performance for B+Tree Index with Many Duplicate Values

btreeindexinnodbMySQL

I'm trying to diagnose seemingly random performance issues with our database server. Below is a simplified scenario, hopefully generic enough to serve as a useful future reference for anyone looking for the same answer.

Suppose I have a (MySQL 5.6 w/ InnoDB) table like

CREATE TABLE Example (
    id INT NOT NULL AUTO_INCREMENT,
    secondary_id INT DEFAULT NULL,
    some_data TEXT NOT NULL,
    PRIMARY KEY (id),
    KEY (secondary_id)
) ENGINE=InnoDB;

with about 15 million rows. However, the secondary_id column is NULL for almost all rows, so the index on secondary_id has very, very low cardinality (in our case about ~30k). In our case, when we experience the performance issue I'm investigating, the process list for the server shows many (100+) queries of the form:

UPDATE Example SET secondary_id = NULL, some_data = '...' WHERE id = 123;

that are taking ~90+ seconds to complete, during which they are in the "updating" state. (These queries would be running in separate transactions.)

I'm specifically wondering if the transition from a not-null secondary_id to a null secondary_id is causing performance slowdowns in the above UPDATE. That is, is it possible that updating the index in this case takes a significant amount of time since there are so many rows (~15mil) that have the same value for that column (NULL)?

I guess this question stems from me not understanding how the B+Tree index will store row pointers for rows having duplicate index values. I'd guess that node would have a linked list (or something similar) with a pretty fast insertion time, so I'd guess the answer to my question is "No". But I'd like to confirm that with the experts, i.e. you all.

I've tried to do a fair amount of research here, but I've come up pretty empty handed. Probably the most comprehensive post is this one, which explains some different techniques for handling duplicate keys, but I'm specifically looking for InnoDB/MySQL's approach.

Best Answer

90 seconds for a single UPDATE sounds too, too much. There is probably some blocking involved and should be investigated.

Apart from that, having a column that has 98% the same (NULL) value does not sound good either. You should consider putting that column in a separate table (which would have only 30K rows). It would complicate your INSERT/DELETE/UPDATE procedures a bit but you would probably gain from the smaller indexes. Suggested design:

CREATE TABLE Example (
    id INT NOT NULL AUTO_INCREMENT,
    some_data TEXT NOT NULL,
    PRIMARY KEY (id)
) ENGINE = InnoDB ;

CREATE TABLE Example_secondary (
    id INT NOT NULL,
    secondary_id INT NOT NULL,
    PRIMARY KEY (id),
    INDEX (secondary_id),
    FOREIGN KEY (id)
      REFERENCES Example (id)
) ENGINE = InnoDB ;

Then your UPDATE:

UPDATE Example 
SET secondary_id = NULL, 
    some_data = '...' 
WHERE id = 123 ;

would become:

BEGIN ;
    UPDATE Example 
    SET some_data = '...' 
    WHERE id = 123 ;

    DELETE FROM Example_secondary 
    WHERE id = 123 ;
COMMIT ;

REASON #1

You only have 150MiB of MyISAM tables (70MiB Data, 80 MiB Indexes)

REASON #2

InnoDB indexes tend to get very bloaty because Secondary Indexes have keys into the Clustered Index. This always results in a double index lookup. This can be overlooked with large, write-heavy datasets.

REASON #3

InnoDB tablespaces tend to get very bloaty because of MVCC being created and discarded without an automatic reclaiming of disk space:

With innodb_file_per_table disabled, the system tablespace file ibdata1 would steadily grow with 0% chance of reclaiming of disk space. You would have to convert InnoDB storage engine to use innodb_file_per_table.
With innodb_file_per_table enabled, you could go to the individual InnoDB and run of the the following to physically shrink the InnoDB table mydb.mytable:
- OPTIMIZE TABLE mydb.mytable;
- ALTER TABLE mydb.mytable ENGINE=InnoDB;

All this can be avoid with MyISAM

REASON #4

InnoDB protects individual rows by performing MVCC for transaction control. The overhead generated for reads in a day would probably be greater than 150 MiB.

I can probably name 2 or 3 more reasons, but let's cut to the chase: Is there anything that can improve performance for MyISAM in your case? Why, yes there is.

Your said the following

The table is about 260k rows in size, with 28 fields which for the most part is varchars and ints

If you have many varchars, there is something you can do to increase read/write performance. For any MyISAM table mydb.mytable: run this command:

ALTER TABLE mydb.mytable ROW_FORMAT=Fixed;

What will this do, this will treat all VARCHARs as CHARs. Every row will be the exact same length. This will increase disk space 80%-100%. In your case, let's assume it doubles your 150 MiB MyISAM table to 300 MiB. Where is the benefit? Your MyISAM table can now be read/written anywhere from 20% - 30% faster without changing anything else I learned that from pages 72,73 from MySQL Database Design and Tuning.

I have written about this in the past:

Which is faster, InnoDB or MyISAM? (May 03, 2012)
Which mysql storage engine to choose? (May 02, 2012)
Best of MyISAM and InnoDB (Sep 20, 2011)
Which DBMS is good for super-fast reads and a simple data structure? (Aug 12, 2011)
What is the performance impact of using CHAR vs VARCHAR on a fixed-size field? (May 10, 2011)

Best Answer

Related Solutions

MySQL 5.5.8 InnoDB Foreign Key, JOIN performance

MySQL – InnoDB vs MyISAM with Many Indexes

REASON #1

REASON #2

REASON #3

REASON #4

Related Question