MySQL Index Optimization – Why Adding a Secondary Index Didn’t Improve Select Performance

indexMySQLoptimization

I have a table in MySQL called messages, which looks like this :

id (primary key)   | description    |  created_at(timestamp) 

This table is supposed to hold chat messages between users of my application. As a result, the number of write operations on it will be high.

The application has an API that returns all the messages between two time stamps. This query will be fairly common too, but less than the number of write operations.

I ran mysqlslap with 100 concurrent connections and around 15000 rows in that table, and got a total time of around 8.6 seconds.

Then I added a secondary index on created_at, hoping to get the results in much less time for my search between two times, but I got an increase of 0.3 seconds for the same inputs.

Why did I not see a significant performance increase?

Edit:

this is what my table looks likes :

DROP TABLE IF EXISTS `mssg`;

CREATE TABLE `mssg` (
  `id` INTEGER NULL AUTO_INCREMENT DEFAULT NULL,
  `body` MEDIUMTEXT NULL DEFAULT NULL,
  `length` VARCHAR NULL DEFAULT NULL,
  `created_at` TIMESTAMP NULL DEFAULT NULL,
  PRIMARY KEY (`id`)
);

and this is how i added the index:

ALTER TABLE testalter_tbl ADD INDEX (created_at);

Typically, there will be around 150 to 350 messages returned from the select statements

Best Answer

I'll skip my comments about that CREATE TABLE statement that would be classified from wrong to insulting, like having all columns declared NULL (what would a row with (NULL, NULL, NULL, NULL) mean?) and concentrate on the question asked.

The difference between 8.6 and 8.9 seconds is negligible. For all we know, the running times of the 2 tests are almost identical. If you want to test properly, you should run longer tests and with larger size of the table (try with 15K, 150K, 1500K, 15M rows) and see whether the efficiency changes or stays the same between having or not the specific index.

You should also examine the execution plans created for the queries in the 2 cases (at various table sizes), to see what is happening behind the scenes, what indexes are used, if any, etc.

My prediction would be that as the table grows larger, the index will get more useful, as the queries will require an index seek to find the (IDs of the) rows in the wanted time range and then some additional seeks from the clustering PK index (I assume you use InnoDB.)

Without the index, a whole table scan will have to be done each time. The larger the table gets, the biggest the difference you'll see between the 2 test cases (with and without the index).


If the vast majority of your queries is similar to the following, getting rows from a small time range:

SELECT <columns_wanted>
FROM mssg
WHERE created_at >= @start_timestamp
  AND created_at < @end_timestamp ;

then I would suggest (assuming you use InnoDB) an alternative design: make the (created_at) the clustering key of the table. Since you may have 2 or more rows with exact same timestamp, you can set the primary key to (create_at, id) for this effect (as InnoDB allows only unique keys for clustering). You will also need an additional index on (id) for the AUTO_INCREMENT to be allowed. This can be declared UNIQUE or not, it's your choice:

CREATE TABLE mssg (
  mssg_id INTEGER NOT NULL AUTO_INCREMENT,
  body MEDIUMTEXT NULL,
  length VARCHAR(20) NULL,
  created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (created_at, mssg_id),
  -- UNIQUE
    INDEX mssg_id_ix (mssg_id)             -- this is needed for the AutoIncrement
);

This way, all the rows that your queries need will be stored in consecutive pages of the clustering (PK) index and the efficiency of the queries - as long as they ask for a small time range - will not be affected by the size of the table.


Further comments, partially related to the question;

  • Why are all columns declared as NULL? I suggest you change them all to NOT NULL except for those columns that there is a specific reason to allow nulls.
  • Why mediumtext for the message? Are you going to store so large messages? It might be better for performance, if you could reduce the size, to say VARCHAR(250).
  • Why is the length declared as VARCHAR and not INTEGER?
  • Further, if it's going to be storing the length of the text, it is not needed at all. You can get the length at any time using the proper text function.
  • You could declare the TIMESTAMP column to get a default CURRENT_TIMESTAMP if you want (if it is not already provided by the application.)