Mysql – Innodb primary key advice

indexinnodbMySQLprimary-key

We have a database table growing at a rate of ~20 million records per month, we're expecting this to double shortly.

The existing schema has a uuid4 as a primary key, and then two additional indexes on two other uuid4 columns. These columns are searched frequently for diagnosing issues. We can't change the UUID4 to something else as they are populated by an application we can't change.

We're planning on creating a new table and then back populating the data required from the old table into this one, fortunately, there's no real-time requirement for historic data.

The basic structure of the table is as follows

Column Type
Started Date Time ( will always be increasing )
app_id_one uuid version 4
app_id_two uuid version 4
app_id_three uuid version 4

My questions are as follows:

  1. As we don't control the uuid generation process does (started, app_id_one ) make sense as a primary key for the above structure, are there similar performance issues using uuid version 4 as the second column in a composite key?
  2. We regularly have to search on app_id_two/app_id_three but I assume even have a secondary index of type uuid4 is bad for performance, so I am thinking we should also index these (started, app_id_two), (started, app_id_three )?
  3. If we do option 2 this leads to a duplicate of the date time in the secondary index ( due to the chosen primary key ), space isn't too much of an issue we've got lots available on the servers, but would it be better just to generate a more suitable primary key instead?
  4. We're considering populating this new table via a trigger as changing the application that populates the existing table is also not without its challenges. I'm worried about this adding to our already struggling write performance, but if I remove the two indexes from the existing table ( app_id_two / app_id_three ) I assume we'll get more than enough performance back from doing so?

Thank you!

Best Answer

Simply put, as the table (or index) grows, the performance approaches a minimum of one disk hit per lookup.

This is because UUIDs (other than a version 1 UUID with the bits suitably shuffled) are effectively "random". That is whatever is in the buffer_pool cache is approaches being useless for the next lookup.

You seem to have 3 such indexes; this means that even inserting one row is threatening to require 3 disk hits! 20M new rows per month = 8 rows per second = upwards of 24 disk hits per second just for the insert traffic. You could tackle this with a huge amount of RAM or by avoiding the 3 UUIDs.

Similarly, any SELECT that uses an INDEX starting with one of those uuids will incur 1 disk hit.

Keep in mind (for InnoDB) that the PRIMARY KEY is a big BTree with all the data, sorted by the PK. And a secondary index is a smaller BTree with the key column(s) plus the PK's column(s). So, some things stated about the PK and the seocndary key are interchangeable.

Of note,... When looking up a row by a secondary key, two BTree lookups are performed, one per BTree involved. If both are UUIDs that could mean 2 disk hits. If you have started, and used composite indexes starting with started, and most queries reached for data around the same time, then you get "locality of reference benefits. That is, everything might be cached and you won't need disk hits.

All of this leads to the beauty of using UUID, version 1, and rearrangin the bits so that the "time" part is in the high order bits.

(For your Q2 and Q3) Assuming you now have

PRIMARY KEY(started),
INDEX(app_id_one),
INDEX(app_id_two),
INDEX(app_id_thre),

And assuming you are using InnoDB, then you already have started implicitly tacked onto the end of the 3 secondary indexes. If, instead, you had

PRIMARY KEY(started),
INDEX(app_id_one),
INDEX(started, app_id_two),
INDEX(started, app_id_thre),

then the indexes would take exactly the same amount of space. (OK building a BTreee leads to some unpredictables variations in "fragmentation", etc.)

Using a DATETIME as the PK is risky unless you are very sure you won't have any duplicates. Changing the PK to INT or BIGINT (presumably AUTO_INCREMENT) would not change the principles here; just a slight difference in space needed.

pt-online-schema-change or gh-ost or MySQL 8.0 can change the existing table's indexes with minimal interruption.

I don't understand how you can change from the current WHERE app_id_two = ... to WHERE started = ... AND app_id_two = .... If you currently have started available, then just use it and get rid of the other columns??