MySQL indexes maintenance

fragmentationindex-maintenanceMySQL

I made a lot of research about how to maintain indexes in MySQL to prevent fragmentation and to optimize somehow the execution of some queries.

I am familiar with that formula that calculates the ratio between the max space available for a table VS the space used by data and indexes.

However my main questions are still unanswered. Perhaps this is due to the fact that I am familiar with index maintenance in SQL Server, and I tend to think that in MySQL it should be somehow similar.

In SQL server, you can have several indexes, and each one of it can have different levels of fragmentation. Then you can pick up one and perform a 'REORGANIZE' or 'REBUILD' operation in that particular index, without affecting the rest.

To the best of my knowledge, there is no 'table fragmentation' as such, and SQL Server doesn't provide any tool to fix the 'table fragmentation'. What it does provide are tools to check index fragmentation (understood like the ratio between the number of pages used by an index VS the fullness of that page and contiguity), as well as the internal and external fragmentation.

All of that is quite straightforward to understand, at least for me.

Now, when it comes the turn to maintain indexes in MySQL, there only exist the concept of 'table fragmentation, as mentioned above.

A table in MySQL can have several indexes, but when I check the 'fragmentation ratio' with that famous formula, I don't see the fragmentation of each index, but the table as a whole.

When I want to optimize the indexes in MySQL, I don't choose a particular index to operate on (as in SQL Server). Instead, I do an 'OPTIMIZE' operation in the whole table, which presumably affects all the indexes.

When the table is optimized in MySQL, the ratio between the space used by data + indexes VS the overall space is reduced, which suggest some kind of physical re-organization in the hard drive, which translates into a reduction of the physical space. However, index fragmentation is not only about physical space, but the structure of the tree that has been changed over the time due to inserts and updates.

Finally, I got a table in InnoDB/MySQL. That table has 3 million records, 105 columns and 55 indexes. It is 1.5GB excluding indexes, which are 2.1GB.

That table is being hit thousands of times ever day for updating, insertion (we don't actually delete records).

That table has been created years a go and I know for sure that nobody is maintaining indexes whatsoever.

I was expecting to find a huge fragmentation in there, but when I perform the fragmentation calculation as prescribed

free_space / (data_length + index_length)

it turns out that I have only a 0.2% fragmentation. IMHO that is quite unrealistic.

So the big questions are:

  1. How do I check fragmentation of a particular index in MySQL, not the table as a whole
  2. Does OPTIMIZE TABLE actually fix the internal / external fragmentation of an index as in SQL Server?
  3. When I optimize a table in MySQL, does it actually rebuilds all the indexes on the table?
  4. Is it realistic to think that reducing the physical space of an index (without rebuilding the tree itself) actually translates into a better performance?

Best Answer

Index fragmentation is much overrated. Do not worry about it.

Two adjacent, somewhat-empty, blocks are merged together by InnoDB as the natural processing.

Random actions on a BTree cause it to naturally gravitate toward an average of 69% full. Sure, this is not 100%, but the overhead of "fixing" it is not worth it.

SHOW TABLE STATUS gives you some metrics, but they are flawed -- "Data_free" includes certain "free" space, but not other "free" space.

There is unused space in each block; free 16KB blocks; free "extents" (nMB chunks); MVCC rows waiting to be reaped; non-leaf nodes have their own fragmentation; etc.

Percona and Oracle have different ways of looking at how big (number of blocks) an index is. I find neither of them to be useful because of the limited definition of "free". It seems that blocks (16KB each) are allocated in chunks (several MB), thereby leading one to believe that there is all sorts of fragmentation. In reality, it is usually just most of one of these multi-MB chunks. And OPTIMIZE TABLE does not necessarily recoup any of the space.

If SQL Server is using BTrees, then it is lying to say that there is "no fragmentation". Think of what happens on a "block split". Or think of the overhead of continually defragmenting. Either way you lose.

Further note that a table and an index are essentially identical structures:

  • B+Tree, based on some index
  • The "data" is based on the PRIMARY KEY; each secondary index is a B+Tree based on its index.
  • The leaf node of the "data" contains all the columns of the table.
  • The leaf node of a secondary index contains the columns of that secondary index, plus the columns of the PRIMARY KEY.

If you have innodb_file_per_table = ON, you can clearly see the shrinkage (if any) after OPTIMIZE TABLE by looking at the .ibd file's size. For OFF, the info is buried in ibdata1, but SHOW TABLE STATUS may be reasonably accurate since all "free" space belongs to every table. Well, except for the pre-allocated chunks.

You may notice that a freshly optimized file-per-table table has exactly 4M, 5M, 6M, or 7M of Data_free. Again, this is the pre-allocation, and the failure to give you the minute details.

I have worked with InnoDB for over a decade; I have worked with thousands of different tables, large and small. I say that only one table in a thousand really needs OPTIMIZE TABLE. Using it on other tables is a waste.

105 columns is a lot, but perhaps not too many.

Do you have 55 indexes on one table? That is bad. That is 55 updates per INSERT. Let's discuss that further. Keep in mind that INDEX(a) is useless if you also have INDEX(a,b). And INDEX(flag) is useless because of low cardinality. (But INDEX(flag, foo) may be useful.)

Q1: There is no good way to check for all forms of fragmentation in either the data or the secondary indexes.

Q2, Q3: OPTIMIZE TABLE rebuilds the table by CREATEing a new table and INSERTing all the rows, then RENAMEing and DROPping. The re-inserting of the data in PK order assures that the data is well-defragmented. The indexes are another matter.

Q4: You could DROP and reCREATE each index to clean it up. But this is an extremely slow process. 5.6 has some speedups, but I don't know if they help with defragmentation.

It is also possible to ALTER TABLE ... DISABLE KEYS, then ENABLE them. This may to a more efficient rebuild of all the secondary indexes at once.