MySQL indexes maintenance

fragmentationindex-maintenanceMySQL

I made a lot of research about how to maintain indexes in MySQL to prevent fragmentation and to optimize somehow the execution of some queries.

I am familiar with that formula that calculates the ratio between the max space available for a table VS the space used by data and indexes.

However my main questions are still unanswered. Perhaps this is due to the fact that I am familiar with index maintenance in SQL Server, and I tend to think that in MySQL it should be somehow similar.

In SQL server, you can have several indexes, and each one of it can have different levels of fragmentation. Then you can pick up one and perform a 'REORGANIZE' or 'REBUILD' operation in that particular index, without affecting the rest.

To the best of my knowledge, there is no 'table fragmentation' as such, and SQL Server doesn't provide any tool to fix the 'table fragmentation'. What it does provide are tools to check index fragmentation (understood like the ratio between the number of pages used by an index VS the fullness of that page and contiguity), as well as the internal and external fragmentation.

All of that is quite straightforward to understand, at least for me.

Now, when it comes the turn to maintain indexes in MySQL, there only exist the concept of 'table fragmentation, as mentioned above.

A table in MySQL can have several indexes, but when I check the 'fragmentation ratio' with that famous formula, I don't see the fragmentation of each index, but the table as a whole.

When I want to optimize the indexes in MySQL, I don't choose a particular index to operate on (as in SQL Server). Instead, I do an 'OPTIMIZE' operation in the whole table, which presumably affects all the indexes.

When the table is optimized in MySQL, the ratio between the space used by data + indexes VS the overall space is reduced, which suggest some kind of physical re-organization in the hard drive, which translates into a reduction of the physical space. However, index fragmentation is not only about physical space, but the structure of the tree that has been changed over the time due to inserts and updates.

Finally, I got a table in InnoDB/MySQL. That table has 3 million records, 105 columns and 55 indexes. It is 1.5GB excluding indexes, which are 2.1GB.

That table is being hit thousands of times ever day for updating, insertion (we don't actually delete records).

That table has been created years a go and I know for sure that nobody is maintaining indexes whatsoever.

I was expecting to find a huge fragmentation in there, but when I perform the fragmentation calculation as prescribed

free_space / (data_length + index_length)

it turns out that I have only a 0.2% fragmentation. IMHO that is quite unrealistic.

So the big questions are:

How do I check fragmentation of a particular index in MySQL, not the table as a whole
Does OPTIMIZE TABLE actually fix the internal / external fragmentation of an index as in SQL Server?
When I optimize a table in MySQL, does it actually rebuilds all the indexes on the table?
Is it realistic to think that reducing the physical space of an index (without rebuilding the tree itself) actually translates into a better performance?

Best Answer

Index fragmentation is much overrated. Do not worry about it.

Two adjacent, somewhat-empty, blocks are merged together by InnoDB as the natural processing.

Random actions on a BTree cause it to naturally gravitate toward an average of 69% full. Sure, this is not 100%, but the overhead of "fixing" it is not worth it.

SHOW TABLE STATUS gives you some metrics, but they are flawed -- "Data_free" includes certain "free" space, but not other "free" space.

There is unused space in each block; free 16KB blocks; free "extents" (nMB chunks); MVCC rows waiting to be reaped; non-leaf nodes have their own fragmentation; etc.

Percona and Oracle have different ways of looking at how big (number of blocks) an index is. I find neither of them to be useful because of the limited definition of "free". It seems that blocks (16KB each) are allocated in chunks (several MB), thereby leading one to believe that there is all sorts of fragmentation. In reality, it is usually just most of one of these multi-MB chunks. And OPTIMIZE TABLE does not necessarily recoup any of the space.

If SQL Server is using BTrees, then it is lying to say that there is "no fragmentation". Think of what happens on a "block split". Or think of the overhead of continually defragmenting. Either way you lose.

Further note that a table and an index are essentially identical structures:

B+Tree, based on some index
The "data" is based on the PRIMARY KEY; each secondary index is a B+Tree based on its index.
The leaf node of the "data" contains all the columns of the table.
The leaf node of a secondary index contains the columns of that secondary index, plus the columns of the PRIMARY KEY.

If you have innodb_file_per_table = ON, you can clearly see the shrinkage (if any) after OPTIMIZE TABLE by looking at the .ibd file's size. For OFF, the info is buried in ibdata1, but SHOW TABLE STATUS may be reasonably accurate since all "free" space belongs to every table. Well, except for the pre-allocated chunks.

You may notice that a freshly optimized file-per-table table has exactly 4M, 5M, 6M, or 7M of Data_free. Again, this is the pre-allocation, and the failure to give you the minute details.

I have worked with InnoDB for over a decade; I have worked with thousands of different tables, large and small. I say that only one table in a thousand really needs OPTIMIZE TABLE. Using it on other tables is a waste.

105 columns is a lot, but perhaps not too many.

Do you have 55 indexes on one table? That is bad. That is 55 updates per INSERT. Let's discuss that further. Keep in mind that INDEX(a) is useless if you also have INDEX(a,b). And INDEX(flag) is useless because of low cardinality. (But INDEX(flag, foo) may be useful.)

Q1: There is no good way to check for all forms of fragmentation in either the data or the secondary indexes.

Q2, Q3: OPTIMIZE TABLE rebuilds the table by CREATEing a new table and INSERTing all the rows, then RENAMEing and DROPping. The re-inserting of the data in PK order assures that the data is well-defragmented. The indexes are another matter.

Q4: You could DROP and reCREATE each index to clean it up. But this is an extremely slow process. 5.6 has some speedups, but I don't know if they help with defragmentation.

It is also possible to ALTER TABLE ... DISABLE KEYS, then ENABLE them. This may to a more efficient rebuild of all the secondary indexes at once.

Related Solutions

Sql-server – How to identify fragmentation level of the table data itself not the table indexes, and then defrag

Tables in SQL Server can be either organised with a clustered index or have no CI in which case they are a heap.

You need to look at sys.dm_db_index_physical_stats. Despite the name this also does analysis of heaps too (though logical fragmentation does not apply to these, pages cannot be out of logical order as there is no "correct" ordering in a heap).

For tables with a CI the clustered index has an index_id of 1. The leaf level of the CI is the table. For a heap this is given an index_id of 0.

Sql-server – Different results rebuilding an index online and offline

This is by no means a full answer but may move things along a bit if you were to try something similar and report your results.

I couldn't reproduce them. With the following test table

CREATE TABLE [dbo].[Table]
(
Col BIGINT
)

CREATE NONCLUSTERED INDEX IX ON [dbo].[Table](Col)

INSERT INTO [dbo].[Table]
SELECT top 12000 ROW_NUMBER() OVER (ORDER BY @@SPID)
FROM master..spt_values v1, master..spt_values v2

And multiple runs of the following script

USE FragTest;

DECLARE @DbccPage TABLE (
  ParentObject VARCHAR(255),
  Object       VARCHAR(255),
  Field        VARCHAR(255),
  VALUE        VARCHAR(255))

DECLARE @sp_index_info TABLE (
  PageFID         TINYINT,
  PagePID         INT,
  IAMFID          TINYINT,
  IAMPID          INT,
  ObjectID        INT,
  IndexID         TINYINT,
  PartitionNumber TINYINT,
  PartitionID     BIGINT,
  iam_chain_type  VARCHAR(30),
  PageType        TINYINT,
  IndexLevel      TINYINT,
  NextPageFID     TINYINT,
  NextPagePID     INT,
  PrevPageFID     TINYINT,
  PrevPagePID     INT,
  PRIMARY KEY (PageFID, PagePID));

DECLARE @I INT = 0

WHILE @I < 2
  BEGIN
      DECLARE @Online VARCHAR(3) = CASE
          WHEN @I = 0 THEN 'OFF'
          ELSE 'ON'
        END

      EXEC('ALTER INDEX [IX] ON [dbo].[Table]
REBUILD WITH
(
    PAD_INDEX  = OFF, 
    STATISTICS_NORECOMPUTE  = OFF, 
    ALLOW_ROW_LOCKS  = ON, 
    ALLOW_PAGE_LOCKS  = ON, 
    ONLINE = ' + @Online + ', 
    SORT_IN_TEMPDB = ON
);')

      INSERT INTO @sp_index_info
      EXEC ('DBCC IND ( FragTest, ''[dbo].[Table]'', 2)' );

      ; WITH T
           AS (SELECT *,
                      PagePID - ROW_NUMBER() OVER (PARTITION BY PageType, IndexLevel ORDER BY PagePID) AS Grp
               FROM   @sp_index_info)
      SELECT PageType,
             MIN(PagePID) AS StartPID,
             MAX(PagePID) AS EndPID,
             COUNT(*)     AS [count],
             IndexLevel
      FROM   T
      GROUP  BY Grp,
                PageType,
                IndexLevel
      ORDER  BY PageType DESC,
                StartPID

      DECLARE @DynSQL NVARCHAR(4000)

      SELECT @DynSQL = N'DBCC PAGE (FragTest, ' + LTRIM(PageFID) + ',' + LTRIM(PagePID) + ',3) WITH TABLERESULTS'
      FROM   @sp_index_info
      WHERE  PageType = 10

      INSERT INTO @DbccPage
      EXEC(@DynSQL)

      SELECT VALUE AS SinglePageAllocations
      FROM   @DbccPage
      WHERE  VALUE <> '(0:0)'
             AND Object LIKE '%IAM: Single Page Allocations%'

      SELECT avg_page_space_used_in_percent,
             avg_fragmentation_in_percent,
             fragment_count,
             page_count,
             @Online                                                   AS [Online],
             (SELECT COUNT(*)
              FROM   @DbccPage
              WHERE  VALUE <> '(0:0)'
                     AND Object LIKE '%IAM: Single Page Allocations%') AS SinglePageAllocations
      FROM   sys.dm_db_index_physical_stats(db_id(), object_id('[dbo].[Table]'), 2, NULL, 'DETAILED')
      WHERE  index_level = 0

      DELETE FROM @sp_index_info

      DELETE FROM @DbccPage

      SET @I = @I + 1
  END

I consistently got results like

Online = OFF

PageType StartPID    EndPID      count       IndexLevel
-------- ----------- ----------- ----------- ----------
10       119         119         1           NULL
2        2328        2351        24          0
2        2352        2352        1           1
2        2384        2392        9           0


SinglePageAllocations
----------------------

(0 row(s) affected)


avg_page_space_used_in_percent avg_fragmentation_in_percent fragment_count       page_count           Online SinglePageAllocations
------------------------------ ---------------------------- -------------------- -------------------- ------ ---------------------
98.8139362490734               0                            2                    33                   OFF    0

Online = ON

PageType StartPID    EndPID      count       IndexLevel
-------- ----------- ----------- ----------- ----------
10       115         115         1           NULL
2        114         114         1           0
2        118         118         1           1
2        2416        2449        34          0



SinglePageAllocations
-----------------------
(1:114)
(1:118)


avg_page_space_used_in_percent avg_fragmentation_in_percent fragment_count       page_count           Online SinglePageAllocations
------------------------------ ---------------------------- -------------------- -------------------- ------ ---------------------
97.4019644180875               2.85714285714286             2                    35                   ON     2

At least in the test I did the differences between the two balanced out fragmentation wise (though similarly to your test I did find that rebuilding the index online led to a higher page count.).

I found that the Online = OFF version always used uniform extents and had zero single page allocations whereas the Online = ON always seemed to put the index root page and first index leaf page in mixed extents.

Putting the first index leaf page in a mixed extent and the rest in contiguous uniform extents causes a fragment count of 2.

The Online = OFF version avoids the fragment caused by the lone index leaf page but the contiguity of the leaf pages is broken by the index root page that shares the same extents and this too has a fragment count of 2.

I was running my test on a newly created database with 1 GB of free space and no concurrent activity. Perhaps the Online = OFF version is more vulnerable to concurrent allocations causing it to be given non contiguous uniform extents.