Mysql – happening under the hood in MySQL 8 when I apply an auto-increment primary key to a raw table of data

MySQLmysql-8.0

I'm responsible for a production database using MySQL 8.0 (InnoDb engine). It has become necessary for me to archive data from a sizeable table (the data file takes up approximately 50% of the hard drive).

I've learnt that using DELETE FROM x WHERE y is a horrible idea and doesn't necessarily reduce disk space usage. My new strategy is to:

  1. Export the data I need to archive to CSV (I was doing this before)
  2. Select the data I want to keep in the DB and place it in a new table (using CREATE TABLE ... SELECT)
  3. Apply the PK AI NN BIGINT primary key that I have in the existing table to the new table
  4. Delete the non-FK indices from the current table (to reduce disk usage)
  5. Apply FK indices to the new table
  6. Apply non-FK indices to the new table
  7. Rename the old table and new table
  8. Verify new table is working
  9. Drop old table

My problem / question really begins at step 3: my raw table consumes 44.6 GB before applying the primary key, and as I am typing this the replacement file is pushing 52.2 GB. That seems like a lot of space for a primary key that's technically based on data that's already there.

Could anyone explain why it's taking so much space? I'm running out of hard drive space now and don't know how to estimate how large the final data file might be.

Here's the CREATE statement I used to generate the temporary table:

CREATE TABLE ts_temp
SELECT * FROM timeseries
WHERE
        (type_id IN (1, 4, 5) AND date > CURDATE() - INTERVAL 24 MONTH)
    OR (type_id IN (8) AND date > CURDATE())
    OR (type_id IN (10, 11) AND date > CURDATE() - INTERVAL 9 MONTH)
    OR (type_id IN (12, 13, 14, 15, 22) AND date > CURDATE() - INTERVAL 6 MONTH)
    OR (type_id IN (16, 17, 19, 20, 21) AND date > CURDATE() - INTERVAL 2 MONTH)
    OR (type_id IN (2, 6) AND date > CURDATE() - INTERVAL 1 MONTH);

Getting a subsequent CREATE statement from MySQL Workbench for this table would look like this:

CREATE TABLE `ts_temp` (
  `id` bigint unsigned NOT NULL DEFAULT '0',
  `item_id` int DEFAULT NULL,
  `date` date DEFAULT NULL,
  `type_id` int DEFAULT NULL,
  `node_id` int DEFAULT NULL,
  `value` float DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Then, I applied the PK using this:

ALTER TABLE `ts_temp` 
CHANGE COLUMN `id` `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT ,
ADD PRIMARY KEY (`id`);

I should perhaps mention that this is a timeseries table and we ran out of AI numbers when it was a (default) signed INT field, which I subsequently converted to BIGINT unsigned.

Best Answer

Do you care whether you keep the old auto_increment values? Your process seems to loose them.

Did you have innodb_file_per_table turned on when the original table was created? If yes, there are easier ways; if no, you need to dump all tables and rebuild.

Be sure to have innodb_file_per_table=ON

What percentage of the table is to be kept? If it is a small percentage, don't bother with the CSV steps, simply rebuild the new table as InnoDB. Also you won't need to rebuild the AI.

While you are at it, consider shrinking the datatypes. INT takes 4 bytes and BIGINT takes 8. Look into smaller datatypes. And UNSIGNED. (Eg, SMALLINT UNSIGNED takes 2 bytes had has a range of 0..64K.)

Adding and dropping secondary indexes (but not the PK) is efficient in 8.0. Do you have any other indexes, perhaps even composite indexes? Is any of them 'unique'? If so, could you get rid of id and use that index as the PRIMARY KEY?

Can each of the columns really be NULL? Consider fixing that if necessary.

Do the rename in these steps. This minimizes downtime:

RENAME TABLE real TO old,
             temp TO real;
DROP TABLE old;

More tips on big deletes and related things: http://mysql.rjweb.org/doc.php/deletebig