MySQL 8.0 – Steps for Deleting Large Number of Rows and Changing Index

deleteindexMySQLmysql-8.0

We daily collect data with multiple records per second. The table my_xpa_data has currently close to 100 Million rows and is growing fast.

  • table size around 8 GB (used to be 4 GB a couple of days ago)
  • index size around 10 GB (used to be 4 GB as well)

I executed a few delete commands affecting only a couple of thousand rows. Then i got greedy and tried to delete around a million rows at once DELETE from myTable WHERE ID < 100276673. After 45mins i aborted the action.

The table looks like this

| ID  |    id_external    | date_collected      | prg | spotName  | ID_Machine| q_result |
|=====|===================|=====================|=====|===========|===========|==========|
| 127 | 20201229231513000 | 2020-12-29 23:15:13 | 110 | 008-04232 |    95     | 0.986879 |
| 128 | 20201229231515000 | 2020-12-29 23:15:15 | 109 | 008-04232 |    95     | 0.986879 |

and

CREATE TABLE `my_xpa_data` (
    `id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
    `id_external` BIGINT UNSIGNED NOT NULL DEFAULT '0',
    `date_collected` DATETIME NOT NULL,
    `prg` INT UNSIGNED NOT NULL,
    `spotName` VARCHAR(50) NOT NULL,
    `ID_Machine` INT UNSIGNED NOT NULL,
    `q_result` DOUBLE NULL DEFAULT NULL,
    PRIMARY KEY (`id`),
    INDEX `ID_Machine` (`ID_Machine`, `spotName`),
    INDEX `idx_qresult` (`q_result`),
    INDEX `idx_xpa_idmachine_prg` (`ID_Machine`, `prg`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB
AUTO_INCREMENT=223323358

Based on Rick James Post MySQL Big DELETEs i am aware that a big delete does require some preparation.

To reduce the space needed and to improve the query speed i wonder what should be changed

  1. get rid of indexes that are not needed (for example idx_qresult)
  2. may be split up date_collected into three columns: YEAR+Month (YYYYMM), Day (DD), Time (HH:MM:SS) or leave it out altogether because id_external has likely the same information
  3. the table id_external is a timestamp (YYYY MM DD HH MM ss SSS example 2020 12 29 23 15 13 000) as far as i can tell the last three digits SSS are always zero.
  4. move the combination of spotName, ID_Machine and possibly prg into a different table and reference it with an id

Old records can be deleted but some aggregates (average, median) must be kept for some time.

Requirements

  • Keep every record from the last weeks (if feasible likely 12 weeks)
  • Calculate and keep aggregates (average, median, quartile) after 12 weeks (hourly, daily)

Question

I am at a loss how to proceed / to start

  • Switch of binary logs (anything else?)
  • create a new table or tables to receive the records from my_xpa_data
  • Regarding columns: Do you know if splitting date_created into three columns (YYYYMM, DD, HH:MM:ss) makes it faster to delete all records for a certain month (202006) without using partition?
  • Does it make sense to use helper tables for date_created that contain Year and Month (id, YYYYMM) days (id, DD) and time (id, hhmmssSSS) only store the ids as a reference?
  • SELECT Copy rows from my_xpa_data into the new table(s)
  • Drop the old table my_xpa_data
  • Create a view named my_xpa_data so that our source code does not need to change much unless using a view to insert records is a bad idea?

Best Answer

Overflow...

Not relevant to the question, but you have disaster coming in a few weeks: INT UNSIGNED will overflow at about 4 billion. DELETEing rows will not prevent this.

Plan A: Get rid of the AUTO_INCREMENT PK if you have some other column(s) that is unique and could be used as the PRIMARY KEY.

Plan B: Change it to BIGINT. Better now than when the table gets bigger. (That is 4 bytes bigger, but do it if you must.)

Note: You can run it this way without losing the old table. This lets you make a mistake without losing data. (Eventually DROP the old table.)

Deletes...

Deleting a million rows, especially while the table is active, will take a long time because of needing to save them in case of ROLLBACK or crash recovery. The simple solution is to delete only 1000 rows at a time. See http://mysql.rjweb.org/doc.php/deletebig

Since you want to delete "old" rows, DROP PARTITION is an excellent option. But it takes some setup. See that link (and another page that it points to). Since you want 12 weeks, I recommend 14 weekly partitions.

Aggregates...

Build and maintain a "summary table" by hour. See http://mysql.rjweb.org/doc.php/summarytables

That will make the 'report' queries much faster.

Size...

Here are some suggestions on shrinking the disk space for the 'Fact' table. That will indirectly help performance some.

spotName VARCHAR(50) NOT NULL looks bulky. Can it be normalized?

DOUBLE takes 8 bytes; do you need 16 digits of precision? Or would a 4-byte FLOAT with 7 digits of precision suffice?

Get rid of most of the indexes once you build the summary table(s).

With rare exceptions, do not split up a date into its component parts. It leads to extra hassles in the code, performance hits, and may even increase disk usage. (Note that there is overhead for each column.)

YYYY MM DD HH MM ss SSS looks like 17(?) bytes. TIMESTAMP(3) is only 7 bytes. If SSS is always 0, then use a 5-byte TIMESTAMP. If you don't want to switch to TIMESTAMP, then consider DECIMAL(17,0), which will take 8 bytes and still look like a 17-char string or 17-digit number.

INT takes 4 bytes. Check out the smaller options. (I suspect you don't need millions of machine ids.)

VIEW...

In MySQL, a VIEW is essentially syntactic sugar. No space is saved. No performance is saved. Etc.

Rebuild...

To delete rows and add partitioning:

  1. Create new table with partitioning, changed indexes, different datatypes, normalization, etc
  2. `INSERT INTO new-table SELECT revised-columns FROM existing-table WHERE id > ...

Note that the WHERE indirectly does the DELETE (by copying over only the rows that are to be kept). That helps things run faster shrinks the result table size and defragments.

More...

After you have dealt with most of my suggestions, start a new query and we can discuss the index(es) and summary table(s) further. Do include the main SELECTs.