Mysql – How to optimize Mysql database of 250 million rows for bulk inserts and selects

database-designinnodbMySQLoptimization

I am a student, who have been given the task of designing a database for sensor data. My university currently has a large database which is being filled with these data, but a lot of what is being stored is not necessary. They want me to extract some of the fields from the existing database, and insert it into a new one, which will only hold the 'essentials'. I will need to extract every row from the old one, as well as fetching new data once a day.

There are 1500 sensors.
They generate a reading every minute.
Approximately 2.1 million readings every day
The current database have about 250 million rows.

The queries which will be performed will typically be to select sensor readings for a set of sensors between a given time span.

I was initially naive with respect to the added complexity large amounts of data introduces, so I grossly underestimated the time needed for this task. Because of this, and the fact that I don't have access to the server from home, I am here asking for help and input.

The initial design looks like this:

CREATE TABLE IF NOT EXISTS SENSORS (
    ID smallint UNSIGNED NOT NULL AUTO_INCREMENT,
    NAME varchar(500) NOT NULL UNIQUE,
    VALUEFACETS varchar(500) NOT NULL,
    PRIMARY KEY (ID)
); 

CREATE TABLE IF NOT EXISTS READINGS (
    ID int UNSIGNED AUTO_INCREMENT,
    TIMESTAMP int UNSIGNED INDEX NOT NULL,
    VALUE float NOT NULL,
    STATUS int NOT NULL,
    SENSOR_ID smallint UNSIGNED NOT NULL,
    PRIMARY KEY (ID),
    FOREIGN KEY (SENSOR_ID) REFERENCES SENSORS(ID)
);

Design Question

My first question is whether i should keep an auto-incremented key for the readings, or if it would be more beneficial to a have a composite key on TIMESTAMP(UNIX epoch) and SENSOR_ID?

This question applies both to the fact that I have to insert 2.1 million rows per day, as well as the fact that I want to optimize for the aforementioned queries.

Initial Bulk insert:

After a lot of trial and error and finding a guide online I have found that inserting using load infile, will best suit this purpose. I have written a script that will select 500 000 rows at the time from the old db, and write them (all 250 million) to a csv file, which will look like this:

TIMESTAMP,SENSOR_ID,VALUE,STATUS
2604947572,1399,96.434564,1432543

My plan is then to sort it with GNU sort, and split it into files containing 1 million rows.

Before inserting these files, I will remove the index on TIMESTAMP, as well as running these commands:

SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET SESSION tx_isolation='READ-UNCOMMITED';
SET sql_log_bin = 0;

After inserting, I will of course revert these changes.

Is this plan at all viable?
Can the inserts be quickened if i sort the csv based on SENSOR_ID and TIMESTAMP instead of TIMESTAMP and SENSOR_ID?
After turning indexing back on after the bulk insert, will the insertion of 2 million rows each day be possible?
Is it possible to do the daily inserts with regular insert statements, or will I have to use load infile in order to keep up
with the input load?

my.cnf

Every configuration is default except for these:

  innodb_flush_log_at_trx_commit=2
  innodb_buffer_pool_size=5GB
  innodb_flush_method=O_DIRECT
  innodb_doublewrite = 0

Are there any other optimizations I need for this particular purpose?

The server has 8GB of ram.
mysqld Ver 8.0.22
Ubuntu 20.04

Any thoughts, ideas or inputs would be greatly appreciated.

Best Answer

General recommendations for a "sensor" dataset:

Minimize datatypes
Minimize indexes
Batch inserts
25 rows per second is fast, but does not require more drastic steps

Specifics:

STATUS int NOT NULL -- 4 bytes? What values might it have? (Make it smaller if practical.)
I suggest that PRIMARY KEY(sensor_id, timestamp) would be unique and adequate. Then get rid of id completely. The result is one fewer secondary index to update. I recommend this order for the columns. But the real choice needs to be based on the SELECTs that will be performed.
Gather all 1500 rows into a single INSERT. Or use LOAD DATA INFILE (except that this requires more disk hits). That is, you will have one INSERT per minute. There should be no problem for a single thread to keep up. Load 'continually', I see no benefit in waiting until the end of the day. (Or am I missing something?)
Debug your code, then get rid of the FOREIGN KEY; FK checking involves extra effort.
The initial LOAD DATA to put millions of rows into the table -- This will probably hit some timeout and/or buffer limit. In that case, break it into chunks - but not 1M rows; instead perhaps 10K rows. And do it single-threaded. I hesitate to think about the hiccups you might encounter if you tried to do it multi-threaded. Furthermore, it might be mostly I/O-bound, thereby not benefiting from multi-threading much.
Have autocommit ON; this way each chunk will be committed. Otherwise, the redo log will get huge, taking extra disk space and slowing things down.
Pre-sorting data -- This helps some. In your original schema, sort by the secondary key (timestamp); the AUTO_INCREMENT will take care of itself. If you remove the auto_inc, then sort by the PRIMARY KEY(sensor_id, timestamp), if you take my suggestion.
Have the PRIMARY KEY in place as you load the data. Else it will need to copy the table when building the PK.
If there are secondary key(s), ALTER TABLE .. ADD INDEX .. after the initial loading.
The settings look fine. However I would leave innodb_doublewrite on -- this protects against a rare, but catastrophic, data corruption.

Comments related to the link you provided:

The link is 8 years old. But most of it is still valid. (However, I disagree with some details.)
If you need to eventually delete old data, plan for it now. See this for a time series that greatly speeds up the monthly deletes of old data by using PARTITION BY RANGE(): http://mysql.rjweb.org/doc.php/partitionmaint Note: The PK I suggest is the correct one for partitioning by day (or week or month).
The only performance benefit of partitioning (based on what has been discussed so far) is when it comes to DELETEing via DROP PARTITION. No SELECTs are likely to run faster (or slower).
If you are using mysqldump, simply use the defaults. That will produce manageable INSERTs with lots of rows. TSV/CSV is not necessary unless you are coming from some other source.
I do not know about his debate between LOAD FILE and bulk INSERT and number of rows per chunk. Perhaps he had secondary indexes that were slowing things down. The "Change buffer" is a kind of write cache for secondary indexes. Once it is full, the inserting necessarily slows down, waiting for updates to be flushed. By getting rid of secondary indexes, this problem goes away. Or, by using ADD INDEX after loading the data, the cost is postponed.

Catching a broken cron job...

If you have a cron job capturing the data for "yesterday" based on the timestamp in the source, there is still a risk of the cron job failing or not even running (when your server is down at the time when the job should be run).

With the PK(sensor, ts) that I recommend, the following is reasonably efficient:

SELECT sensor, MAX(ts) AS max_ts
    FROM dest_table GROUP BY sensor;

The efficiency comes from hopping through the table, hitting only 1500 spots.

At the start of the cron job, run this (on the destination) to "find out where you left off":

SELECT MAX(max_ts) AS left_off FROM
       ( SELECT sensor, MAX(ts) AS max_ts
            FROM dest_table GROUP BY sensor ) AS x;

Then fetch new rows from the source

    WHERE ts > left_off
      AND ts < CURDATE()

Normally, left_off will be shortly before midnight of yesterday morning. When there is a hiccup, it will be a day earlier.

You could also use that subquery to see if any sensors have gone offline and when.

Related Solutions

Mysql – Speed inserts in database with a table with +47 million data per each year, containing more than 10 years

Simple this one. Create a staging table with the same columns as your main table. No indexes on the staging table please as this will slow it down. Insert your rows into the staging table via .NET as normal. Then run:

BEGIN TRAN
INSERT INTO main_table (cols here)
    SELECT cols here from staging_table
TRUNCATE TABLE staging_table
COMMIT

You can still run that SQL inline via .net, or simply use a stored proc. Either way, the insert into the main table will be in a SINGLE insert, rather than .nets single insert for each row (RBAR). As the above method will insert data into the main table in one go, the indexes are updated once. We use this to great effect in our warehousing databases.

As your inserting a lot of data into the main_table, an alternative is to drop all the indexes on the main_table, insert your data, then re-create the indexes. This too is a very efficient (read fast) way.

Mysql – Finding rows for a specified date range

Another way to get the result is this. It finds first all groups that the teacher has surely taught (or is going to) by checking that she has started within the month and then in another subquery it finds - for every group - the last teacher that started at the first day of the month or earlier.

With the unique index you have on the table, the second subquery should be quite efficient. The first subquery would benefit from an index on (teacherid, startdate, groupid):

SELECT groupid
FROM pupilgroupteacher
WHERE teacherid = @teacher 
  AND startdate >= @month + INTERVAL 1 DAY
  AND startdate < @month + INTERVAL 1 MONTH

UNION DISTINCT

SELECT gg.groupid
FROM 
    ( SELECT DISTINCT groupid
      FROM pupilgroupteacher
    ) AS gd
  JOIN pupilgroupteacher AS gg
    ON  gg.groupid = gd.groupid
    AND gg.startdate =
        ( SELECT MAX(gi.startdate)
          FROM pupilgroupteacher AS gi 
          WHERE gi.groupid = gd.groupid
            AND gi.startdate < @month + INTERVAL 1 DAY
        )
WHERE gg.teacherid = @teacher ;

Best Answer

Related Solutions

Mysql – Speed inserts in database with a table with +47 million data per each year, containing more than 10 years

Mysql – Finding rows for a specified date range

Related Question