MySQL – Resolving LOAD DATA with Subquery Performance Issues

loadMySQLsubquery

I'm migrating data into a MySql (5.7.26) database (32GB ram), running as a managed service on AWS. While importing the data, I need to map one of the columns of the CSV being imported to another value using a MEMORY table lookup; so my LOAD DATA resembles the following:

LOAD DATA LOCAL INFILE 'file.csv'
INTO TABLE table_1(col_1, @var1)
SET col_2 = (select mapped_value from table_2 where id = @var1)

table_2 is a 2-column (id, mapped_value) MEMORY table with 3.4MM rows.

When I import the CSV without the subquery, I get several million inserts per minute. However, when I run the same import with the subquery the LOAD DATA performance degrades to near zero (~100 inserts per minute). Is this to be expected with a subquery, or is there something I'm doing wrong in the example above?

Best Answer

without the subquery ... several million inserts per minute ... with the subquery ... ~100 inserts per minute

Of course. Server executes the subquery per each separate imported record!

Specify datatypes for table_1.col2 and table2.id.

If table2.id can be stored into table_1.col2 then perform 2-step importing:

LOAD DATA LOCAL INFILE 'file.csv'
INTO TABLE table_1(col_1, col_2);

UPDATE table_1, table_2
SET table_1.col2 = table_2.mapped_value
WHERE table_2.id = table_1.col2;

If the types are not compatible then I'd recommend to import the data into temporary table and then copy the data into working table with needed substitution. Anycase it will more fast then one-step import with subquery.

Related Solutions

Mysql – Query performance with subquery and IN clause

Refactor the query as follows:

SELECT
    readings.*
FROM
    (
        SELECT boxsn FROM readings
        WHERE (time >= 1325404800) 
        AND (time < 1326317400) 
        ORDER BY `time` ASC
    ) readings_keys
    LEFT JOIN
    (
        SELECT id AS boxsn FROM boards WHERE siteId = '1'
    ) boards
    USING (boxsn)
    LEFT JOIN readings
    USING (boxsn)
;

Make sure you have the following indexes:

ALTER TABLE boards ADD INDEX siteId_id_ndx (siteId,id);
ALTER TABLE readings ADD INDEX time_boxsn_ndx (time,boxsn);

You can drop the other index

ALTER TABLE readings DROP INDEX boxsn_time_ndx;

You should definitely see a dramatic improvement in performance as the tables grow.

In your case,

The first EXPLAIN plan says you have to perform a lookup of SerialNumber for each row in readings against a list of value in memory
The second EXPLAIN plan says you have to perform a lookup of SerialNumber for each row in readings against a table.

UPDATE 2012-01-12 14:03 EDT

I refactored it again to make sure the readings keys and boards keys are combined correctly before retrieving the data from the readings table:

SELECT 
    readings.* 
FROM 
    ( 
        SELECT A.* FROM
        (
            SELECT boxsn FROM readings 
            WHERE (time >= 1325404800)  
            AND (time < 1326317400)  
            ORDER BY `time` ASC
        ) A
        LEFT JOIN
        (
            SELECT id AS boxsn
            FROM boards
            WHERE siteId = '1'
        ) B
        USING (boxsn)
        WHERE B.boxsn IS NOT NULL
    ) readings_keys 
    LEFT JOIN readings 
    USING (boxsn) 
;

Mysql – slave issue with load data

The LOAD DATA INFILE statement was not always replicated correctly to a slave running MySQL 5.1.42 or earlier from a master running MySQL 4.0 or earlier. When using statement-based replication, the LOAD DATA INFILE statement CONCURRENT option was not replicated. This issue was fixed in MySQL 5.1.43. This issue does not have any impact on CONCURRENT option handling when using row-based replication in MySQL 5.1 or later (See this bug report http://bugs.mysql.com/bug.php?id=34628)

Also In MySQL 5.1.52 and later, LOAD DATA INFILE is considered unsafe. It causes a warning when using statement-based logging format, and is logged using row-based format when using mixed-format logging.

http://dev.mysql.com/doc/refman/5.1/en/replication-rbr-safe-unsafe.html

Best Answer

Related Solutions

Mysql – Query performance with subquery and IN clause

UPDATE 2012-01-12 14:03 EDT

Mysql – slave issue with load data

Related Question