Mysql – Performance issues on large CSV import with before INSERT trigger

importMySQLperformancetrigger

I am doing import csv file using load into file. The files are big and contains minimum 500K+ lines of records and table extend day by day. Each line is just a phone number and nothing else. My table size has now grown for records more than 200+ Mil. I am also using a trigger to update a few items before INSERT. A bulk import takes a very long time naturally because of the trigger.

To import records query

load data local infile '/tmp/file.csv' IGNORE into table contacts FIELDS TERMINATED BY ',' (contact) set phonebook_id = 1

Trigger

CREATE TRIGGER `contacts_before_insert`
BEFORE INSERT ON `contacts`
FOR EACH ROW
BEGIN

   DECLARE vTimezone varchar(4);

   SELECT IF((timezone = '' OR count(timezone) = 0), 'ERR', timezone)
     INTO vTimeZone
     FROM timezones
    WHERE area_code = (SELECT LEFT(NEW.contact , 3));

   SET NEW.timezone = vTimezone;

timezone table structure

  CREATE TABLE `timezones` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `state` varchar(45) NOT NULL,
    `area_code` varchar(3) NOT NULL,
    `timezone` varchar(4) NOT NULL,
    PRIMARY KEY (`id`)
  ) ENGINE=InnoDB AUTO_INCREMENT=327 DEFAULT CHARSET=latin1      

Contacts table structure

 CREATE TABLE `contacts` (
   `id` bigint(20) NOT NULL AUTO_INCREMENT,
   `phonebook_id` int(11) NOT NULL,
   `campaign_id` int(11) DEFAULT NULL,
   `contact` varchar(45) NOT NULL,
   `status` tinyint(1) unsigned NOT NULL DEFAULT '1',
   `admin` tinyint(1) unsigned NOT NULL DEFAULT '0',
   `call_status` tinyint(1) unsigned NOT NULL DEFAULT '1',
   `active_call` tinyint(1) unsigned NOT NULL DEFAULT '0',
   `test_call` tinyint(1) unsigned NOT NULL DEFAULT '0',
   `v_node_name` varchar(45) DEFAULT NULL,
   `timezone` varchar(4) DEFAULT NULL,
   `updated_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    `created_date` timestamp NULL DEFAULT NULL,
    `contact_name` varchar(45) DEFAULT NULL,
     PRIMARY KEY (`id`),
     UNIQUE KEY `phonebook_id` (`phonebook_id`,`contact`),
     KEY `idx_001` (`active_call`,`campaign_id`,`test_call`),
     KEY `idx_002` (`phonebook_id`,`call_status`,`test_call`,`campaign_id`,`active_call`,`dnc_contact`),
     KEY `idx_003` (`phonebook_id`,`test_call`,`call_status`,`dnc_contact`),
     KEY `FK_campaign_id_idx` (`campaign_id`),
     KEY `idx_testcall` (`test_call`),
     KEY `idx_004` (`campaign_id`,`call_status`,`test_call`),
     KEY `idx_active_call` (`active_call`)
 ) ENGINE=InnoDB AUTO_INCREMENT=136600117 DEFAULT CHARSET=latin1     

My question is: if I break the CSV file into smaller pieces, say 2 Meg each or something like that, is that going to speed up the INSERT? Or, should I create some sort of stored procedure which runs after the import and updates the data in contacts table where my contacts are being imported. Not to mention, I absolutely want to avoid row/table locking because at the same time the table has heavy SELECT/UPDATES.

What would be the best practice here? How to speed up large CSV imports avoiding table locking etc .. ?

Best Answer

My suggestions are roughly in order of the performance benefit they will provide.

(1) (Similar to what a_vlad says) Rethink the use of a Trigger. I would load the csv into a temp table, then do the timezone lookup in a single query. Only then copy the data into the main table. Other tips on High speed ingestion . Tips on chunking . Yes, 500K rows at once is likely to cause some kind of trouble.

(2) Add to timezones:

INDEX(area_code, timezone)

(3) This is inefficient:

WHERE area_code = (SELECT LEFT(NEW.contact , 3));

change it to

WHERE area_code = LEFT(NEW.contact , 3);

(4) The first of these is redundant and can be dropped:

 KEY `FK_campaign_id_idx` (`campaign_id`),
 KEY `idx_004` (`campaign_id`,`call_status`,`test_call`),

(5) What is the distinction between contact and contact_name?

(6) Low-cardinality indexes (eg, idx_active_call) are essentially never chosen by the Optimizer, so could be dropped.