I am doing import csv file using load into file. The files are big and contains minimum 500K+ lines of records and table extend day by day. Each line is just a phone number and nothing else. My table size has now grown for records more than 200+ Mil. I am also using a trigger to update a few items before INSERT. A bulk import takes a very long time naturally because of the trigger.
To import records query
load data local infile '/tmp/file.csv' IGNORE into table contacts FIELDS TERMINATED BY ',' (contact) set phonebook_id = 1
Trigger
CREATE TRIGGER `contacts_before_insert`
BEFORE INSERT ON `contacts`
FOR EACH ROW
BEGIN
DECLARE vTimezone varchar(4);
SELECT IF((timezone = '' OR count(timezone) = 0), 'ERR', timezone)
INTO vTimeZone
FROM timezones
WHERE area_code = (SELECT LEFT(NEW.contact , 3));
SET NEW.timezone = vTimezone;
timezone table structure
CREATE TABLE `timezones` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`state` varchar(45) NOT NULL,
`area_code` varchar(3) NOT NULL,
`timezone` varchar(4) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=327 DEFAULT CHARSET=latin1
Contacts table structure
CREATE TABLE `contacts` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`phonebook_id` int(11) NOT NULL,
`campaign_id` int(11) DEFAULT NULL,
`contact` varchar(45) NOT NULL,
`status` tinyint(1) unsigned NOT NULL DEFAULT '1',
`admin` tinyint(1) unsigned NOT NULL DEFAULT '0',
`call_status` tinyint(1) unsigned NOT NULL DEFAULT '1',
`active_call` tinyint(1) unsigned NOT NULL DEFAULT '0',
`test_call` tinyint(1) unsigned NOT NULL DEFAULT '0',
`v_node_name` varchar(45) DEFAULT NULL,
`timezone` varchar(4) DEFAULT NULL,
`updated_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`created_date` timestamp NULL DEFAULT NULL,
`contact_name` varchar(45) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `phonebook_id` (`phonebook_id`,`contact`),
KEY `idx_001` (`active_call`,`campaign_id`,`test_call`),
KEY `idx_002` (`phonebook_id`,`call_status`,`test_call`,`campaign_id`,`active_call`,`dnc_contact`),
KEY `idx_003` (`phonebook_id`,`test_call`,`call_status`,`dnc_contact`),
KEY `FK_campaign_id_idx` (`campaign_id`),
KEY `idx_testcall` (`test_call`),
KEY `idx_004` (`campaign_id`,`call_status`,`test_call`),
KEY `idx_active_call` (`active_call`)
) ENGINE=InnoDB AUTO_INCREMENT=136600117 DEFAULT CHARSET=latin1
My question is: if I break the CSV file into smaller pieces, say 2 Meg each or something like that, is that going to speed up the INSERT? Or, should I create some sort of stored procedure which runs after the import and updates the data in contacts table where my contacts are being imported. Not to mention, I absolutely want to avoid row/table locking because at the same time the table has heavy SELECT/UPDATES.
What would be the best practice here? How to speed up large CSV imports avoiding table locking etc .. ?
Best Answer
My suggestions are roughly in order of the performance benefit they will provide.
(1) (Similar to what a_vlad says) Rethink the use of a Trigger. I would load the csv into a temp table, then do the timezone lookup in a single query. Only then copy the data into the main table. Other tips on High speed ingestion . Tips on chunking . Yes, 500K rows at once is likely to cause some kind of trouble.
(2) Add to
timezones
:(3) This is inefficient:
change it to
(4) The first of these is redundant and can be dropped:
(5) What is the distinction between
contact
andcontact_name
?(6) Low-cardinality indexes (eg,
idx_active_call
) are essentially never chosen by the Optimizer, so could be dropped.