Mysql – Large dataset of 400+ million rows to compare

MySQLquery-performance

I am trying to solve this logic problem but my current MySQL schema is giving me a very poor performance with a large dataset.

I have a table which contains almost half a billion telephone records. They are called the Do Not Call list or simply the DNC.

When I start my campaign to dial numbers, I have to check against this list to see if my phone numbers are there in the list or not. If yes, then scrub from the list and do not dial that number.

Here is my dnc table:

CREATE TABLE `dnc_contacts` (
  `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
  `group_id` INT(11) NOT NULL,
  `contact` VARCHAR(45) NOT NULL,
  `status` TINYINT(1) NOT NULL DEFAULT '1',
  `created_date` TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP,
  `admin` TINYINT(1) NOT NULL DEFAULT '0',
  `updated_date` TIMESTAMP NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`,`group_id`),
  KEY `fk_dnc_contact_dnc_group1_idx` (`group_id`),
  KEY `idx-group_id-contact` (`group_id`,`contact`),
  CONSTRAINT `dnc_contacts_copy_ibfk_1` FOREIGN KEY (`group_id`) REFERENCES `dnc_groups` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

The query I am using to compare the dnc table is:

UPDATE contacts c LEFT JOIN dnc_contacts d ON d.contact = c.contact AND d.group_id = '9' SET campaign_id = '525', call_status = 1, active_call = 0, dnc_contact = if(d.contact IS NULL, 0, 1) WHERE phonebook_id IN (920)

My contacts table looks like:

CREATE TABLE `contacts` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `phonebook_id` int(11) NOT NULL,
  `campaign_id` int(11) DEFAULT NULL,
  `contact` varchar(45) NOT NULL,
  `status` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `admin` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `call_status` tinyint(1) unsigned NOT NULL DEFAULT '1',
  `active_call` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `test_call` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `dnc_contact` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `v_node_name` varchar(45) DEFAULT NULL,
  `updated_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `created_date` timestamp NULL DEFAULT NULL,
  PRIMARY KEY (`id`,`phonebook_id`),
  KEY `fk_contact_phonebook1_idx` (`phonebook_id`),
  KEY `idx-phonebook_id-contact` (`id`,`contact`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Running this query for small datasets with 100K to 200K contacts works fine, but with large dataset with more than 300 million records kills the performance. Can't think of a better way to achieve the goal.

Appreciate any advice to improve the logic/schema/query.


The contact is not unique. I can have the same contact but under a different group or a phonebook. It gets even messier when there are multiple phonebooks. active_call and call_status keep on changing as the numbers are dialled.

Best Answer

Isn't contact unique in both tables? If so, get rid if id and have PRIMARY KEY(contact). There are multiple reasons why this is likely to speed things up.

active_call and call_status sound like things that are changed often? That interferes with the setting of the dnc_contact, especially when done in large chunks.

I can think of no good reason to even have dnc_contact in contacts. It is redundant information. That is, when you are about to make a call, do the JOIN to verify that the phone number is not in dnc_contact.

If you really need to do the UPDATE, here are tips on how to do it is a relatively non-invasive way -- bu chunking.