Mariadb – Slow lookup of part of primary key

mariadb

I am struggling with query performance. I'm working with historical records of who has voted (or not) in an election. I have data for nearly 40 million ballots over 22 election. When I get additional data in the future, the oldest elections will drop off of the new data set, but I want to retain that information in my own database. So I need a system that will let me merge new data into the existing data.

I have a table 'elections' with 22 rows:

CREATE TABLE `elections` (
  `election_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `date` date NOT NULL,
  `election_type` enum('primary','general','special','board') NOT NULL,
  `whose` enum('state','other') NOT NULL,
  PRIMARY KEY (`election_id`),
  UNIQUE KEY `date` (`date`,`whose`),
  KEY `whose` (`whose`)
) ENGINE=InnoDB AUTO_INCREMENT=32 DEFAULT CHARSET=latin1;

And a table 'ballots' which is initially empty:

CREATE TABLE `ballots` (
  `voter_id` int(10) unsigned NOT NULL,
  `election_id` int(10) unsigned NOT NULL,
  `status` enum('unreturned','invalid','valid') NOT NULL,
  PRIMARY KEY (`voter_id`,`election_id`),
  KEY `election_id` (`election_id`),
  KEY `status` (`status`),
  CONSTRAINT `ballots_ibfk_1` FOREIGN KEY (`election_id`) REFERENCES `elections` (`election_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

My raw data obviously does not contain an election_id. I need to look up the election_id based on the date. I load my raw data initially into a 'temp_ballots' table which has ~40M rows:

CREATE TABLE `temp_ballots` (
  `voter_id` int(10) unsigned NOT NULL,
  `date` date NOT NULL,
  `status` enum('unreturned','invalid','valid') NOT NULL,
  KEY `date` (`date`),
  KEY `voter_id` (`voter_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

And now, the part I struggle with. I want to take the data from temp_ballots, look up the correct election_id, and insert it into ballots. (Then I will drop temp_ballots.) The following operation is taking hours, and I don't know when it might finish, and I don't understand why this is slow. I think I've indexed the appropriate columns. What am I doing wrong?

INSERT INTO `ballots` (voter_id, election_id, status)
  SELECT voter_id, election_id, status
  FROM elections NATURAL JOIN temp_ballots
  WHERE whose = 'state'
  ON DUPLICATE KEY UPDATE ballots.status = temp_ballots.status;

I am using MariaDB 10.0.36, so I do not have the ANALYZE statement. But here is the output from EXPLAIN:

+------+-------------+--------------+------+---------------+-------+---------+-----------------------+------+-----------------------+
| id   | select_type | table        | type | possible_keys | key   | key_len | ref                   | rows | Extra                 |
+------+-------------+--------------+------+---------------+-------+---------+-----------------------+------+-----------------------+
|    1 | SIMPLE      | elections    | ref  | date,whose    | whose | 1       | const                 |   22 | Using index condition |
|    1 | SIMPLE      | temp_ballots | ref  | date          | date  | 3       | voters.elections.date | 1123 |                       |
+------+-------------+--------------+------+---------------+-------+---------+-----------------------+------+-----------------------+

Best Answer

Instead of depending on IODKU to update rows, do the messy work in the SELECT:

INSERT INTO `ballots` (voter_id, election_id, status)
    SELECT t.voter_id, e.election_id, MAX(t.status)
        FROM elections AS e
        JOIN temp_ballots AS t  USING(date)
        WHERE e.whose = 'state'
        GROUP BY t.voter_id, e.election_id

Or...

INSERT INTO `ballots` (voter_id, election_id, status)
    SELECT  voter_id,
            ( SELECT election_id FROM elections
                     WHERE date = t.date
                       AND whose = 'state' ) AS election_id,
            MAX(status)
        FROM temp_ballots AS t
        GROUP BY voter_id, election_id

Notes:

  • I think the first one might be better.
  • The first one designed to produce the rows in PK order.
  • I picked MAX to choose among the status values, since you did not have any way of picking among them.
Related Question