InnoDB – How to Perform Rolling Update to All Rows Without Rollback if Aborted

innodblockingoptimizationtransactionupdate

I have the following table, which stores the Social Security Death Master File database (~86M rows):

CREATE TABLE `death_master_files` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `ssn_an` smallint(5) unsigned DEFAULT NULL,
  `ssn_gn` tinyint(3) unsigned DEFAULT NULL,
  `ssn_sn` smallint(5) unsigned DEFAULT NULL,
  `as_of` date DEFAULT NULL,
  `change_type` varchar(1) DEFAULT NULL,
  `last_name` varchar(20) DEFAULT NULL,
  `first_name` varchar(15) DEFAULT NULL,
  `middle_name` varchar(15) DEFAULT NULL,
  `name_suffix` varchar(4) DEFAULT NULL,
  `verify_proof_code` varchar(1) DEFAULT NULL,
  `death_year` smallint(4) unsigned DEFAULT NULL,
  `death_month` tinyint(3) unsigned DEFAULT NULL,
  `death_day` tinyint(3) unsigned DEFAULT NULL,
  `birth_year` smallint(4) unsigned DEFAULT NULL,
  `birth_month` tinyint(3) unsigned DEFAULT NULL,
  `birth_day` tinyint(3) unsigned DEFAULT NULL,
  `state_of_residence` varchar(2) DEFAULT NULL,
  `last_known_zip_residence` varchar(5) DEFAULT NULL,
  `last_known_zip_payment` varchar(5) DEFAULT NULL,
  `extra` varchar(7) DEFAULT NULL,
  `lifespan` mediumint(8) unsigned DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_ssn_as_of` (`ssn_an`,`ssn_gn`,`ssn_sn`,`as_of`),
  KEY `index_death_master_files_on_last_name_and_first_name` (`last_name`,`first_name`),
  KEY `idx_dob_ssn` (`birth_year`,`birth_month`,`ssn_an`)
) ENGINE=InnoDB AUTO_INCREMENT=85952194 DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED;

I tried populating lifespan using update death_master_files set lifespan = datediff(concat(death_year, "-", death_month, "-", ifnull(death_day,1)), concat(birth_year, "-", birth_month, "-", ifnull(birth_day,1)));

Unfortunately, there was an unrelated issue which caused the server to restart, thus killing the transaction midstream. Now I'm looking at ROLLING BACK 1 lock struct(s), heap size 376, 1 row lock(s), undo log entries 32937515 in my SHOW ENGINE INNODB STATUS, which is blocking all further actions on that table until it finishes. 🙁

One speed improvement would seem to be using update death_master_files set lifespan = datediff(MAKEDATE(death_year, 1) + INTERVAL (death_month-1) MONTH + INTERVAL (ifnull(death_day,1)-1) DAY, MAKEDATE(birth_year, 1) + INTERVAL (birth_month-1) MONTH + INTERVAL (ifnull(birth_day,1)-1) DAY); instead of the string manipulation.

There's no need for any transactions on this. The table is not in use yet, and the lifespan column is either NULL or the value populated by this update. The update is multi-run-invariant. There are currently no slaves on the database either.

I'm going to let it finish the rollback (which'll take hours…). The next time I run it though, I'd like it to run such that it just writes, without rollback/transactions, so that if something dies, it can resume where it left off.

So: any suggestions for how to run UPDATE on the entire table, with rolling updates per row (no locking / transactions)?

(Bonus: it'd be nice if I don't have to do an OPTIMIZE TABLE after this either, since that also takes a while with an 8 GB table. This is the only change since the db was populated [by 1krow bulk inserts from the raw file parser].)

Best Answer

I ended up doing this, to do it in batches of 1krow (which takes about .2s per batch if there's nothing to update, 2s otherwise; basically it has the same total speed per row if it's in batches of 10k, 5k or 1k):

SET SESSION sql_mode = 'strict_all_tables';
SET SESSION sql_warnings = 1;
\W
DELIMITER //
DROP PROCEDURE IF EXISTS update_dmf_lifespan//
CREATE PROCEDURE update_dmf_lifespan(INOUT i INT)
BEGIN
 SET @i = i;
 SET @max = (select id from death_master_files order by id desc limit 1);
 setloop: LOOP
  UPDATE death_master_files SET lifespan = DATEDIFF(MAKEDATE(death_year, 1) + INTERVAL (death_month-1) MONTH + INTERVAL (IFNULL(IFNULL(death_day,birth_day),1)-1) DAY, MAKEDATE(birth_year, 1) + INTERVAL (birth_month-1) MONTH + INTERVAL (IFNULL(birth_day,1)-1) DAY) WHERE DATEDIFF(MAKEDATE(death_year, 1) + INTERVAL (death_month-1) MONTH + INTERVAL (IFNULL(IFNULL(death_day,birth_day),1)-1) DAY, MAKEDATE(birth_year, 1) + INTERVAL (birth_month-1) MONTH + INTERVAL (IFNULL(birth_day,1)-1) DAY) >= 0 and id BETWEEN @i AND @i+1000;
  SELECT @i, ROW_COUNT(), LAST_INSERT_ID();
  SELECT *, DATEDIFF(MAKEDATE(death_year, 1) + INTERVAL (death_month-1) MONTH + INTERVAL (IFNULL(IFNULL(death_day,birth_day),1)-1) DAY, MAKEDATE(birth_year, 1) + INTERVAL (birth_month-1) MONTH + INTERVAL (IFNULL(birth_day,1)-1) DAY) as diff from death_master_files WHERE DATEDIFF(MAKEDATE(death_year, 1) + INTERVAL (death_month-1) MONTH + INTERVAL (IFNULL(IFNULL(death_day,birth_day),1)-1) DAY, MAKEDATE(birth_year, 1) + INTERVAL (birth_month-1) MONTH + INTERVAL (IFNULL(birth_day,1)-1) DAY) < 0 and id BETWEEN @i AND @i+1000;
  SET @i = @i + 1000;
  IF @i + 1000 > @max THEN
   LEAVE setloop;
  END IF;
 END LOOP setloop;
END
//
SET @i = 1//
CALL update_dmf_lifespan(@i)//
DROP PROCEDURE IF EXISTS update_dmf_lifespan//
DELIMITER ;

… so this'll probably take 2 days to run, plus more for the OPTIMIZE TABLE afterwards. :-/

Also, I haven't figured out how to get both SHOW WARNINGS and ROW_COUNT(), since they don't seem to be joinable in a single statement and both output results based on the previous statement (including each other).

I've set it to strict mode with a resumable point, which lets me manually inspect and resume after warnings.

That's surely not the best option, though I'm not sure what would be better. For cases where the deathdate is before the birthdate (the data is not exactly 100% reliable), and thus it raises on lifespan being unsigned, it doesn't update and instead outputs with select. This won't catch other sources of warnings, though.