Mysql – How to coordinate two sessions writing the same row to a MySQL table

MySQL

I am using MySQL 5.6 and have an append-only "lookup" table that's used to store reference data:

CREATE TABLE my_table (
  id INT UNSIGNED NOT NULL AUTO_INCREMENT,
  my_column VARCHAR(32) NOT NULL,

  PRIMARY KEY (id),
  UNIQUE KEY my_column_uk (my_column)
) ENGINE=INNODB DEFAULT CHARSET=utf8;

Rows are inserted, but never updated or deleted. The table is small (thousands of rows at most), so query performance is not a concern.

I have 2 different processes (running identical code) which insert rows into this table. Each process has an in-memory write-through cache of the my_column -> id value. Whenever a new entry is put into the cache that doesn't already exist, it's synchronously inserted into the database table, then added to the cache.

How do I handle the scenario where 2 (or more) processes try to insert the same my_column value at the same time?

Since the table has a unique key constraint on my_column, my thought is to have both processes try to insert the new row and get the new PK id value. One of them will succeed, and can add the new entry to its cache. The other one will fail, but then it can fall back to querying for the id of the row that the other process inserted. I call this the "write-read" approach.

Is there a better way of handling this contention between the two processes? Would a "read-write" approach be better, potentially with a different transaction isolation level ("read uncommitted" e.g.)?

The one potential downside I see with the "write-read" approach is that it involves catching a duplicate key exception, and using that as an indicator to query for the existing row. This is perhaps a misuse of an exception, because this scenario is expected to occasionally happen, and is not really exceptional.

The problem with the "read-write" approach is that I'm not sure there's a good way to have process 2 "see" the row that process 1 added, until it finishes its transaction. So this might need to be a "read-write-read", in which case I'm better off just starting with the INSERT like in the "write-read" approach.


Some more constraints:

  1. I need to get the resulting PK ID, regardless of whether the row was inserted or already exists. However I don't want to burn through auto-increment values, so I'm not going to use the UPDATE id = LAST_INSERT_ID(id) kludge. I'd rather just execute a second SQL statement to SELECT the id value, because I'm only inserting a row every ~5 minutes into a table with ~100k rows, so query performance is not a major concern.

  2. The reason why I'm not using the VARCHAR my_column column as the PRIMARY KEY column is that this lookup table has a FK reference from another much bigger (500M rows, 20 cols) table. I'm using this lookup table pattern to store reference data – e.g. failure codes ("eligibility.failure.not_enough_foobar_data") or time zones ("America/Los_Angeles") – so that the much larger main table can use a short FK id to reference that value (e.g. failure_code_id or time_zone_id), instead of repeating the long string.

Best Answer

Here's the best solution I've been able to come up with, using a write-read approach:

  1. INSERT using ON DUPLICATE KEY UPDATE; (without first checking to see if the row exists)
  2. if the value returned != 0, then that's the PK of the new row;
  3. else a row must already exist with that value, so SELECT it;
  4. if the SELECT can't find the row, then throw an exception because this is unrecoverable.

This has the following advantages:

  1. it doesn't require any transactionality or a different isolation level, because the insert-if-not-exists statement is already an atomic operation;
  2. it executes at most 2 SQL statements (insert-select) instead of 3 (select-insert-select), making the calling code simpler;
  3. since we're not using INSERT IGNORE, if the INSERT statement throws an exception it will be for a legit reason - e.g. auto_increment has reached the max value of id column - and thus we don't have to worry about needlessly logging duplicate key exceptions or ignoring actual database problems.