I am using MySQL 5.6 and have an append-only "lookup" table that's used to store reference data:
CREATE TABLE my_table (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
my_column VARCHAR(32) NOT NULL,
PRIMARY KEY (id),
UNIQUE KEY my_column_uk (my_column)
) ENGINE=INNODB DEFAULT CHARSET=utf8;
Rows are inserted, but never updated or deleted. The table is small (thousands of rows at most), so query performance is not a concern.
I have 2 different processes (running identical code) which insert rows into this table. Each process has an in-memory write-through cache of the my_column
-> id
value. Whenever a new entry is put into the cache that doesn't already exist, it's synchronously inserted into the database table, then added to the cache.
How do I handle the scenario where 2 (or more) processes try to insert the same my_column
value at the same time?
Since the table has a unique key constraint on my_column
, my thought is to have both processes try to insert the new row and get the new PK id
value. One of them will succeed, and can add the new entry to its cache. The other one will fail, but then it can fall back to querying for the id
of the row that the other process inserted. I call this the "write-read" approach.
Is there a better way of handling this contention between the two processes? Would a "read-write" approach be better, potentially with a different transaction isolation level ("read uncommitted" e.g.)?
The one potential downside I see with the "write-read" approach is that it involves catching a duplicate key exception, and using that as an indicator to query for the existing row. This is perhaps a misuse of an exception, because this scenario is expected to occasionally happen, and is not really exceptional.
The problem with the "read-write" approach is that I'm not sure there's a good way to have process 2 "see" the row that process 1 added, until it finishes its transaction. So this might need to be a "read-write-read", in which case I'm better off just starting with the INSERT like in the "write-read" approach.
Some more constraints:
-
I need to get the resulting PK ID, regardless of whether the row was inserted or already exists. However I don't want to burn through auto-increment values, so I'm not going to use the
UPDATE id = LAST_INSERT_ID(id)
kludge. I'd rather just execute a second SQL statement to SELECT the id value, because I'm only inserting a row every ~5 minutes into a table with ~100k rows, so query performance is not a major concern. -
The reason why I'm not using the VARCHAR
my_column
column as thePRIMARY KEY
column is that this lookup table has a FK reference from another much bigger (500M rows, 20 cols) table. I'm using this lookup table pattern to store reference data – e.g. failure codes ("eligibility.failure.not_enough_foobar_data") or time zones ("America/Los_Angeles") – so that the much larger main table can use a short FK id to reference that value (e.g.failure_code_id
ortime_zone_id
), instead of repeating the long string.
Best Answer
Here's the best solution I've been able to come up with, using a write-read approach:
This has the following advantages:
id
column - and thus we don't have to worry about needlessly logging duplicate key exceptions or ignoring actual database problems.