Proper Way to Map IDs in MySQL – Additional Columns vs New Rows

MySQL

I am working on a new database structure in which we get external IDS from a few different sources and map them against our internal ID in one table.

Method 1

In the first version the table structure is shown below by the following "create database query":

create table external_id_mapping(
id  int not null auto_increment,  
internal_id int,  
external_id varchar(150),
external_source_name varchar(60),
varchar(70),
primary key(id))

In which case we would be able to look our own internal_ID dependant on the external through these types of queries:

select internal_id 
from external_id_mapping
where external_id = 4214 
and external_source_name = 'source_name_x'

Each new source would have a new row with an external_id and a source_name. External_id cannot be integer as it needs to cover every possible variation of external_ids we are using, some of which are integers others which are strings.

Method 2

In the other method the database structure would look like this:

create table external_id_mapping(
internal_id int, 
external_id_x int,
external_id_z int,
external_id_y varchar(150),
primary key(internal_id))

Thus query would look like this:

select internal_id 
from external_id_mapping
where external_id_x = 4214    

Each internal_id represents its own row and gets matched for different columns against different external_ids. If we in the future wants to use additional external sources we would create new columns.

In terms of inserting new data in the table, I highly prefer the latter method sicne I can use "replace into" and overwrite the existing primary key ID.

Whereas in the first method, the ID is auto-incremented and it needs to be looked up before we know whether it needs to be inserted or updated.

However, which method is generally considered best practice?

Best Answer

INDEX(external_source, external_id) in that order. This would facilitate looking up the external id to find the internal one. Now, assuming the main use of this table is to go from "external" to "internal", then I would do

PRIMARY KEY (external_source, external_id)
INDEX(...)   -- for either `internal_id` or `id`, whichever you keep.

You have two internal ids -- internal_id and id. Why both? id could be AUTO_INCREMENT to take care of inventing new numbers. internal_id would need some mechanism for inventing new numbers.

The external_* may as well be VARCHAR, since you can't trust them to stick to numbers.

I would not do the extra 'normalization' even if you have a million rows in this mapping table. I might do it for a billion rows. (I claim there is such a thing as "over-normalization".)