Postgresql – Distinguishing data from different sources

postgresql

I want to store data that comes from multiple data sources, the issue is that the row identifiers have no guarantee to be unique across sources, which wouldn't satisfy the unique key constraint.

I'm not really familiar with databases, and if I had to model this in an imperative programming language I would have probably just stored data from each source apart in different arrays, this (storing in different tables) doesn't look at all like a correct approach for databases, though.

What is the usual approach for storing data that shares the exact same column types, but that we want to be separated, for example because the IDs might conflict? Should I instead put the original IDs in a different column, add another column to specify the "row origin" (ie: what source does it come from) and yet another column for the original ID that it would have had?

Thank you

Best Answer

1) Add a column to the table which holds the source system. Change the primary key of the table to a compound key including source system and source id. Then value ('System A', row 1) is distinct from ('System B', row 1).

There are a few downsides to this. If the source system were ever to revert to previous values (say a DB was restored, or a sequence reset, or a programming error or ... ) our uniqueness would be compromised.

It is generally advantageous to make keys (and hence indexes and foreign keys) narrow for performance reasons. So a one-column key is, other things being equal, preferable to a two-column key. If you're not running your hardware to its limits this is unlikely to be a problem in real life.

Similarly having the system name - a character string - in the key makes it wider. That can be mitigated by introducing a table SourceSystem(SystemName varchar(<whatever>), SystemId smallint) and using SystemId in the actual data table. Now you have to join to the new table should you ever need the human-readable name.

The advantage is simplicity. The loading process need only hard-code a simple string (or a table look-up) and everything else remains the same. It is easy to reference back to the source since all the IDs are intact.

2) Add a surrogate key to your table. You allocate values to this key as you insert rows. This surrogate key becomes the primary key. Say Source A sent rows 1 & 2 and Source B sent rows 1 & 5 the table becomes

SurrogateId  SourceId
-----------  --------
1            1
2            2
3            1
4            5

Advantages are we have complete control over the primary key values in our system. Whatever changes the developers of the source choose to deploy, and however their ID scheme fluctuates, it will not affect us.

Unfortunately we have no direct reference telling us which was the source for each row, if that is important in your world. That can be addressed by adding a new non-key column to hold that information.

3) Put a offset for each source. Source A's rows all get 10,000 (or 10 million, or 100 million or whatever) added, Source B's have 20,000, Source C's have 30,000 and so on. The obvious problem is that each range could be exhausted. If that happens you can move to a new interval for that source.

Option 2 - the surrogate key - is the my preferred solution.