Postgresql – Distinguishing data from different sources

postgresql

I want to store data that comes from multiple data sources, the issue is that the row identifiers have no guarantee to be unique across sources, which wouldn't satisfy the unique key constraint.

I'm not really familiar with databases, and if I had to model this in an imperative programming language I would have probably just stored data from each source apart in different arrays, this (storing in different tables) doesn't look at all like a correct approach for databases, though.

What is the usual approach for storing data that shares the exact same column types, but that we want to be separated, for example because the IDs might conflict? Should I instead put the original IDs in a different column, add another column to specify the "row origin" (ie: what source does it come from) and yet another column for the original ID that it would have had?

Thank you

Best Answer

1) Add a column to the table which holds the source system. Change the primary key of the table to a compound key including source system and source id. Then value ('System A', row 1) is distinct from ('System B', row 1).

There are a few downsides to this. If the source system were ever to revert to previous values (say a DB was restored, or a sequence reset, or a programming error or ... ) our uniqueness would be compromised.

It is generally advantageous to make keys (and hence indexes and foreign keys) narrow for performance reasons. So a one-column key is, other things being equal, preferable to a two-column key. If you're not running your hardware to its limits this is unlikely to be a problem in real life.

Similarly having the system name - a character string - in the key makes it wider. That can be mitigated by introducing a table SourceSystem(SystemName varchar(<whatever>), SystemId smallint) and using SystemId in the actual data table. Now you have to join to the new table should you ever need the human-readable name.

The advantage is simplicity. The loading process need only hard-code a simple string (or a table look-up) and everything else remains the same. It is easy to reference back to the source since all the IDs are intact.

2) Add a surrogate key to your table. You allocate values to this key as you insert rows. This surrogate key becomes the primary key. Say Source A sent rows 1 & 2 and Source B sent rows 1 & 5 the table becomes

SurrogateId  SourceId
-----------  --------
1            1
2            2
3            1
4            5

Advantages are we have complete control over the primary key values in our system. Whatever changes the developers of the source choose to deploy, and however their ID scheme fluctuates, it will not affect us.

Unfortunately we have no direct reference telling us which was the source for each row, if that is important in your world. That can be addressed by adding a new non-key column to hold that information.

3) Put a offset for each source. Source A's rows all get 10,000 (or 10 million, or 100 million or whatever) added, Source B's have 20,000, Source C's have 30,000 and so on. The obvious problem is that each range could be exhausted. If that happens you can move to a new interval for that source.

Option 2 - the surrogate key - is ~~the~~ my preferred solution.

Related Solutions

PostgreSQL insert into table (not origin) based on a condition on fields on different tables

You seem to be under the impression that some kind of automatic "row number" would exist. That is not the case. Unlike rows in a spreadsheed, tables in a relational database have no natural order.

This query should do the job, but it relies on the the contents of name_1 and name_2 to make the connection. If you rely on a row number, you have to add an actual column for that.

INSERT INTO table2 (id1, id2, score1, score2, comment_string)
SELECT t1.id1, t1.id2, t1.score1, t1.score2
      ,CASE WHEN t3.name_1 IS NULL
            AND  t4.name_2 IS NULL THEN 'removed_because ...'
       ELSE END AS comment_string
FROM   table_1 t1
LEFT   JOIN table_3 t3 USING (name_1)
LEFT   JOIN table_4 t4 USING (name_2)
ORDER  BY id1; -- undeclared in Q

Based on the assumption that table_3.name_1 and table_4.name_2 are unique. Else, the query could create a "proxy cross join", possibly multiplying rows, if there are several matches.
More about this caveat in this related answer on SO.

Postgresql – Append elements to array only if the element does’t exist

It's not super efficient, but you can define something like your solution as a function:

create or replace function set_append(anyarray, anyelement)
  returns anyarray
  immutable
  PARALLEL safe
  language sql as
$$
  select case when $2=any($1) then $1 else $1 || $2 end;
$$;

You can also use that to define an aggregate:

CREATE AGGREGATE set_agg( ANYELEMENT ) (
  SFUNC = set_append,
  STYPE = ANYARRAY,
  INITCOND = '{}');

This will let you select set_agg(x order by y) to control the order of the distinct elements where you can't do select array_agg(distinct x order by y).

Best Answer

Related Solutions

PostgreSQL insert into table (not origin) based on a condition on fields on different tables

Postgresql – Append elements to array only if the element does’t exist

Related Question