PostgreSQL – How to Get Unique Sets of IDs from Table

database-designmany-to-manypostgresqlunique-constraint

I want to represent mixture of arbitrary numbers of chemicals. Since the relationship between chemicals and mixtures here is many-to-many, I thought I'd implement it like this (simplified):

CREATE TABLE chemicals (
    name text PRIMARY KEY,
    chem_id SERIAL UNIQUE NOT NULL
);
CREATE TABLE mixtures (
    mixture_id SERIAL PRIMARY KEY,
);
CREATE TABLE mixture_chems (
    mixture_id INTEGER REFERENCES mixtures (mixture_id),
    chem_id INTEGER REFERENCES chemicals (chem_id)
);

But I also would like to enforce that there is only one (unique) mixture_id that is referred to by any particular combination of chem_id (via the rows in the mixture_chems table).

How could I implement this in PostgreSQL?

One person suggested I might need to use triggers to compute some new value, that uniquely identifies a mixture, and then enforce uniqueness on that. Thoughts on how to implement that, or whether it'd be appropriate here?

Best Answer

I agree with that person. Here is an implementation.

Basically, add a UNIQUE array column of all chem_ids to table mixtures and keep it current with triggers. Arrays must be sorted consistently, I use the additional module intarray for that and to optimize performance.

For lack of definition I assume frequent multi-row writes, making it a perfect candidate for transition tables introduced with Postgres 10. See:

Important note: Technically, this works in Postgres 10. But while testing I ran into a bug of intarray functions that seemed oddly familiar: empty arrays would not compare equal due to incorrect internal array dimensions. Tom Lane found and fixed this for Postgres 11, but it was not backported to Postgres 10. I strongly advise to use Postgres 11 with this.

Turns out to be another instance of a bug I reported myself earlier. See here and here. Took me a while to reproduce and get the full picture.

This uses a variety of advanced features. Not recommended for beginners.

Code

CREATE TABLE chemicals (
  chem_id serial UNIQUE NOT NULL
, name text PRIMARY KEY
);

CREATE TABLE mixtures (
  mixture_id serial PRIMARY KEY
, chem_ids int[] UNIQUE  -- default NULL !
);

CREATE TABLE mixture_chems (
  mixture_id int REFERENCES mixtures (mixture_id)
, chem_id int    REFERENCES chemicals (chem_id)
);
INSERT trigger
CREATE OR REPLACE FUNCTION trg_mixture_chems_insaft()
  RETURNS trigger AS
$func$
BEGIN
   UPDATE mixtures AS m
   SET    chem_ids = sort(COALESCE(m.chem_ids, '{}') + n.chem_ids)
   FROM  (
      SELECT mixture_id, array_agg(chem_id) AS chem_ids
      FROM   new_table
      GROUP  BY 1
      ) n
   WHERE m.mixture_id = n.mixture_id;

   RETURN NULL;
END
$func$  LANGUAGE plpgsql;


CREATE TRIGGER mixture_chems_insaft
AFTER INSERT ON mixture_chems
REFERENCING NEW TABLE AS new_table
FOR EACH STATEMENT
EXECUTE PROCEDURE trg_mixture_chems_insaft();
UPDATE trigger
CREATE OR REPLACE FUNCTION trg_mixture_chems_upaft()
  RETURNS trigger AS
$func$
BEGIN
   UPDATE mixtures AS m
   SET    chem_ids = sort(COALESCE(m.chem_ids, '{}')
                        - COALESCE(o.chem_ids, '{}')
                        + COALESCE(n.chem_ids, '{}'))
   FROM  (
      SELECT mixture_id, array_agg(chem_id) AS chem_ids
      FROM   new_table
      GROUP  BY 1
      ) n
   FULL  JOIN (
      SELECT mixture_id, array_agg(chem_id) AS chem_ids
      FROM   old_table
      GROUP  BY 1
      ) o USING (mixture_id)
   WHERE m.mixture_id = COALESCE(n.mixture_id, o.mixture_id)
   AND   m.chem_ids IS DISTINCT FROM sort(COALESCE(m.chem_ids, '{}')
                                        - COALESCE(o.chem_ids, '{}')
                                        + COALESCE(n.chem_ids, '{}'));

   RETURN NULL;
END
$func$  LANGUAGE plpgsql;


CREATE TRIGGER mixture_chems_upaft
AFTER UPDATE ON mixture_chems
REFERENCING NEW TABLE AS new_table
            OLD TABLE AS old_table
FOR EACH STATEMENT
EXECUTE PROCEDURE trg_mixture_chems_upaft();
DELETE trigger
CREATE OR REPLACE FUNCTION trg_mixture_chems_delaft()
  RETURNS trigger AS
$func$
BEGIN
   UPDATE mixtures AS m
   SET    chem_ids = m.chem_ids - o.chem_ids  -- assuming this does not upset sort order!
   FROM  (
      SELECT mixture_id, array_agg(chem_id) AS chem_ids
      FROM   old_table
      GROUP  BY 1
      ) o
   WHERE m.mixture_id = o.mixture_id
   AND   m.chem_ids IS DISTINCT FROM (m.chem_ids - o.chem_ids);

   RETURN NULL;
END
$func$  LANGUAGE plpgsql;


CREATE TRIGGER mixture_chems_delaft
AFTER DELETE ON mixture_chems
REFERENCING OLD TABLE AS old_table
FOR EACH STATEMENT
EXECUTE PROCEDURE trg_mixture_chems_delaft();

db<>fiddle here

This implementation is strict: a mixture with no chemicals (chem_ids = '{}') is just another case that is only allowed once. You may want to allow that multiple times instead. This state is only reached after deleting all existing components, newly inserted row in mixtures start out with chem_ids IS NULL to dodge this UNIQUE constraint.

And you may want to add a PRIMARY KEY constraint to disallow adding the same chemical to a mixture multiple times:

CREATE TABLE mixture_chems (
  mixture_id INTEGER REFERENCES mixtures (mixture_id)
, chem_id INTEGER REFERENCES chemicals (chem_id)
, PRIMARY KEY (mixture_id, chem_id)
);

But my implementation works either way.

Related: