PostgreSQL – How to Get Unique Sets of IDs from Table

database-designmany-to-manypostgresqlunique-constraint

I want to represent mixture of arbitrary numbers of chemicals. Since the relationship between chemicals and mixtures here is many-to-many, I thought I'd implement it like this (simplified):

CREATE TABLE chemicals (
    name text PRIMARY KEY,
    chem_id SERIAL UNIQUE NOT NULL
);
CREATE TABLE mixtures (
    mixture_id SERIAL PRIMARY KEY,
);
CREATE TABLE mixture_chems (
    mixture_id INTEGER REFERENCES mixtures (mixture_id),
    chem_id INTEGER REFERENCES chemicals (chem_id)
);

But I also would like to enforce that there is only one (unique) mixture_id that is referred to by any particular combination of chem_id (via the rows in the mixture_chems table).

How could I implement this in PostgreSQL?

One person suggested I might need to use triggers to compute some new value, that uniquely identifies a mixture, and then enforce uniqueness on that. Thoughts on how to implement that, or whether it'd be appropriate here?

Best Answer

I agree with that person. Here is an implementation.

Basically, add a UNIQUE array column of all chem_ids to table mixtures and keep it current with triggers. Arrays must be sorted consistently, I use the additional module intarray for that and to optimize performance.

For lack of definition I assume frequent multi-row writes, making it a perfect candidate for transition tables introduced with Postgres 10. See:

What is a “transition table" in Postgres?

Important note: Technically, this works in Postgres 10. But while testing I ran into a bug of intarray functions that seemed oddly familiar: empty arrays would not compare equal due to incorrect internal array dimensions. Tom Lane found and fixed this for Postgres 11, but it was not backported to Postgres 10. I strongly advise to use Postgres 11 with this.

_{Turns out to be another instance of a bug I reported myself earlier. See here and here. Took me a while to reproduce and get the full picture.}

This uses a variety of advanced features. Not recommended for beginners.

Code

CREATE TABLE chemicals (
  chem_id serial UNIQUE NOT NULL
, name text PRIMARY KEY
);

CREATE TABLE mixtures (
  mixture_id serial PRIMARY KEY
, chem_ids int[] UNIQUE  -- default NULL !
);

CREATE TABLE mixture_chems (
  mixture_id int REFERENCES mixtures (mixture_id)
, chem_id int    REFERENCES chemicals (chem_id)
);

INSERT trigger

CREATE OR REPLACE FUNCTION trg_mixture_chems_insaft()
  RETURNS trigger AS
$func$
BEGIN
   UPDATE mixtures AS m
   SET    chem_ids = sort(COALESCE(m.chem_ids, '{}') + n.chem_ids)
   FROM  (
      SELECT mixture_id, array_agg(chem_id) AS chem_ids
      FROM   new_table
      GROUP  BY 1
      ) n
   WHERE m.mixture_id = n.mixture_id;

   RETURN NULL;
END
$func$  LANGUAGE plpgsql;


CREATE TRIGGER mixture_chems_insaft
AFTER INSERT ON mixture_chems
REFERENCING NEW TABLE AS new_table
FOR EACH STATEMENT
EXECUTE PROCEDURE trg_mixture_chems_insaft();

UPDATE trigger

CREATE OR REPLACE FUNCTION trg_mixture_chems_upaft()
  RETURNS trigger AS
$func$
BEGIN
   UPDATE mixtures AS m
   SET    chem_ids = sort(COALESCE(m.chem_ids, '{}')
                        - COALESCE(o.chem_ids, '{}')
                        + COALESCE(n.chem_ids, '{}'))
   FROM  (
      SELECT mixture_id, array_agg(chem_id) AS chem_ids
      FROM   new_table
      GROUP  BY 1
      ) n
   FULL  JOIN (
      SELECT mixture_id, array_agg(chem_id) AS chem_ids
      FROM   old_table
      GROUP  BY 1
      ) o USING (mixture_id)
   WHERE m.mixture_id = COALESCE(n.mixture_id, o.mixture_id)
   AND   m.chem_ids IS DISTINCT FROM sort(COALESCE(m.chem_ids, '{}')
                                        - COALESCE(o.chem_ids, '{}')
                                        + COALESCE(n.chem_ids, '{}'));

   RETURN NULL;
END
$func$  LANGUAGE plpgsql;


CREATE TRIGGER mixture_chems_upaft
AFTER UPDATE ON mixture_chems
REFERENCING NEW TABLE AS new_table
            OLD TABLE AS old_table
FOR EACH STATEMENT
EXECUTE PROCEDURE trg_mixture_chems_upaft();

DELETE trigger

CREATE OR REPLACE FUNCTION trg_mixture_chems_delaft()
  RETURNS trigger AS
$func$
BEGIN
   UPDATE mixtures AS m
   SET    chem_ids = m.chem_ids - o.chem_ids  -- assuming this does not upset sort order!
   FROM  (
      SELECT mixture_id, array_agg(chem_id) AS chem_ids
      FROM   old_table
      GROUP  BY 1
      ) o
   WHERE m.mixture_id = o.mixture_id
   AND   m.chem_ids IS DISTINCT FROM (m.chem_ids - o.chem_ids);

   RETURN NULL;
END
$func$  LANGUAGE plpgsql;


CREATE TRIGGER mixture_chems_delaft
AFTER DELETE ON mixture_chems
REFERENCING OLD TABLE AS old_table
FOR EACH STATEMENT
EXECUTE PROCEDURE trg_mixture_chems_delaft();

db<>fiddle here

This implementation is strict: a mixture with no chemicals (chem_ids = '{}') is just another case that is only allowed once. You may want to allow that multiple times instead. This state is only reached after deleting all existing components, newly inserted row in mixtures start out with chem_ids IS NULL to dodge this UNIQUE constraint.

And you may want to add a PRIMARY KEY constraint to disallow adding the same chemical to a mixture multiple times:

CREATE TABLE mixture_chems (
  mixture_id INTEGER REFERENCES mixtures (mixture_id)
, chem_id INTEGER REFERENCES chemicals (chem_id)
, PRIMARY KEY (mixture_id, chem_id)
);

But my implementation works either way.

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

Postgresql – What happens to the index of a primary key after a DROP CONSTRAINT

Disclaimer

This is experimental and only tested rudimentarily. Proceed at your own risk. I would not use it myself and just drop / recreate constraints with standard DDL commands. If you break entries in the catalog tables you could easily mess up your database.

For all I know, there are only two differences between a PRIMARY KEY and a UNIQUE constraint in the catalog tables (the index itself is identical):

pg_index.indisprimary:
For PRIMARY KEY constraint ... TRUE
For UNIQUE constraint ... FALSE

pg_constraint.contype:
PRIMARY KEY constraint ... 'p'
UNIQUE constraint ... 'u'

You could convert constraint and index in place, from PRIMARY KEY constraint to UNIQUE constraint, my_idx being the (optionally schema-qualified) index name:

UPDATE pg_index SET indisprimary = FALSE WHERE indexrelid = 'my_idx'::regclass
UPDATE pg_constraint SET contype = 'u' WHERE conindid = 'my_idx'::regclass;

Or upgrade from UNIQUE to PRIMARY KEY:

UPDATE pg_index SET indisprimary = TRUE WHERE indexrelid = 'my_idx'::regclass;
UPDATE pg_constraint SET contype = 'p' WHERE conindid = 'my_idx'::regclass;