PostgreSQL Optimization – Best Way to Select a Matching Subset of Rows

arraymany-to-manyoptimizationpostgresql

DB Fiddle link. I have a many to many table like this:

CREATE TABLE house_to_cats (
  id SERIAL PRIMARY KEY,
  house_id INTEGER,
  cat_id INTEGER
);

-- house with cats 1 and 2: too small
INSERT INTO house_to_cats (house_id, cat_id) VALUES (1, 1), (1, 2);
-- house with cats 1 2 3 4: too big
INSERT INTO house_to_cats (house_id, cat_id) VALUES (2, 1), (2, 2), (2, 3), (2, 4);
-- house with cats 1 2 3: just right
INSERT INTO house_to_cats (house_id, cat_id) VALUES (3, 1), (3, 2), (3, 3);

I need a query that takes an arbitrary list of cats and returns a matching house if exists. I came up with this:

SELECT
    house_id
FROM (
    SELECT
          house_id
        , ARRAY_AGG(cat_id) as cat_id_agg
    FROM house_to_cats
    JOIN (
        SELECT DISTINCT
              house_id
        FROM house_to_cats
        JOIN (SELECT * FROM UNNEST(ARRAY[1, 2, 3]) cat_id) inn USING (cat_id)
    ) filter USING (house_id)
    GROUP BY house_id
) agg
WHERE cat_id_agg <@ ARRAY[1, 2, 3]
  AND cat_id_agg @> ARRAY[1, 2, 3];

Is there a better way to do this?

The idea behind my query: in filter, get the house_id which have at least one of our cat in them. In agg, create cat_id_agg arrays for all of those house_ids. And in the outermost query filter out groups that don't match our set.

Best Answer

If I understood you correctly, your query can be simplified to:

select house_id
from house_to_cats
group by house_id
having array_agg(cat_id order by cat_id) = array[1,2,3]

Note the order by in the array_agg() call - the array [3,2,1] is not equal the array [1,2,3]. To avoid incorrect results due to the aggregation being done in a different order, the aggregated array has to contain the values in the same order as the comparison value.

Online example: https://rextester.com/IKGHWL59301

Query

Creating a temporary table and looping are expensive overkill for the purpose. You don't even need plpgsql in the first place - though it may be slightly faster for repeated calls in the same session. Radically simplify:

CREATE OR REPLACE FUNCTION get_users_by_ids(_uids text[])
  RETURNS JSON AS
$func$
   SELECT json_agg(sub)
   FROM (
      SELECT u.id, u.username
           , ARRAY (SELECT followid FROM followers WHERE userid   = u.id) AS following
           , ARRAY (SELECT userid   FROM followers WHERE followid = u.id) AS followers
      FROM   users u
      WHERE  u.id = ANY (_uids)
      ) sub
$func$  LANGUAGE sql SECURITY DEFINER;

I use a json_agg() on a subquery instead of json_build_object(). Should be a bit faster, yet. Related:

Select columns inside json_agg

And it conveniently allows cheap ordering of array elements if you should need that: add ORDER BY in the subquery. You might want to preserve original order of elements. See:

PostgreSQL unnest() with element number

If you need SECURITY DEFINER (do you really?), make sure it cannot be abused. See this Postgres Wiki page:

A Guide to CVE-2018-1058: Protect Your Search Path

Correlated subqueries should be fastest here; you get NULL for following and followers if none are found. Alternatively, a LATERAL join might serve. Related:

If you need to nest everything in a 'data' key, you can add that easily, but that seems to be just noise.

A VARIADIC parameter for _uids may be convenient:

How to use an array as argument to a VARIADIC function in PostgreSQL?

(But the list input only allows up to 100 parameters. You can still pass arrays of any length.)

Index

To allow index-only scans make secondary index followers_followid_idx on (followid, userid) instead of just (followid). Related:

Does PostgreSQL use an index-only scan in this JOIN?

DB design

The normalized design is a good idea. It helps write speed a lot and prevents extensive table bloat and locking contention when working on followings. And it is superior in a number of other respects.

I would strongly suggest to work with integer IDs, though. Smaller, faster. Optimum size for your indexes. Related:

Is a composite index also good for queries on the first field?

You can always output text IDs additionally.

Best Answer

Related Solutions

PostgreSQL – Adding Value to Integer Array

PostgreSQL Performance – Efficiently Return Aggregated Arrays from m:n Table

Query

Index

DB design

Related Question