Filter table lookup by array value, but returning groupby unfiltered

many-to-manypostgresqlpostgresql-13postgresql-performance

I have multiple lookup tables and each need to be filtered by an array of values:

select 
array_agg(joined_properties.name) as properties,
items.name
from items
left join (
  select properties.name, item_property.item_id, item_property.property_id
  from item_property
  join properties on properties.id = item_property.property_id) as joined_properties on joined_properties.item_id = items.id
group by items.name;

properties	name
property1	item2
property1,property2,property4	item1
property2,property3	item3
property2	item5
property3,property4	item4

I would like to filter each item by an array of properties, but return the unfiltered properties (just like above) in the same query.
What is the performant/correct way of doing it without needing to re-run the joins once I found matching rows? Whenever I filter inside the join or outside it before grouping it only returns the matching rows of course:

select 
array_agg(joined_properties.name) as properties,
items.name
from items
inner join (
  select properties.name, item_property.item_id, item_property.property_id
  from item_property
  join properties on properties.id = item_property.property_id
  where properties.name = ANY('{"property1","property2"}')
) as joined_properties on joined_properties.item_id = items.id
group by items.name;

properties	name
property1,property2	item1
property1	item2
property2	item3
property2	item5

Expected output:

properties	name
property1	item2
property1,property2,property4	item1
property2,property3	item3
property2	item5

Fiddle below:

View on DB Fiddle

Edit 1:

Approximate hardware specs: 4-core Xeon Gold VM, 16GB RAM, SSD
Relevant DB cardinalities:
- ~100k records in items,
- 6 properties_i text tables with their respective intermediary tables,
- most properties_i tables range from 10-30 records,
- largest properties_i will have 200 records indexed,
- each item in items has 0-10 associated properties for any property_i

Best Answer

The best solution heavily depends on data distribution and cardinalities - and available indexes.

This query should be generally good:


SELECT sub.properties, i.name AS item
FROM  (
   SELECT item_id, array_agg(p.name) AS properties
   FROM  (
      SELECT DISTINCT item_id
      FROM   item_property
      WHERE  property_id = ANY (
         SELECT id FROM properties
         WHERE  name = ANY ('{"property1","property2"}')
         )
      ) ip0
   JOIN   item_property ip USING (item_id)
   JOIN   properties    p  ON p.id = ip.property_id
   GROUP  BY item_id
   ) sub
JOIN   items i ON i.id = sub.item_id
ORDER  BY i.name;  -- my optional addition

db<>fiddle here

The innermost subquery ip0 identifies items (or rather: item_id's) that have one of the filtered properties.
The next subquery sub joins to all properties and aggregates per item - not involving table items, yet!
The outer query replaces item_id with actual item name to arrive at your desired result. If item names are not defined UNIQUE NOT NULL, I would still return item_id additionally to avoid ambiguities.

Other query styles may be faster for particular data distributions / search arguments. Thee are many ways. I added one alternative with IN to the fiddle.

Indexes

For big tables and selective filters, matching indexes make all the difference. For your given data distribution:

You don't need any indexes on properties as it's tiny.
You'll already have a UNIQUE or PK constraint on item_property (item_id, property_id). Add another index on the same two columns in reversed order, so (property_id, item_id) in my example. Why? See:
- Is a composite index also good for queries on the first field?
If item_property is vacuumed enough, you'll even see index-only scans.
Obviously an index on items(id) - that will already be covered by the PK.

Related Solutions

Postgresql – Filtering UNION ALL result is much slower than filtering each subquery

What you're asking the DB to do in Query one is: Give me ALL from table A FILTERED Give me ALL from table B FILTERED Give me ALL from table C FILTERED Give me ALL from table D FILTERED And then Union.

In the second query you first get all the data, and only after that you do the join and the filter. JOIN and WHERE on a UNION query, which doesn't really enable you to index anything, obviously runs slower. (It has nothing to do with the Server variant or the OS).

Postgresql – SELECT multiple values from one table having matching record in another table in one row

Actually, since

there can be n number of values for a single record of table first.

the number of columns to return is not exactly arbitrary. There is a maximum of columns, and this has a clean solution - unless you have more columns than Postgres allows for a table:

250 - 1600 depending on column types

BTW, typically, you would also have a third table listing all possible values of value, the whole of it implementing a classical many-to-many relationship.

How to implement a many-to-many relationship in PostgreSQL?

You can use CASE statements, or more elegantly, the crosstab() function of the additional module tablefunc. If you are unfamiliar with it, read basic instructions here first:

PostgreSQL Crosstab Query

Columns to the right of actual values are filled with NULL. Assuming a maximum of 5 possible values and building on this setup:

CREATE TABLE first  (id int, name text);
CREATE TABLE second (id int, value "char", fid int);  

INSERT INTO first  VALUES
  (1,'Ahmad')
, (2,'Sami')
, (3,'Khan')
, (4,'Nobody');  -- Added to demonstrate difference
INSERT INTO second VALUES
  (1,'a',1)
, (2,'b',1)
, (3,'c',2)
, (4,'d',1)
, (5,'e',2)
, (6,'f',3);

Either use crosstab(text) (1 parameter form) and join to table first another time:

SELECT id, f.name, value1, value2, value3, value4, value5
FROM   crosstab(
        'SELECT f.id, 1 AS dummy_category, s.value
         FROM   first  f
         JOIN   second s ON s.fid = f.id
         ORDER  BY f.id, s.value'
       ) ct (id int
           , value1 "char", value2 "char", value3 "char", value4 "char", value5 "char")
JOIN first f USING (id);

Or use crosstab(text, text) (2 parameter form) and generate dummy categories for your values:

SELECT *
FROM   crosstab(
        'SELECT f.id, f.name
              , row_number() OVER (PARTITION BY f.id ORDER BY s.value) AS dummy_category
              , s.value
         FROM   first  f
         JOIN   second s ON s.fid = f.id
         ORDER  BY f.id, s.value'
      , ('SELECT generate_series(1,5)')
       ) ct (id int, name text
           , value1 "char", value2 "char", value3 "char", value4 "char", value5 "char");

Result is the same either way:

 id | name  | value1 | value2 | value3 | value4 | value5
----+-------+--------+--------+--------+--------+--------
  1 | Ahmad | a      | b      | d      |        |
  2 | Sami  | c      | e      |        |        |
  3 | Khan  | f      |        |        |        |

If you want to include all rows from table first, make it a LEFT [OUTER] JOIN:

         ...
         LEFT JOIN second s ON s.fid = f.id
         ...

Then we get one additional result row for the above example:

  4 | Nobody |        |        |        |        |

Best Answer

Indexes

Related Solutions

Postgresql – Filtering UNION ALL result is much slower than filtering each subquery

Postgresql – SELECT multiple values from one table having matching record in another table in one row

Related Question