PostgreSQL – Correct Way for Subset Calculations

postgresql

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id
---------------------
1       | 1
1       | 2
1       | 3
2       | 1
2       | 2

I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.

From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).

Best Answer

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
   and not 3 = any(array_agg(activity_id));

The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3

If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221