PostgreSQL – Correct Way for Subset Calculations

postgresql

I have a huge table (around 10 million items). For simplicity, let's say it has only 2 columns: user_id and activity_id like this

user_id | activity_id
---------------------
1       | 1
1       | 2
1       | 3
2       | 1
2       | 2

I want to select all user_id with activity_id = 1, 2 NOT 3. In the case above it will be just one result: user_id = 2. I can do it using SELECT DISTINCT combined with INTERSECT and EXCEPT operators, but it seems to be extremely slow.

From what I know about databases, it can be improved with GIN and table partitioning, however I feel like it's not correct solution in the case of PostgreSQL (because subsets operators are slow by their own).

Best Answer

You can easily do this with arrays in Postgres:

select user_id, array_agg(activity_id) as activities
from users
group by user_id
having array_agg(activity_id) @> array[1,2]
   and not 3 = any(array_agg(activity_id));

The condition array_agg(activity_id) @> array[1,2] only returns those that have activity_ids 1 and 2 and the condition not 3 = any(array_agg(activity_id)) removes all those that contain activity_id = 3

If the table contains more than just those two columns, an index on (user_id, activitiy_id) will help as it enables Postgres to use an "Index Only Scan" instead of a full table scan. If there are only very users that have activity_ids 1 and two, an additional condition that only returns rows with either one of them (e.g. using a where exists condition) might help as it reduces the number of rows that need to be aggregated. In that case the index should be on (activity_id, user_id) to enable Postgres to remove unwanted rows efficiently.

On a table with 100.000 rows this ran in about 100ms on my laptop with Postgres 11 and a SSD.

Online example: https://rextester.com/YLN7221

Update

"You'll have to check that out for each install by querying the catalog."—exactly, can you tell me such a query? – John Frazer Jun 17 at 8:03

Sure, so what you want to do is run psql -E, then run \?. This gives you the far more than what you've asked for (viz. list tables, views, and sequences , aggregates , access methods , tablespaces , conversions , casts , default privileges , domains , foreign tables , foreign servers , user mappings , foreign-data wrappers , functions , text search configurations , text search dictionaries , text search parsers , text search templates , roles , indexes , large objects, same as \lo_list , procedural languages , materialized views , schemas , operators , collations , table, view, and sequence access privileges , sequences , tables , data types , roles , views , foreign tables , extensions , event triggers , databases). Next pick one of them, like \df which lists functions,

You'll get a query like

SELECT n.nspname as "Schema",
  p.proname as "Name",
  pg_catalog.pg_get_function_result(p.oid) as "Result data type",
  pg_catalog.pg_get_function_arguments(p.oid) as "Argument data types",
 CASE
  WHEN p.proisagg THEN 'agg'
  WHEN p.proiswindow THEN 'window'
  WHEN p.prorettype = 'pg_catalog.trigger'::pg_catalog.regtype THEN 'trigger'
  ELSE 'normal'
 END as "Type"
FROM pg_catalog.pg_proc p
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
WHERE pg_catalog.pg_function_is_visible(p.oid)
      AND n.nspname <> 'pg_catalog'
      AND n.nspname <> 'information_schema'
ORDER BY 1, 2, 4;

This is internally the query that psql uses to get this information. You can cut it up though if you only want the name and don't care about whether or not the function is an aggregate, window, or trigger; and, you can cut it further if you don't care about the result and arguments types.

SELECT n.nspname as "Schema",
  p.proname as "Name"
FROM pg_catalog.pg_proc p
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = p.pronamespace
WHERE pg_catalog.pg_function_is_visible(p.oid)
      AND n.nspname <> 'pg_catalog'
      AND n.nspname <> 'information_schema'
ORDER BY 1, 2;

Etc., or of course you can query the slower information_schema interface which is part of the SQL spec and standardized.

SELECT routine_catalog, routine_schema, routine_name
FROM information_schema.routines
WHERE routine_schema NOT IN ('pg_catalog', 'information_schema');

Best Answer

Related Solutions

Postgresql – How to stabilize performance on frequently updated table in PostgreSQL

Postgresql – Possible to get comprehensive view on all defined names in PostgreSQL

Update

Related Question