Deduplicate SELECT statements in relational division

redshiftrelational-division

I have a query that does a lot of duplicate work:

SELECT visitor_id, '1'::text AS filter    
FROM events    
WHERE id IN (SELECT event_id FROM params
             WHERE key = 'utm_campaign' AND value = 'campaign_one')
  AND id IN (SELECT event_id FROM params
             WHERE key = 'utm_source'   AND value = 'facebook')
GROUP BY visitor_id

UNION ALL

SELECT visitor_id, '2'::text AS filter
FROM events    
WHERE id IN (SELECT event_id FROM params
             WHERE key = 'utm_campaign' AND value = 'campaign_two')
  AND id IN (SELECT event_id FROM params
             WHERE key = 'utm_source'   AND value = 'facebook')
GROUP BY visitor_id

As you can see, it performs different filtering on the params table 4 times. I'm using Redshift and although it scans this table very quickly, I have quite a few of these statements UNIONed together. Is there a way to rewrite the SQL using CASE/IF statements?

The example uses key = 'utm_source' AND value = 'facebook' in both, but this is not necessarily true for all selects.

Best Answer

I see a possible minor simplification with a CTE (available in Redshift) for the repeated subquery with identical predicates:

WITH p2 AS (
   SELECT event_id
   FROM   params 
   WHERE  key = 'utm_source' AND value = 'facebook'
   )
SELECT e.visitor_id, '1'::text AS filter   
FROM          p2 
JOIN   params p1 USING (event_id)
JOIN   events e ON e.id = p2.event_id
WHERE  p1.key = 'utm_campaign' AND p1.value = 'campaign_one'
GROUP  BY e.visitor_id

UNION ALL
SELECT e.visitor_id, '2'::text AS filter   
FROM          p2 
JOIN   params p1 USING (event_id)
JOIN   events e ON e.id = p2.event_id
WHERE  p1.key = 'utm_campaign' AND p1.value = 'campaign_two'
GROUP  BY e.visitor_id;

Plain joins may also be faster than multiple IN semi-joins.

This multicolumn index should allow index-only scans on params:

CREATE INDEX foo_idx ON params (key, value, event_id)

Add another index on just (event_id) if you don't have one, yet.

Arsenal of available query techniques for relational division under this related question on SO:

How to filter SQL results in a has-many-through relation

Avoid multiple SELECTs

As commented by @Andriy, we can squeeze out some more:

WITH p2 AS (  -- repeated, immutable filter
   SELECT event_id
   FROM   params 
   WHERE  key = 'utm_source' AND value = 'facebook'
   )
, p3 (value, filter) AS (  -- values for variable filter
   SELECT text 'campaign_one', text '1'
   UNION ALL SELECT 'campaign_two', '2'
   )
SELECT e.visitor_id, p3.filter   
FROM          p3
JOIN   params p1 USING (value)
JOIN          p2 USING (event_id)
JOIN   events e ON e.id = p2.event_id
WHERE  p1.key = 'utm_campaign'  -- repeated for p1
GROUP  BY 1, 2;

In Postgres we could use a shorter, faster VALUES expression, but that feature is not currently supported in Redshift:

...
, p3 (value, filter) AS (
     VALUES
     (text 'campaign_one', text '1')
   , (     'campaign_two',      '2')
   )
...

For just two SELECT in the UNION, this won't buy much. But it should be a substantial improvement for more - like you mentioned.

2nd query without CTE

The CTE is not necessary for the second query. You can simplify to:

SELECT e.visitor_id, p3.filter   
FROM  (
   SELECT text 'campaign_one' AS value, text '1' AS filter
   UNION ALL SELECT 'campaign_two', '2'
   ) p3                            -- values for variable filter
JOIN   params p1 USING (value)
JOIN   params p2 USING (event_id)
JOIN   events e ON e.id = p2.event_id
WHERE  p1.key = 'utm_campaign'     -- repeated, immutable filters
AND    p2.key = 'utm_source'
AND    p2.value = 'facebook'
GROUP  BY 1, 2;

About CTEs

Resources for Common Table Expressions (as requested in the comment):

Data-modifying CTEs are particularly useful. Example:

How do I insert a row which contains a foreign key?

Basics with an added advanced example on this related answer on SO:

Are SELECT type queries the only type that can be nested?

Related Solutions

PostgreSQL – Select Column Values Linked to Exact Set of Values in Another Column

Assuming a PRIMARY KEY on (chat_id, user_id) - or a UNIQUE constraint plus NOT NULL on both columns.

To allow any number of IDs, use a 1-dimensional array parameter:

SELECT chat_id
FROM   chats_users c
WHERE  user_id = ANY ('{1,2,3}'::int[])
GROUP  BY 1
HAVING count(*) = array_length('{1,2,3}'::int[], 1)
AND    NOT EXISTS (
   SELECT 1
   FROM   chats_users
   WHERE  chat_id = c.chat_id
   AND    user_id <> ALL ('{1,2,3}'::int[])
   );

Typically, in function or prepared statements, you provide the array parameter only once. Example for prepared statement:

PREPARE foo(int[]) AS
SELECT chat_id
FROM   chats_users c
WHERE  user_id = ANY ($1)
GROUP  BY 1
HAVING count(*) = array_length($1, 1)  -- all required user_id
AND    NOT EXISTS (
   SELECT 1
   FROM   chats_users
   WHERE  chat_id = c.chat_id
   AND    user_id <> ALL ($1)  -- no additional user_id
   );

Tests:

EXECUTE foo('{1,2,3}');  -- 1
EXECUTE foo('{1,3}');    -- 2
EXECUTE foo('{2,3}');    -- 3
EXECUTE foo('{1,2}');    -- empty

db<>fiddle here
Old sqlfiddle.

That's basically the first solution by Martin in this related answer on SO (with some modifications to fit your case):

How to filter SQL results in a has-many-through relation?

There are faster solutions for a small, given number of IDs.

Redshift – Grant SELECT on All Tables

You need the USAGE privilege (at least) for the schema as well:

GRANT USAGE ON SCHEMA something TO GROUP data_viewers;