PostgreSQL Query – Select Rows Where at Least One Row Per Set Meets a Condition

postgresqlpostgresql-9.4

I have following table:

create table test (
  company_id integer not null, 
  client_id integer not null, 
  client_status text,
  unique (company_id, client_id)
);

insert into test values
  (1, 1, 'y'),    -- company1

  (2, 2, null),   -- company2

  (3, 3, 'n'),    -- company3

  (4, 4, 'y'),    -- company4
  (4, 5, 'n'),

  (5, 6, null),   -- company5
  (5, 7, 'n')
;

Basically, there are 5 different companies, each one has one or more clients and each client has status: 'y' or 'n' (might be null as well).

What I have to do is to select all pairs (company_id, client_id) for all companies for which there is at least one client whose status is not 'n' ('y' or null). So for the example data above, the output should be:

company_id;client_id
1;1
2;2
4;4
4;5
5;6
5;7

I tried something with window functions but I can't figure out how to compare the number of ALL clients with the number of clients with STATUS = 'n'.

select company_id,
count(*) over (partition by company_id) as all_clients_count
from test
-- where all_clients_count != ... ?

I figured out how to do this, but I am not sure if it's the right way:

select sub.company_id, unnest(sub.client_ids)
from (
  select company_id, array_agg(client_id) as client_ids
  from test
  group by company_id
  having count(*) != count( (case when client_status = 'n' then 1 else null end) )
) sub

Best Answer

Basically you are looking for the expression:

client_status IS DISTINCT FROM 'n'

The column client_status should really be data type boolean, not text, which would allow the simpler expression:

client_status IS NOT FALSE

The manual has details in the chapter Comparison Operators.

Assuming your actual table has a UNIQUE or PK constraint, we arrive at:

CREATE TABLE test (
  company_id    integer NOT NULL, 
  client_id     integer NOT NULL, 
  client_status boolean,
  PRIMARY KEY (company_id, client_id)
);

Queries

All of these do the same (what you asked), which is the fastest depends on data distribution:

SELECT company_id, client_id
FROM   test t
WHERE  EXISTS (
   SELECT FROM test
   WHERE  company_id = t.company_id
   AND    client_status IS NOT FALSE
   );

Or:

SELECT company_id, client_id
FROM   test t
JOIN  (
   SELECT company_id
   FROM   test t
   GROUP  BY 1
   HAVING bool_or(client_status IS NOT FALSE)
   ) c USING (company_id);

Or:

SELECT company_id, client_id
FROM   test t
JOIN  (
   SELECT DISTINCT company_id, client_status 
   FROM   test t
   ORDER  BY company_id, client_status DESC
   ) c USING (company_id)
WHERE  c.client_status IS NOT FALSE;

Boolean values sort FALSE -> TRUE -> NULL in ascending sort order. So FALSE comes last in descending order. If there's is any other value available, then that one's picked first ...

Sorting null values after all others, except special

The added PK is implemented with a useful index for these queries. If you want faster, yet, add a partial index for query 1:

CREATE INDEX test_special_idx ON test (company_id, client_id)
WHERE  client_status IS NOT FALSE;

You could use window functions, too, but that would be slower. Example with first_value():

SELECT company_id, client_id
FROM  (
   SELECT company_id, client_id
        , first_value(client_status) OVER (PARTITION BY company_id
                                           ORDER BY client_status DESC) AS stat
   FROM   test t
   ) sub
WHERE  stat IS NOT FALSE;

For lots of rows per company_id, one of these techniques may be faster, still:

Optimize GROUP BY query to retrieve latest record per user

Related Solutions

Postgresql – Rotate a table in PostgreSQL

What about ALTER TABLE RENAME followed by CREATE TABLE?

Postgresql – how to chain postgres RULEs

Next time, please include the EXPLAIN output rather than making us dig for it in your scripts. There's no guarantee my system is using the same plan as yours (although with your test data it is likely).

The rule system here is working properly. First, the I want to include my own diagnostic queries (note I did not run EXPLAIN ANALYSE since I was just interested in what query plan was generated):

rulestest=# explain DELETE FROM user_hits WHERE day = '2013-03-16';
                                              QUERY PLAN                        

--------------------------------------------------------------------------------
----------------------
 Delete on application_hits  (cost=0.00..3953181.85 rows=316094576 width=24)
   ->  Nested Loop  (cost=0.00..3953181.85 rows=316094576 width=24)
         ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=10)
               Filter: (day = '2013-03-16'::date)
         ->  Materialize  (cost=0.00..128.53 rows=6352 width=22)
               ->  Nested Loop  (cost=0.00..96.78 rows=6352 width=22)
                     ->  Seq Scan on project_hits  (cost=0.00..14.93 rows=397 wi
dth=10)
                           Filter: (day = '2013-03-16'::date)
                     ->  Materialize  (cost=0.00..2.49 rows=16 width=16)
                           ->  Nested Loop  (cost=0.00..2.41 rows=16 width=16)
                                 ->  Seq Scan on application_hits  (cost=0.00..1
.10 rows=4 width=10)
                                       Filter: (day = '2013-03-16'::date)
                                 ->  Materialize  (cost=0.00..1.12 rows=4 width=
10)
                                       ->  Seq Scan on client_hits  (cost=0.00..
1.10 rows=4 width=10)
                                             Filter: (day = '2013-03-16'::date)

 Delete on client_hits  (cost=0.00..989722.41 rows=79023644 width=18)
   ->  Nested Loop  (cost=0.00..989722.41 rows=79023644 width=18)
         ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=10)
               Filter: (day = '2013-03-16'::date)
         ->  Materialize  (cost=0.00..43.83 rows=1588 width=16)
               ->  Nested Loop  (cost=0.00..35.89 rows=1588 width=16)
                     ->  Seq Scan on project_hits  (cost=0.00..14.93 rows=397 wi
dth=10)
                           Filter: (day = '2013-03-16'::date)
                     ->  Materialize  (cost=0.00..1.12 rows=4 width=10)
                           ->  Seq Scan on client_hits  (cost=0.00..1.10 rows=4 
width=10)
                                 Filter: (day = '2013-03-16'::date)

 Delete on project_hits  (cost=0.00..248851.80 rows=19755911 width=12)
   ->  Nested Loop  (cost=0.00..248851.80 rows=19755911 width=12)
         ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=10)
               Filter: (day = '2013-03-16'::date)
         ->  Materialize  (cost=0.00..16.91 rows=397 width=10)
               ->  Seq Scan on project_hits  (cost=0.00..14.93 rows=397 width=10
)
                     Filter: (day = '2013-03-16'::date)

 Delete on user_hits  (cost=0.00..1887.00 rows=49763 width=6)
   ->  Seq Scan on user_hits  (cost=0.00..1887.00 rows=49763 width=6)
         Filter: (day = '2013-03-16'::date)
(39 rows)

rulestest=# select distinct day from application_hits;
    day     
------------
 2013-03-15
 2013-03-16
(2 rows)

rulestest=# select count(*), day from application_hits group by day;
 count |    day     
-------+------------
     4 | 2013-03-15
     4 | 2013-03-16
(2 rows)

rulestest=# select count(*), day from client_hits group by day;
 count |    day     
-------+------------
     4 | 2013-03-15
     4 | 2013-03-16
(2 rows)

rulestest=# select count(*), day from project_hits group by day;
 count |    day     
-------+------------
   397 | 2013-03-15
   397 | 2013-03-16
(2 rows)

If your data is anything like your existing data, neither rules nor triggers will work very well. Better will be a stored procedure which you pass a value and it deletes everything you want.

First let's note that indexes here will get you nowhere because in all cases you are pulling half of the tables (I did add indexes on day on all tables to help the planner but this made no real difference).

You need to start with what you are doing with RULEs. RULEs basically rewrite queries and they do so using ways that are as robust as possible. Your code also doesn't match your example though it matches your question better. You have rules on tables which cascade to rules on other tables which cascade to rules on other tables

Therefore when you delete from user_hits where [criteria], the rules transform this into a set of queries:

DELETE FROM application_hits 
 WHERE day IN (SELECT day FROM client_hits 
               WHERE day IN (SELECT day FROM user_hits WHERE [condition]));
DELETE FROM client_hits
  WHERE day IN (SELECT day FROM user_hits WHERE [condition]);
DELETE FROM user_hits WHERE [condition];

Now, you might think we could skip the scan on client_hits in the first, but that isn't what happens here. The problem is that you could have days in user_hits and application_hits that are not in client_hits so you really have to scan all tables.

Now here there is no magic bullet. A trigger isn't going to work much better because, while it gets to avoid scanning every table, it gets fired every row that gets deleted so you basically end up with the same nested loop sequential scans that are currently killing performance. It will work a bit better because it will delete rows along the way rather than rewriting the query along the way, but it isn't going to perform very well.

A much better solution is to just define a stored procedure and have the application call that. Something like:

CREATE OR REPLACE FUNCTION delete_stats_at_date(in_day date) RETURNS BOOL 
LANGUAGE SQL AS
$$
DELETE FROM application_hits WHERE day = $1;
DELETE FROM project_hits WHERE day = $1;
DELETE FROM client_hits WHERE day  = $1;
DELETE FROM user_hits WHERE day = $1;
SELECT TRUE;
$$;

On the test data this runs in 280 ms on my laptop.

One of the hard things regarding RULEs is remembering what they are and noting that the computer cannot, in fact, read your mind. This is why I would not consider them a beginner's tool.

Best Answer

Queries

Related Solutions

Postgresql – Rotate a table in PostgreSQL

Postgresql – how to chain postgres RULEs

Related Question