PostgreSQL Query – Select Rows Where at Least One Row Per Set Meets a Condition

postgresqlpostgresql-9.4

I have following table:

create table test (
  company_id integer not null, 
  client_id integer not null, 
  client_status text,
  unique (company_id, client_id)
);

insert into test values
  (1, 1, 'y'),    -- company1

  (2, 2, null),   -- company2

  (3, 3, 'n'),    -- company3

  (4, 4, 'y'),    -- company4
  (4, 5, 'n'),

  (5, 6, null),   -- company5
  (5, 7, 'n')
;

Basically, there are 5 different companies, each one has one or more clients and each client has status: 'y' or 'n' (might be null as well).

What I have to do is to select all pairs (company_id, client_id) for all companies for which there is at least one client whose status is not 'n' ('y' or null). So for the example data above, the output should be:

company_id;client_id
1;1
2;2
4;4
4;5
5;6
5;7

I tried something with window functions but I can't figure out how to compare the number of ALL clients with the number of clients with STATUS = 'n'.

select company_id,
count(*) over (partition by company_id) as all_clients_count
from test
-- where all_clients_count != ... ?

I figured out how to do this, but I am not sure if it's the right way:

select sub.company_id, unnest(sub.client_ids)
from (
  select company_id, array_agg(client_id) as client_ids
  from test
  group by company_id
  having count(*) != count( (case when client_status = 'n' then 1 else null end) )
) sub

Best Answer

Basically you are looking for the expression:

client_status IS DISTINCT FROM 'n'

The column client_status should really be data type boolean, not text, which would allow the simpler expression:

client_status IS NOT FALSE

The manual has details in the chapter Comparison Operators.


Assuming your actual table has a UNIQUE or PK constraint, we arrive at:

CREATE TABLE test (
  company_id    integer NOT NULL, 
  client_id     integer NOT NULL, 
  client_status boolean,
  PRIMARY KEY (company_id, client_id)
);

Queries

All of these do the same (what you asked), which is the fastest depends on data distribution:

SELECT company_id, client_id
FROM   test t
WHERE  EXISTS (
   SELECT FROM test
   WHERE  company_id = t.company_id
   AND    client_status IS NOT FALSE
   );

Or:

SELECT company_id, client_id
FROM   test t
JOIN  (
   SELECT company_id
   FROM   test t
   GROUP  BY 1
   HAVING bool_or(client_status IS NOT FALSE)
   ) c USING (company_id);

Or:

SELECT company_id, client_id
FROM   test t
JOIN  (
   SELECT DISTINCT company_id, client_status 
   FROM   test t
   ORDER  BY company_id, client_status DESC
   ) c USING (company_id)
WHERE  c.client_status IS NOT FALSE;

Boolean values sort FALSE -> TRUE -> NULL in ascending sort order. So FALSE comes last in descending order. If there's is any other value available, then that one's picked first ...

The added PK is implemented with a useful index for these queries. If you want faster, yet, add a partial index for query 1:

CREATE INDEX test_special_idx ON test (company_id, client_id)
WHERE  client_status IS NOT FALSE;

You could use window functions, too, but that would be slower. Example with first_value():

SELECT company_id, client_id
FROM  (
   SELECT company_id, client_id
        , first_value(client_status) OVER (PARTITION BY company_id
                                           ORDER BY client_status DESC) AS stat
   FROM   test t
   ) sub
WHERE  stat IS NOT FALSE;

For lots of rows per company_id, one of these techniques may be faster, still: