Postgresql – Finding and storing groups of rows PK

postgresql

My problem is for Postgresql 10 but it probably is also relevant for other DB systems.

I have several tables in which I need to identify groups based on several criteria (some of them are PostGIS-geospatial criteria, some other are shared values in certain columns, … that don't matter).

I am going to take a simple example. Let's say I have I have first table with people:

CREATE TABLE employees
(
id serial PRIMARY KEY,
name varchar(255),
desk_number varchar(20)
)

and that I want to group all employees that share the same desk (field desk_number).
Illustration :

employees
--------------
id | name | desk_number
1  | Bill | 314
2  | Joe  | 200
3  | Bob  | 314
4  | Matt | 189
5  | Sam  | 314
6  | Anne | 150

In this example, what I want to have is the list of employees sharing desk number 314 : Bill, Bob and Sam (id: 1, 3, 5).

My current query for finding this is :

WITH findpairs AS
(   -- part 1 of the query : selecting pairs
    SELECT s1.id AS id1, s2.id AS id2
    FROM employees AS s1 
    JOIN employees AS s2
    ON s1.desk_number = s2.desk_number
    WHERE s1.id < s2.id
)
-- part2 of the query : filtering
SELECT s1.id1
FROM findpairs AS s1
LEFT JOIN findpairs AS s2
ON s1.id1 = s2.id2
WHERE s2.id2 IS NULL
;

Explanations :
In the first part of the query, I find all pairs of rows that for employees that share the same desk.

Without the WHERE clause, I would obtain as a result:

id1 | id2
 1  | 3
 3  | 1
 1  | 5
 5  | 1
 1  | 1
 2  | 2
 3  | 3
 4  | 4
 5  | 5
 6  | 6

The WHERE clause "s1.id < s2.id" prevents returning matches for a row with itself and makes sure that for two rows matching each other, only one row will be returned.

With this WHERE clause, what the part 1 of the query returns is :

id1 | id2
 1  | 3
 1  | 5
 3  | 5

I only two pairs of rows (1-3 and 1-5) to define the group I need ; the row 3-5 is unnecessary. That is why the second part of the query also does some filtering : because of the previous "s1.id < s2.id" clause, I know the lowest of the id values is only present in s1.id and never in s2.id. By using a LEFT JOIN, I can identify these rows from the set returned by part 1 and discard the other ones.

The final result returned is :

id1 | id2
 1  | 3
 1  | 5

And I can store this as such in a specific table :

CREATE TABLE matched_employees
(  id1 integer, id2 integer )

Question 1 : is there a less tedious way to do this ?
This takes a hell of a time to run on my server with my dataset.

Question 2 : instead to store the result in a table with pairs ( matched_employees ). Is there a better way to store this information (and easily access it later) ?

Best Answer

If you don't need real "pairs" you can get the list of employees sharing a desk using this:

select e1.*
from employees e1
where exists (select *
              from employees e2
              where e1.desk_number = e2.desk_number
                and e1.id <> e2.id);

If you want that as a single "list of IDs" you can use string_agg()

select string_agg(e1.id::text, ',') as id_list
from employees e1
where exists (select *
              from employees e2
              where e1.desk_number = e2.desk_number
                and e1.id <> e2.id);

Or use array_agg() to get those IDs as an array

Online example: http://rextester.com/AXVGA22292