Postgresql – Finding and storing groups of rows PK

postgresql

My problem is for Postgresql 10 but it probably is also relevant for other DB systems.

I have several tables in which I need to identify groups based on several criteria (some of them are PostGIS-geospatial criteria, some other are shared values in certain columns, … that don't matter).

I am going to take a simple example. Let's say I have I have first table with people:

CREATE TABLE employees
(
id serial PRIMARY KEY,
name varchar(255),
desk_number varchar(20)
)

and that I want to group all employees that share the same desk (field desk_number).
Illustration :

employees
--------------
id | name | desk_number
1  | Bill | 314
2  | Joe  | 200
3  | Bob  | 314
4  | Matt | 189
5  | Sam  | 314
6  | Anne | 150

In this example, what I want to have is the list of employees sharing desk number 314 : Bill, Bob and Sam (id: 1, 3, 5).

My current query for finding this is :

WITH findpairs AS
(   -- part 1 of the query : selecting pairs
    SELECT s1.id AS id1, s2.id AS id2
    FROM employees AS s1 
    JOIN employees AS s2
    ON s1.desk_number = s2.desk_number
    WHERE s1.id < s2.id
)
-- part2 of the query : filtering
SELECT s1.id1
FROM findpairs AS s1
LEFT JOIN findpairs AS s2
ON s1.id1 = s2.id2
WHERE s2.id2 IS NULL
;

Explanations :
In the first part of the query, I find all pairs of rows that for employees that share the same desk.

Without the WHERE clause, I would obtain as a result:

The WHERE clause "s1.id < s2.id" prevents returning matches for a row with itself and makes sure that for two rows matching each other, only one row will be returned.

With this WHERE clause, what the part 1 of the query returns is :

id1 | id2
 1  | 3
 1  | 5
 3  | 5

I only two pairs of rows (1-3 and 1-5) to define the group I need ; the row 3-5 is unnecessary. That is why the second part of the query also does some filtering : because of the previous "s1.id < s2.id" clause, I know the lowest of the id values is only present in s1.id and never in s2.id. By using a LEFT JOIN, I can identify these rows from the set returned by part 1 and discard the other ones.

The final result returned is :

id1 | id2
 1  | 3
 1  | 5

And I can store this as such in a specific table :

CREATE TABLE matched_employees
(  id1 integer, id2 integer )

Question 1 : is there a less tedious way to do this ?
This takes a hell of a time to run on my server with my dataset.

Question 2 : instead to store the result in a table with pairs ( matched_employees ). Is there a better way to store this information (and easily access it later) ?

Best Answer

If you don't need real "pairs" you can get the list of employees sharing a desk using this:

select e1.*
from employees e1
where exists (select *
              from employees e2
              where e1.desk_number = e2.desk_number
                and e1.id <> e2.id);

If you want that as a single "list of IDs" you can use string_agg()

select string_agg(e1.id::text, ',') as id_list
from employees e1
where exists (select *
              from employees e2
              where e1.desk_number = e2.desk_number
                and e1.id <> e2.id);

Or use array_agg() to get those IDs as an array

Online example: http://rextester.com/AXVGA22292

Related Solutions

Postgresql – Filtering data that could have more than one category

If I understood you correctly, one way to achieve this is getting those rows where category = 1, then check for the non-existence of non-category1 rows:

SELECT document, term
FROM Annotation ann
WHERE 
    category = 1
    AND NOT EXISTS (
        SELECT 1
        FROM Annotation a
        WHERE 
            a.document = ann.document
            AND a.term = ann.term
            AND category <> 1
    )
;

(Thanks to DavideChicco.it for showing that this was overly complicated).

Postgresql – Take advantage of monotonic columns in PostreSQL

if in a row col1 is greater then or equal two col1 in another row, then the same relation is valid between the two corresponding col2 entries

In which case you can reformulate your query to look like:

SELECT * FROM table WHERE col2 >= val1 AND col2 <= val2;

because you can find the lower bound for col2 from the lower bound for col1, like this:

schema:

create schema stack;
set search_path=stack;
--
create table t(foo integer, bar integer);
insert into t(foo,bar) select 10*g, 20*g from generate_series(1,100000) g;
create index on t(foo);
create index on t(bar);

method:

explain analyse select min(foo) from t where foo>500; -- assuming val1=500

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                              QUERY PLAN                                                               │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Result  (cost=0.38..0.39 rows=1 width=0) (actual time=0.063..0.064 rows=1 loops=1)                                                    │
│   InitPlan 1 (returns $0)                                                                                                             │
│     ->  Limit  (cost=0.29..0.38 rows=1 width=4) (actual time=0.059..0.060 rows=1 loops=1)                                             │
│           ->  Index Only Scan using t_foo_idx on t  (cost=0.29..2803.63 rows=33167 width=4) (actual time=0.058..0.058 rows=1 loops=1) │
│                 Index Cond: ((foo IS NOT NULL) AND (foo > 500))                                                                       │
│                 Heap Fetches: 1                                                                                                       │
│ Total runtime: 0.087 ms                                                                                                               │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

select min(foo) from t where foo>500;

┌─────┐
│ min │
├─────┤
│ 510 │
└─────┘

select min(bar) from t where foo=510;

┌──────┐
│ min  │
├──────┤
│ 1020 │
└──────┘

explain analyse select * from t where bar>=1020 and bar<= 1100; -- assuming val2=1100

┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                     QUERY PLAN                                                      │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Bitmap Heap Scan on t  (cost=13.42..485.00 rows=500 width=8) (actual time=0.011..0.013 rows=5 loops=1)              │
│   Recheck Cond: ((bar >= 1020) AND (bar <= 1100))                                                                   │
│   ->  Bitmap Index Scan on t_bar_idx  (cost=0.00..13.29 rows=500 width=0) (actual time=0.008..0.008 rows=5 loops=1) │
│         Index Cond: ((bar >= 1020) AND (bar <= 1100))                                                               │
│ Total runtime: 0.030 ms                                                                                             │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

select * from t where bar>=1020 and bar<= 1100;

┌─────┬──────┐
│ foo │ bar  │
├─────┼──────┤
│ 510 │ 1020 │
│ 520 │ 1040 │
│ 530 │ 1060 │
│ 540 │ 1080 │
│ 550 │ 1100 │
└─────┴──────┘

clean up:

drop schema stack cascade;

Best Answer

Related Solutions

Postgresql – Filtering data that could have more than one category

Postgresql – Take advantage of monotonic columns in PostreSQL

Related Question