Postgresql – Select rows with repeat data within certain timeframe

join;postgresqlselect

As a follow-up question to my previous question on difficult row categorisation, I have another problem which I'm sure will be easily solvable. After this I plan on reading a good book or two on databases and SQL but for the time being I need to overcome this brick wall.

The table in question has the following layout:

TaskID             -  -  -  1. On-site
DateStart         |         2. Return-to-base
DateEnd           |         3. Chargeable
Type -  -  -  -  -          4. Complaint
Reference                   5. Rebook
Requester
Approver
Cost
...

I want to select all "repeat" TaskID. TaskID is a unique ID but a "repeat" is classed as any two tasks that share the same Reference and where the DateEnd of each task is within a 30 day period, so, I'm thinking something along the lines of:

SELECT BOTH TaskID, Reference
FROM   Tasks
WHERE  Row1.Reference = Row2.Reference AND ABS(Row1.DateEnd - Row2.DateEnd) <= 30;

Obviously the above isn't valid but hopefully it illustrates how I want to select the data. I'm guessing it will involve a self-join but I have no clue where to start… having the query return both TaskIDs with the reference is ideal but if it only returns one TaskID then that's no problem.

I'm version of PostgreSQL that I'm using is 9.1.3, if that helps.

Thanks for your help.

Best Answer

SELECT *
FROM   task t1
WHERE  EXISTS (
   SELECT 1
   FROM   task t2
   WHERE  t2.reference = t1.reference  -- same reference
   AND    t2.task_id  <> t1.task_id    -- different task_id (exclude self-join)
   AND    t2.end_date BETWEEN t1.end_date - 30
                      AND     t1.end_date + 30 
   )

An EXISTS semi-join should be the fastest way to arrive at DISTINCT rows, because you don't get duplicates to begin with. Also, in the event of multiple matching rows (many matching tasks withing 30 days) evaluation can stop after the first match is found. I would expect this query to beat anything you have so far. Test with EXPLAIN ANALYZE.

In a WHERE clause or JOIN condition don't use an expression like:

abs(t1.end_date - t2.end_date) < 30

If you can avoid it. As the left hand expression is derived from values of two tables, the only way for the query planner is to form a limited CROSS JOIN (after applying other conditions) and compute a value for every possible combination. This is a well known anti-pattern for good performance!

As long as there are only few rows per reference, it won't matter much. But the cost grows with O(N²) with more rows per reference. I rewrote the condition to:

t2.end_date BETWEEN t1.end_date - 30 AND t1.end_date + 30

In my tests, the first form went from 5x slower to 500x slower quickly when I narrowed down the id-space for reference (-> more matching rows).

The second form can also more easily use indexes. If you want to select a small sample of the table, with conditions like:

reference > 450
date_end > now()::date

And an index like:

CREATE INDEX test_idx ON task (reference, date_end);
CREATE INDEX test_idx2 ON task (date_end);

The performance gap becomes even more overwhelming.