Postgresql – Select rows with repeat data within certain timeframe

join;postgresqlselect

As a follow-up question to my previous question on difficult row categorisation, I have another problem which I'm sure will be easily solvable. After this I plan on reading a good book or two on databases and SQL but for the time being I need to overcome this brick wall.

The table in question has the following layout:

TaskID             -  -  -  1. On-site
DateStart         |         2. Return-to-base
DateEnd           |         3. Chargeable
Type -  -  -  -  -          4. Complaint
Reference                   5. Rebook
Requester
Approver
Cost
...

I want to select all "repeat" TaskID. TaskID is a unique ID but a "repeat" is classed as any two tasks that share the same Reference and where the DateEnd of each task is within a 30 day period, so, I'm thinking something along the lines of:

SELECT BOTH TaskID, Reference
FROM   Tasks
WHERE  Row1.Reference = Row2.Reference AND ABS(Row1.DateEnd - Row2.DateEnd) <= 30;

Obviously the above isn't valid but hopefully it illustrates how I want to select the data. I'm guessing it will involve a self-join but I have no clue where to start… having the query return both TaskIDs with the reference is ideal but if it only returns one TaskID then that's no problem.

I'm version of PostgreSQL that I'm using is 9.1.3, if that helps.

Thanks for your help.

Best Answer

SELECT *
FROM   task t1
WHERE  EXISTS (
   SELECT 1
   FROM   task t2
   WHERE  t2.reference = t1.reference  -- same reference
   AND    t2.task_id  <> t1.task_id    -- different task_id (exclude self-join)
   AND    t2.end_date BETWEEN t1.end_date - 30
                      AND     t1.end_date + 30 
   )

An EXISTS semi-join should be the fastest way to arrive at DISTINCT rows, because you don't get duplicates to begin with. Also, in the event of multiple matching rows (many matching tasks withing 30 days) evaluation can stop after the first match is found. I would expect this query to beat anything you have so far. Test with EXPLAIN ANALYZE.

In a WHERE clause or JOIN condition don't use an expression like:

abs(t1.end_date - t2.end_date) < 30

If you can avoid it. As the left hand expression is derived from values of two tables, the only way for the query planner is to form a limited CROSS JOIN (after applying other conditions) and compute a value for every possible combination. This is a well known anti-pattern for good performance!

As long as there are only few rows per reference, it won't matter much. But the cost grows with O(N²) with more rows per reference. I rewrote the condition to:

t2.end_date BETWEEN t1.end_date - 30 AND t1.end_date + 30

In my tests, the first form went from 5x slower to 500x slower quickly when I narrowed down the id-space for reference (-> more matching rows).

The second form can also more easily use indexes. If you want to select a small sample of the table, with conditions like:

reference > 450
date_end > now()::date

And an index like:

CREATE INDEX test_idx ON task (reference, date_end);
CREATE INDEX test_idx2 ON task (date_end);

The performance gap becomes even more overwhelming.

Related Solutions

Postgresql – Difficult categorising of rows

Ok, so my answer here is conditioned on the following assumptions:

This categorization needs to be relatively stable and
It may be re-used in various ways, and
It needs to perform well.

The obvious, simple answer is to use table methods in PostgreSQL. What you do is create a SQL language function which does not hit the table and returns the value you want. Make it immutable so you can index the output if you want to query against it, etc. Note the specific rules aren't really clear from your post but this should get you started.

CREATE OR REPLACE FUNCTION subcategory(task) -- task is table name
RETURNS text LANGUAGE sql IMMUTABLE AS
$$
    SELECT CASE WHEN $1.type = 3 THEN
                CASE WHEN $I.reference LIKE 'IX#%' 
                     THEN 'Company IX Chargeable'
                     ELSE 'Other Chargeable'
                END
                WHEN $1.reference LIKE 'IX#%' THEN 
                CASE WHEN $1.type = 5 THEN 'Company IX Rebool'
                     ELSE 'Company IX Other'
                END
                WHEN $1.reference LIKE 'THR#%' AND $1.type IN (1, 2) THEN
                CASE WHEN $1.type = 2 THEN 'THR Return to Base'
                     WHEN $1.type = 1 THEN 'THR On Site'
                END
                WHEN $1.type = 2 THEN 'Return to Base'
                ELSE 'Other Nonchargeable'
           END;
$$;

You can then query this using:

SELECT t.subcategory FROM task t;

Note that the table name is non-optional here. The parser converts this to:

SELECT subcategory(t) FROM task t;

You can use this however in any part of the select statement including the where clause. If it proves slow, you can add cost estimates (but I think this should be fast), and you can even index the output using PostgreSQL's functional indexes.

On the performance side, LIKE and substring() don't seem to be significantly different:

postgres=# select count(*) from generate_series(1, 10000000);
  count   
----------
 10000000
(1 row)

Time: 9007.618 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE s::text like '1%';
  count  
---------
 1111112
(1 row)

Time: 13653.000 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE substring(s::text from 1 for 1) = '1';
  count  
---------
 1111112
(1 row)

Time: 16681.860 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE s::text like '1%';
  count  
---------
 1111112
(1 row)

Time: 17163.470 ms
postgres=# select count(*) from generate_series(1, 10000000) s WHERE substring(s::text from 1 for 1) = '1';
  count  
---------
 1111112
(1 row)

Time: 17052.004 ms

So sometimes one is faster than the other, but they don't seem out of line with eachother.

Mysql – How to select items from a table where a single column must contain two (or more) values

This is the relational division problem and there is a question about it at SO, with a lot of ways to write this query, plus performance analysis for PostgreSQL: How to filter SQL results in a has-many-through relation

Shamelessly copying code form there and removing/changing code for answers that have features lacking from MySQL, like CTEs, EXCEPT, INTERSECT, etc, here are a few ways to do this.

Assumptions:

the table is called factors
there is a UNIQUE constraint on (wordid, docid)
there is a documents and a words table:

Easy to write, medium efficiency:

-- Query 1 -- by Martin
SELECT d.docid, d.docname
FROM   document d
JOIN   factors f USING (docid)
WHERE  f.wordid IN (2, 4, 5)
GROUP  BY d.docid
HAVING COUNT(*) = 3 ;           -- number of words

Easy to write, medium efficiency:

-- Query 2 -- by Erwin
SELECT d.docid, d.docname
FROM   documents d
JOIN   (
   SELECT docid
   FROM   factors
   WHERE  wordid IN (2, 4, 5)
   GROUP  BY docid
   HAVING COUNT(*) = 3
   ) f USING (docid) ;

More complex to write, very good efficiency in Postgres - probably lousy in MySQL:

-- Query 4 -- by Derek
SELECT d.docid, d.docname
FROM   documents d
WHERE  d.docid IN (SELECT docid FROM factors WHERE wordid = 2)
AND    d.docid IN (SELECT docid FROM factors WHERE wordid = 4);
AND    d.docid IN (SELECT docid FROM factors WHERE wordid = 5);

More complex to write, very good efficiency in Postgres - and probably the same in MySQL:

-- Query 5 -- by Erwin
SELECT d.docid, d.docname
FROM   documents d
WHERE  EXISTS (SELECT * FROM factors 
               WHERE  docid = d.docid AND wordid = 2)
AND    EXISTS (SELECT * FROM factors 
               WHERE  docid = d.docid AND wordid = 4)
AND    EXISTS (SELECT * FROM factors 
               WHERE  docid = d.docid AND wordid = 5) ;

More complex to write, very good efficiency in Postgres - and probably the same in MySQL:

-- Query 6 -- by Sean
SELECT d.docid, d.docname
FROM   documents d
JOIN   factors x ON d.docid = x.docid
JOIN   factors y ON d.docid = y.docid
JOIN   factors z ON d.docid = z.docid
WHERE  x.wordid = 2
AND    y.wordid = 4
AND    z.wordid = 5 ;

Easy to write and extend to an arbitrary set of words but not as efficient as the JOIN and EXISTS solutions:

-- Query 7 -- by ypercube
SELECT d.docid, d.docname
FROM   documents d
WHERE  NOT EXISTS (
   SELECT *
   FROM   words AS w 
   WHERE  w.wordid IN (2, 4, 5)
   AND    NOT EXISTS (
      SELECT *
      FROM   factors AS f 
      WHERE  f.docid = d.docid 
      AND    f.wordid = w.wordid 
      )
   );

Easy to write, not good efficiency:

-- Query 8 -- by ypercube
SELECT d.docid, d.docname
FROM   documents d
WHERE  NOT EXISTS (
   SELECT *
   FROM  (
      SELECT 2 AS wordid UNION  ALL
      SELECT 4 UNION ALL
      SELECT 5
      ) AS w
   WHERE NOT EXISTS (
      SELECT *
      FROM   factors AS f 
      WHERE  f.docid = d.docid 
      AND    f.wordid = w.wordid 
      )
   );

Enjoy testing them :)

Best Answer

Related Solutions

Postgresql – Difficult categorising of rows

Mysql – How to select items from a table where a single column must contain two (or more) values

Related Question