Get qualifying rows only
One way ..
WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
, y AS (
SELECT *
FROM x
WHERE x.rn = 1
AND date_out >= '2012-10-01'
AND date_out < '2012-11-01'
)
, z AS (
SELECT x.*
FROM x
JOIN y USING (reference)
WHERE x.rn = 2
AND x.centreid = 1
)
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM y
JOIN z USING (reference)
UNION ALL
SELECT taskid, reference, centreid, date_out
FROM z
ORDER BY reference, date_out;
Another way:
WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
,y AS (
SELECT x.*, y.taskid AS taskid2, y.centreid AS centreid2, y.date_out AS date_out2
FROM x
JOIN x y USING (reference)
WHERE x.rn = 1
AND x.date_out >= '2012-10-01'
AND x.date_out < '2012-11-01'
AND y.rn = 2
AND y.centreid = 1
)
SELECT y.taskid, y.reference, y.centreid, y.date_out
FROM y
UNION ALL
SELECT y.taskid2, y.reference, y.centreid2, y.date_out2
FROM y
ORDER BY reference, date_out;
I'd expect the second one to be faster. Depends on your data distribution. Test with EXPLAIN ANALYZE
.
Get all rows for qualifying references
WITH x AS (
SELECT *
,row_number() OVER (PARTITION BY reference ORDER BY date_out DESC) AS rn
FROM tbl
)
,y AS (
SELECT reference
FROM x
JOIN x y USING (reference)
WHERE x.rn = 1
AND x.date_out >= '2012-10-01'
AND x.date_out < '2012-11-01'
AND y.rn = 2
AND y.centreid = 1
)
SELECT *
FROM tbl
JOIN y USING (reference)
ORDER BY reference, date_out;
-> sqlfiddle
Answer to follow-up in comment
Separate groups if more than 30 days between entries.
WITH a AS (
SELECT *
,lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) AS last_date
,CASE WHEN date_out >
(lag(date_out) OVER (PARTITION BY reference ORDER BY date_out DESC) - 30)
THEN 0 ELSE 1
END AS step
FROM tbl
)
,b AS (
SELECT *
,sum(step) OVER (PARTITION BY reference ORDER BY date_out DESC) AS grp
FROM a
)
,c AS (
SELECT *
,row_number() OVER (PARTITION BY reference, grp ORDER BY date_out DESC) AS rn
FROM b
)
,d AS (
SELECT reference, grp
FROM c
JOIN c d USING (reference, grp)
WHERE c.rn = 1
AND c.date_out >= '2012-10-01'
AND c.date_out < '2012-11-01'
AND d.rn = 2
AND d.centreid = 1
)
SELECT b.taskid, b.reference, b.centreid, b.date_out
FROM b
JOIN d USING (reference, grp)
ORDER BY reference, date_out
-> sqlfiddle
But while pure SQL is a beaut .. I would solve this procedurally in a plpgsql function. Very similar to this recent answer on SO. Would be faster, because it can be done in a single table scan.
I derive from your example that you only want to match tables with the same structure (compatible row types).
Base query
First of all, your base query is needlessly complex and possibly incorrect. Consider instead:
(TABLE a EXCEPT ALL TABLE b) -- ALL keeps duplicate rows and is faster
UNION ALL
(TABLE b EXCEPT ALL TABLE a);
I doubt you want to fold duplicates. Typically, tables do not hold complete duplicates to begin with, and omitting the ALL
key word would instruct Postgres to try and fold duplicates in every step, which is just a waste of time. Details:
Even if completely duplicate rows are possible, it might be best to return all of them to avoid misleading answers. If you want to fold duplicates in the result anyway, a single UNION
does the job:
(TABLE a EXCEPT ALL TABLE b)
UNION
(TABLE b EXCEPT ALL TABLE a);
However, a query with NATURAL FULL OUTER JOIN
is more efficient, doing almost the same:
SELECT *
FROM a NATURAL FULL OUTER JOIN b
WHERE a IS NULL OR b IS NULL;
There are two subtle, exotic corner cases:
- This returns rows that are all NULL to begin with. It would be an odd table design that allows such rows, though.
- This does not return duplicate rows that are not matched with the same number of duplicate rows in the other table: a single match is enough to eliminate all - which is subtly different from folding duplicates in the result! Have a look at the fiddle below. Again, it would be an odd table design that allows completely duplicate rows.
Function
What you have so far does not work for several reasons.
To return actual table rows dynamically (and not just a count or a text representation), you need to use a polymorphic type.
Since the second table is bound to have a compatible row type (as per my assumption) it's enough to hand in just the table name for that.
CREATE OR REPLACE FUNCTION f_tbl_diff(_tbl1_type ANYELEMENT, _tbl2 text)
RETURNS SETOF ANYELEMENT AS
$func$
BEGIN
RETURN QUERY EXECUTE format('
SELECT *
FROM %1$s NATURAL FULL OUTER JOIN %2$I
WHERE %1$s IS NULL OR %2$I IS NULL'
, pg_typeof(_tbl1_type), _tbl2);
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_tbl_diff(NULL::a, 'b');
Note the special syntax for the first argument! We hand in an actual row defining the return type, not just a table name. Related answer on SO with ample details (scroll to the last chapter):
This returns all rows that have no exact, complete match in the other table
SQL Fiddle also demonstrating exotic corner cases.
About the dynamic SQL:
Best Answer
You can use the solution with the window function, you just need to filter out the "other" row. This can e.g. be done using Postgres'
distinct on ()
This will always pick exactly one row per name (the one with the latest
timestamp_received
), even if there are more then two rows per name.