Suppose a table data
recording values of some foo
and bar
over time:
time | foo | bar
-----+-----+----
1 | a | a
2 | a | a
3 | a | a
4 | B | a
5 | B | a
6 | a | a
7 | a | a
8 | a | X
...
We would like to select the duplicate rows in order to get rid of them. Duplicate rows are, in this example, those with times 2, 3, 5 and 7, because the tuple (foo, bar)
did not change from the previous time. Also, while the row 6 seems to be duplicate of 1, 2 and 3, there is a change in between, thus 6 is not a duplicate (but 7 is).
My simple solution is this:
SELECT
*
FROM
data AS candidate
WHERE
EXISTS (
SELECT
*
FROM
data AS original
WHERE
ROW(candidate.foo, candidate.bar) IS NOT DISTINCT FROM ROW(original.foo, original.bar)
AND original.time < candidate.time
AND NOT EXISTS (
SELECT
*
FROM
data AS other
WHERE
ROW(candidate.foo, candidate.bar) IS DISTINCT FROM ROW(other.foo, other.bar)
AND original.time < other.time
AND other.time < candidate.time
)
)
In human language: looking for rows where we can find a row above it with the same (foo, bar)
, but where there is no other row with some other (foo, bar)
in between.
Is there some feature of PostgreSQL that could be used for this purpose?
Best Answer
There are many ways for doing this.
When measurements can be trusted to be sequential (
time
), then one can use a simple self left join, which should perform very fast:Another one is using window functions (PG version >= 8.4):
And another option is using a recursive CTE (PG version >= 8.4):