PostgreSQL – How to Select Redundant Rows

postgresql

Suppose a table data recording values of some foo and bar over time:

time | foo | bar
-----+-----+----
1    | a   | a
2    | a   | a
3    | a   | a
4    | B   | a
5    | B   | a
6    | a   | a
7    | a   | a
8    | a   | X
...

We would like to select the duplicate rows in order to get rid of them. Duplicate rows are, in this example, those with times 2, 3, 5 and 7, because the tuple (foo, bar) did not change from the previous time. Also, while the row 6 seems to be duplicate of 1, 2 and 3, there is a change in between, thus 6 is not a duplicate (but 7 is).

My simple solution is this:

SELECT 
  *
FROM 
  data AS candidate
WHERE 
  EXISTS (
  SELECT 
    * 
  FROM 
    data AS original
  WHERE 
    ROW(candidate.foo, candidate.bar) IS NOT DISTINCT FROM ROW(original.foo, original.bar)
    AND original.time < candidate.time
    AND NOT EXISTS (
      SELECT
        *
      FROM
        data AS other
      WHERE
        ROW(candidate.foo, candidate.bar) IS DISTINCT FROM ROW(other.foo, other.bar)
        AND original.time < other.time
        AND other.time < candidate.time
    )
  )

In human language: looking for rows where we can find a row above it with the same (foo, bar), but where there is no other row with some other (foo, bar) in between.

Is there some feature of PostgreSQL that could be used for this purpose?

Best Answer

There are many ways for doing this.

When measurements can be trusted to be sequential (time), then one can use a simple self left join, which should perform very fast:

SELECT *
FROM data d
LEFT JOIN data prev ON (prev.time = d.time - 1)
WHERE (d.foo, d.bar) IS DISTINCT FROM (prev.foo, prev.bar)

Another one is using window functions (PG version >= 8.4):

SELECT time, foo, bar
FROM (
  SELECT *, lag(foo) over () AS prev_foo, lag(bar) over () AS prev_bar
  FROM data
  ) d
WHERE (foo, bar) IS DISTINCT FROM (prev_foo, prev_bar)

And another option is using a recursive CTE (PG version >= 8.4):

WITH RECURSIVE d(time, foo, bar) AS (
  SELECT * FROM (SELECT * FROM data ORDER BY time LIMIT 1) a
  UNION ALL
  SELECT *
  FROM (
    SELECT data.*
    FROM data, d
    WHERE data.time > d.time AND (data.foo, data.bar) IS DISTINCT FROM (d.foo, d.bar)
    ORDER BY data.time
    LIMIT 1) a
)
SELECT * FROM d