PostgreSQL – How to Select Redundant Rows

postgresql

Suppose a table data recording values of some foo and bar over time:

time | foo | bar
-----+-----+----
1    | a   | a
2    | a   | a
3    | a   | a
4    | B   | a
5    | B   | a
6    | a   | a
7    | a   | a
8    | a   | X
...

We would like to select the duplicate rows in order to get rid of them. Duplicate rows are, in this example, those with times 2, 3, 5 and 7, because the tuple (foo, bar) did not change from the previous time. Also, while the row 6 seems to be duplicate of 1, 2 and 3, there is a change in between, thus 6 is not a duplicate (but 7 is).

My simple solution is this:

SELECT 
  *
FROM 
  data AS candidate
WHERE 
  EXISTS (
  SELECT 
    * 
  FROM 
    data AS original
  WHERE 
    ROW(candidate.foo, candidate.bar) IS NOT DISTINCT FROM ROW(original.foo, original.bar)
    AND original.time < candidate.time
    AND NOT EXISTS (
      SELECT
        *
      FROM
        data AS other
      WHERE
        ROW(candidate.foo, candidate.bar) IS DISTINCT FROM ROW(other.foo, other.bar)
        AND original.time < other.time
        AND other.time < candidate.time
    )
  )

In human language: looking for rows where we can find a row above it with the same (foo, bar), but where there is no other row with some other (foo, bar) in between.

Is there some feature of PostgreSQL that could be used for this purpose?

Best Answer

There are many ways for doing this.

When measurements can be trusted to be sequential (`time`), then one can use a simple self left join, which should perform very fast:

SELECT *
FROM data d
LEFT JOIN data prev ON (prev.time = d.time - 1)
WHERE (d.foo, d.bar) IS DISTINCT FROM (prev.foo, prev.bar)

Another one is using window functions (PG version >= 8.4):

SELECT time, foo, bar
FROM (
  SELECT *, lag(foo) over () AS prev_foo, lag(bar) over () AS prev_bar
  FROM data
  ) d
WHERE (foo, bar) IS DISTINCT FROM (prev_foo, prev_bar)

And another option is using a recursive CTE (PG version >= 8.4):

WITH RECURSIVE d(time, foo, bar) AS (
  SELECT * FROM (SELECT * FROM data ORDER BY time LIMIT 1) a
  UNION ALL
  SELECT *
  FROM (
    SELECT data.*
    FROM data, d
    WHERE data.time > d.time AND (data.foo, data.bar) IS DISTINCT FROM (d.foo, d.bar)
    ORDER BY data.time
    LIMIT 1) a
)
SELECT * FROM d

Related Solutions

PostgreSQL – How to Ensure a Trigger is Fired After Variable Number of Inserts

You could implement all your inserts on those tables via stored procedures. That would allow you to execute your check_percentage process at the end of said stored procedure. It would also allow you to impose additional conditions on said execution.

PostgreSQL Performance Optimization – How to Speed Up SELECT DISTINCT

You probably don't want to hear this, but the best option to speed up SELECT DISTINCT is to avoid DISTINCT to begin with. In many cases (not all!) it can be avoided with better database-design or better queries.

Sometimes, GROUP BY is faster, because it takes a different code path.

In your particular case, it doesn't seem like you can get rid of DISTINCT (well, see blow). But you can support the query with a special index if you have many queries of that kind:

CREATE INDEX foo ON events (project_id, "time", user_id);

Adding user_id is only useful if you get index-only scans out of this. Follow the link for details. Would remove the expensive ~~Bitmap Heap Scan~~ from your query plan, which consumes 90% of the query time.

Your EXPLAIN shows 2,491 distinct users out of half a million qualifying rows. This won't become super-fast, no matter what you do, but it can be substantially faster. With around 200 rows per user, emulating an index skip scan on above index might pay. The range condition on time complicates matters, and 200 rows per user is still a moderate number. So not sure. See:

Either way, if time intervals in your queries are always the same, a MATERIALIZED VIEW folding user_id per (project_id, <fixed time interval>) would go a long way. No chance there with varying time intervals, though. Maybe you could at least fold users per hour or some other minimum time unit, and that would buy enough performance to warrant the considerable overhead. Can be combined with either query style.

Nitpick:
Most probably, the predicates on "time" should really be:

AND "time" >= '2015-01-11 8:00:00'
AND "time" <  '2015-02-10 8:00:00';

Aside:
Don't use time as identifier. It's a reserved word in standard SQL and a basic type in Postgres.