PostgreSQL – Combining Two Event Tables into a Single Timeline

join;postgresqlwindow functions

Given two tables:

CREATE TABLE foo (ts timestamp, foo text);
CREATE TABLE bar (ts timestamp, bar text);

I wish to write a query that returns values for ts, foo, and bar that represents a unified view of the most recent values. In other words, if foo contained:

ts | foo
--------
1  | A
7  | B

and bar contained:

ts | bar
--------
3  | C
5  | D
9  | E

I want a query that returns:

ts | foo | bar
--------------
1  | A   | null
3  | A   | C
5  | A   | D
7  | B   | D
9  | B   | E

If both tables have an event at the same time, the order does not matter.

I have been able to create the structure needed using union all and dummy values:

SELECT ts, foo, null as bar FROM foo
UNION ALL SELECT ts, null as foo, bar FROM bar

which will give me a linear timeline of new values, but I'm not quite able to work out how to populate the null values based on the previous rows. I've tried the lag window function, but AFAICT it will only look at the previous row, not recursively backward. I've looked at recursive CTEs, but I'm not quite sure how to set up the start and termination conditions.

Best Answer

Use a FULL [OUTER] JOIN, combined with two rounds of window functions:

SELECT ts
     , min(foo) OVER (PARTITION BY foo_grp) AS foo
     , min(bar) OVER (PARTITION BY bar_grp) AS bar
FROM (
   SELECT ts, f.foo, b.bar
        , count(f.foo) OVER (ORDER BY ts) AS foo_grp
        , count(b.bar) OVER (ORDER BY ts) AS bar_grp
   FROM   foo f
   FULL   JOIN bar b USING (ts)
   ) sub;

Since count() does not count NULL values it conveniently only increases with every non-null value, thereby forming groups that will share the same value. In the outer SELECT, min() (or max()) likewise ignores NULL values, thereby picking the one non-null value per group. Voilá.

Related FULL JOIN case:

It's one of those cases where a procedural solution might just be faster, since it can get the job done in a single scan. Like this plpgsql function:

CREATE OR REPLACE FUNCTION f_merge_foobar()
  RETURNS TABLE(ts int, foo text, bar text)
  LANGUAGE plpgsql AS
$func$
#variable_conflict use_column
DECLARE
   last_foo text;
   last_bar text;
BEGIN
   FOR ts, foo, bar IN
      SELECT ts, f.foo, b.bar
      FROM   foo f
      FULL   JOIN bar b USING (ts)
      ORDER  BY 1
   LOOP
      IF foo IS NULL THEN foo := last_foo;
      ELSE                last_foo := foo;
      END IF;

      IF bar IS NULL THEN bar := last_bar;
      ELSE                last_bar := bar;
      END IF;

      RETURN NEXT;
   END LOOP;
END
$func$;

Call:

SELECT * FROM f_merge_foobar();

db<>fiddle here, demonstrating both.
Old sqlfiddle.

Related answer explaining the #variable_conflict use_column: