PostgreSQL – Combining Two Event Tables into a Single Timeline

join;postgresqlwindow functions

Given two tables:

CREATE TABLE foo (ts timestamp, foo text);
CREATE TABLE bar (ts timestamp, bar text);

I wish to write a query that returns values for ts, foo, and bar that represents a unified view of the most recent values. In other words, if foo contained:

ts | foo
--------
1  | A
7  | B

and bar contained:

ts | bar
--------
3  | C
5  | D
9  | E

I want a query that returns:

ts | foo | bar
--------------
1  | A   | null
3  | A   | C
5  | A   | D
7  | B   | D
9  | B   | E

If both tables have an event at the same time, the order does not matter.

I have been able to create the structure needed using union all and dummy values:

SELECT ts, foo, null as bar FROM foo
UNION ALL SELECT ts, null as foo, bar FROM bar

which will give me a linear timeline of new values, but I'm not quite able to work out how to populate the null values based on the previous rows. I've tried the lag window function, but AFAICT it will only look at the previous row, not recursively backward. I've looked at recursive CTEs, but I'm not quite sure how to set up the start and termination conditions.

Best Answer

Use a FULL [OUTER] JOIN, combined with two rounds of window functions:

SELECT ts
     , min(foo) OVER (PARTITION BY foo_grp) AS foo
     , min(bar) OVER (PARTITION BY bar_grp) AS bar
FROM (
   SELECT ts, f.foo, b.bar
        , count(f.foo) OVER (ORDER BY ts) AS foo_grp
        , count(b.bar) OVER (ORDER BY ts) AS bar_grp
   FROM   foo f
   FULL   JOIN bar b USING (ts)
   ) sub;

Since count() does not count NULL values it conveniently only increases with every non-null value, thereby forming groups that will share the same value. In the outer SELECT, min() (or max()) likewise ignores NULL values, thereby picking the one non-null value per group. Voilá.

Related FULL JOIN case:

Add up conditional counts on multiple columns of the same table

It's one of those cases where a procedural solution might just be faster, since it can get the job done in a single scan. Like this plpgsql function:

CREATE OR REPLACE FUNCTION f_merge_foobar()
  RETURNS TABLE(ts int, foo text, bar text)
  LANGUAGE plpgsql AS
$func$
#variable_conflict use_column
DECLARE
   last_foo text;
   last_bar text;
BEGIN
   FOR ts, foo, bar IN
      SELECT ts, f.foo, b.bar
      FROM   foo f
      FULL   JOIN bar b USING (ts)
      ORDER  BY 1
   LOOP
      IF foo IS NULL THEN foo := last_foo;
      ELSE                last_foo := foo;
      END IF;

      IF bar IS NULL THEN bar := last_bar;
      ELSE                last_bar := bar;
      END IF;

      RETURN NEXT;
   END LOOP;
END
$func$;

Call:

SELECT * FROM f_merge_foobar();

db<>fiddle here, demonstrating both.
_{Old sqlfiddle.}

Related answer explaining the #variable_conflict use_column:

Naming conflict between function parameter and result of JOIN with USING clause

Related Solutions

MySQL and PostgreSQL – Resolving Invalid Byte Sequence for UTF8 Encoding

One or more of those character/text fields MAY have 0x00 for its content.

Try the following:

SELECT * FROM rt3 where some_text_field = 0x00 LIMIT 1;

If this returns any single row then try updating those character/text fields with:

UPDATE rt3 SET some_text_field = '' WHERE some_text_field = 0x00;

Afterwards, try another MYSQLDUMP ... ( and PostgreSQL import method ).

Mysql – Limit WHERE to MAX() & COUNT()

Here is your original query from the question

SELECT e.*, MAX(m.datetime) AS unread_last, COUNT(m.id) AS unread 
FROM TAB_EVENT e 
LEFT JOIN TAB_MESSAGE m ON e.id=m.event_id 
WHERE ( m.`read` IS NULL OR m.`read` = 0) 
GROUP BY e.id 
ORDER BY m.datetime DESC, e.id ASC 
LIMIT 10;

Maybe try refactoring the query in such a way that in executes in this sequence

only collect necessary columns from TAB_MESSAGE
apply LIMIT 10 against the collected rows from TAB_MESSAGE
run the JOIN
apply the MAX() and COUNT() last

Here is what I am proposing

SELECT e.*, MAX(m.datetime) AS unread_last, COUNT(m.id) AS unread 
FROM
(
    SELECT * FROM
    (SELECT id,event_id,datetime FROM TAB_MESSAGE
    WHERE read IS NULL OR read = 0
    ORDER BY datetime DESC) mm
    LIMIT 10
) m
LEFT JOIN TAB_EVENT e 
ON e.id=m.event_id
ORDER BY m.datetime DESC, e.id ASC;

Give it a Try !!!

UPDATE 2012-02-21 17:06 EDT

SELECT e.*, MAX(m.datetime) AS unread_last, COUNT(m.id) AS unread 
FROM
TAB_EVENT e LEFT JOIN
(
    SELECT * FROM
    (SELECT id,event_id,datetime FROM TAB_MESSAGE
    WHERE read IS NULL OR read = 0
    ORDER BY datetime DESC) mm
    LIMIT 10
) m
ON e.id=m.event_id
ORDER BY m.datetime DESC, e.id ASC;

@Sebastian, I put the query back in the original join order. Please try this as well !!!

UPDATE 2012-02-21 17:11 EDT

Make sure the datetime field is indexed

ALTER TABLE TAB_MESSAGE ADD INDEX read_datetime_ndx (read,datetime);