Postgresql – Finding earliest connected value over two columns

I have a, perhaps slightly esoteric, use case.

Given a table like the following:

 id1 | id2 |      timestamp      
-----+-----+---------------------
   3 |   4 | 2016-03-22 09:52:15
   1 |   2 | 2016-03-22 12:05:28
   2 |   3 | 2016-03-22 11:55:56
   5 |   6 | 2016-03-22 11:58:23
   6 |   7 | 2016-03-22 09:51:54

I would like to "trace" each id in id1 to its oldest id2 where connection can span over id1 and id2. For instance in the above dataset I would like the resultset to contain:

1 |   4 | 2016-03-22 09:52:15
5 |   7 | 2016-03-22 09:51:54

The results should be ordered by the timestamp column.
And ideally no other occurrences of "1" and "5".

I am using Amazon Redshift, which imposes some limitations.

So far I have dabbled with using window functions combined with a self join:

SELECT t1.id1,
FIRST_VALUE(t2.id2) OVER (
   PARTITION BY t1.id1, (t1.id1=t2.id1 or t1.id2=t2.id2)
   ORDER BY t2.timestamp ASC
   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
FROM    test_decision t1
LEFT JOIN test_decision t2 ON t1.id1 = t2.id2

The general idea is to group based on logical expression (e.g. contains either id1 or id2), rather than partitioning only on rows with both. I've tried SELECT ... GROUP BY in a similar manner, but without getting the results I want.

I hope this use case makes sense to some of you.

In an effort to eliminate ambiguity: The rows are connected by having a value of id1 or id2 in either id1 or id2 columns of other rows. For instance, 1 leads to 4 through rows having (id1,id2) = (1,2),(2,3) and (3,4) in my example.

Regarding recursiveness: I think I can solve it by joining multiple times and coalescing the "outermost" value, my problem here is that the problem can be arbitrarily deep, e.g. I can solve this case by joining twice, but is there a way in SQL to get arbitrary recursion depth? And is that a good idea?

Best Answer

It's a recursive problem by nature. SQL introduced recursive CTEs for that purpose quite some time ago. Since Redshift does not seem to support this, your only remaining option to solve this within the database is a user defined function (UDF). It seems like the only server-side language supported by Redshift is plpythonu.

Here is an old answer with a PL/pgSQL function and an rCTE implementing the same solution side-by-side. But it's using the default procedural language of Postgres: PL/pgSQL:

Loop in function does not work as expected

No idea why Redshift chose not to support PL/pgSQL.

Best Answer

Related Solutions

Postgresql – Set a column’s default value to the concatenation of two other columns’ values

Mysql – Count where two or more columns in a row are over a certain value [basketball, double double, triple double]

Related Question