Postgresql – query for each row from another arbitrary query, and union results

postgresql

I'm trying to build a query that will take another query. Then, perform another query based on those results. Union the results so that there are no duplicates. All on the same table. I've seen some similar posts here and here, but the "input" query could be almost anything and the number of columns may change.

Here's a sample dataset:

  num  |  sub  |                      valid_time                      | col1 | col2  |  ...
-------+-------+------------------------------------------------------+------+-------+------
   1   |   1   | ["2018-12-31 01:00:00-00","2019-12-31 01:00:00-00")  |   X  |  ...  |  ...
   1   |   1   | ["2018-12-31 01:00:00-00","2019-12-31 01:00:00-00")  |   Y  |  ...  |  ...
   1   |   2   | ["2018-12-31 01:00:00-00","2019-12-31 01:00:00-00")  |   Z  |  ...  |  ...
   2   |   3   | ["2018-12-31 01:00:00-00","2019-12-31 01:00:00-00")  |   X  |  ...  |  ...
   2   |   4   | ["2018-12-31 01:00:00-00","2019-12-31 01:00:00-00")  |   Y  |  ...  |  ...
   2   |   3   | ["2018-12-31 01:00:00-00","2019-12-31 01:00:00-00")  |   Z  |  ...  |  ...

Based on some of the examples I was seeing with JOINs, a subset of columns were typically picked out. I'm not sure how to return all columns with a JOIN. The number of columns may change over time, and all the columns need to be returned. I would rather not have to update this query if a new column is added.

This does not work (I think because I need to JOIN somehow), but it shows where my head is at.

with cte as (
    select *
    from table
    where col1 = 'X' and 
    now() <@ valid_time
)
select * from cte
union
select * from table
    where table.num = cte.num and 
    table.sub = cte.sub and 
    now() <@ valid_time;

I mentioned earlier that the "input" query could be almost anything. I plan to templatized similar to:

with cte as (%(query)s)
select * from cte
union
select * from table
    where table.num = cte.num and 
    table.sub = cte.sub and 
    %(query_time_part)s;

Finally based on the example query, I would expect something like so:

  num  |  sub  |                      valid_time                      | col1 | col2  |  ...
-------+-------+------------------------------------------------------+------+-------+------
   1   |   1   | ["2018-12-31 01:00:00-00","2019-12-31 01:00:00-00")  |   X  |  ...  |  ...
   1   |   1   | ["2018-12-31 01:00:00-00","2019-12-31 01:00:00-00")  |   Y  |  ...  |  ...
   2   |   3   | ["2018-12-31 01:00:00-00","2019-12-31 01:00:00-00")  |   X  |  ...  |  ...
   2   |   3   | ["2018-12-31 01:00:00-00","2019-12-31 01:00:00-00")  |   Z  |  ...  |  ...

Best Answer

How about:

with cte as (%(query))
select unique table.* from table join cte using (num,sub)
    where %(query_time_part);

I'm assuming here that "%(query_time_part)" is written such that it applies to the literal "table", while "%(query)" already contains the analogous code which applies to the "table" underlying "cte".

The rows that make up the cte should automatically meet the join condition to self-join, so should not need any special steps, like a UNION, for them to be emitted in their own right. That is, the version of them that get emitted is the version that comes from "table", not from "cte", but since they are identical it doesn't matter.

Related Solutions

Postgresql – Query for a table with paging and filtering vs. CTE (common table expression)

We have a similar issue with CTEs. From what I gather researching the question, and from actually testing on our own queries, indexes which would have been used to filter the results in the CTE when used in a WHERE clause outside of the CTE aren't used because, as mentioned here, the CTE acts as an optimization fence. This means that, for performance reasons, you will want to refactor queries using CTEs to use subqueries instead.

We had a bunch of queries that used CTEs where we gained an order or two of magnitude of performance when we refactored them to subqueries — in one case, we dropped query time from approx. 2 minutes to just under a second. So keep that in mind when building queries.

So, in your example, you would use the query inside the CTE as a subquery instead:

SELECT * FROM (
    SELECT
      customer.name,
      customer.surname,
      (SELECT aggregate_whatever(foo) FROM bar) past_purchases,
      baz.aaa bbb,
      gaz.ccc ddd,
      ...
    FROM 
      customers customer
    LEFT JOIN
      baz ON ...
    LEFT JOIN
      gaz ON ...
) encapsulated
WHERE
  past_purchases = 5
  AND <other conditions>

(In our case, we had relatively abysmal performance on some of our queries due to us using XPath column expressions, combined with scanning upwards of 30k rows in a table, caused us to spend seconds on calculating those XPaths on rows which would ultimately be discarded anyway. Removing the CTEs and using subqueries sped things up considerably, as the XPath columns would be calculated only for the actual rows returned).

PostgreSQL – How to Filter and Append Data to Row

You could use an inner join to filter for rows that share a (p_id, k_id) value with rows that have a (t_id, pos) of (1,1) (sqlfiddle):

select  base.p_id
,       base.k_id
,       json_agg(base order by base.p_id, base.k_id, base.t_id, base.pos) as stats
from    (
        select  p_id
        ,       k_id
        ,       t_id
        ,       pos
        from    two
        ) base
join    two as filter
on      filter.t_id = 1
        and filter.pos = 1
        and filter.p_id = base.p_id
        and filter.k_id = base.k_id
group by
        base.p_id
,       base.k_id;

Or an equivalent way using an exists subquery (sqlfiddle):

select  base.p_id
,       base.k_id
,       json_agg(base order by base.p_id, base.k_id, base.t_id, base.pos) as stats
from    (
        select  p_id
        ,       k_id
        ,       t_id
        ,       pos
        from    two
        ) base
where   exists
        (
        select  *
        from    two as filter
        where   filter.t_id = 1
                and filter.pos = 1
                and filter.p_id = base.p_id
                and filter.k_id = base.k_id
        )
group by
        base.p_id
,       base.k_id;

Updated to use json_agg from @ErwinBrandstetter's answer.

Best Answer

Related Solutions

Postgresql – Query for a table with paging and filtering vs. CTE (common table expression)

PostgreSQL – How to Filter and Append Data to Row

Related Question