PostgreSQL – Subquery to Return Multiple Columns or JSON Object

postgresql

Is there any way for me to return multiple columns from a subquery or at least return a column that is a JSON object of all of the columns I need?

This is a working query that works:

select r.*,
       (select status
        from builds
        where repo = r.id
          and branch = 'master'
        order by id desc
        limit 1)
from repos r
where id = any($1)

Where $1 is an array of ids

But I would like to return more than just the status column from the builds table:

select r.*,
       (select status, start_time, end_time
        from builds
        where repo = r.id
          and branch = 'master'
        order by id desc
        limit 1)
from repos r
where id = any($1)

After looking around it seems that row_to_json should work for me. However, I'm not sure how it should work in this given case:

select r.*,
       row_to_json(select status,
                   start_time, end_time
                   from builds
                   where repo = r.id
                     and branch = 'master'
                   order by id desc
                   limit 1) as build
from repos r
where id = any($1)

I am getting

syntax error at or near "select"

Best Answer

Don't make life hard for yourself, use a lateral join:

SELECT r.*, 
       b.status, b.start_time, b.end_time
FROM repos AS r
   LEFT JOIN LATERAL
      (SELECT status, start_time, end_time
       FROM builds
       WHERE repo = r.id
         AND branch = 'master'
       ORDER BY id DESC
       LIMIT 1
      ) AS b ON TRUE
WHERE id = ANY($1);

Related Solutions

PostgreSQL – Return Unique Column Combinations Based on Where Clause

The execution plan shown does not seem to match the big SELECT DISTINCT query because the Sort and Unique steps are missing. Anyway you are correct than when retrieving ~50% of a table, index don't help. The best strategy is a big sequential scan of the main table and only fast hardware helps with that.

For the 2nd part of the question:

How would I go about selecting only the unique combinations of adjacent columns? Is this too complicated a task to perform through a database query? Would it speed up the query?

To remove duplicate combinations of adjacent columns, the structure of the resultset should be changed so that each output row has only one couple of adjacent columns along with their corresponding dimensions in the parallel coordinates graph. Well, except that the dimension for the 2nd column is not necessary since it's always the dimension for the other column plus one.

In one single query, this could be written like this:

WITH logs as (
  SELECT log_time_mapped, syslog_priority_mapped, 
     operation_mapped, message_code_mapped, protocol_mapped, 
     source_ip_mapped, destination_ip_mapped, 
     source_port_mapped, destination_port_mapped, 
     destination_service_mapped, direction_mapped, 
     connections_built_mapped, connections_torn_down_mapped, 
     hourofday_mapped, meridiem_mapped 
  FROM firewall_logs_mapped 
  WHERE operation = 'Built')
SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped FROM logs
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped FROM logs
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped FROM logs
UNION ALL
...etc...
SELECT DISTINCT 14,  hourofday_mapped, meridiem_mapped FROM logs
;

The first SELECT DISTINCT subquery extracts the lines to draw between dimensions 1 and 2, the next subquery between dimensions 2 and 3, and so on. DISTINCT eliminates duplicates, so the client side doesn't have to do it. The UNION ALL concatenates the results without any further processing.

However it's a heavy query and it's dubious that it would be any faster than what you're already doing.

The WITH subquery is likely to gets slowly materialized on disk, so it might be interesting to compare the execution time with this other form repeating the same condition:

SELECT DISTINCT 1, log_time_mapped, syslog_priority_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 2, syslog_priority_mapped, operation_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
UNION ALL
SELECT DISTINCT 3, operation_mapped, message_code_mapped
   FROM firewall_logs_mapped WHERE operation = 'Built'
...etc...
;

Postgresql – Postgres pg_trgm JOIN multiple columns with large tables (~50 million rows)

This type of join cannot effectively use an index-to-index join. One of the tables is going to need to be seq scanned. You might want to try forcing it to reverse which table gets seq scanned, for example by dropping the gin index on the larger table so it can't be used, or appending the empty string to each column of the large table, such as on ((b.name||'') % a.track_title and .... Doing this will probably only work if you are limited by IO, not CPU.

Also, should I expect this to run in a reasonable amount of time (minutes) on tables this large, even with indexes?

No, I don't think so. There might be some things you can do to improve performance, but you are fundamentally asking the database to do a massive amount of work and it will take a long time to do it.

Make sure you are using the latest version of pg_trgm, which is 1.3. This can help a lot for LIKE queries, but don't expect miracles for % queries.

Increase the value of pg_trgm.similarity_threshold as much as you can while still obtaining the number of results you want. The default value of 0.3 is really very low.

It is not clear that the multi-column index is actually going to be helpful. Try dropping it and see how it does with the individual column indexes.

Also, just be patient. This looks to me like a data-cleaning exercise, not something you need to run interactively. Maintain a materialized view, or a join table updated by triggers, to record the matches if need to be able to retrieve them interactively.

Best Answer

Related Solutions

PostgreSQL – Return Unique Column Combinations Based on Where Clause

Postgresql – Postgres pg_trgm JOIN multiple columns with large tables (~50 million rows)

Related Question