Postgresql – Limit crosstab query based on join from outer query

performancepostgresqlquery-performance

In my database I have two tables, assessments, and responses, in a 1:N relationship.

I have a query that looks a little something like this

  WITH "cross_responses" AS (select * from crosstab(
       $$select
         assessments.id as id,
         i.name as name,
         r.value as response
       from assessments
         left join users u on u.id = assessments.user_id
         inner join responses r on r.assessment_id = assessments.id
         inner join items i on i.id = r.item_id
       where i.name = ANY ('{E-mail Address, First Name, Last Name}')
       order by 1,2$$,
       $$VALUES('E-mail Address'), ('First Name'), ('Last Name')$$
     ) as ct(id int, email text, first_name text, last_name text)
) SELECT  ...blah blah...  
ORDER BY "assessments"."id" desc, assessments.end_time DESC NULLS LAST, 
assessments.start_time DESC NULLS LAST LIMIT 1000 OFFSET 4000

So basically, I'm creating a CTE made up a crosstabbed results for key responses from the assessment, then joining it two the assessments table, and selecting to get a nice flat representation mixing meta-data from the assessment, with the answers to individuals questions.

As you can see at the bottom, this query gets run in batches of 1000 when we are streaming a CSV file of all the results over HTTP.

I believe this results in a potential performance issue, because for each batch, I'm assuming crosstab will run against the entire assessment and responses tables, not against just the batch of defined by the outer query. Running explain analyze seemed to confirm this for as the crosstab line in the explain output, says it runs over 5080 rows, which is all the assessments in my development database (production has a lot more)

What I'm asking 1) Is my understanding about how crosstab behaves correct? 2) Is there a way to make the query passed into cross tab "aware" of limitations or conditions applied to the outer query? 3) Is there another approach I should be taking entirely to performantly run this query in a batched fashion to stream the data out to CSV.

EDIT: Adding the results of explain analyze

 Limit  (cost=1813.08..1813.08 rows=1 width=124) (actual time=71.913..72.036 rows=1000 loops=1)
  CTE cross_responses
    ->  Function Scan on crosstab ct  (cost=0.00..10.00 rows=1000 width=100) (actual time=63.593..64.074 rows=5080 loops=1)
  ->  Sort  (cost=1800.58..1803.08 rows=1000 width=124) (actual time=71.568..71.817 rows=5000 loops=1)
        Sort Key: assessments.id, assessments.end_time, assessments.start_time
        Sort Method: quicksort  Memory: 861kB
        ->  Hash Join  (cost=1712.00..1750.75 rows=1000 width=124) (actual time=66.358..69.842 rows=5080 loops=1)
              Hash Cond: (cross_responses.id = assessments.id)
              ->  CTE Scan on cross_responses  (cost=0.00..20.00 rows=1000 width=100) (actual time=63.595..65.204 rows=5080 loops=1)
              ->  Hash  (cost=1645.89..1645.89 rows=5289 width=28) (actual time=2.751..2.751 rows=5289 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 289kB
                    ->  Seq Scan on assessments  (cost=0.00..1645.89 rows=5289 width=28) (actual time=0.008..1.879 rows=5289 loops=1)
Total runtime: 72.384 ms

Best Answer

If all the export can happen through a single database connection, or even better, through a single transaction, it would be simpler and more efficient to just use a server-side CURSOR instead of a pagination based on successive queries with LIMIT/OFFSET.

For instance:

BEGIN;

DECLARE cursor_name FOR
WITH "cross_responses" AS (select * from crosstab(
       $$select...
        ...
) SELECT  ...blah blah...  
ORDER BY "assessments"."id" desc, assessments.end_time DESC NULLS LAST, 
assessments.start_time DESC NULLS LAST

Then call the SQL statement FETCH 1000, stream the results to the web client, until FETCH returns no row. Then close the transaction, and the cursor will disappear with it.

It's also possible to not use a transaction if the cursor is declared WITH HOLD.


If multiple passes on the query must be run by independant connections (such as if there are multiple invocations by a web server), another possibility is to not use the crosstab() function, and do the pivot operation in the main query itself. This is not difficult because there are only 3 static columns.

For instance:

 select
  assessments.id as id,
  max(case when i.name='Last Name' then r.value end) as "Last Name",
  max(case when i.name='First Name' then r.value end) as "First Name",
  max(case when i.name='E-mail Address' then r.value end) as "Email"
 from assessments
     left join users u on u.id = assessments.user_id
     inner join responses r on r.assessment_id = assessments.id
     inner join items i on i.id = r.item_id
 where i.name = ANY ('{E-mail Address, First Name, Last Name}')
 group by assessments.id
 order by 1 LIMIT 1000 OFFSET 4000;