Postgresql – Query for a table with paging and filtering vs. CTE (common table expression)

performancepostgresqlpostgresql-9.3postgresql-performance

Intro

In PostgreSQL 9.3: I am building a query that fetches data for a table that supports sorting, filtering and paging. Think Customers for example, you want to show name, surname, some detailed information, some information from associated tables (eg. past purchases) etc. A lot of data is being fetched, and in the simplest form the query looks like this:

SELECT
  customer.name,
  customer.surname,
  (SELECT aggregate_whatever(foo) FROM bar) past_purchases,
  baz.aaa bbb,
  gaz.ccc ddd,
  ...
FROM 
  customers customer
LEFT JOIN
  baz ON ...
LEFT JOIN
  gaz ON ...

There are several thousand of customers and the additional data being fetched comes from most of the system.

Filtering, paging and sorting

In the end we will have to put past_purchases into the WHERE clause when building the query according to the filter. For this reason, the whole query is encapsulated as a CTE (common table expression), just like in this SO question. It looks like this:

WITH encapsulated AS (
  SELECT
    customer.name,
    customer.surname,
    (SELECT aggregate_whatever(foo) FROM bar) past_purchases,
    baz.aaa bbb,
    gaz.ccc ddd,
    ...
  FROM 
    customers customer
  LEFT JOIN
    baz ON ...
  LEFT JOIN
    gaz ON ...
) 
SELECT * FROM encapsulated
WHERE
  past_purchases = 5
  AND <other conditions>

That ultimate SELECT will also have an ORDER BY clause for the necessary columns. For paging purposes, a LIMIT is added at the end, and the number of total rows is calculated as follows:

...
SELECT *, COUNT(*) OVER () as total_row_count
FROM encapsulated
WHERE
  past_purchases = 5
  AND <other conditions>
ORDER BY
  surname, name
LIMIT 0, 10

Problem

This solution works but quickly runs into performance problems. All the tables have proper indices and EXPLAIN (ANALYZE) shows a nice yet costly plan with indexed accesses, yet the whole query takes almost 20 seconds to finish.

My suspicion is that the CTE is basically built in full at first, meaning the server fetches all the data for all the thousands of customers, and only then it applies the WHERE filtering, ordering and LIMITing.

Question 1

Are the WHERE conditions outside of the CTE propagated into the CTE?

Question 2

Am I using the CTE correctly? Basically I am simplifying my life: my final WHERE clause is trivial, because it does not need to repeat the expressions that form my columns, but did I break the performance in that act?

Things I have tried

Made sure all the table accesses are on an index (this helped a bit, but still almost 20 seconds in total to fetch).
Tried replacing all the views referenced from the query by their materialized versions (no improvement).
Tried removing JOINs and sub-SELECTs from the query to see if the performance improves (only a very little).
Set shared_buffers to 1/4 of the available RAM (1 GB of 4 GB, no effect).
VACUUMed the whole database (FULL FREEZE ANALYZE, no effect).

Question 3

Am I correct in thinking that my approach requires to scan basically the whole DB, build a specialized view of it (that is that CTE) and only then apply some filtering? Is there any better way to do this?

Best Answer

We have a similar issue with CTEs. From what I gather researching the question, and from actually testing on our own queries, indexes which would have been used to filter the results in the CTE when used in a WHERE clause outside of the CTE aren't used because, as mentioned here, the CTE acts as an optimization fence. This means that, for performance reasons, you will want to refactor queries using CTEs to use subqueries instead.

We had a bunch of queries that used CTEs where we gained an order or two of magnitude of performance when we refactored them to subqueries — in one case, we dropped query time from approx. 2 minutes to just under a second. So keep that in mind when building queries.

So, in your example, you would use the query inside the CTE as a subquery instead:

SELECT * FROM (
    SELECT
      customer.name,
      customer.surname,
      (SELECT aggregate_whatever(foo) FROM bar) past_purchases,
      baz.aaa bbb,
      gaz.ccc ddd,
      ...
    FROM 
      customers customer
    LEFT JOIN
      baz ON ...
    LEFT JOIN
      gaz ON ...
) encapsulated
WHERE
  past_purchases = 5
  AND <other conditions>

(In our case, we had relatively abysmal performance on some of our queries due to us using XPath column expressions, combined with scanning upwards of 30k rows in a table, caused us to spend seconds on calculating those XPaths on rows which would ultimately be discarded anyway. Removing the CTEs and using subqueries sped things up considerably, as the XPath columns would be calculated only for the actual rows returned).

Consistent rows

The important question which does not seem to be on your radar yet:
From each set of rows for the same seriesName, do you want the columns of one row, or just any values from multiple rows (which may or may not go together)?

Your answer does the latter, you combine the maximum dbid with the maximum retreivaltime, which may come from a different row.

To get consistent rows, use DISTINCT ON and wrap it in a subquery to order the result differently:

SELECT * FROM (
   SELECT DISTINCT ON (seriesName)
          dbid, seriesName, retreivaltime
   FROM   FileItems
   WHERE  sourceSite = 'mk' 
   ORDER  BY seriesName, retreivaltime DESC NULLS LAST  -- latest retreivaltime
   ) sub
ORDER BY retreivaltime DESC NULLS LAST
LIMIT  100;

Details for DISTINCT ON:

Select first row in each GROUP BY group?

Aside: should probably be retrievalTime, or better yet: retrieval_time. Unquoted mixed case identifiers are a common source of confusion in Postgres.

Better Performance with rCTE

Since we are dealing with a big table here, we'd need a query that can use an index, which is not the case for the above query (except for WHERE sourceSite = 'mk')

On closer inspection, your problem seems to be a special case of a loose index scan. Postgres does not support loose index scans natively, but it can be emulated with a recursive CTE. There is a code example for the simple case in the Postgres Wiki.

Related answer on SO with more advanced solutions, explanation, fiddle:

Optimize GROUP BY query to retrieve latest record per user

Your case is more complex, though. But I think I found a variant to make it work for you. Building on this index (without WHERE sourceSite = 'mk')

CREATE INDEX mi_special_full_idx ON MangaItems
(retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)

Or (with WHERE sourceSite = 'mk')

CREATE INDEX mi_special_granulated_idx ON MangaItems
(sourceSite, retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST, dbid)

The first index can be used for both queries, but is not fully efficient with the additional WHERE condition. The second index is of very limited use for the first query. Since you have both variants of the query consider creating both indexes.

I added dbid at the end to allow Index Only scans.

This query with a recursive CTE makes use of the index. I tested with Postgres 9.3 and it works for me: no sequential scan, all index-only scans:

WITH RECURSIVE cte AS (
   (
   SELECT dbid, seriesName, retreivaltime, 1 AS rn, ARRAY[seriesName] AS arr
   FROM   MangaItems
   WHERE  sourceSite = 'mk'
   ORDER  BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
   LIMIT  1
   )
   UNION ALL
   SELECT i.dbid, i.seriesName, i.retreivaltime, c.rn + 1, c.arr || i.seriesName
   FROM   cte c
   ,      LATERAL (
      SELECT dbid, seriesName, retreivaltime
      FROM   MangaItems
      WHERE (retreivaltime, seriesName) < (c.retreivaltime, c.seriesName)
      AND    sourceSite = 'mk'  -- repeat condition!
      AND    seriesName <> ALL(c.arr)
      ORDER  BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST
      LIMIT  1
      ) i
   WHERE  c.rn < 101
   )
SELECT dbid
FROM   cte
ORDER  BY rn;

You need to include seriesName in ORDER BY, since retreivaltime is not unique. "Almost" unique is still not unique.

Explain

The non-recursive query starts with the latest row.
The recursive query adds the next-latest row with a seriesName that's not in the list, yet etc., until we have 100 rows.
Essential parts are the JOIN condition (b.retreivaltime, b.seriesName) < (c.retreivaltime, c.seriesName) and the ORDER BY clause ORDER BY retreivaltime DESC NULLS LAST, seriesName DESC NULLS LAST. Both match the sort order of the index, which allows for the magic to happen.
Collecting seriesName in an array to rule out duplicates. The cost for b.seriesName <> ALL(c.foo_arr) grows progressively with the number of rows, but for just 100 rows it is still cheap.
Just returning dbid as clarified in the comments.

Alternative with partial indexes:

We have been dealing with similar problems before. Here is a highly optimized complete solution based on partial indexes and a looping function:

Can spatial index help a "range - order by - limit" query

Probably the fastest way (except for a materialized view) if done right. But more complex.

Materialized View

Since you do not have a lot of write operations and they are not performance-critical as stated in the comments (should be in the question), save the top n pre-computed rows in a materialized view and refresh it after relevant changes to the underlying table. Base your performance-critical queries on the materialized view instead.

Could just be a "thin" mv of the latest 1000 dbid or so. In the query, join to the original table. For instance, if content is sometimes updated, but the top n rows can remain unchanged.
Or a "fat" mv with whole rows to return. Faster, yet. Needs to be refreshed more often, obviously.

Details in the manual here and here.

Postgresql – Update column with value of another column or another column

You can use COALESCE() function to do the update:

UPDATE ...
SET columnA = COALESCE(columnB, columnC, now());

COALESCE will return the first non-null value in the list that you provide.