Postgresql – the recommended way to join junction tables for efficient ordering/pagination

join;pagingperformancepostgresqlpostgresql-performance

Summary: I have a simple database schema but even with just a few 10's of thousands of records the performance on basic queries is already becoming a problem.

Database: PostgreSQL 9.6

Simplified schema:

CREATE TABLE article (
  id bigint PRIMARY KEY,
  title text NOT NULL,
  score int NOT NULL
);
CREATE TABLE tag (
  id bigint PRIMARY KEY,
  name text NOT NULL
);
CREATE TABLE article_tag (
  article_id bigint NOT NULL REFERENCES article (id),
  tag_id bigint NOT NULL REFERENCES tag (id),
  PRIMARY KEY (article_id, tag_id)
);
CREATE INDEX ON article (score);

Production data info:

All tables are read/write. Low write volume, only a new record every couple minutes or so.

Approximate record counts:

~66K articles
~63K tags
~147K article_tags

Average of 5 tags per article.

Question: I want to create a view article_tags which includes an array of tags for every article record, can be ordered by article.score and paginated with or without additional filtering.

In my first attempt I was surprised to see that the query took ~350 ms to execute and wasn't using the indexes. In subsequent attempts I was able to get it down to ~5 ms but I don't understand what is going on. I would expect all these queries to take the same amount of time. What crucial concept am I missing here?

Attempts (SQL Fiddles):

Best Answer

Pagination

For pagination, LIMIT (and OFFSET) are simple, but typically inefficient tools for bigger tables. Your tests with LIMIT 10 only show the tip of the iceberg. Performance is going to degrade with a growing OFFSET, no matter which query you choose.

If you have no or little concurrent write access, the superior solution is a MATERIALIZED VIEW with an added row number, plus index on that. And all your queries select rows by row numbers.

Under concurrent write load, such a MV is outdated quickly (But a compromise like refreshing the MV CONCURRENTLY every N minutes may be acceptable).
LIMIT / OFFSET is not going to work properly at all since "the next page" is a moving target there, and LIMIT / OFFSET cannot cope with that. The best technique depends on undisclosed information.

Index

Your indexes generally look good. But your comment indicates that table tag has many rows. Typically, there is very little write load on a table like tag, which is perfect for index-only support. So add a multicolumn ("covering") index:

CREATE INDEX ON tag(id, name);

Can Postgres use an index-only scan for this query with joined tables?

Just the top N rows

If you don't actually need more pages (which isn't strictly "paging"), then any query style is good that reduces qualifying rows from article before retrieving details from the related tables (expensively). Your "limited subquery" (3.) and "lateral join" (4.) solutions are good. But you can do better:

Use an ARRAY constructor for the LATERAL variant:

SELECT a.id, a.title, a.score, tags.names
FROM   article a
LEFT   JOIN LATERAL (
   SELECT ARRAY (
      SELECT t.name
      FROM   article_tag a_t 
      JOIN   tag t ON t.id = a_t.tag_id
      WHERE  a_t.article_id = a.id
   -- ORDER  BY t.id  -- optionally sort array elements
      )
  ) AS tags(names) ON true
ORDER  BY a.score DESC
LIMIT  10;

The LATERAL subquery assembles tags for a single article_id at a time, so GROUP BY article_id is redundant, as well as the join condition ON tags.article_id = article.id, and a basic ARRAY constructor is cheaper than array_agg(tag.name) for the remaining simple case.

Why is array_agg() slower than the non-aggregate ARRAY() constructor?

Or use a lowly correlated subquery, typically even faster, yet:

SELECT a.id, a.title, a.score
     , ARRAY (
         SELECT t.name
         FROM   article_tag a_t 
         JOIN   tag t ON t.id = a_t.tag_id
         WHERE  a_t.article_id = a.id
      -- ORDER  BY t.id  -- optionally sort array elements
      ) AS names
FROM   article a
ORDER  BY a.score DESC
LIMIT  10;

db<>fiddle here
SQL Fiddle

Related Solutions

Postgresql – Postgres multiple joins slow query, how to store default child record

You write:

Each customer can have multiple sites, but only one should be displayed in this list.

Yet, your query retrieves all rows. That would be a point to optimize. But you also do not define which site is to be picked.

Either way, it does not matter much here. Your EXPLAIN shows only 5026 rows for the site scan (5018 for the customer scan). So hardly any customer actually has more than one site. Did you ANALYZE your tables before running EXPLAIN?

From the numbers I see in your EXPLAIN, indexes will give you nothing for this query. Sequential table scans will be the fastest possible way. Half a second is rather slow for 5000 rows, though. Maybe your database needs some general performance tuning?

Maybe the query itself is faster, but "half a second" includes network transfer? EXPLAIN ANALYZE would tell us more.

If this query is your bottleneck, I would suggest you implement a materialized view.

After you provided more information I find that my diagnosis pretty much holds.

The query itself needs 27 ms. Not much of a problem there. "Half a second" was the kind of misunderstanding I had suspected. The slow part is the network transfer (plus ssh encoding / decoding, possibly rendering). You should only retrieve 100 rows, that would solve most of it, even if it means to execute the whole query every time.

If you go the route with a materialized view like I proposed you could add a serial number without gaps to the table plus index on it - by adding a column row_number() OVER (<your sort citeria here>) AS mv_id.

Then you can query:

SELECT *
FROM   materialized_view
WHERE  mv_id >= 2700
AND    mv_id <  2800;

This will perform very fast. LIMIT / OFFSET cannot compete, that needs to compute the whole table before it can sort and pick 100 rows.

pgAdmin timing

When you execute a query from the query tool, the message pane shows something like:

Total query runtime: 62 ms.

And the status line shows the same time. I quote pgAdmin help about that:

The status line will show how long the last query took to complete. If a dataset was returned, not only the elapsed time for server execution is displayed, but also the time to retrieve the data from the server to the Data Output page.

If you want to see the time on the server you need to use SQL EXPLAIN ANALYZE or the built in Shift + F7keyboard shortcut or Query -> Explain analyze. Then, at the bottom of the explain output you get something like this:

Total runtime: 0.269 ms

How and what is the most efficient way to join two tables, retaining a particular field from both

It's not entirely clear what you are after but I think a FULL outer join would help:

SELECT 
    COALESCE(a.Component, b.Component) AS Component
  , COALESCE(a.data1, 0) AS data1
  , COALESCE(a.data2, 0) AS data2
  , COALESCE(b.data3, 0) AS data3
  , COALESCE(b.data4, 0) AS data4
FROM 
    table1 AS a
  FULL JOIN
    table2 AS b
      ON b.Component = a.Component ;

If Component is not UNIQUE on one (or both) of the tables, then you could aggreagte and then join:

WITH a AS
  ( SELECT 
        Component
      , COUNT(*) AS cnt
      , SUM(data1) AS sum_data1
      , SUM(data2) AS sum_data2
      -- ...
      , AVG(data1) AS avg_data1
      -- ...
    GROUP BY
        Component
    FROM
        table1
  )
  , b AS
  ( SELECT 
        Component
      , COUNT(*) AS cnt
      , SUM(data3) AS sum_data3
      -- ...
      , AVG(data3) AS avg_data3
      -- ...
    GROUP BY
        Component
    FROM
        table2
  ) 
    SELECT 
        COALESCE(a.Component, b.Component) AS Component
      , COALESCE(a.sum_data1,0) AS sum_data1
      -- ...
      , COALESCE(b.sum_data3,0) AS sum_data3
      -- ...
    FROM 
        a FULL JOIN b
            ON b.Component = a.Component ;