Postgresql – Multiple to_json(array_agg), separate joins

aggregatejoin;jsonpostgresql

I'm building a messaging system, and trying to get a list of messages and their replies; and also join the participants of the message threads.

So I'm shooting for messages = [{id, subject, body, replies: [..], users: [..]}]

SELECT messages.*, 
       to_json(array_agg(users)) users, 
       to_json(array_agg(replies)) replies
FROM messages
LEFT JOIN (SELECT id, fullname, email, company FROM users) users 
   ON users.id = messages.user_id OR users.id = messages.to
LEFT JOIN (SELECT * FROM messages ORDER BY created_at) replies 
   ON replies.message_id = messages.id
WHERE messages.to = :to OR (messages.to IS NOT NULL AND messages.user_id = :to)
GROUP BY messages.id;

I've tried various other queries, this gets me the closest. The problem is that for each message, all the replies are repeated once. I.e., if there are 3 replies in a thread, I get 6 (2x each). Any ideas what might be wrong?

I am using Postgres 9.5.

Best Answer

The basic problem is this:

Here is one way to avoid this problem and improve some other details while being at it:

SELECT m.*, u.users, r.replies
FROM   messages m
LEFT   JOIN LATERAL (
   SELECT json_agg(u) AS users
   FROM   (
      SELECT id, fullname, email, company
      FROM   users
      WHERE  id IN (m.user_id, m.to)
      -- no order by?
      ) u
   ) u ON TRUE
LEFT   JOIN LATERAL (
   SELECT json_agg(r) AS replies
   FROM   (
      SELECT *
      FROM   messages
      WHERE  message_id = m.id
      ORDER  BY created_at
      ) r
   ) r ON TRUE
WHERE  :to IN (m.user_id, m.to)
AND    m.to IS NOT NULL;
  • Use json_agg() instead of to_json(array_agg()). Simpler, faster.

  • Aggregate the subquery on users right away in a LATERAL join, thus avoiding the main problem. Related:

  • This:

    WHERE  m.to = :to OR
          (m.to IS NOT NULL AND m.user_id = :to)
    

    can be expressed simpler:

    WHERE  :to IN (m.user_id, m.to)
    AND    m.to IS NOT NULL
    
  • This:

    SELECT json_agg(r) AS replies
    FROM   (
       SELECT *
       FROM   messages
       WHERE  message_id = m.id
       ORDER  BY created_at
       ) r
    

    could be shortened to:

    SELECT json_agg(r ORDER BY created_at) AS replies
    FROM   messages r
    WHERE  message_id = m.id
    

But a single ORDER BY in a subquery is typically faster than per-aggregate ORDER BY.