Postgresql – optimize SELECT query with CASE statements and JOINs

postgresql

I'm creating a social network, and I need a query to fetch the posts for a user's feed. I have the following (simplified) tables:

CREATE TABLE users (
  user_id bigint NOT NULL (primary key),
  username character varying(32)
);

CREATE TABLE posts (
  post_id bigint NOT NULL (primary key),
  repost_id bigint,
  text text,
  media jsonb,
  timestamp timestamp with time zone DEFAULT NOW() NOT NULL,
  user_id bigint
);

CREATE TABLE followers (
  follower_id bigint NOT NULL (primary key),
  user_id bigint NOT NULL,
  following_user_id bigint NOT NULL
);

Users kan follow other users, and their posts will appear on the feed. Users kan repost another post, I want to include these reposts in the feed.

I have the following query (for now). This query works, but is has a lot of CASE statements and joins, and I don't know if this query is the best way to go. I'm a postgresql beginner and I want to know if somebody has some tips to perfect this query.

SELECT
  a.post_id::int,
  a.repost_id::int,
  CASE WHEN a.repost_id IS NULL THEN a.text ELSE b.text END,
  CASE WHEN a.repost_id IS NULL THEN a.media ELSE b.media END,
  CASE WHEN a.repost_id IS NULL THEN a.timestamp ELSE b.timestamp END,
  c.username,
  d.username as original_username
FROM
  posts a
LEFT JOIN
  posts b
  ON a.repost_id = b.post_id
JOIN
  users c
  ON a.user_id = c.user_id
LEFT JOIN
  users d
  ON b.user_id = d.user_id
WHERE
  a.user_id IN (SELECT following_user_id FROM followers WHERE user_id = $1)
  AND
  post_id < $2
ORDER BY
  post_id DESC
LIMIT
  10

Explanation:

I select all posts from the table posts (a) where the user_id is contained in the results of the query to select all the following_user_id's of a user from the table followers (subquery in WHERE). I also do a LEFT JOIN on the table posts (b) to select the information from the reposted post. In the SELECT with the CASE statements, I check if the repost_id is NULL. If it is null, I need the post information from (a), otherwise, I need the post information from (b). The same thing with the usernames. username from (c) is the username from the post itself, the username from (d) is the username of the reposted post (if this is a reposted post), or NULL if it's an original post itself.

A second option is to copy the row op the original post when someone reposts a post, but I want to know if my initial option is OK to?

Best Answer

Moved from a comment because getting lengthy:

I think if your query is executing in only 4ms currently you're fine. The relationship between data growth and execution time is not linear, so if your data even grew 10x what it currently is and you re-ran your query, it'll probably still run in under 10ms, roughly speaking. I don't see much issues in your query itself as long as your indexing everything properly.

The only recommendation I could make is to avoid sub-queries when not necessary, especially as part of predicates, such as in your WHERE clause. This can lead to slight inefficiencies, and there's a better relational alternative by just doing an INNER JOIN to the followers table from your subquery on the user_id = following_user_id field. You may never run into a performance issue as you currently wrote it, but just a heads up a better alternative exists.

Example:

SELECT
  a.post_id::int,
  a.repost_id::int,
  CASE WHEN a.repost_id IS NULL THEN a.text ELSE b.text END,
  CASE WHEN a.repost_id IS NULL THEN a.media ELSE b.media END,
  CASE WHEN a.repost_id IS NULL THEN a.timestamp ELSE b.timestamp END,
  c.username,
  d.username as original_username
FROM
  posts a
INNER JOIN followers f
    ON a.user_id = f.following_user_id
    AND f.user_id = $1
LEFT JOIN
  posts b
      ON a.repost_id = b.post_id
JOIN
  users c
      ON a.user_id = c.user_id
LEFT JOIN
  users d
      ON b.user_id = d.user_id
WHERE
  post_id < $2
ORDER BY
  post_id DESC
LIMIT
  10