PostgreSQL – Best Indexes for Optimizing Query with JOIN Depth of 2

indexjoin;optimizationperformancepostgresqlpostgresql-performance

Disclaimer: I am relatively new to PostgreSQL.

I'm wondering how to optimize a query that does 2 INNER JOINs. My scenario is fairly simple:

Select Posts with a photo (Posts.photo IS NOT NULL) and a Hashtag that has the name 'dead' (Hashtags.name = 'dead').

The associations are as follows:

Posts <- PostHashtags -> Hashtags

Posts.id    = PostHashtags.postId (FK)
Hashtags.id = PostHashtags.hashtagId (FK)

Here is the query:

SELECT
  "Posts".*,
  "hashtags"."id" AS "hashtags.id",
  "hashtags"."count" AS "hashtags.count",
  "hashtags"."name" AS "hashtags.name",
  "hashtags"."createdAt" AS "hashtags.createdAt",
  "hashtags"."updatedAt" AS "hashtags.updatedAt",
  "hashtags"."objectId" AS "hashtags.objectId",
  "hashtags"."_etl" AS "hashtags._etl",
  "hashtags.PostHashtag"."id" AS "hashtags.PostHashtag.id",
  "hashtags.PostHashtag"."createdAt" AS "hashtags.PostHashtag.createdAt",
  "hashtags.PostHashtag"."updatedAt" AS "hashtags.PostHashtag.updatedAt",
  "hashtags.PostHashtag"."postId" AS "hashtags.PostHashtag.postId",
  "hashtags.PostHashtag"."hashtagId" AS "hashtags.PostHashtag.hashtagId",
  "hashtags.PostHashtag"."objectId" AS "hashtags.PostHashtag.objectId",
  "hashtags.PostHashtag"."_etl" AS "hashtags.PostHashtag._etl"

FROM (
  SELECT
    "Posts"."id",
    "Posts"."note",
    "Posts"."photo",
    "Posts"."createdAt",
    "user"."id" AS "user.id",
    "user"."name" AS "user.name"
  FROM "Posts" AS "Posts"

  INNER JOIN "Users" AS "user" ON "Posts"."userId" = "user"."id"

  WHERE "Posts"."photo" IS NOT NULL
  AND (
    SELECT "PostHashtags"."id" FROM "PostHashtags" AS "PostHashtags"
    INNER JOIN "Hashtags" AS "Hashtag" ON "PostHashtags"."hashtagId" = "Hashtag"."id"
    WHERE "Posts"."id" = "PostHashtags"."postId"
    LIMIT 1
  ) IS NOT NULL

  ORDER BY "Posts"."createdAt" DESC LIMIT 10
) AS "Posts"

INNER JOIN (
  "PostHashtags" AS "hashtags.PostHashtag"
  INNER JOIN "Hashtags" AS "hashtags" ON "hashtags"."id" = "hashtags.PostHashtag"."hashtagId"
)

ON "Posts"."id" = "hashtags.PostHashtag"."postId"
AND "hashtags"."name" = 'dead'

ORDER BY "Posts"."createdAt" DESC;

EXPLAIN results:

Nested Loop  (cost=886222912.89..886223769.55 rows=1 width=277)
  Join Filter: ("hashtags.PostHashtag"."postId" = "Posts".id)
  ->  Limit  (cost=886220835.39..886220835.42 rows=10 width=189)
        ->  Sort  (cost=886220835.39..886220988.88 rows=61394 width=189)
              Sort Key: "Posts"."createdAt"
              ->  Nested Loop  (cost=0.42..886219508.69 rows=61394 width=189)
                    ->  Seq Scan on "Posts"  (cost=0.00..885867917.51 rows=78196 width=177)
                          Filter: ((photo IS NOT NULL) AND ((SubPlan 1) IS NOT NULL))
                          SubPlan 1
                            ->  Limit  (cost=0.42..815.70 rows=1 width=4)
                                  ->  Nested Loop  (cost=0.42..815.70 rows=1 width=4)
                                        ->  Seq Scan on "PostHashtags"  (cost=0.00..811.25 rows=1 width=8)
                                              Filter: ("Posts".id = "postId")
                                        ->  Index Only Scan using "Hashtags_pkey" on "Hashtags" "Hashtag"  (cost=0.42..4.44 rows=1 width=4)
                                              Index Cond: (id = "PostHashtags"."hashtagId")
                    ->  Index Scan using "Users_pkey" on "Users" "user"  (cost=0.42..4.49 rows=1 width=16)
                          Index Cond: (id = "Posts"."userId")
  ->  Materialize  (cost=2077.50..2933.89 rows=1 width=88)
        ->  Hash Join  (cost=2077.50..2933.89 rows=1 width=88)
              Hash Cond: ("hashtags.PostHashtag"."hashtagId" = hashtags.id)
              ->  Seq Scan on "PostHashtags" "hashtags.PostHashtag"  (cost=0.00..721.00 rows=36100 width=40)
              ->  Hash  (cost=2077.49..2077.49 rows=1 width=48)
                    ->  Seq Scan on "Hashtags" hashtags  (cost=0.00..2077.49 rows=1 width=48)
                          Filter: ((name)::text = 'dead'::text)

This query has been simplified slightly. It also performs OUTER JOINS on other data related to Posts, which is why the SELECT must be performed on Posts instead of, say, PostHashtags.

Any help in translating the EXPLAIN to a useful index would be greatly appreciated.

My ideas:

  1. Build an index on Posts.photo, but should it be a partial index WHERE "photo" IS NOT NULL?
  2. Build a UNIQUE index on Hashtags.name.

I'm not sure if those are necessarily the bottlenecks, though.

Best Answer

Also consider the first answer.

Query

This does what your current query currently does, just simpler and faster:

SELECT p.id, p.note, p.photo, p."createdAt",
  u.id           AS "user.id",
  u.name         AS "user.name",
  h.id           AS "hashtags.id",
  h.count        AS "hashtags.count",
  h.name         AS "hashtags.name",
  h."createdAt"  AS "hashtags.createdAt",
  h."updatedAt"  AS "hashtags.updatedAt",
  h."objectId"   AS "hashtags.objectId",
  h._etl         AS "hashtags._etl",
  ph.id          AS "hashtags.PostHashtag.id",
  ph."createdAt" AS "hashtags.PostHashtag.createdAt",
  ph."updatedAt" AS "hashtags.PostHashtag.updatedAt",
  ph."postId"    AS "hashtags.PostHashtag.postId",
  ph."hashtagId" AS "hashtags.PostHashtag.hashtagId",
  ph."objectId"  AS "hashtags.PostHashtag.objectId",
  ph._etl        AS "hashtags.PostHashtag._etl"
FROM (
    SELECT id, note, photo, "createdAt", "userId"
    FROM   "Posts" p
    WHERE  photo IS NOT NULL
    AND    EXISTS (
        SELECT 1
        FROM   "PostHashtags" ph
        WHERE  ph."postId" = p.id
        )
    ORDER  BY p."createdAt" DESC
    LIMIT  10
   ) p
JOIN   "PostHashtags" ph ON ph."postId" = p.id
JOIN   "Hashtags"     h  ON h.id = ph."hashtagId"
JOIN   "Users"        u  ON u.id = p."userId"
WHERE  h.name = 'dead'
ORDER  BY p."createdAt" DESC;

The EXISTS semi-join should be faster than your subquery construct. I am assuming that the column "PostHashtags".id is the PK and cannot be NULL by itself. Also, if referential integrity is enforced by a FK constraint, no need to join to "Hashtags" in this test.

Indexes

Partial index on Posts

CREATE INDEX posts_foo_idx ON "Posts" ("createdAt", id)
WHERE photo IS NOT NULL;

Note the columns: ("createdAt", id). Postgres will the latest posts, I expect an index scan on posts_foo_idx from the top, followed by a test for matching entries in PostHashtags using id with the next index.

UNIQUE index on PostHashtags

This time we need the index with "postId" first.

The rest is mostly like in the first answer.