PostgreSQL – Improving SELECT Performance with Large Partial Index

indexoptimizationpostgresql

Query:

EXPLAIN (ANALYZE, BUFFERS) SELECT
    COUNT (*) AS "count"
FROM
    "Posts" AS "Post"
WHERE
    "Post"."createdAt" > '2015-08-19 14:55:50.398'
AND "Post"."new" = TRUE;

Index:

CREATE INDEX posts_new_createdat_idx ON "Posts" ("createdAt")
WHERE
    NEW = TRUE

Plan:

Aggregate  (cost=389915.59..389915.60 rows=1 width=0) (actual time=4234.772..4234.773 rows=1 loops=1)
  Buffers: shared hit=254427
  ->  Bitmap Heap Scan on "Posts" "Post"  (cost=14415.81..387990.63 rows=769985 width=0) (actual time=123.805..3859.150 rows=1138854 loops=1)
        Recheck Cond: (("createdAt" > '2015-08-19 14:55:50.398+00'::timestamp with time zone) AND new)
        Rows Removed by Index Recheck: 8238790
        Buffers: shared hit=254427
        ->  Bitmap Index Scan on posts_new_createdat_idx  (cost=0.00..14223.32 rows=769985 width=0) (actual time=122.601..122.601 rows=1138854 loops=1)
              Index Cond: ("createdAt" > '2015-08-19 14:55:50.398+00'::timestamp with time zone)
              Buffers: shared hit=3114
Total runtime: 4234.989 ms

Schema:

CREATE TABLE "public"."Posts" (
    "id" int4 NOT NULL DEFAULT nextval('"Posts_id_seq"'::regclass),
    "actionId" int4,
    "commentCount" int4 DEFAULT 0,
    "facebook" bool,
    "featurePostOnDate" timestamp(6) WITH TIME ZONE,
    "forcedPrivate" bool,
    "instagram" bool,
    "isReported" bool,
    "likeCount" int4 DEFAULT 0,
    "note" text COLLATE "default",
    "photo" varchar(255) COLLATE "default",
    "private" bool,
    "new" bool,
    "popular" bool,
    "twitter" bool,
    "userId" int4,
    "objectId" varchar(255) COLLATE "default",
    "createdAt" timestamp(6) WITH TIME ZONE,
    "updatedAt" timestamp(6) WITH TIME ZONE,
    "activityLogId" int4,
    "weightLogId" int4,
    "workoutId" int4,
    "workoutLogId" int4,
    "thumbnail" varchar(255) COLLATE "default"
)

Data:

  • new = true for 99% or records
  • Any posts older than 2 weeks can be ignored (from the count and the index)

Database Details:

Host           = Amazon AWS
Engine         = PostgreSQL 9.3.10
Instance Class = db.r3.8xlarge
Storage Type   = SSD
IOPS           = 3000
StorageAmount  = 500 GB

As you can see, the partial index is very large. Is there a way to better index so that the recheck condition is not so heavy?

Best Answer

Add the second predicate of your query to the partial index as well:

WHERE "Post"."createdAt" > '2015-08-19 14:55:50.398'

Your timestamp is probably a moving target, but I am going to assume you have lots of old rows that are excluded in most of your queries and only few "younger" rows are of interest. A typical use case. You can cut off old rows in your partial index.

CREATE INDEX posts_new_createdat_idx ON "Posts"("createdAt")
WHERE "new"
AND   "createdAt" > '2015-08-01 00:00';  -- replace with useful timestamp

This only makes sense if you can reduce the size of the index to a fraction. Else, the effect is typically not worth the trouble.

Replace with an opportune timestamp to cut off as many rows as possible while still covering relevant rows for your queries.

Modern versions of Postgres are smart enough to understand the index is applicable to queries cutting off at a later timestamp. For older versions you have to add the verbatim index condition redundantly to make the query planner use this partial index.

This index will be used by your query immediately as is. (I just removed some more noise):

SELECT COUNT(*) AS "count"
FROM  "Posts"
WHERE "createdAt" > '2015-08-19 14:55:50.398'
AND   "new";

Related:

Aside 1: "new" = TRUE is just a noisy way of stating "new" for a boolean value.

Aside 2: Don't use reserved words or CaMeL case or other illegal identifiers so you don't have to double-quote all the time. Very error prone.