Postgresql – Finding optimal indices, different query competing indices

indexpostgresql

My schema:

create TABLE "logs" (
    "id" serial not null default nextval('logs_id_seq'::regclass),
    "request" varchar(1024),
    "token" varchar(512) default NULL,
    "was_batch_request" bool not null default false,
    "created" timestamp,
    PRIMARY KEY ("id")
);
CREATE UNIQUE INDEX "idx_logs_token_created" ON "logs" ("token", "created");
CREATE UNIQUE INDEX "idx_logs_created" ON "logs" ("created");

I've around 5.000.000 entries with varying token and I've two queries I regularly perform against this table. Think of token usually having 200 to 500 chars. There are about 600 distinct token in this table.

Query 1:

SELECT
    TO_CHAR("logs".created, 'YYYYMMDDHH24') AS date
  , COUNT(*)                                AS num
FROM "public"."logs" AS "logs"
WHERE "logs"."created" BETWEEN NOW() - INTERVAL '24 hour' AND NOW()
GROUP BY TO_CHAR("logs"."created", 'YYYYMMDDHH24')
ORDER BY "date" DESC
LIMIT 24

For this query I've an index on created.

Query 2

SELECT count(*)
FROM logs
WHERE token = 'token_600_chars_long'
      AND
      created >= NOW() - INTERVAL '1 day';

For this query I created an index on token,created.

Some observations:

  • Query 1 takes around 2s
  • Query 2 takes around 1s

The issue with Query 1 .. unknown. It uses the created index; if the created index does not exist, it takes as long as 5s.

The issue I see with Query 2: it does not use the token,created index but the created index.

However, when I do the following changes, Query 2 uses the token,created index:

  • I drop created index (but this slows Query 1 down to 5s)
  • I change the interval to e.g. 4 days and then it starts using the token,created index and finishes in 100-200ms.

Explain for Query 1:

QUERY PLAN
Limit  (cost=75793.01..75793.07 rows=24 width=8) (actual time=2077.100..2077.103 rows=23 loops=1)
  ->  Sort  (cost=75793.01..75814.66 rows=8662 width=8) (actual time=2077.098..2077.099 rows=23 loops=1)
        Sort Key: (to_char(created, 'YYYYMMDDHH24'::text))
        Sort Method: quicksort  Memory: 26kB
        ->  HashAggregate  (cost=75442.85..75551.12 rows=8662 width=8) (actual time=2076.830..2076.979 rows=23 loops=1)
              Group Key: to_char(created, 'YYYYMMDDHH24'::text)
              ->  Index Only Scan using idx_logs_created on logs ""logs""  (cost=0.44..72326.16 rows=623337 width=8) (actual time=0.244..1804.994 rows=876114 loops=1)
                    Index Cond: ((created >= (now() - '24:00:00'::interval)) AND (created <= now()))
                    Heap Fetches: 876143
Planning time: 0.968 ms
Execution time: 2077.309 ms

Explain for Query 2:

QUERY PLAN
Aggregate  (cost=70856.83..70856.84 rows=1 width=0) (actual time=640.327..640.327 rows=1 loops=1)
  Output: count(*)
  ->  Index Scan using idx_logs_created on public.logs  (cost=0.44..70795.30 rows=24610 width=0) (actual time=0.096..637.190 rows=30863 loops=1)
        Output: id, request, token, was_batch_request, created
        Index Cond: (logs.created >= (now() - '1 day'::interval))
        Filter: ((logs.token)::text = 'DazToken'::text)
        Rows Removed by Filter: 843711
Planning time: 0.381 ms
Execution time: 640.417 ms

In MySQL I could force certain queries to use specific, to my knowledge PostgreSQL does not provide such a think.

Is it possible to speed up the queries with the current schema?

Best Answer

The query planner seems to think that created is more selective than token. Are your table statistics up to date? Check the last_analyze column in:

SELECT  schemaname, relname, last_analyze 
FROM    pg_stat_all_tables 
WHERE   relname = 'logs'

To really force the issue, you can use a temporary table:

begin;
create temporary table tmp_logs (created timestamp);
insert tmp_logs select created from logs where token = 'token_600_chars_long';
select count(*) from tmp_logs where created >= NOW() - INTERVAL '1 day';
commit;

The first query only uses a condition on token, so the optimizer can only use an index that starts with token.