How to Index Two Tables for Optimized JOIN Queries in PostgreSQL

index-tuningjoin;postgresql

I'm using PostgreSQL 9.5 on High Sierra.

Over the two tables:

request_logs - ~ 26K rows

response_logs - ~ 9K rows

I've the following query (with JOIN):

SELECT
  req.uuid,
  res.status,
  req.method,
  req.requesturi,
  req.accessed,
  req.payload reqpayload,

  res.payload respayload,
  COUNT(*) OVER() AS total_rows
FROM
  request_logs req
  LEFT OUTER JOIN response_logs res ON req.uuid = res.uuid
WHERE
   req.accountid = 2 AND
   req.requesturi not ilike '/v1/sessions%' AND
   req.accessed BETWEEN “2018-01-01 15:04:05 +0000” and “2019-01-02 15:04:05+0000” AND
   res.status IS NOT NULL AND
   req.applicationid = 1
ORDER BY
   accessed DESC LIMIT 1000

As I'm trying to optimise the query, I've experimented with different indexes:
Here's a list of what I tried:

Configuration 1:
 1. request_log.uuid (pkey, unique)
 2. response_log.uuid (pkey, unique, foreign key)

Response time avg. : 260 ms
Configuration 2:
 1. request_log.uuid (pkey, unique)
 2. request_log.applicationid
 3. response_log.uuid (pkey, unique, foreign key)

Response time avg. : 230 ms
Configuration 3:
 1. request_log.uuid (pkey, unique)
 2. request_log.applicationid
 3. request_log.accessed (timestampz)
 4. response_log.uuid (pkey, unique, foreign key)

Response time avg. : 230 ms
Configuration 4:
 1. request_log.uuid (pkey, unique)
 2. request_log.applicationid
 3. request_log.accessed (timestampz)
 4. request_log.accountid
 5. response_log.uuid (pkey, unique, foreign key)

Response time avg. : 230 ms
Configuration 5:
 1. request_log.uuid (pkey, unique)
 2. request_log.applicationid, request_log.accessed (combined)
 3. response_log.uuid (pkey, unique, foreign key)

Response time avg. : 240 ms

As visible from the result, indexing by applicationid (an int8) did help a little, while indexing by the timestampz accessed didn't help at all.
Maybe the bad performance is due to the JOIN?
Altogether, it seems quite slow and I try not to think what will happen when these tables contain millions of record (10M+).

What would be a better way to index these tables to make this query run faster?

EDIT:

Here is EXPLAIN ANALYZE for the last configuration:

Limit  (cost=3489.80..3490.69 rows=356 width=823) (actual time=241.152..241.345 rows=1000 loops=1)
  ->  Sort  (cost=3489.80..3490.69 rows=356 width=823) (actual time=241.150..241.288 rows=1000 loops=1)
        Sort Key: req.accessed DESC
        Sort Method: top-N heapsort  Memory: 2064kB
        ->  WindowAgg  (cost=1829.41..3474.71 rows=356 width=823) (actual time=230.040..237.993 rows=3951 loops=1)
              ->  Hash Join  (cost=1829.41..3470.26 rows=356 width=823) (actual time=8.622..17.974 rows=3951 loops=1)
                    Hash Cond: (res.uuid = req.uuid)
                    ->  Seq Scan on response_logs res  (cost=0.00..1604.21 rows=8821 width=758) (actual time=0.006..4.527 rows=9124 loops=1)
                          Filter: (status IS NOT NULL)
                    ->  Hash  (cost=1816.39..1816.39 rows=1042 width=102) (actual time=8.243..8.243 rows=4046 loops=1)
                          Buckets: 4096 (originally 2048)  Batches: 1 (originally 1)  Memory Usage: 1122kB
                          ->  Bitmap Heap Scan on request_logs req  (cost=105.85..1816.39 rows=1042 width=102) (actual time=0.581..6.449 rows=4046 loops=1)
Execution time: 242.154 ms

Best Answer

For starters, get your LEFT JOIN right and try to get rid of total rows (as discussed in comments):

Query

Updated to your comment:

I only want to returns rows from request_logs that have a parallel entry in response_logs with non nil status:

SELECT q.uuid
     , s.status
     , q.method
     , q.requesturi
     , q.accessed
     , q.payload reqpayload
     , s.payload respayload
--   , COUNT(*) OVER() AS total_rows  -- see below
FROM   request_logs q
JOIN   response_logs s USING (uuid)   -- not LEFT JOIN
WHERE  q.accountid = 2
AND    q.requesturi NOT ILIKE '/v1/sessions%'
AND    q.accessed BETWEEN '2018-01-01 15:04:05 +0' AND '2019-01-02 15:04:05+0'
AND    q.applicationid = 1
AND    s.status IS NOT NULL           -- see below
ORDER  BY q.accessed DESC
LIMIT  1001;                          -- see below

Your LEFT [OUTER] JOIN burns down to a plain [INNER] JOIN. Related:

COUNT(*) OVER() AS total_rows is expensive for counts substantially bigger than LIMIT (and you are expecting "10m+ rows"). Maybe it's good enough to use LIMIT 1001, only use the first 1000 rows, check the row count and if it's 1001 then you know there are "more than 1000 matching rows". Related:

Indexes

If (like you commented)

the only 'moving parts' in this query are the dates (from and to) that are compared with accessed

... and more than a few rows are excluded with the other predicates on applicationid, requesturi and accountid, then a partial index should help read performance (a lot):

CREATE INDEX ON request_logs (uuid, accessed DESC)
WHERE  accountid = 2
AND    requesturi NOT ILIKE '/v1/sessions%'
AND    applicationid = 1

And if a substantial percentage of rows is eliminated by status IS NOT NULL, also:

CREATE INDEX ON response_logs (uuid)
WHERE  status IS NOT NULL;

It may pay to append the columns status and payload as index columns if you can get index-only scans out of it. Some preconditions apply.