Postgresql – Postgres query plan has high rowcount, slow execution

postgresql

I've identified this snippet of a query as running at incredibly slow speeds, although it's quite simple. It aims to find all the records in Table1 which don't have a more recent entry in Table2 with the same SessionID.

SELECT t1."VisitorID", t1."Timestamp"
FROM "Table1" t1
JOIN Table2 s ON t2."Timestamp" > t1."Timestamp"
            AND t2."SessionID" = t1."SessionID"
GROUP BY sgs."VisitorID", sgs."Timestamp"
HAVING COUNT(*) = 0

Where Table1 is a temporary table with 2640 rows and Table2 is permanent with 3756955 rows.

Explain analyze gives the following output:

HashAggregate  (cost=119823801.74..119823830.53 rows=2303 width=24) (actual time=5172.891..5172.891 rows=0 loops=1)
Group Key: t1."VisitorID", t1."Timestamp"
Filter: (count(*) = 0)
Rows Removed by Filter: 2228
->  Nested Loop  (cost=0.43..97379087.60 rows=2992628552 width=24) (actual time=0.036..3986.857 rows=2909911 loops=1)
        ->  Seq Scan on "Table1" t1  (cost=0.00..80.40 rows=2640 width=61) (actual time=0.007..1.588 rows=2640 loops=1)
        ->  Index Scan using "IX_Table2_Timestamp" on "Table2" t2  (cost=0.43..25191.47 rows=1169452 width=45) (actual time=0.036..1.120 rows=1102 loops=2640)
              Index Cond: ("Timestamp" >= t1."Timestamp")
              Filter: ((t1."SessionID")::text = ("SessionID")::text)
              Rows Removed by Filter: 71
"Planning time: 0.243 ms"
"Execution time: 5172.947 ms"

I'm guessing that the slow execution time is down to the high row count, estimating nearly 3bn in the loop with actual count close to 3m, and 1.1m in the index scan and finding 1102, but I can't find the reason for this. This is just a small dataset for testing, when I used it on the full dataset (340,000 rows in Table1), it filled 45GB of temp files before I cancelled it.

What I've tried so far:

  • VACUUM ANALYZE on all tables involved before that part of the query runs.
  • Indexing all columns involved.
  • Replacing the greater-than with equals on Timestamp
  • Removing the GROUP BY and HAVING COUNT(*) statements

These gave performance increases but the snippet was still far slower than the rest of the query and the row count was still just as high.

If it's relevant, I'm running PostgreSQL 9.4.1 on an Amazon RDS instance.

EDIT: Adding the statements for creating tables and indexes. I can't see anything out of the ordinary here.

CREATE TABLE Table2
(
  "VisitorID" uuid NOT NULL,
  "Timestamp" timestamp without time zone NOT NULL,
  "SessionID" character varying(50),
  CONSTRAINT "Table2_pkey" PRIMARY KEY ("VisitorID", "Timestamp")
)
WITH (
  OIDS=FALSE
);

CREATE INDEX "IX_Table2_SessionID"
  ON Table2
  USING btree
  ("SessionID" COLLATE pg_catalog."default");

CREATE INDEX "IX_Table2_Timestamp"
  ON Table2
  USING btree
  ("Timestamp");

CREATE INDEX "IX_Table2_VisitorID"
  ON Table2
  USING btree
  ("VisitorID");
CREATE TEMP TABLE Table1 ON COMMIT DROP AS
    SELECT 
        s."VisitorID", 
        s."Timestamp",
        s."SessionID", 
    FROM Table2 t2
  -- This date range is just to ensure a small dataset for testing
    WHERE s."Timestamp" > '2015-07-29 16:55:00' AND
             s."Timestamp" < '2015-07-29 17:00:00';

    CREATE INDEX "IX_Temp_Table1_SessionID"
    ON "Table1"
    USING BTREE ("SessionID");
    CREATE INDEX "IX_Temp_Table1_Timestamp"
    ON "Table1"
    USING BTREE ("Timestamp");

Best Answer

(Unsure whether this belongs as an answer, comment or edit so I'll leave it here unless I hear otherwise.)

I was unable to find the cause of the bad row count. However, after rewriting the query like this,

SELECT t1."VisitorID", t1."Timestamp"
FROM Table1 t1
WHERE NOT EXISTS (
    SELECT t2.VisitorID, t2.Timestamp
    FROM Table2 t2
    WHERE t2.SessionID = t1.SessionID
        AND t2.Timestamp > t1.Timestamp
)

the estimated row count went to a normal level and execution time decreased drastically, even on large datasets. I tested this with and without composite indexes like so:

CREATE INDEX "IX_Temp_Table1_Visitor-Session-Time"
ON "Table1"
USING BTREE ("VisitorID","SessionID","Timestamp");

CREATE INDEX "IX_Table2_Visitor-Session-Time"
ON "Table2"
USING BTREE ("VisitorID","SessionID","Timestamp");

and found no significant performance difference.