I am struggling to optimize the following query:
SELECT tbA.id, tbB.createDate
FROM
tbA
LEFT OUTER JOIN
(SELECT id, MAX(createDate) as createDate FROM tb GROUP BY id) as tbB
ON tbA.id=tbB.id
WHERE tbA.someDate >= '2021-05-11' and tbA.someDate <= '2021-05-12'
Both tables have a comparable number of rows and the condition does limit the result to 2000 rows (from approximately 1 mil). The problem is the inner select where the planner does evaluate the select (scan, sort & aggregate) before applying the condition. While the expected behavior would be at first to limit the rows by id from the join (as in MSSQL).
The above query is a simplified version of the real query. I am using a few views and the last/top view is used by the entity framework and contains the conditions.
EDIT 2
The query:
SELECT
d.*,
lrm."OriginalDealStatus" AS "MessageStatus"
FROM
"DealStore"."DealsData" d
LEFT OUTER JOIN
(SELECT "ReplyMessages"."ForeFrontID",
max("ReplyMessages"."CreatedDate") AS "LastCreatedDate",
'' as "OriginalDealStatus"
FROM "FeedProcessor"."ReplyMessages"
GROUP BY "ReplyMessages"."ForeFrontID"
) lrm
ON d."ForeFrontID" = lrm."ForeFrontID"
WHERE d."TradeDay" >= '2021-05-11' AND d."TradeDay" <= '2021-05-12'
The execution plan:
Nested Loop Left Join (cost=275949.07..299006.19 rows=622 width=327) (actual time=1421.176..1427.758 rows=2093 loops=1)
-> Merge Left Join (cost=275948.64..293934.08 rows=622 width=303) (actual time=1421.124..1422.709 rows=2093 loops=1)
Merge Cond: (d.ForeFrontID = ReplyMessages.ForeFrontID)
-> Sort (cost=171.93..173.49 rows=622 width=271) (actual time=2.703..2.894 rows=2093 loops=1)
Sort Key: d.ForeFrontID
Sort Method: quicksort Memory: 652kB
-> Index Scan using IDX_NC_Deals_TradeDay on Deals d (cost=0.43..143.07 rows=622 width=271) (actual time=0.016..1.534 rows=2093 loops=1)
Index Cond: ((TradeDay >= '2021-05-11'::date) AND (TradeDay <= '2021-05-12'::date))
-> GroupAggregate (cost=275776.71..285877.95 rows=630249 width=44) (actual time=1130.147..1384.325 rows=688123 loops=1)
Group Key: ReplyMessages.ForeFrontID
-> Sort (cost=275776.71..277676.08 rows=759750 width=4) (actual time=1130.133..1210.091 rows=726161 loops=1)
Sort Key: ReplyMessages.ForeFrontID
Sort Method: external merge Disk: 10296kB
-> Seq Scan on ReplyMessages (cost=0.00..191179.50 rows=759750 width=4) (actual time=0.068..784.905 rows=759705 loops=1)
-> Index Scan using pk_dealsxml_forefrontid on DealsXml dx (cost=0.43..8.12 rows=1 width=388) (actual time=0.002..0.002 rows=1 loops=2093)
Index Cond: (ForeFrontID = d.ForeFrontID)
Planning Time: 0.653 ms
Execution Time: 1429.023 ms
You may notice that there's an: Seq Scan on ReplyMessages & Sort & GroupAggregate
I would expect Postgres to evaluate the previous query much faster as in the bellow example (have added the LATERAL):
SELECT
d.*,
lrm."OriginalDealStatus" AS "MessageStatus"
FROM
"DealStore"."DealsData" d
LEFT OUTER JOIN LATERAL
(SELECT "ReplyMessages"."ForeFrontID",
max("ReplyMessages"."CreatedDate") AS "LastCreatedDate",
'' as "OriginalDealStatus"
FROM "FeedProcessor"."ReplyMessages"
WHERE d."ForeFrontID" = "ForeFrontID"
GROUP BY "ReplyMessages"."ForeFrontID"
) lrm
ON d."ForeFrontID" = lrm."ForeFrontID"
WHERE d."TradeDay" >= '2021-05-11' AND d."TradeDay" <= '2021-05-12'
and the execution plan:
Nested Loop Left Join (cost=1.28..10488.19 rows=622 width=327) (actual time=3.116..15.949 rows=2093 loops=1)
-> Nested Loop Left Join (cost=0.85..5191.86 rows=622 width=655) (actual time=0.035..6.775 rows=2093 loops=1)
-> Index Scan using IDX_NC_Deals_TradeDay on Deals d (cost=0.43..143.07 rows=622 width=271) (actual time=0.022..1.751 rows=2093 loops=1)
Index Cond: ((TradeDay >= '2021-05-11'::date) AND (TradeDay <= '2021-05-12'::date))
-> Index Scan using pk_dealsxml_forefrontid on DealsXml dx (cost=0.43..8.12 rows=1 width=388) (actual time=0.002..0.002 rows=1 loops=2093)
Index Cond: (ForeFrontID = d.ForeFrontID)
-> Subquery Scan on lrm (cost=0.42..8.47 rows=1 width=36) (actual time=0.004..0.004 rows=0 loops=2093)
Filter: (d.ForeFrontID = lrm.ForeFrontID)
-> GroupAggregate (cost=0.42..8.46 rows=1 width=44) (actual time=0.003..0.003 rows=0 loops=2093)
Group Key: ReplyMessages.ForeFrontID
-> Index Only Scan using idx_test2 on ReplyMessages (cost=0.42..8.44 rows=1 width=4) (actual time=0.003..0.003 rows=0 loops=2093)
Index Cond: (ForeFrontID = d.ForeFrontID)
Heap Fetches: 746
Planning Time: 0.637 ms
Execution Time: 16.139 ms
Best Answer
Surely something as simple as this will work adequately: