I've following query which takes about 25-35 seconds to execute:
EXPLAIN (analyze, buffers, format text) SELECT booking.*
FROM booking
WHERE booking.reference_number = '9999999999' OR booking.booking_id = '9999999999' OR
(
EXISTS (
SELECT * FROM booking_customer
WHERE booking_customer.booking_id = booking.booking_id AND
(booking_customer.email = '9999999999' OR booking_customer.phone = '9999999999') AND
booking_customer.deleted = false
)
);
The explain output is here: https://explain.depesz.com/s/RPNV
As visible from the plan, there is sequence scan done on booking table, inspite of filtering on indexed columns – reference_number
and booking_id
. The booking_customer
table is properly using index scan though.
Is it something to do with EXISTS
or OR
clauses there?
My table structure is as below:
Table "public.booking"
Column | Type | Collation | Nullable | Default
-----------------------+--------------------------+-----------+----------+---------
deleted | boolean | | |
booking_id | character varying | | not null |
reference_number | character varying | | |
booking_owner | character varying | | |
checkin_date | timestamp with time zone | | |
checkout_date | timestamp with time zone | | |
status | character varying | | |
hold_till | timestamp with time zone | | |
version | integer | | not null |
comments | text | | |
extra_information | json | | |
cancellation_reason | character varying | | |
cancellation_datetime | timestamp with time zone | | |
created_at | timestamp with time zone | | not null | now()
modified_at | timestamp with time zone | | not null | now()
Indexes:
"booking_pkey" PRIMARY KEY, btree (booking_id)
"ix_booking_reference_number" UNIQUE, btree (reference_number)
"idx_booking_checkin_date" btree (checkin_date)
"idx_booking_checkout_date" btree (checkout_date)
"ix_booking_status" btree (status)
"trgm_booking_ref_num" gist (reference_number gist_trgm_ops)
And customer table:
Table "public.booking_customer"
Column | Type | Collation | Nullable | Default
-----------------------+--------------------------+-----------+----------+---------
deleted | boolean | | |
customer_id | character varying | | not null |
booking_id | character varying | | not null |
first_name | character varying | | |
last_name | character varying | | |
gender | character varying | | |
age | integer | | |
nationality | character varying | | |
phone | character varying | | |
email | character varying | | |
gst_addr_field1 | character varying | | |
gst_addr_field2 | character varying | | |
gst_addr_city | character varying | | |
gst_addr_state | character varying | | |
gst_addr_country | character varying | | |
gst_pincode | character varying | | |
legal_name | character varying | | |
created_at | timestamp with time zone | | not null | now()
modified_at | timestamp with time zone | | not null | now()
Indexes:
"booking_customer_pkey" PRIMARY KEY, btree (customer_id, booking_id)
"book_cust_idx" btree (booking_id, customer_id)
"ix_booking_customer_email" btree (email)
"ix_booking_customer_phone" btree (phone)
"trgm_cust_last_name" gist (last_name gist_trgm_ops)
Details on total records in table are as follows:
db=> SELECT COUNT(*) FROM booking;
count
--------
958092
(1 row)
db=> SELECT COUNT(*) FROM booking_customer;
count
---------
2471445
(1 row)
db=> SELECT COUNT(*) FROM booking WHERE reference_number = '9999999999' OR booking_id = '9999999999';
count
-------
1
(1 row)
db=> SELECT COUNT(*) FROM booking_customer WHERE (email = '9999999999' OR phone = '9999999999') AND deleted = false;
count
--------
156377
(1 row)
db=> SELECT COUNT(DISTINCT(booking_id)) FROM booking_customer WHERE (email = '9999999999' OR phone = '9999999999') AND deleted = false;
count
-------
65196
(1 row)
So ideally, I should have gotten 65196
rows as result, which may be seems good reason to use sequence scan. But, when I try with a different value of phone or email, which results in roughly 1300 rows, still it uses seq scan on booking table.
I ran vacuum analyze on both tables yesterday midnight only.
Is there a way to optimise this query?
Postgres details:
db=> SELECT version();
version
-------------------------------------------------------------------------------
PostgreSQL 9.6.12 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.9.3, 64-bit
(1 row)
Update:
Just tried with UNION ALL
, and the performance is too good. As good, as it makes me wonder if the result is correct? Or is there some drawback. Because why would anyone do OR
condition in WHERE
clause ever?
Here's the new query:
EXPLAIN (analyze, buffers, format text)
SELECT booking.*
FROM booking
WHERE booking.deleted = false AND booking.reference_number = '9999999999'
UNION ALL
SELECT booking.*
FROM booking
WHERE booking.deleted = false AND booking.booking_id = '9999999999'
UNION ALL
SELECT booking.*
FROM booking
WHERE booking.deleted = false AND EXISTS (
SELECT * FROM booking_customer
WHERE booking_customer.booking_id = booking.booking_id AND
(booking_customer.email = '9999999999' OR booking_customer.phone = '9999999999') AND
booking_customer.deleted = false
);
New plan is here: https://explain.depesz.com/s/aXMV
Best Answer
The reason why the query is choosing a sequential scan on
booking
is because the estimated number of rows to be scanned is 478,187, and the entire table is 958,092. That means that more than half of the table will need to be scanned anyways--might as well just do a sequential scan, rather than pull 478K rows out of the index, then go to the disk to pull the actual rows out.Note also that the estimate is incorrect. It estimates 478,187 rows, but there are actually only 61,695 rows -- the estimate is 7.8x off. You may need to perform an
ANALYZE
on thebooking
table and try again.However, to focus on the Sequential Scan is sort of misleading, as the bulk of the runtime is in the loop over Index Scan on
book_cust_idx
958,047 times. Each scan of the index takes 0.025ms, but after looping through it nearly 1M times, the duration adds up. The reason why it loops 958,047 times is because for each row inbooking
, it needs to compare with the result ofEXISTS
.You may be better off just doing a
JOIN
, so that the two tables are merged, rather than looped upon:UPDATE: Using a
UNION ALL
works as well, since the individual reference_number and booking_id columns are indexed, effectively partitioning the result set for faster retrieval