I have table items
with following schema (in postgres v9.3.5):
Column | Type | Modifiers | Storage
-----------+--------+----------------------------------------------------+----------
id | bigint | not null default nextval('items_id_seq'::regclass) | plain
data | text | not null | extended
object_id | bigint | not null | plain
Indexes:
"items_pkey" PRIMARY KEY, btree (id)
"items_object_id_idx" btree (object_id)
Has OIDs: no
When I execute query it hangs for a very long time:
SELECT * FROM "items" WHERE "object_id" = '123' ORDER BY "id" DESC LIMIT 1;
After VACUUM ANALYZE the query execution improved a lot, but still not perfect.
# EXPLAIN ANALYZE SELECT * FROM "items" WHERE "object_id" = '123' ORDER BY "id" DESC LIMIT 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.44..1269.14 rows=1 width=63) (actual time=873796.061..873796.061 rows=0 loops=1)
-> Index Scan Backward using items_pkey on items (cost=0.44..1164670.11 rows=918 width=63) (actual time=873796.059..873796.059 rows=0 loops=1)
Filter: (object_id = 123::bigint)
Rows Removed by Filter: 27942522
Total runtime: 873796.113 ms
(5 rows)
The strange thing is that when I execute
SELECT * FROM "items" WHERE "object_id" = '123' LIMIT 1;
it returns 0 rows and I can do it in my code to optimize my web application's performance, but why it can be done by Postgres itself? I came to Postgres from MySQL and never seen such strange things there.
=====
I have found that it uses different query plan, different index, but why?
# EXPLAIN ANALYZE SELECT * FROM "items" WHERE "object_id" = '123' LIMIT 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..3.34 rows=1 width=63) (actual time=0.014..0.014 rows=0 loops=1)
-> Index Scan using items_object_id_operation_idx on items (cost=0.56..2579.16 rows=929 width=63) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (object_id = 123::bigint)
Total runtime: 0.029 ms
(4 rows)
Best Answer
Trying to explain why there is difference in performance between the two queries.
This one:
SELECT * FROM "items" WHERE "object_id" = '123' LIMIT 1
is satisfied by any one row with the matchingobject_id
, so the index onobject_id
is a natural choice. The query requires minimal I/O: index scan to find the first matching value plus one heap read to fetch the entire row.The alternative:
SELECT * FROM "items" WHERE "object_id" = '123' ORDER BY "id" DESC LIMIT 1
requires all rows with the matchingobject_id
be sorted by another column,id
, then the row with the maximum value ofid
be returned. If you were to use the index onobject_id
you would need to perform the following operations: scan the index to find every matchingobject_id
; for every match go fetch the actual row; then sort all fetched rows byid
and return the one with the largestid
.The alternative chosen by the optimizer, presumably based on the
object_id
histogram, is: scan the index onid
backwards, in its entirety; for each value go fetch the row and check if the value ofobject_id
matches; return the first matching row, which will have the maximum possibleid
value. This alternative avoids sorting the rows, so I guess the optimizer prefers it to using the index onobject_id
.The presence of an index on
(object_id asc, id desc)
allows for yet another alternative: scan this new index for the first entry matching the suppliedobject_id
value, which by definition will have the highestid
value; go fetch one matching row and return. Obviously, this is the most efficient approach.