I want to efficiently sort a table in PostgreSQL based off of two criteria. Here is my table structure:
CREATE TABLE shapes
(
shape_id character varying,
shape_pt_lat double precision,
shape_pt_lon double precision,
shape_pt_sequence integer
)
And there is an index on shape_id
and shape_pt_sequence
:
CREATE INDEX shapes_idx ON shapes (shape_id, shape_pt_sequence)
Here is the query I tried at first:
SELECT shape_pt_lat, shape_pt_lon
FROM shapes
ORDER BY shape_id, shape_pt_sequence
However, then I broke it up into two queries, because I thought that would prevent it from being sorted all at once. The first query gets the distinct shapes and then my program will make a second query for each shape_id
to sort just the shape records from a single shape_id
:
SELECT DISTINCT shape_id FROM shapes
then
SELECT shape_pt_lat, shape_pt_lon
FROM shapes
WHERE shape_id = ?
ORDER BY shape_pt_sequence
It turns out for smaller datasets (shape table records ~ 750k), this does cut the calculation time in half, but for larger datasets (shape table records ~ 4m) the performance was actually slower. I'm guessing maybe it was because for the larger dataset all the extra individual SQL requests may have slowed things down.
Here is the results of EXPLAIN on some queries:
Query: EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM shapes ORDER BY shape_id, shape_pt_sequence
Result:
Index Scan using ielz_tahfnniqhzylvuhipcjjnt_shapes_idx on shapes (cost=0.42..29796.64 rows=473805 width=34) (actual time=0.039..214.900 rows=473805 loops=1)"
Buffers: shared hit=31022
Planning Time: 0.361 ms
Execution Time: 259.751 ms
Query: EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM shapes WHERE shape_id = '117112' ORDER BY shape_pt_sequence
Result:
Sort (cost=534.92..535.31 rows=156 width=34) (actual time=0.163..0.173 rows=134 loops=1)
Sort Key: shape_pt_sequence
Sort Method: quicksort Memory: 35kB
Buffers: shared hit=5
-> Bitmap Heap Scan on shapes (cost=5.63..529.23 rows=156 width=34) (actual time=0.062..0.094 rows=134 loops=1)
Recheck Cond: ((shape_id)::text = '117112'::text)
Heap Blocks: exact=2
Buffers: shared hit=5
-> Bitmap Index Scan on ielz_tahfnniqhzylvuhipcjjnt_shapes_idx (cost=0.00..5.59 rows=156 width=0) (actual time=0.050..0.050 rows=134 loops=1)
Index Cond: ((shape_id)::text = '117112'::text)
Buffers: shared hit=3
Planning Time: 0.098 ms
Execution Time: 0.216 ms
Anyways, I'm wondering if I'm missing something here and if there is a better way to help instruct PostgreSQL to sort this more efficiently in a single query for example.
Best Answer
If I divide the time for query planning and execution by the number of rows returned, I find that the "big" query took 0.00054 milliseconds per result row, while the "small" one took 0.00234 milliseconds, more than four times as long.
If you add to that the (unknown) time it took to get the
shape_id
s in the first place, I cannot see how splitting up the query can lead to a performance advantage.Besides, you later have to combine all the small result sets and sort them by
shape_pt_sequence
, which will also take some time.The "big" query doesn't sort at all, it uses an index scan, and that is probably the most efficient way to go about it.