Postgresql – Optimizing multi-criteria sort query

order-bypostgresqlquery-performance

I want to efficiently sort a table in PostgreSQL based off of two criteria. Here is my table structure:

CREATE TABLE shapes
(
    shape_id character varying,
    shape_pt_lat double precision,
    shape_pt_lon double precision,
    shape_pt_sequence integer
)

And there is an index on shape_id and shape_pt_sequence:

CREATE INDEX shapes_idx ON shapes (shape_id, shape_pt_sequence)

Here is the query I tried at first:

SELECT shape_pt_lat, shape_pt_lon 
FROM shapes 
ORDER BY shape_id, shape_pt_sequence

However, then I broke it up into two queries, because I thought that would prevent it from being sorted all at once. The first query gets the distinct shapes and then my program will make a second query for each shape_id to sort just the shape records from a single shape_id:

SELECT DISTINCT shape_id FROM shapes

then

SELECT shape_pt_lat, shape_pt_lon 
FROM shapes 
WHERE shape_id = ? 
ORDER BY shape_pt_sequence

It turns out for smaller datasets (shape table records ~ 750k), this does cut the calculation time in half, but for larger datasets (shape table records ~ 4m) the performance was actually slower. I'm guessing maybe it was because for the larger dataset all the extra individual SQL requests may have slowed things down.

Here is the results of EXPLAIN on some queries:

Query: EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM shapes ORDER BY shape_id, shape_pt_sequence

Result:

Index Scan using ielz_tahfnniqhzylvuhipcjjnt_shapes_idx on shapes  (cost=0.42..29796.64 rows=473805 width=34) (actual time=0.039..214.900 rows=473805 loops=1)"
  Buffers: shared hit=31022
Planning Time: 0.361 ms
Execution Time: 259.751 ms

Query: EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM shapes WHERE shape_id = '117112' ORDER BY shape_pt_sequence

Result:

Sort  (cost=534.92..535.31 rows=156 width=34) (actual time=0.163..0.173 rows=134 loops=1)
  Sort Key: shape_pt_sequence
  Sort Method: quicksort  Memory: 35kB
  Buffers: shared hit=5
  ->  Bitmap Heap Scan on shapes  (cost=5.63..529.23 rows=156 width=34) (actual time=0.062..0.094 rows=134 loops=1)
        Recheck Cond: ((shape_id)::text = '117112'::text)
        Heap Blocks: exact=2
        Buffers: shared hit=5
        ->  Bitmap Index Scan on ielz_tahfnniqhzylvuhipcjjnt_shapes_idx  (cost=0.00..5.59 rows=156 width=0) (actual time=0.050..0.050 rows=134 loops=1)
              Index Cond: ((shape_id)::text = '117112'::text)
              Buffers: shared hit=3
Planning Time: 0.098 ms
Execution Time: 0.216 ms

Anyways, I'm wondering if I'm missing something here and if there is a better way to help instruct PostgreSQL to sort this more efficiently in a single query for example.

Best Answer

If I divide the time for query planning and execution by the number of rows returned, I find that the "big" query took 0.00054 milliseconds per result row, while the "small" one took 0.00234 milliseconds, more than four times as long.

If you add to that the (unknown) time it took to get the shape_ids in the first place, I cannot see how splitting up the query can lead to a performance advantage.

Besides, you later have to combine all the small result sets and sort them by shape_pt_sequence, which will also take some time.

The "big" query doesn't sort at all, it uses an index scan, and that is probably the most efficient way to go about it.