PostgreSQL Multiple Aggregations – Improving Select Query Performance

postgresqlpostgresql-12

I have a table with columns: id, antenna_id, latitude, longitude. There are two composite indexes on (antenna_id, latitude) and (antenna_id, longitude). When I do a max(latitude) for a specific antenna id(s), the speed is acceptable, but doing a min and max for both latitude and longitude at the same time is very slow.

Using PostgreSQL 12.3

Query


EXPLAIN (analyze, buffers, format text) 
SELECT max(latitude) 
FROM packets 
WHERE antenna_id IN (1,2)

Finalize Aggregate  (cost=443017.21..443017.22 rows=1 width=32) (actual time=4373.679..4373.679 rows=1 loops=1)
  Buffers: shared hit=10812 read=16887
  ->  Gather  (cost=443017.10..443017.21 rows=1 width=32) (actual time=4373.412..4389.032 rows=2 loops=1)
        Workers Planned: 1
        Workers Launched: 1
        Buffers: shared hit=10812 read=16887
        ->  Partial Aggregate  (cost=442017.10..442017.11 rows=1 width=32) (actual time=4313.576..4313.577 rows=1 loops=2)
              Buffers: shared hit=10809 read=16887
              ->  Parallel Index Only Scan using idx_packets_antenna_id_latitude on packets  (cost=0.57..433527.51 rows=3395835 width=7) (actual time=0.375..3435.488 rows=2201866 loops=2)
                    Index Cond: (antenna_id = ANY ('{1,2}'::integer[]))
                    Heap Fetches: 0
                    Buffers: shared hit=10809 read=16887
Planning Time: 5.992 ms
JIT:
  Functions: 8
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 6.236 ms, Inlining 0.000 ms, Optimization 1.549 ms, Emission 32.058 ms, Total 39.842 ms
Execution Time: 4706.406 ms

The explain looks almost identical for max(longitude), min(latitude) and min(longitude) on their own. Speed is acceptable.

But when I combine the queries

SELECT max(latitude), max(longitude), min(latitude), min(longitude) 
FROM packets 
WHERE antenna_id IN (1,2)

Duration

[2021-03-06 09:28:30] 1 row retrieved starting from 1 in 5 m 35 s 907 ms (execution: 5 m 35 s 869 ms, fetching: 38 ms)

Explain

Finalize Aggregate  (cost=3677020.18..3677020.19 rows=1 width=128)
  ->  Gather  (cost=3677020.06..3677020.17 rows=1 width=128)
        Workers Planned: 1
        ->  Partial Aggregate  (cost=3676020.06..3676020.07 rows=1 width=128)
              ->  Parallel Seq Scan on packets  (cost=0.00..3642080.76 rows=3393930 width=14)
                    Filter: (antenna_id = ANY ('{1,2}'::integer[]))
JIT:
  Functions: 7
  Options: Inlining true, Optimization true, Expressions true, Deforming true
EXPLAIN (analyze, buffers, format text) 
SELECT max(latitude), max(longitude), min(latitude), min(longitude) 
FROM packets 
WHERE antenna_id IN (1,2)

has been running for 24h now, and isn't done yet

Indexes

create index idx_packets_antenna_id_time
    on packets (antenna_id, time);

create index idx_packets_antenna_id_longitude
    on packets (antenna_id, longitude);

create index idx_packets_device_id_time
    on packets (device_id, time);

create index idx_packets_antenna_id_latitude
    on packets (antenna_id, latitude);

Data stats

select count(*) from packets
136758098

select count(distinct (antenna_id)) from packets
17558

select antenna_id, count(*) as records 
from packets 
where antenna_id in (1,2) 
group by antenna_id 
order by records desc

1,4361049
2,42683

Question

Why does the second query which does the min and max on the latitude and longitude field not use the indexes? And how can I rewrite the query so that it is faster?

Best Answer

Let's create some test data. Looks like your query has about 1% of rows per antenna_id so let's replicate this.

CREATE UNLOGGED TABLE foo( lat FLOAT NOT NULL, lon FLOAT NOT NULL, aid INTEGER NOT NULL );
INSERT INTO foo SELECT random(), random(), random()*100
    FROM generate_series(1,10000000) s;
CREATE INDEX foo_aid_lat ON foo( aid, lat );
CREATE INDEX foo_aid_lon ON foo( aid, lon );
VACUUM ANALYZE foo;

SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid IN (1,2);

 Finalize Aggregate  (cost=71572.35..71572.36 rows=1 width=32) (actual time=119.907..125.118 rows=1 loops=1)
   ->  Gather  (cost=71572.12..71572.33 rows=2 width=32) (actual time=119.648..125.110 rows=3 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=70572.12..70572.13 rows=1 width=32) (actual time=95.595..95.596 rows=1 loops=3)
               ->  Parallel Bitmap Heap Scan on foo  (cost=4886.47..69687.39 rows=88473 width=16) (actual time=9.532..90.336 rows=66524 loops=3)
                     Recheck Cond: (aid = ANY ('{1,2}'::integer[]))
                     Heap Blocks: exact=26477
                     ->  Bitmap Index Scan on foo_aid_lon  (cost=0.00..4833.39 rows=212336 width=0) (actual time=20.022..20.023 rows=199572 loops=1)
                           Index Cond: (aid = ANY ('{1,2}'::integer[]))
 Planning Time: 0.499 ms
 Execution Time: 125.202 ms

This is really slow. Let's try for one antenna_id.

SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid=1;
 Result  (cost=1.88..1.89 rows=1 width=32) (actual time=0.192..0.196 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..0.47 rows=1 width=8) (actual time=0.059..0.060 rows=1 loops=1)
           ->  Index Only Scan using foo_aid_lat on foo  (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.057..0.057 rows=1 loops=1)
                 Index Cond: ((aid = 1) AND (lat IS NOT NULL))
                 Heap Fetches: 0
   InitPlan 2 (returns $1)
     ->  Limit  (cost=0.43..0.47 rows=1 width=8) (actual time=0.044..0.045 rows=1 loops=1)
           ->  Index Only Scan Backward using foo_aid_lat on foo foo_1  (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.043..0.044 rows=1 loops=1)
                 Index Cond: ((aid = 1) AND (lat IS NOT NULL))
                 Heap Fetches: 0
   InitPlan 3 (returns $2)
     ->  Limit  (cost=0.43..0.47 rows=1 width=8) (actual time=0.038..0.038 rows=1 loops=1)
           ->  Index Only Scan using foo_aid_lon on foo foo_2  (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.037..0.037 rows=1 loops=1)
                 Index Cond: ((aid = 1) AND (lon IS NOT NULL))
                 Heap Fetches: 0
   InitPlan 4 (returns $3)
     ->  Limit  (cost=0.43..0.47 rows=1 width=8) (actual time=0.042..0.042 rows=1 loops=1)
           ->  Index Only Scan Backward using foo_aid_lon on foo foo_3  (cost=0.43..3777.80 rows=106668 width=8) (actual time=0.041..0.041 rows=1 loops=1)
                 Index Cond: ((aid = 1) AND (lon IS NOT NULL))
                 Heap Fetches: 0
 Planning Time: 0.504 ms
 Execution Time: 0.277 ms

That's the correct plan, that uses the multicolumn index to compute the max and min. This requires only 1 index lookup per min() or max(), because

SELECT max(lat) where aid=...

is equivalent to

SELECT lat WHERE aid=... ORDER BY lat DESC LIMIT 1

...which is optimizable using an index which contains the rows in pre-sorted order.

The above optimization of max() and min() is basically syntactic sugar, it turns the query into an ORDER BY+LIMIT and puts that into an InitPlan in order to use the index.

But, apparently, it doesn't do this when querying several antenna_ids using "WHERE IN()". Adding a "GROUP BY aid" at the end of the first query doesn't help.

So... Let's query the antenna_ids one at a time then.

SELECT * FROM 
(VALUES (1),(2)) AS v
CROSS JOIN LATERAL (SELECT min(lat),max(lat),min(lon),max(lon) FROM foo WHERE aid=v.column1) x;

It makes a nested loop over the VALUES, and inside the nested loop is the above fast query. It returns the max() and min() for each antenna_id, so to get the global max() and min() you have to wrap that in a subquery and apply max() and min() over the result.

This shouldn't take more than a millisecond unless there is another problem.

Replacing the VALUES above with a generate_series(1,100) that gets the max for the 100 aid's in the table takes about 5ms. Doing it the old fashioned way:

select aid,min(lat),max(lat),min(lon),max(lon) FROM foo group by aid;

takes about 100x longer.