PostgreSQL 9.3 – Slow GROUP BY on Indexed Column in Large Table

performancepostgresqlpostgresql-9.3query-performance

I have a simple table which holds GPS data for various vehicles with 30 million rows.

         Column         |            Type             | Modifiers | Storage  | Stats target | Description
------------------------+-----------------------------+-----------+----------+--------------+-------------
 id                     | bigint                      | not null  | plain    |              |
 equipmentid            | bigint                      |           | plain    |              |
Indexes:
    "some_pkey" PRIMARY KEY, btree (id)
    "some_idx" btree (equipmentid)

The table has more columns, but in my opinion they are not revelant for my question (but if you think they are let me know and I'll add them).

I'm trying to obtain the last GPS position for each equipment like so:

user=> explain (analyze, verbose, buffers) select max(gps2.id) from GpsData gps2 group by gps2.equipmentId;
                                                                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.56..131266837.64 rows=298 width=16) (actual time=64679.945..637667.147 rows=307 loops=1)
   Output: max(id), equipmentid
   Buffers: shared hit=5662655 read=11225117
   ->  Index Scan using some_idx on public.gpsdata gps2  (cost=0.56..131099834.76 rows=33399980 width=16) (actual time=35.452..634355.329 rows=29176300 loops=1)
         Output: equipmentid, id
         Buffers: shared hit=5662655 read=11225117
 Total runtime: 637677.321 ms
(7 rows)

As you can see this query runs in more than 5 minutes, which is an awful lot. It also seems to me that this is one of the most simple use-cases for a GROUP BY (i.e. I expect it to be much faster).

I'm trying to look at the analysis, but… I don't know where to begin. I mean it does say rows=29176300, which looks like a lot of rows, but still 5 mins seems excessive.

I should also note that I've forced an index scan via SET enable_seqscan = off hoping it would be faster, but I was wrong (I had Seq Scan instead of index scan in my output before; it used to take around 4 minutes which is slightly less)

Any hints for me on where to start? My postgres version is PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit.

Edit: Some more info, as requested.

user=> select pg_total_relation_size('some_idx');
 pg_total_relation_size
------------------------
             1620033536
(1 row)

Best Answer

Looking at the number of buffers it took to read the index, you are probably suffering from table and index bloat.

If you can afford to lock the table, run vacuum full analyze gpsdata to clean up old data.

There are several possible reasons why auto-vacuum did not take care of cleaning up properly and re-using deleted space:

  1. You have many sessions accessing that table that are in the state "idle in transaction". Rows that are used by a transaction can't be cleaned up. Make sure you properly commit all transactions - that includes SELECT statement which also start a transaction.
    The easiest way to do that is to enable autocommit (default in psql)
  2. You have a high frequency of updates on that table so that the table is constantly changed and auto-vacuum can't keep up (or doesn't get a chance to clean up - see 1.)
  3. you have disabled auto-vacuum

An index on gpsdata (equipmentid,id) might speed up the calculation.