PostgreSQL 9.3 – Slow GROUP BY on Indexed Column in Large Table

performancepostgresqlpostgresql-9.3query-performance

I have a simple table which holds GPS data for various vehicles with 30 million rows.

         Column         |            Type             | Modifiers | Storage  | Stats target | Description
------------------------+-----------------------------+-----------+----------+--------------+-------------
 id                     | bigint                      | not null  | plain    |              |
 equipmentid            | bigint                      |           | plain    |              |
Indexes:
    "some_pkey" PRIMARY KEY, btree (id)
    "some_idx" btree (equipmentid)

The table has more columns, but in my opinion they are not revelant for my question (but if you think they are let me know and I'll add them).

I'm trying to obtain the last GPS position for each equipment like so:

user=> explain (analyze, verbose, buffers) select max(gps2.id) from GpsData gps2 group by gps2.equipmentId;
                                                                                           QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.56..131266837.64 rows=298 width=16) (actual time=64679.945..637667.147 rows=307 loops=1)
   Output: max(id), equipmentid
   Buffers: shared hit=5662655 read=11225117
   ->  Index Scan using some_idx on public.gpsdata gps2  (cost=0.56..131099834.76 rows=33399980 width=16) (actual time=35.452..634355.329 rows=29176300 loops=1)
         Output: equipmentid, id
         Buffers: shared hit=5662655 read=11225117
 Total runtime: 637677.321 ms
(7 rows)

As you can see this query runs in more than 5 minutes, which is an awful lot. It also seems to me that this is one of the most simple use-cases for a GROUP BY (i.e. I expect it to be much faster).

I'm trying to look at the analysis, but… I don't know where to begin. I mean it does say rows=29176300, which looks like a lot of rows, but still 5 mins seems excessive.

I should also note that I've forced an index scan via SET enable_seqscan = off hoping it would be faster, but I was wrong (I had Seq Scan instead of index scan in my output before; it used to take around 4 minutes which is slightly less)

Any hints for me on where to start? My postgres version is PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2, 64-bit.

Edit: Some more info, as requested.

user=> select pg_total_relation_size('some_idx');
 pg_total_relation_size
------------------------
             1620033536
(1 row)

Best Answer

Looking at the number of buffers it took to read the index, you are probably suffering from table and index bloat.

If you can afford to lock the table, run vacuum full analyze gpsdata to clean up old data.

There are several possible reasons why auto-vacuum did not take care of cleaning up properly and re-using deleted space:

You have many sessions accessing that table that are in the state "idle in transaction". Rows that are used by a transaction can't be cleaned up. Make sure you properly commit all transactions - that includes SELECT statement which also start a transaction.
The easiest way to do that is to enable autocommit (default in psql)
You have a high frequency of updates on that table so that the table is constantly changed and auto-vacuum can't keep up (or doesn't get a chance to clean up - see 1.)
you have disabled auto-vacuum

An index on gpsdata (equipmentid,id) might speed up the calculation.

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

What I still don't understand, is why this is slower.

That sorting the rows will cost something is obvious. But why so much?
Without ORDER BY rank0... Postgres can just

pick the first 5 rows it finds and stop fetching rows as soon as it has 5.

Bitmap Heap Scan on entities ... rows=5 ...
then compute ts_rank() for just 5 rows.

In the second case, Postgres has to

fetch all (1495 according to your query plan) rows that qualify.

Bitmap Heap Scan on entities ... rows=1495 ...
compute ts_rank() for all of them.
sort all of them to find the first 5 according to the calculated value.

Try ORDER BY name just to see the cost of computing to_tsquery('english', 'hockey'::text)) for the superfluous rows and how much remains for fetching more rows and sorting.

Thesql insert into indexed table taking long time after few million records

You should load such a huge file in chunk for faster loading of data, Here it is mentioned problems loading huge Load local data file and solution how to solve it and make it faster.

http://www.mysqlperformanceblog.com/2008/07/03/how-to-load-large-files-safely-into-innodb-with-load-data-infile/

MyISAM tables you should follow this steps to fasten loading data:

With some extra work, it is possible to make LOAD DATA INFILE run even faster for a MyISAM table when the table has many indexes.

Execute a FLUSH TABLES statement or a mysqladmin flush-tables command.
Use myisamchk --keys-used=0 -rq /path/to/db/tbl_name to remove all use of indexes for the table.
Insert data into the table with LOAD DATA INFILE. This does not update any indexes and therefore is very fast.
Re-create the indexes with myisamchk -rq /path/to/db/tbl_name. This creates the index tree in memory before writing it to disk, which is much faster that updating the index during LOAD DATA INFILE because it avoids lots of disk seeks. The resulting index tree is also perfectly balanced.
Execute a FLUSH TABLES statement or a mysqladmin flush-tables command.

LOAD DATA INFILE performs the preceding optimization automatically if the MyISAM table into which you insert data is empty. The main difference between automatic optimization and using the procedure explicitly is that you can let myisamchk allocate much more temporary memory for the index creation than you might want the server to allocate for index re-creation when it executes the LOAD DATA INFILE statement.

You can also disable or enable the nonunique indexes for a MyISAM table by using the following statements rather than myisamchk. If you use these statements, you can skip the FLUSH TABLE operations:

ALTER TABLE tbl_name DISABLE KEYS;
ALTER TABLE tbl_name ENABLE KEYS;

For more information refer these:

http://dev.mysql.com/doc/refman/5.5/en/optimizing-myisam-bulk-data-loading.html

https://wiki.rice.edu/confluence/display/~as43/Make+LOAD+DATA+INFILE+run+even+faster+for+a+MyISAM+table

Best Answer

Related Solutions

Postgresql – Optimizing ORDER BY in a full text search query

Thesql insert into indexed table taking long time after few million records

Related Question