Postgresql – Speeding up GROUPING SETS queries

performancepostgresqlpostgresql-9.6query-performance

I have a query that fetches a number of aggregations over a table. It's very similar to the following adapted example from the Postgres documentation, just with more entries in the grouping sets clause:

SELECT 
    brand, 
    size, 
    count(*) 
FROM items_sold 
GROUP BY GROUPING SETS (
    brand,
    size,
    ()
);

If there is a WHERE clause that restricts this query to a small subset of the table, it is pretty obvious how to speed that up with an index. But what options do I have for cases where I run this kind of query against the whole table, or a large enough fraction that an index on the WHERE clause condition isn't useful?

Now, I personally can't really think of a way of speeding this kind of query up, in the end it has to scan the entire table to perform these aggregations. But just because I can't think of any way doesn't mean there isn't one. From what I read, dedicated search engines like Elasticsearch or Solr can perform similar aggregations over very large amounts of data, and I assume they're reasonably fast at this.

What options do I have to speed up this kind of query that uses a GROUP BY GROUPING SET over an entire table?

Best Answer

Now, I personally can't really think of a way of speeding this kind of query up, in the end it has to scan the entire table to perform these aggregations. But just because I can't think of any way doesn't mean there isn't one.

That's basically right, and isn't unique to GROUPING SETS. If you are scanning a lot of blocks then IO will likely dominate execution time so you have only two strategies:

  1. Make IO faster
  2. Reduce IO

Regarding faster IO, you need to look at hardware or configuration changes. Parallel query was mentioned in the comments, but this could be faster or slower and isn't a magic bullet for IO.

Regarding reducing IO, note that your query does not necessarily need to look at the table at all — it could use an appropriate covering index instead:

Covering indexes are indexes creating for the express purpose of being used in index-only scans. They typically "cover" more columns than would otherwise make sense for an index, typically columns that are known to be part of particular expensive, frequently executed query's selectlist.

So an index including brand and size (in either order) is all that is needed.

test data:

create table items_sold(brand integer, size integer, other text);

insert into items_sold(brand, size, other)
select g%5, random() * 5 + 1, lpad('',1000)
from generate_series(1,100000) g;

vacuum analyze; -- you wouldn't normally run this manually

without covering index:

explain (analyze,buffers)
select brand, size, count(*) 
from items_sold 
group by grouping sets (brand,size,());

/*
|QUERY PLAN                                                                                                                   |
|:----------------------------------------------------------------------------------------------------------------------------|
|GroupAggregate  (cost=23590.82..33395.88 rows=24 width=16) (actual time=117.246..206.585 rows=12 loops=1)                    |
|  Group Key: brand                                                                                                           |
|  Group Key: ()                                                                                                              |
|  Sort Key: size                                                                                                             |
|    Group Key: size                                                                                                          |
|  Buffers: shared hit=885 read=13401, temp read=441 written=441                                                              |
|  ->  Sort  (cost=23590.82..23840.82 rows=100000 width=8) (actual time=108.069..129.823 rows=100000 loops=1)                 |
|        Sort Key: brand                                                                                                      |
|        Sort Method: external merge  Disk: 1752kB                                                                            |
|        Buffers: shared hit=885 read=13401, temp read=220 written=220                                                        |
|        ->  Seq Scan on items_sold  (cost=0.00..15286.00 rows=100000 width=8) (actual time=0.010..49.793 rows=100000 loops=1)|
|              Buffers: shared hit=885 read=13401                                                                             |
|Planning time: 0.034 ms                                                                                                      |
|Execution time: 207.241 ms                                                                                                   |
*/

with covering index:

create index on items_sold(brand,size);

explain (analyze,buffers)
select brand, size, count(*) 
from items_sold 
group by grouping sets (brand,size,());

/*
|QUERY PLAN                                                                                                                                                   |
|:------------------------------------------------------------------------------------------------------------------------------------------------------------|
|GroupAggregate  (cost=0.29..12159.35 rows=24 width=16) (actual time=10.478..95.557 rows=12 loops=1)                                                          |
|  Group Key: brand                                                                                                                                           |
|  Group Key: ()                                                                                                                                              |
|  Sort Key: size                                                                                                                                             |
|    Group Key: size                                                                                                                                          |
|  Buffers: shared hit=1 read=275, temp read=221 written=221                                                                                                  |
|  ->  Index Only Scan using items_sold_brand_size_idx on items_sold  (cost=0.29..2604.29 rows=100000 width=8) (actual time=0.023..21.911 rows=100000 loops=1)|
|        Heap Fetches: 0                                                                                                                                      |
|        Buffers: shared hit=1 read=275                                                                                                                       |
|Planning time: 0.108 ms                                                                                                                                      |
|Execution time: 95.953 ms                                                                                                                                    |
*/

dbfiddle here

Note the "Buffers" data — the covering index does much less IO here.