PostgreSQL Performance – Speeding Up GROUP BY HAVING COUNT Queries

countindexperformancepostgresqlpostgresql-performance

I'm trying to speed up this query in Postgres 9.4:

SELECT "groupingsFrameHash", COUNT(*) AS nb
FROM "public"."zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876"
GROUP BY "groupingsFrameHash"
HAVING COUNT(*) > 1
ORDER BY nb DESC LIMIT 10

I have an index on "groupingsFrameHash". I don't need exact results, a fuzzy approximation is good enough.

Here is the query plan:

Limit  (cost=17207.03..17207.05 rows=10 width=25) (actual time=740.056..740.058 rows=10 loops=1)
  ->  Sort  (cost=17207.03..17318.19 rows=44463 width=25) (actual time=740.054..740.055 rows=10 loops=1)
        Sort Key: (count(*))
        Sort Method: top-N heapsort  Memory: 25kB
        ->  GroupAggregate  (cost=14725.95..16246.20 rows=44463 width=25) (actual time=615.109..734.740 rows=25977 loops=1)
              Group Key: "groupingsFrameHash"
              Filter: (count(*) > 1)
              Rows Removed by Filter: 24259
              ->  Sort  (cost=14725.95..14967.07 rows=96446 width=25) (actual time=615.093..705.507 rows=96026 loops=1)
                    Sort Key: "groupingsFrameHash"
                    Sort Method: external merge  Disk: 3280kB
                    ->  Seq Scan on "zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876"  (cost=0.00..4431.46 rows=96446 width=25) (actual time=0.007..33.813 rows=96026 loops=1)
Planning time: 0.080 ms
Execution time: 740.877 ms

I don't understand why it needs to do a Seq Scan.

Best Answer

You want the 10 most common values for "groupingsFrameHash" with their respective counts (excluding unique values) - a common task. This specification caught my attention, though:

a fuzzy approximation is good enough

This allows for radically faster solutions. Postgres happens to store just those approximations in the system catalogs, the total count in pg_class and most common values in pg_statistic. The manual about the nature of these numbers:

Entries are created by ANALYZE and subsequently used by the query planner. Note that all the statistical data is inherently approximate, even assuming that it is up-to-date.

You have been warned.

Also consider the chapter Statistics Used by the Planner in the manual.

If you have set up autovacuum properly and the contents of your table do not change too much, these estimates should be good. In case you run this query immediately after substantial changes to the table (so autovacuum didn't get a chance to kick in), run ANALYZE first (or better VACUUM ANALYZE if you can spare the time). You can also fine-tune precision, that's beyond the scope of this question ...

There are security considerations. Quoting the manual again:

pg_statistic should not be readable by the public, since even statistical information about a table's contents might be considered sensitive. (Example: minimum and maximum values of a salary column might be quite interesting.) pg_stats is a publicly readable view on pg_statistic that only exposes information about those tables that are readable by the current user.

All this considered, you can get quick estimates:

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_stats s
CROSS  JOIN LATERAL
   unnest(s.most_common_vals::text::text[]  -- use your actual data type
        , s.most_common_freqs) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
CROSS  JOIN (
   SELECT reltuples FROM pg_class
   WHERE oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
   ) c
WHERE  schemaname = 'public'
AND    tablename  = 'zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    attname    = 'groupingsFrameHash'  -- case sensitive
ORDER  BY v.ord
LIMIT  10;

There are a couple of noteworthy features in this query:

  • Provide all identifier strings unescaped and case sensitive.

  • unnest() for multiple arrays requires Postgres 9.4 or later. Details:

  • pg_stats.most_common_vals is a special column with the data pseudo-type anyarray (not available in user tables). It can store arrays of any type. To decompose, cast to text and then to the array type of your column type. Assuming text[] in the example:

    s.most_common_vals::text::text[]
    

    Replace with your actual data type.

  • I added WITH ORDINALITY to unnest() (Postgres 9.4 or later) to preserve the original order of elements. Since the numbers in the arrays are ordered by descending frequency, we can work with that sort order right away. Consider:

This takes around 1 ms or less - no matter how many rows there are in your table.

Experimental optimizations

If you still need to squeeze out more performance and you have superuser access, you could use pg_statistic directly:

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_attribute a 
JOIN   pg_class     c ON  c.oid = a.attrelid
JOIN   pg_statistic s ON  s.starelid = a.attrelid
                      AND s.staattnum = a.attnum
     , unnest(s.stavalues1::text::text[]
            , s.stanumbers1) WITH ORDINALITY v ("groupingsFrameHash", freq, ord)
WHERE  a.attrelid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    a.attname  = 'groupingsFrameHash'
ORDER  BY v.ord
LIMIT  10;

As we are getting closer to the core of Postgres, you need to know what you are doing. We are relying on implementation details that may change across major Postgres versions (though unlikely). Read details about pg_statistics in the manual and comments in the source code.

To squeeze out the last drop, you could even hard-code the attribute number of your column (which changes if you change the position of the column in your table!) and rely on the order of rows returned by unnest(), which normally works:

SELECT v."groupingsFrameHash", (c.reltuples * freq)::int AS estimate_ct
FROM   pg_class     c
JOIN   pg_statistic s ON s.starelid = c.oid
     , unnest(s.stavalues1::text::text[], s.stanumbers1) v("groupingsFrameHash", freq)    
WHERE  c.oid = regclass 'public.zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876'
AND    s.staattnum = int2 '6'  -- hard-coded pg_attribute.attnum
LIMIT  10;

Get your own estimates

With the new TABLESAMPLE feature in Postgres 9.5 you can base your aggregates on a (more or less) random sample of the table:

SELECT birthday, 10 * count(*) AS estimate
FROM   big
TABLESAMPLE SYSTEM (10)
GROUP  BY 1
ORDER  BY estimate DESC
LIMIT  10;

Details:

Exact counts

If you need exact counts, the best query depends on data distribution and value frequencies. Emulating a loose index scan (like @Mihai commented) can very well improve performance - in a limited fashion, though, (like @ypercube commented) since you need to consider all distinct values for your sort order. For relatively few distinct values the technique still pays, but for your example with ~ 25k distinct values in a table of ~ 100k rows the chances are slim. Basics:

But first you probably need to tune your cost settings. Using SET LOCAL enable_seqscan = off; is primarily meant for debugging problems. Using it in your transaction is a measure of last resort. It may seem to fix your problem at hand, but can bite you later.

Rather fix the underlying problem. My educated guess is that your setting for random_page_cost is unrealistically high. If most of your database (or at least most of the relevant parts) fit into available cache, the default setting of 4.0 is typically much too high. Depending on the complete picture it can be as low as 1.1 or even 1.0.

The fact that Postgres incorrectly estimates a sequential scan to be faster, while using the index is ten times faster would be a typical indicator for misconfiguration: