I'm trying to speed up this query in Postgres 9.4:
SELECT "groupingsFrameHash", COUNT(*) AS nb
FROM "public"."zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876"
GROUP BY "groupingsFrameHash"
HAVING COUNT(*) > 1
ORDER BY nb DESC LIMIT 10
I have an index on "groupingsFrameHash"
. I don't need exact results, a fuzzy approximation is good enough.
Here is the query plan:
Limit (cost=17207.03..17207.05 rows=10 width=25) (actual time=740.056..740.058 rows=10 loops=1)
-> Sort (cost=17207.03..17318.19 rows=44463 width=25) (actual time=740.054..740.055 rows=10 loops=1)
Sort Key: (count(*))
Sort Method: top-N heapsort Memory: 25kB
-> GroupAggregate (cost=14725.95..16246.20 rows=44463 width=25) (actual time=615.109..734.740 rows=25977 loops=1)
Group Key: "groupingsFrameHash"
Filter: (count(*) > 1)
Rows Removed by Filter: 24259
-> Sort (cost=14725.95..14967.07 rows=96446 width=25) (actual time=615.093..705.507 rows=96026 loops=1)
Sort Key: "groupingsFrameHash"
Sort Method: external merge Disk: 3280kB
-> Seq Scan on "zrac_c1e350bb-a7fc-4f6b-9f49-92dfd1873876" (cost=0.00..4431.46 rows=96446 width=25) (actual time=0.007..33.813 rows=96026 loops=1)
Planning time: 0.080 ms
Execution time: 740.877 ms
I don't understand why it needs to do a Seq Scan.
Best Answer
You want the 10 most common values for
"groupingsFrameHash"
with their respective counts (excluding unique values) - a common task. This specification caught my attention, though:This allows for radically faster solutions. Postgres happens to store just those approximations in the system catalogs, the total count in
pg_class
and most common values inpg_statistic
. The manual about the nature of these numbers:You have been warned.
Also consider the chapter Statistics Used by the Planner in the manual.
If you have set up autovacuum properly and the contents of your table do not change too much, these estimates should be good. In case you run this query immediately after substantial changes to the table (so autovacuum didn't get a chance to kick in), run
ANALYZE
first (or betterVACUUM ANALYZE
if you can spare the time). You can also fine-tune precision, that's beyond the scope of this question ...There are security considerations. Quoting the manual again:
All this considered, you can get quick estimates:
There are a couple of noteworthy features in this query:
Provide all identifier strings unescaped and case sensitive.
unnest()
for multiple arrays requires Postgres 9.4 or later. Details:pg_stats.most_common_vals
is a special column with the data pseudo-typeanyarray
(not available in user tables). It can store arrays of any type. To decompose, cast totext
and then to the array type of your column type. Assumingtext[]
in the example:Replace with your actual data type.
I added
WITH ORDINALITY
tounnest()
(Postgres 9.4 or later) to preserve the original order of elements. Since the numbers in the arrays are ordered by descending frequency, we can work with that sort order right away. Consider:This takes around 1 ms or less - no matter how many rows there are in your table.
Experimental optimizations
If you still need to squeeze out more performance and you have superuser access, you could use
pg_statistic
directly:As we are getting closer to the core of Postgres, you need to know what you are doing. We are relying on implementation details that may change across major Postgres versions (though unlikely). Read details about
pg_statistics
in the manual and comments in the source code.To squeeze out the last drop, you could even hard-code the attribute number of your column (which changes if you change the position of the column in your table!) and rely on the order of rows returned by
unnest()
, which normally works:Get your own estimates
With the new
TABLESAMPLE
feature in Postgres 9.5 you can base your aggregates on a (more or less) random sample of the table:Details:
Exact counts
If you need exact counts, the best query depends on data distribution and value frequencies. Emulating a loose index scan (like @Mihai commented) can very well improve performance - in a limited fashion, though, (like @ypercube commented) since you need to consider all distinct values for your sort order. For relatively few distinct values the technique still pays, but for your example with ~ 25k distinct values in a table of ~ 100k rows the chances are slim. Basics:
But first you probably need to tune your cost settings. Using
SET LOCAL enable_seqscan = off;
is primarily meant for debugging problems. Using it in your transaction is a measure of last resort. It may seem to fix your problem at hand, but can bite you later.Rather fix the underlying problem. My educated guess is that your setting for
random_page_cost
is unrealistically high. If most of your database (or at least most of the relevant parts) fit into available cache, the default setting of 4.0 is typically much too high. Depending on the complete picture it can be as low as 1.1 or even 1.0.The fact that Postgres incorrectly estimates a sequential scan to be faster, while using the index is ten times faster would be a typical indicator for misconfiguration: