PostgreSQL 8.3: Slow GROUP BY on large table

group byindexperformancepostgresqlpostgresql-8.3query-performance

I have a table with about 10 million records. I want to do a simple group by, but it's using a sequential scan and is slow…

select run_id, count(*) from result group by run_id;

I have an index defined on the run_id column.

How can I speed this up?

Best Answer

So why does Postgres 9.2 still show a sequential scan? I quote the Postgres Wiki:

Is "count(*)" much faster now?

A traditional complaint made of PostgreSQL, generally when comparing it unfavourably with MySQL (at least when using the MyIsam storage engine, which doesn't use MVCC) has been "count(*) is slow". Index-only scans can be used to satisfy these queries without there being any predicate to limit the number of rows returned, and without forcing an index to be used by specifying that the tuples should be ordered by an indexed column. However, in practice that isn't particularly likely.

It is important to realise that the planner is concerned with minimising the total cost of the query. With databases, the cost of I/O typically dominates. For that reason, "count(*) without any predicate" queries will only use an index-only scan if the index is significantly smaller than its table. This typically only happens when the table's row width is much wider than some indexes'.

Emphasis mine.
There is hardly anything to gain from an index scan here, as long as your talbe isn't bloated with more (big) columns.

Related Solutions

Postgresql – Slow query on large table with GROUP BY and ORDER BY

The LEFT JOIN in @dezso's answer should be good. An index, however, will hardly be useful (per se), because the query has to read the whole table anyway - the exception being index-only scans in Postgres 9.2+ and favorable conditions, see below.

SELECT m.hash, m.string, count(m.method) AS method_ct
FROM   methods m
LEFT   JOIN nostring n USING (hash)
WHERE  n.hash IS NULL
GROUP  BY m.hash, m.string 
ORDER  BY count(m.method) DESC;

Run EXPLAIN ANALYZE on the query. Several times to exclude cashing effects and noise. Compare the best results.

Create a multi-column index that matches your query:

CREATE INDEX methods_cluster_idx ON methods (hash, string, method);

Wait? After I said an index wouldn't help? Well, we need it to CLUSTER the table:

CLUSTER methods USING methods_cluster_idx;
ANALYZE methods;

Rerun EXPLAIN ANALYZE. Any faster? It should be.

CLUSTER is a one-time operation to rewrite the whole table in the order of the used index. It is also effectively a VACUUM FULL. If you want to be sure, you'd run a pre-test with VACUUM FULL alone to see what can be attributed to that.

If your table sees a lot of write operations, the effect will degrade over time. Schedule CLUSTER at off-hours to restore the effect. Fine tuning depends of your exact use-case. The manual about CLUSTER.

CLUSTER is a rather crude tool, needs an exclusive lock on the table. If you can't afford that, consider pg_repack which can do the same without exclusive lock. More in this later answer:

Configuring PostgreSQL for read performance

If the percentage of NULL values in the column method is high (more than ~ 20 percent, depending on actual row sizes), a partial index should help:

CREATE INDEX methods_foo_idx ON methods (hash, string)
WHERE method IS NOT NULL;

_{(Your later update shows your columns to be NOT NULL, so not applicable.)}

If you are running PostgreSQL 9.2 or later (as @deszo commented) the presented indexes may be useful without CLUSTER if the planner can utilize index-only scans. Only applicable under favorable conditions: No write operations that would effect the visibility map since the last VACUUM and all columns in the query have to be covered by the index. Basically read-only tables can use this any time, while heavily written tables are limited. More details in the Postgres Wiki.

The above mentioned partial index could be even more useful in that case.

If, on the other hand, there are no NULL values in column method, you should
1.) define it NOT NULL and
2.) use count(*) instead of count(method), that's slightly faster and does the same in the absence of NULL values.

If you have to call this query often and the table is read-only, create a MATERIALIZED VIEW.

Exotic fine point: Your table is named nostring, yet seems to contain hashes. By excluding hashes instead of strings, there is a chance that you exclude more strings than intended. Extremely unlikely, but possible.

Postgresql – Efficient query to get greatest value per group from big table

Index

A plain multicolumn B-tree index should work after all:

CREATE INDEX foo_idx
ON geoposition_records (equipment_id, created_at DESC NULLS LAST);

Why DESC NULLS LAST?

Unused index in range of dates query

It's safe to assume you have an equipment table? Then performance won't be a problem:

Correlated subquery

Based on this equipment table, run a lowly correlated subquery to great effect:

SELECT equipment_id
     ,(SELECT created_at
       FROM   geoposition_records
       WHERE  equipment_id = eq.equipment_id
       ORDER  BY created_at DESC NULLS LAST
       LIMIT  1) AS latest
FROM   equipment eq;

For a small number of rows in the equipment table (57 judging from your EXPLAIN ANALYZE output), that's very fast.

`LATERAL` join in Postgres 9.3+

SELECT eq.equipment_id, r.latest
FROM   equipment eq
LEFT   JOIN LATERAL (
   SELECT created_at
   FROM   geoposition_records
   WHERE  equipment_id = eq.equipment_id
   ORDER  BY created_at DESC NULLS LAST
   LIMIT  1
   ) r(latest) ON true;

Detailed explanation:

Optimize GROUP BY query to retrieve latest record per user

Performance similar to the correlated subquery.

Function

If you can't talk sense into the query planner (which shouldn't occur), a function looping through the equipment table is certain to do the trick. Looking up one equipment_id at a time uses the index.

CREATE OR REPLACE FUNCTION f_latest_equip()
  RETURNS TABLE (equipment_id int, latest timestamp)
  LANGUAGE plpgsql STABLE AS
$func$
BEGIN
   FOR equipment_id IN
      SELECT e.equipment_id FROM equipment e ORDER BY 1
   LOOP
      SELECT g.created_at
      FROM   geoposition_records g
      WHERE  g.equipment_id = f_latest_equip.equipment_id
                           -- prepend function name to disambiguate
      ORDER  BY g.created_at DESC NULLS LAST
      LIMIT  1
      INTO   latest;

      RETURN NEXT;
   END LOOP;
END  
$func$;

Makes for a nice call, too:

SELECT * FROM f_latest_equip();

Performance comparison:

db<>fiddle here
_{OLD sqlfiddle}