PostgreSQL Performance – Out of Memory on Count Group By Query

performancepostgresql

I'm trying to execute the following statement in PostgreSQL database:

create table tmp_nodes as select node_id,count(way_id) as count 
from way_nodes 
group by node_id;

This query fails because the machine runs out of memory. (Unfortunately, I can't find the corresponding error in the log files; however, I can see memory usage steadily increasing and that the process terminates with not results.) This is probably because the result table is kept in memory and is written to disk after the query has been executed, count with group by. Note, the number of rows in way_nodes is about 2.5 billion:

EXPLAIN create table tmp_nodes as select node_id,count(way_id) as count 
from way_nodes 
group by node_id;
                                    QUERY PLAN                                   
--------------------------------------------------------------------------------
 HashAggregate  (cost=53797289.60..54287411.12 rows=49012152 width=16)
    ->  Seq Scan on way_nodes  (cost=0.00..41209846.40 rows=2517488640 width=16)
(2 rows)

I was wondering if there is a good approach to get the query to be executed. The options I have thought of and I did some research about are:

  1. Is there a query parameter, or something, that forces the query to write the result earlier to the disk, meaning to not keep everything in memory?
  2. I found cursors as a way to do the same thing as (1), but manually. However, I am not sure how to do that. (If you think this is a good approach, can you show me some example code, either a link or in your response, please?)
  3. Thanks to TomH's response: Saying 'set enable_hashagg = false;' gives a query plan as shown below. Does this force the query to be executed in pipelined mode. (Is there any hint in query plan?) If so, this solves the problem.

Query plan for (3):

explain create table tmp_nodes as select node_id,count(way_id) as count 
from way_nodes 
group by node_id;
                                      QUERY PLAN                                      
--------------------------------------------------------------------------------------
 GroupAggregate  (cost=477330853.69..496702140.01 rows=49012152 width=16)
   ->  Sort  (cost=477330853.69..483624575.29 rows=2517488640 width=16)
         Sort Key: node_id
         ->  Seq Scan on way_nodes  (cost=0.00..41209846.40 rows=2517488640 width=16)
(4 rows)

Best Answer

Well you can use set enable_hashagg = false to disable the HashAggregate plan type, which should force it so choose an alternate plan.

You should probably also make sure you have analyzed the table, as it may that it is a lack of statistics that is stopping postgres realising it won't have enough memory for the result, which is going to be pretty massive.