PostgreSQL – Memory Usage on SELECT for Large Number of Rows

memoryperformancepostgresqlquery-performance

I am trying to dump the entire contents of a large table, from the command line using pqsl, but am running into a problem where memory usage goes up, to the point where the process is killed, before any data is even dumped.

What I don't understand is: why isn't the query returning results immediately, and completing without running out of memory?

Here is an explanation of exactly what I'm attempting:

I have a table, say:

CREATE TABLE big
(
  id integer,
  rand double precision
)

A large number of rows is inserted (50 million):

insert into big 
  select generate_series(1, 50000000) AS id, random();

The query plan to select every row looks like (not surprisingly):

$ psql -d big -c "explain select * from big;"

                       QUERY PLAN                           
----------------------------------------------------------------
 Seq Scan on big  (cost=0.00..924326.24 rows=50000124 width=12)
(1 row)

I then attempt to dump the contents to file:

$ psql -d big -c "select * from big;" > big.dump

As I said above, this command is failing before any data is written, seemingly by taking up an ever increasing amount of memory before being killed by the OS (by "OOM killer").

Note: I understand I could use pg_dump to accomplish something similar, but in reality, my query is more complex than this – specifically, I would like to encode each row as JSON when dumping.

Some configuration details:

  • postgresql version = 9.3.4
  • work_mem = 1MB
  • shared_buffers = 128MB
  • effective_cache_size = 128MB

Best Answer

By default, the results are entirely buffered in memory for two reasons:

1) Unless using the -A option, output rows are aligned so the output cannot start until psql knows the maximum length of each column, which implies visiting every row (which also takes a significant time in addition to lots of memory).

2) Unless specifying a FETCH_COUNT, psql uses the synchronous PQexec function directly on the query, which buffers the entire resultset. But when setting a FETCH_COUNT, it will use a cursor-based method with successive fetch calls and freeing or reusing the client-side buffer every FETCH_COUNT rows.

So a big resultset should be fetched by a command like:

psql -A -t --variable="FETCH_COUNT=10000" \
     -c "select columns from bigtable" \
     > output-file

With FETCH_COUNT reduced if the rows are very large and it still eats too much memory.

The -t stands for --tuples-only, which suppresses the output of headers and footers.