Fastest Way to Extract Full Table in PostgreSQL

copydumppostgresqlpostgresql-11

I'm trying to dump a table with 50M records to a file, and my goal is to reduce the time in which this action is performed. I usually use the COPY metrics TO 'metrics.csv' DELIMITER ',' CSV; This could take like an hour in the best cases. I'm also interested in exporting the data in some plain format(avoid using pd_dump directories).

One of the ideas is to somehow access this table by a condition or cursor that splits the whole table into equal size pieces, so you can perform for example 2 copy queries at the same time reducing the time to the half.

Example:

COPY (SELECT * FROM metrics WHERE id < 25000000) TO 'metrics_1.csv' DELIMITER ',' CSV;
COPY (SELECT * FROM metrics WHERE id >= 25000000) TO 'metrics_2.csv' DELIMITER ',' CSV;

Could partial indexes created on those conditions help?

Any ideas a good way to implement this partial copy dumps of a table? Is there any other solution to dump this table faster?

Postgresql 11 / 100GB RAM / 20 cores.

After some parallelization with COPY IO bounds doesn't seem to be the bottleneck.

enter image description here

Best Answer

Your current idea of using a range query on the primary key is probably the best bet. I don't see how partial indexes could help with that. You would need a series of partial indexes which would in aggregate be a total index, and with BTREE indexes used for range queries that would be pointless. But, the success of this method is likely to depend on the rows of your table being physically approximately ordered by the primary key value. You can enforce this ordering by using the CLUSTER the command, but that is itself a very expensive operation to do.

If you want to get PostgreSQL to parallelize this for you, you would have use a dummy query which selects everything:

COPY (SELECT * FROM metrics) TO 'metrics.csv' DELIMITER ',' CSV;

and you will probably also need to dramatically lower the "parallel_tuple_cost setting", perhaps all the way to zero. But this is not likely to give you an actual improvement, because the bottleneck (if it is not disk IO) will be in converting your data from its internal binary format to the text format that is going to be output by the COPY. This conversion is always done in the leader process, not the parallel workers.

You can kind of get around this, by writing your queries so the heavy lifting is explicitly done in the query itself:

COPY (SELECT col1::text, col2::text, col3::text, ... FROM metrics) TO 'metrics.csv' DELIMITER ',' CSV;

Now the lead process still needs to scan all that text for things which need to be escaped/quoted, but at least it doesn't need to convert JSONB to text for all the rows by itself. In my hands, using 7 parallel workers (8 processes including the leader) cuts the time required to COPY...TO in half. The overhead then becomes the communication from the parallel workers to the lead process. The way to remove that is to go back your original suggestions of indexed range queries in separate sessions.