PostgreSQL – How to Copy Large Tables to Multiple CSV Files with Limited Rows

csvpostgresql

Is there a way, using a sql function or an already existing psql function, to copy a very large table into multiple csv files and specify the maximum number of rows allowed in the output files ?

For example, a 35 million rows table would output 35 csv files of 1 million rows.

Best Answer

If the records don't have a embedded newlines in text fields, so that there is a strict [one line = one record] mapping, you may pass the output of \copy csv in psql to the Unix command split. For instance:

\copy (your query here) TO program 'split --lines 1000000' (format csv)

See the options of split to change the format of the names of output files or the destination directory. It can also be used server-side with COPY instead of \copy if you're superuser.

If the records may have embedded newlines, it's more complicated because with the above method, a record might span two consecutive files, making each file an invalid CSV file in isolation. For instance:

\copy (select E'1st line\n2nd line') TO program 'split --lines 1' (format csv)

would produce two files

xaa: "1st line

and

xab: 2nd line"

If the goal is to concatenate the files back into a single file to process it, then it doesn't matter, but if they must be processed individually, a different method should be considered.

It's possible in psql, but a bit involved (as opposed to writing it in a programming language). As of PostgreSQL 12, csv is a native output format in psql so a cursor on the query might be used with FETCH 1000000 statements doing the actual cut-and-retrieve. The skeleton for a piece of script that should work would look like this:

\pset format csv
\pset tuples_only on
begin;
declare c cursor for your-query-here;
fetch 1000000  from c \g file1.csv
fetch 1000000  from c \g file2.csv
... as many times as necessary...
close c;
end;

Because there is no looping construct in psql and assuming you don't know in advance how many fetch steps are needed, you have to generate that piece of script in a previous step that computes the count(*) of the resultset and emits (count(*)+NR-1)/NR times fetch NR from... commands to numbered files like above, where NR is your number of records per file.