PostgreSQL – How to Copy Large Tables to Multiple CSV Files with Limited Rows

csvpostgresql

Is there a way, using a sql function or an already existing psql function, to copy a very large table into multiple csv files and specify the maximum number of rows allowed in the output files ?

For example, a 35 million rows table would output 35 csv files of 1 million rows.

Best Answer

If the records don't have a embedded newlines in text fields, so that there is a strict [one line = one record] mapping, you may pass the output of \copy csv in psql to the Unix command split. For instance:

\copy (your query here) TO program 'split --lines 1000000' (format csv)

See the options of split to change the format of the names of output files or the destination directory. It can also be used server-side with COPY instead of \copy if you're superuser.

If the records may have embedded newlines, it's more complicated because with the above method, a record might span two consecutive files, making each file an invalid CSV file in isolation. For instance:

\copy (select E'1st line\n2nd line') TO program 'split --lines 1' (format csv)

would produce two files

xaa: "1st line

and

xab: 2nd line"

If the goal is to concatenate the files back into a single file to process it, then it doesn't matter, but if they must be processed individually, a different method should be considered.

It's possible in psql, but a bit involved (as opposed to writing it in a programming language). As of PostgreSQL 12, csv is a native output format in psql so a cursor on the query might be used with FETCH 1000000 statements doing the actual cut-and-retrieve. The skeleton for a piece of script that should work would look like this:

\pset format csv
\pset tuples_only on
begin;
declare c cursor for your-query-here;
fetch 1000000  from c \g file1.csv
fetch 1000000  from c \g file2.csv
... as many times as necessary...
close c;
end;

Because there is no looping construct in psql and assuming you don't know in advance how many fetch steps are needed, you have to generate that piece of script in a previous step that computes the count(*) of the resultset and emits (count(*)+NR-1)/NR times fetch NR from... commands to numbered files like above, where NR is your number of records per file.

Related Solutions

Postgresql – Set field values to newly imported rows in PostgreSQL table with existing data

I'd recommend using the COPY command from psql. You can set a DEFAULT value for a column and omit that value from the COPY command, eg:

\copy tablename(col1,col2,col3) FROM 'thefile.csv' WITH (FORMAT CSV)

Alternately, you can create a new TEMPORARY table in PgAdmin-III wtih just the columns in the CSV, import the CSV into it, and use SQL like this to merge it into the main table:

INSERT INTO realtable (col1, col2, col3, colwithdefault)
SELECT
    col1, col2, col, 'some default value'
FROM tempcsvtable;

You can use this to calculate columns based on expressions, combine and split columns, omit some rows, etc.

Mysql – Dealing with empty strings while loading a table from a CSV

According to the documentation, you can use SET statements to transform the data on the way in.

 [SET col_name = expr,...]

The expr expression can include the column name, which will be interpreted as the data being read from the file and destined for that column... so, for example, at the end of your LOAD DATA INFILE statement you might use:

SET latitude = IF(latitude + 0 = 0,NULL,latitude),
    area_code = IF(area_code = '',NULL,area_code)

This example transforms 2 columns. If latitude + 0 is 0, latitude gets set to NULL, and otherwise it gets set to the value from the file as the data is inserted; if area_code contains an empty string, it gets set to NULL, otherwise to the data from the file. The more appropriate choice will depend on how MySQL handles casting the data, but I suspect either of these constructs would work in your situation.

You do not have to reference columns you don't intend to transform. They'll be inserted as-is.

Best Answer

Related Solutions

Postgresql – Set field values to newly imported rows in PostgreSQL table with existing data

Mysql – Dealing with empty strings while loading a table from a CSV

Related Question