PostgreSQL COPY Command – Mapping CSV Columns to Fields

copycsvpostgresql

Does the PostgreSQL COPY command have the option of choosing with fields to map the CSV columns to?

The PostgreSQL COPY command appears to expect that the target table matches its columns exactly. Am I missing something or is that how it actually works?

Is there some alternative command that enables that?

Best Answer

It is absolutely possible - the ever helpful documentation comes to the rescue, again:

COPY table_name [ ( column_name [, ...] ) ]
    FROM { 'filename' | PROGRAM 'command' | STDIN }
    [ [ WITH ] ( option [, ...] ) ]

Which means you can do something like this:

COPY my_table (mt_id, mt_name, mt_created_by, ...)
    FROM 'filename' [...]

What you cannot do is to refer columns of the CSV file. To overcome this, one can create an intermediate table with the matching number and type of columns, do the COPY into it, then do an INSERT ... SELECT ... to the final destination. Based on an important remark from Patrick7, the intermediate table can be defined as UNLOGGED, saving a lot of WAL overhead when the table is big.

Related Solutions

MySQL Import – Fixing CSV Import Errors with Empty Fields

Something like this, checking every value for '', and inserting NULL instead, may help. Change a,b,c,d for the actual name and number of rows on the table:

$ cat /tmp/test.csv 
test1, test2, test3, test4
,,,
1,2,3,4
,,,

mysql> LOAD DATA INFILE "/tmp/test.csv" INTO TABLE test.test  
          FIELDS TERMINATED BY ','  
          OPTIONALLY ENCLOSED BY '"' 
          LINES TERMINATED BY '\n' 
          IGNORE 1 ROWS 
       (@a, @b, @c, @d) 
       SET a = IF(@a = '', NULL, @a), 
           b = IF(@b = '', NULL, @b), 
           c = IF(@c = '', NULL, @c), 
           d = IF(@d = '', NULL, @d);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Deleted: 0  Skipped: 0  Warnings: 0

mysql> SELECT * FROM test.test;
+------+------+------+------+
| a    | b    | c    | d    |
+------+------+------+------+
| NULL | NULL | NULL | NULL |
|    1 |    2 |    3 |    4 |
| NULL | NULL | NULL | NULL |
+------+------+------+------+
3 rows in set (0.00 sec)

Check the syntax for LOAD DATA for more details.

Postgresql – Copying CSV file to temp table with dynamic number of columns

Basics:

The 1st row of the CSV file has column names of defined format.
The PROGRAM clause of COPY and GET DIAGNOSTICS after COPY require Postgres 9.3+.
format() requires Postgres 9.1+
This works with pure standard Postgres - except for the head command that the shell is expected to provide. For Windows versions consider:
- How to do what head, tail, more, less, sed do in Powershell?

Full automation

This function copies any table structure completely dynamically:

CREATE OR REPLACE FUNCTION f_dynamic_copy(_file    text
                                        , _tbl     text = 'tmp1'
                                        , _delim   text = E'\t'
                                        , _nodelim text = chr(127)) -- see below!
  RETURNS text AS
$func$
DECLARE
   row_ct int;
BEGIN
   -- create staging table for 1st row as  single text column 
   CREATE TEMP TABLE tmp0(cols text) ON COMMIT DROP;

   -- fetch 1st row
   EXECUTE format($$COPY tmp0 FROM PROGRAM 'head -n1 %I' WITH (DELIMITER %L)$$  -- impossible delimiter
                , _file, _nodelim);

   -- create actual temp table with all columns text
   EXECUTE (
      SELECT format('CREATE TEMP TABLE %I(', _tbl)
          || string_agg(quote_ident(col) || ' text', ',')
          || ')'
      FROM  (SELECT cols FROM tmp0 LIMIT 1) t
           , unnest(string_to_array(t.cols, E'\t')) col
      );

   -- Import data
   EXECUTE format($$COPY %I FROM %L WITH (FORMAT csv, HEADER, NULL '\N', DELIMITER %L)$$
                , _tbl, _file, _delim);

   GET DIAGNOSTICS row_ct = ROW_COUNT;
   RETURN format('Created table %I with %s rows.', _tbl, row_ct);
END
$func$  LANGUAGE plpgsql;

Call variants:

SELECT f_dynamic_copy('/path/to/file.csv');
SELECT f_dynamic_copy('/path/to/file2.csv', 'tmp_file2');
SELECT f_dynamic_copy(_file  => '/path/to/file2.csv'
                    , _tbl   => 'tmp_file2');
                    , _delim => E'\t'); -- using assignment operator since pg 9.5

Answer:

Created table tmp_file2 with 123 rows.

Before the main COPY, run a preliminary COPY ... TO tmp0 to fetch the first row with column names, which are expected to be unquoted, case-sensitive strings like COPY ... TO ... (FORMAT csv, HEADER) would export them.

The structure of the actual target table is derived from it, all columns with data type text. The default name of the resulting table is tmp1 - or provide your own as 2nd function parameter.

Then COPY is executed. The default delimiter is a tab character - or provide your delimiter as 3rd function parameter.

Use any single-byte character for the non-delimiter _nodelim which does not appear in the first line of your CSV file. I am arbitrarily picking the control character "Delete" (ASCII 127). That character would be swallowed here on SO, so I generate with chr(127) instead, which is also valid. Assuming the character won't pop up - or provide your non-delimiter as 4th function parameter.

The function returns table name and number of imported rows.
Remember, a temporary table dies with the end of the session.

The manual:

Executing a command with PROGRAM might be restricted by the operating system's access control mechanisms, such as SELinux.

Postgres 8.4

That version is too old, I am not going to back-port that far.

GET DIAGNOSTICS is an optional feature. You can just leave it away or replace it with a full count on the table
A primitive (expensive) alternative for the PROGRAM clause of COPY in pg 9.3 would be to import the complete table instead:
```
EXECUTE format($$COPY tmp0 FROM %L WITH (DELIMITER %L)$$, _file, _delim);
```

Or you prepare a second input file, or you can make it work by piping from the shell: COPY tablename FROM STDIN is available in pg 8.4.

format() can be replaced with plain string concatenation. Be wary of SQL injection though!
- SQL injection in Postgres functions vs prepared queries