Postgresql – How to disable escaping when using COPY FROM in PostgreSQL

csvpostgresqlpostgresql-9.5

I have a large tab delimited file that I want to read into a table in PostgreSQL 9.5. It contains double-quotes and backslashes that I want to treat as regular characters.

I think COPY FROM is the way to go, but I can't figure out how to disable escaping.

Here is a sample of the data (it's from Google's ngram dataset):

aX13_X  2006    8   5
aX13_X  2007    4   3
aX13_X  2008    2   1
a\  1852    1   1
a\  1935    1   1
a\  1937    2   2
ACT1V1T1ES  2003    15  11
ACT1V1T1ES  2004    63  6
ACT1V1T1ES  2005    1   1
ACT1V1T1ES  2006    5   4
ACT1V1T1ES  2008    4   3
ACTION="    1995    3   3
ACTION="    1996    6   5
ACTION="    1997    9   7
ACTION="    1998    19  11
ACTION="    1999    11  5

And the table:

CREATE TABLE onegram (
    id SERIAL,
    ngram character text,
    year integer NOT NULL,
    match_count integer NOT NULL,
    volume_count integer NOT NULL
);

If I try without modifiers, the backslash escapes the tab:

ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv';
ERROR:  missing data for column "volume_count"
CONTEXT:  COPY onegram, line 4: "a\     1852    1       1"

So I switch to CSV, and then the double quote characters quote the tabs:

ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv' WITH CSV DELIMITER E'\t';
ERROR:  unterminated CSV quoted field
CONTEXT:  COPY onegram, line 17: "ACTION="      1999    11      5
"

Using CSV lets me use the DELIMITER keyword. It works if I pick a delimiter that doesn't occur in the sample (space, in this case):

ngram=# copy onegram (ngram, year, match_count, volume_count)
from '/home/tims/data/ngram/test.tsv' WITH CSV DELIMITER E'\t' QUOTE E' ';
COPY 16

But I want to be able to include any character (except for tab and newline). So, how can I disable QUOTE? Or what can I use instead of COPY FROM?

Edit: For slightly arbitrary reasons, ideally I'd like an option that does not involve pre-processing the data.

Best Answer

The copy command by default uses text format with tab delimiter. So only one thing you need is to escape backslashes:

copy onegram (ngram, year, match_count, volume_count)
from program 'sed ''s/\\/\\\\/g'' < /home/tims/data/ngram/test.tsv';
select * from onegram;
╔════╤════════════╤══════╤═════════════╤══════════════╗
║ id │   ngram    │ year │ match_count │ volume_count ║
╠════╪════════════╪══════╪═════════════╪══════════════╣
║  1 │ aX13_X     │ 2006 │           8 │            5 ║
║  2 │ aX13_X     │ 2007 │           4 │            3 ║
║  3 │ aX13_X     │ 2008 │           2 │            1 ║
║  4 │ a\         │ 1852 │           1 │            1 ║
║  5 │ a\         │ 1935 │           1 │            1 ║
║  6 │ a\         │ 1937 │           2 │            2 ║
║  7 │ ACT1V1T1ES │ 2003 │          15 │           11 ║
║  8 │ ACT1V1T1ES │ 2004 │          63 │            6 ║
║  9 │ ACT1V1T1ES │ 2005 │           1 │            1 ║
║ 10 │ ACT1V1T1ES │ 2006 │           5 │            4 ║
║ 11 │ ACT1V1T1ES │ 2008 │           4 │            3 ║
║ 12 │ ACTION="   │ 1995 │           3 │            3 ║
║ 13 │ ACTION="   │ 1996 │           6 │            5 ║
║ 14 │ ACTION="   │ 1997 │           9 │            7 ║
║ 15 │ ACTION="   │ 1998 │          19 │           11 ║
║ 16 │ ACTION="   │ 1999 │          11 │            5 ║
╚════╧════════════╧══════╧═════════════╧══════════════╝