Postgresql – Postgres/pgloader: setting a delimiter for multiple line text records

postgresqlpostgresql-9.3

I wish to import a 5 gigabyte text file with pgloader.

The individual records are paragraphs of html. The records include carriage returns and white space. So this not an import where it is one record per line.

Reading the pgloader howto I get the impression I can import these records by using setting lines terminated by to a special character to demark the end of a record. The options is described as:

lines terminated by

Takes a single character as argument, which must be found inside single quotes, and might be given as the printable character itself, the special value \t to denote a tabulation character, or 0x then an hexadecimal value read as the ASCII code for the character.

This character is used to recognize end-of-line condition when reading the CSV data.

So, am I right? Can I can import these records by using setting lines terminated by to a special character?

UPDATE: I had to get the project done, so I used some perl commands to modify the data to be one record per line. It's an undesired option because I wanted a multi-line text field.

But I still would like a way to do this. Something that allows a custom record-delimiter would probably be the answer.

Best Answer

I don't believe you can use pgloader in this way if any of your HTML contains character sequences that would be interpreted in the context of CSV. For example, any backslash followed by digits is perfectly valid (and not-special) HTML, but would be transformed to something else by pgloader while importing. I can't even estimate how pgloader will handle the double-quote and single-quote symbols scattered around in your documents.

You would also need a byte that doesn't exist anywhere in your content to serve as a row delimiter; you could use some low-value control byte for this purpose assuming your HTML is valid and doesn't contain any of these special bytes. This naturally depends on your chosen character encoding, and is rendered worse if you have multiple encodings among your documents.

Assuming everything is UTF-8, the NUL byte should not appear in any UTF-8 stream. Assuming you aren't concerned about special CSV character sequences, you could attempt to do this import using NUL as your separator value:

--with "lines terminated by '\0x00'"

I wouldn't recommend this avenue though for fear of subtle corruption of your documents (as noted above).