PostgreSQL Import – How to Import Plain Files into a Table

copyimportpostgresqlpsql

What is a good way to insert/copy plain files into a postgres table, preferrably using the psql command-line?

In my case the files are a bunch of eMails from Maildir archives, so I tried to use COPY:

psql -c "COPY emails (data) FROM '/tmp/emailfile' WITH (FORMAT text);" emails

which I would use in a for-loop shell script (for file in $(ls dir); do psql ...; done).

However, I have trouble finding a good "delimiter" that couldn't possibly be in the file,
and I'm getting these errors: ERROR: extra data after last expected column.

So I considered using the COPY ... FORMAT binary version and a BYTEA field in the db (then convert the column to TEXT inside the db),
but that requires a file header and trailer, which I have no easy way of building on-the-fly.

Is there an easy way to do this from the commandline, or do I need to write a python script for this?

Best Answer

Server-side files

Assuming the account has the pg_read_server_files role (or is superuser), if the files are accessible on a filesystem mounted on the server, and their paths have been collected in the table, this will grab the contents efficiently:

UPDATE emails SET mail_data = pg_read_binary_file(emails.fullpath);

It is more efficient than using large objects as an intermediate storage area.

Client-side files

When the files are not accessible server-side or the account does not have elevated permissions, with psql a more generic solution may be, for each file:

\set clientpath '/path/to/file'
-- assume clean paths (without any character that would be special to the shell)
\set contents `base64 :clientpath`

insert into email_data([other columns...], mail_data)
 values ( [other columns values...], decode(:'contents','base64'));

A base64 intermediate representation is used because psql doesn't support parameters in binary form. The :'contents' syntax instructs psql to inject the variable in text form into the query.

Related Solutions

PostgreSQL – Change Permissions of Copy to Exported File

You'll want to create a directory without the t (sticky) permissions bit set, and have PostgreSQL copy the file into there. You should also make the directory setgid (g+s), as that makes the group inherited from the directory, not the creating process; see setgid directories.

You could put this directory in /tmp but personally I'd put it some neutral location like /var/lib, or mounted volume somewhere, as /tmp gets cleaned out automatically so you'll need to have the directory re-created if it's missing.

Assuming you do use /tmp:

mkdir -p /tmp/pgcsv
chgrp users /tmp/pgcsv
chmod u=rwX,g=rwsX,o= /tmp/pgcsv

then create the csv in there. It'll be automatically created with group-owner users, which will make it readable by your users (unless PostgreSQL's umask is too restrictive). Because the directory is group-writable by users (or whatever group you use) and does not have the t (sticky) bit set your users can delete files they do not own from it.

The reason this doesn't work when using /tmp/ directly is that the sticky bit is set in /tmp for security, so that user1 can't delete a file created by user2 and replace it with a symlink that user2 then writes to, allowing user1 to steal their data / trick them into overwriting things.

Other options:

Run the cron job as user export, using psql's \copy command, so it writes the data over the client/server socket and the file is created as the export user in the first place. This is by far the simplest option, and only slightly slower than a direct COPY.
Have a cron job running as root invoke the PostgreSQL copy then move the data file and change its ownership once the COPY finishes.
Give the export user the ability to run a very restricted command as root via sudo to move/delete.
Use POSIX ACLs to grant the postgres user special permission to write to /home/export (if POSIX ACLs are enabled on your file system)
Set group-owner of /home/export to postgres and make its permissions g=wx. PostgreSQL can add new entries, and overwrite anything within that it has write permission to, but not list entries. I don't like this option much.

DynamoDB Import – How to Import CSV or JSON File?

In the end I coded a Python function import_csv_to_dynamodb(table_name, csv_file_name, colunm_names, column_types) that imports a CSV into a DynamoDB table. Column names and column must be specified. It uses boto, and takes a lot of inspiration from this gist. Below is the function as well as a demo (main()) and the CSV file used. Tested on Windows 7 x64 with Python 2.7.5, but it should work on any OS that has boto and Python.

import boto

MY_ACCESS_KEY_ID = 'copy your access key ID here'
MY_SECRET_ACCESS_KEY = 'copy your secrete access key here'


def do_batch_write(items, table_name, dynamodb_table, dynamodb_conn):
    '''
    From https://gist.github.com/griggheo/2698152#file-gistfile1-py-L31
    '''
    batch_list = dynamodb_conn.new_batch_write_list()
    batch_list.add_batch(dynamodb_table, puts=items)
    while True:
        response = dynamodb_conn.batch_write_item(batch_list)
        unprocessed = response.get('UnprocessedItems', None)
        if not unprocessed:
            break
        batch_list = dynamodb_conn.new_batch_write_list()
        unprocessed_list = unprocessed[table_name]
        items = []
        for u in unprocessed_list:
            item_attr = u['PutRequest']['Item']
            item = dynamodb_table.new_item(
                    attrs=item_attr
            )
            items.append(item)
        batch_list.add_batch(dynamodb_table, puts=items)


def import_csv_to_dynamodb(table_name, csv_file_name, colunm_names, column_types):
    '''
    Import a CSV file to a DynamoDB table
    '''        
    dynamodb_conn = boto.connect_dynamodb(aws_access_key_id=MY_ACCESS_KEY_ID, aws_secret_access_key=MY_SECRET_ACCESS_KEY)
    dynamodb_table = dynamodb_conn.get_table(table_name)     
    BATCH_COUNT = 2 # 25 is the maximum batch size for Amazon DynamoDB

    items = []

    count = 0
    csv_file = open(csv_file_name, 'r')
    for cur_line in csv_file:
        count += 1
        cur_line = cur_line.strip().split(',')

        row = {}
        for colunm_number, colunm_name in enumerate(colunm_names):
            row[colunm_name] = column_types[colunm_number](cur_line[colunm_number])

        item = dynamodb_table.new_item(
                    attrs=row
            )           
        items.append(item)

        if count % BATCH_COUNT == 0:
            print 'batch write start ... ', 
            do_batch_write(items, table_name, dynamodb_table, dynamodb_conn)
            items = []
            print 'batch done! (row number: ' + str(count) + ')'

    # flush remaining items, if any
    if len(items) > 0: 
        do_batch_write(items, table_name, dynamodb_table, dynamodb_conn)


    csv_file.close() 


def main():
    '''
    Demonstration of the use of import_csv_to_dynamodb()
    We assume the existence of a table named `test_persons`, with
    - Last_name as primary hash key (type: string)
    - First_name as primary range key (type: string)
    '''
    colunm_names = 'Last_name First_name'.split()
    table_name = 'test_persons'
    csv_file_name = 'test.csv'
    column_types = [str, str]
    import_csv_to_dynamodb(table_name, csv_file_name, colunm_names, column_types)


if __name__ == "__main__":
    main()
    #cProfile.run('main()') # if you want to do some profiling

test.csv's content (must be located in the same folder as the Python script):

John,Doe
Bob,Smith
Alice,Lee
Foo,Bar
a,b
c,d
e,f
g,h
i,j
j,l