In the end I coded a Python function import_csv_to_dynamodb(table_name, csv_file_name, colunm_names, column_types)
that imports a CSV into a DynamoDB table. Column names and column must be specified. It uses boto, and takes a lot of inspiration from this gist. Below is the function as well as a demo (main()
) and the CSV file used. Tested on Windows 7 x64 with Python 2.7.5, but it should work on any OS that has boto and Python.
import boto
MY_ACCESS_KEY_ID = 'copy your access key ID here'
MY_SECRET_ACCESS_KEY = 'copy your secrete access key here'
def do_batch_write(items, table_name, dynamodb_table, dynamodb_conn):
'''
From https://gist.github.com/griggheo/2698152#file-gistfile1-py-L31
'''
batch_list = dynamodb_conn.new_batch_write_list()
batch_list.add_batch(dynamodb_table, puts=items)
while True:
response = dynamodb_conn.batch_write_item(batch_list)
unprocessed = response.get('UnprocessedItems', None)
if not unprocessed:
break
batch_list = dynamodb_conn.new_batch_write_list()
unprocessed_list = unprocessed[table_name]
items = []
for u in unprocessed_list:
item_attr = u['PutRequest']['Item']
item = dynamodb_table.new_item(
attrs=item_attr
)
items.append(item)
batch_list.add_batch(dynamodb_table, puts=items)
def import_csv_to_dynamodb(table_name, csv_file_name, colunm_names, column_types):
'''
Import a CSV file to a DynamoDB table
'''
dynamodb_conn = boto.connect_dynamodb(aws_access_key_id=MY_ACCESS_KEY_ID, aws_secret_access_key=MY_SECRET_ACCESS_KEY)
dynamodb_table = dynamodb_conn.get_table(table_name)
BATCH_COUNT = 2 # 25 is the maximum batch size for Amazon DynamoDB
items = []
count = 0
csv_file = open(csv_file_name, 'r')
for cur_line in csv_file:
count += 1
cur_line = cur_line.strip().split(',')
row = {}
for colunm_number, colunm_name in enumerate(colunm_names):
row[colunm_name] = column_types[colunm_number](cur_line[colunm_number])
item = dynamodb_table.new_item(
attrs=row
)
items.append(item)
if count % BATCH_COUNT == 0:
print 'batch write start ... ',
do_batch_write(items, table_name, dynamodb_table, dynamodb_conn)
items = []
print 'batch done! (row number: ' + str(count) + ')'
# flush remaining items, if any
if len(items) > 0:
do_batch_write(items, table_name, dynamodb_table, dynamodb_conn)
csv_file.close()
def main():
'''
Demonstration of the use of import_csv_to_dynamodb()
We assume the existence of a table named `test_persons`, with
- Last_name as primary hash key (type: string)
- First_name as primary range key (type: string)
'''
colunm_names = 'Last_name First_name'.split()
table_name = 'test_persons'
csv_file_name = 'test.csv'
column_types = [str, str]
import_csv_to_dynamodb(table_name, csv_file_name, colunm_names, column_types)
if __name__ == "__main__":
main()
#cProfile.run('main()') # if you want to do some profiling
test.csv
's content (must be located in the same folder as the Python script):
John,Doe
Bob,Smith
Alice,Lee
Foo,Bar
a,b
c,d
e,f
g,h
i,j
j,l
Best Answer
Basics:
PROGRAM
clause ofCOPY
andGET DIAGNOSTICS
afterCOPY
require Postgres 9.3+.format()
requires Postgres 9.1+head
command that the shell is expected to provide. For Windows versions consider:Full automation
This function copies any table structure completely dynamically:
Call variants:
Answer:
Before the main
COPY
, run a preliminaryCOPY ... TO tmp0
to fetch the first row with column names, which are expected to be unquoted, case-sensitive strings likeCOPY ... TO ... (FORMAT csv, HEADER)
would export them.The structure of the actual target table is derived from it, all columns with data type
text
. The default name of the resulting table istmp1
- or provide your own as 2nd function parameter.Then
COPY
is executed. The default delimiter is a tab character - or provide your delimiter as 3rd function parameter.Use any single-byte character for the non-delimiter
_nodelim
which does not appear in the first line of your CSV file. I am arbitrarily picking the control character "Delete" (ASCII 127). That character would be swallowed here on SO, so I generate withchr(127)
instead, which is also valid. Assuming the character won't pop up - or provide your non-delimiter as 4th function parameter.The function returns table name and number of imported rows.
Remember, a temporary table dies with the end of the session.
The manual:
Related answer on SO:
Postgres 8.4
That version is too old, I am not going to back-port that far.
GET DIAGNOSTICS
is an optional feature. You can just leave it away or replace it with a full count on the tableA primitive (expensive) alternative for the
PROGRAM
clause ofCOPY
in pg 9.3 would be to import the complete table instead:Or you prepare a second input file, or you can make it work by piping from the shell:
COPY tablename FROM STDIN
is available in pg 8.4.format()
can be replaced with plain string concatenation. Be wary of SQL injection though!