Postgresql – the best way to upload custom-parsed data into AWS Aurora PostgreSQL

awspostgresql

I have a large (5-10 GB) binary file on AWS S3 that will require custom parsing, probably in python. It is essentially a sequential set of millions of dataframes, all having the same structure. What is the best way for me to get this data into a severless/hosted AWS Aurora PostgreSQL instance? So far I have thought of:
1. I could write to a CSV file and use COPY, but the size would be astronomical
2. I could send it over the wire in batches of rows
3. use AWS Glue, though I'm still learning about that.

Best Answer

I could write to a CSV file and use COPY, but the size would be astronomical

You could write the CSV data stream to a pipe rather than a file:

geneate_csv | psql -c '\copy tablename from stdin'

\copy tablename from program 'generate_csv'

Related Solutions

Postgresql – What AWS instance to choose for write-heavy PostgreSQL database server

Asking "What AWS instance to choose for write-heavy PostgreSQL database server" is somewhat like asking "which spoon should I use to cut up this loaf of bread".

The answer is: don't use a spoon, use a knife. Or in this case, real hardware like a dedicated server with a batch of good quality SSDs and a decent raid controller that has a battery-backed write-back cache. Or at least a VPS with very strong I/O performance guarantees.

If in doubt, benchmark.

That said, you can make AWS perform significantly better than the defaults by using provisioned I/O EBS volumes on EBS-optimized instances, improving performance while retaining data durability. Alternately, you can sacrifice durability in exchange for more performance by using a high I/O instance with the database on the instance store; in this case you will need a careful backup and replication scheme because all your data will be lost if your instance stops or crashes.

If you need more assistance, contact me directly.

Postgresql – Best way to pg_dump PostgreSQL on RDS to S3

You can directly upload the (gzipped) pg_dump results to s3 using the aws cli:

pg_dump -Z 9 -v -h ${DATABASE_HOST} -U ${DATABASE_USER} -d ${DATABASE_NAME} | aws s3 cp --storage-class STANDARD_IA --sse aws:kms - s3://my-bucket/dump.sql.gz

At work we wrapped this in a script which additionally fetches the required credentials from parameter store and then stores the backup in s3. The script runs within a docker container as a scheduled ecs task. Maybe the code in this repository will help you.

Best Answer

Related Solutions

Postgresql – What AWS instance to choose for write-heavy PostgreSQL database server

Postgresql – Best way to pg_dump PostgreSQL on RDS to S3

Related Question