AWS PostgreSQL – Speed Up Insertion via Python Script

amazon-rdsawspostgresqlpython

I have 120,000 csv inside my AWS EC2 instance, each containing 120,000 rows. I can't insert each of them as is into my AWS RDS postgresql DB, that will be 120,000^2 = 14,400,000,000 records. Each csv is about 2 MB.

My approach is:

  • Python script that converts 120,000 records into just 1 record (list of dictionary) for each csv (now 5 MB after condensing the data) with the help of pandas library
  • The python script then insert each csv's 1 record into AWS postgresql database via pandas.to_sql (which uses sqlalchemy
  • I use python multiprocessing module to fully utilize my AWS EC2 to speed up data insertion
  • I did not create additional indexes in order to speed up my data insertion
  • I use AWS EC2 instance with up to 25GB of network connection

My observation is:

  • At the beginning, my code will insert 50 csv per min, which is decent speed
  • However, right now, with 50k csv being inserted, it only insert 1k csv in 5 hours, about 3.33 csv per min
  • I tried using psql \copy and realized that it takes between 30-50 sec to insert 1 csv, that's slower than my script that converts the data and insert into the DB directly

I am not sure how I can speed up things up.

Best Answer

t2 classes are unsuitable for high work loads, by design. Once you have used up all your CPU credits, performance is abysmal. Again, by design.

Don't use t2 (or any t*) instances if you care about performance.

As far as I can tell, there is no way to monitor the CPU credits on RDS (the way you can on EC2)