I have 120,000 csv inside my AWS EC2 instance, each containing 120,000 rows. I can't insert each of them as is into my AWS RDS postgresql DB, that will be 120,000^2 = 14,400,000,000 records. Each csv is about 2 MB.
My approach is:
- Python script that converts 120,000 records into just 1 record (list of dictionary) for each csv (now 5 MB after condensing the data) with the help of
pandas
library - The python script then insert each csv's 1 record into AWS postgresql database via
pandas.to_sql
(which usessqlalchemy
- I use python multiprocessing module to fully utilize my AWS EC2 to speed up data insertion
- I did not create additional indexes in order to speed up my data insertion
- I use AWS EC2 instance with up to 25GB of network connection
My observation is:
- At the beginning, my code will insert 50 csv per min, which is decent speed
- However, right now, with 50k csv being inserted, it only insert 1k csv in 5 hours, about 3.33 csv per min
- I tried using psql
\copy
and realized that it takes between 30-50 sec to insert 1 csv, that's slower than my script that converts the data and insert into the DB directly
I am not sure how I can speed up things up.
Best Answer
t2 classes are unsuitable for high work loads, by design. Once you have used up all your CPU credits, performance is abysmal. Again, by design.
Don't use t2 (or any t*) instances if you care about performance.
As far as I can tell, there is no way to monitor the CPU credits on RDS (the way you can on EC2)