Postgresql – Check the time that PostgreSQL is taking to automatically create existing indexes when you do bulk insert using copy command from Spark

apache-sparkpostgresqlpostgresql-performance

I am doing a bulk insert from the spark to the postgres table. Amount of data that I am ingesting is huge. The number of records is around 120-130 million

I am first saving the records as multiple csv files on distributed storage location i.e. S3 bucket in my use case. Now I am using multiple copy command to copy the data in the PostgreSQL table. The actual PostgreSQL table has four indexes on it.

The copy command takes around 8 hours to save the data. I created a similar table without indexes and the data got saved in around 28-30 minutes. Based on searches that I have done on internet on multiple sites, they mentioned that indexes can slow down the performance and that is definitely seen based on the time difference that I have specified above.

Now the actual question is how I can identify which index creation is taking more time. Is there any utility, query or command that shows the time taken to create the indexes on the table when we are doing bulk inserts.
I am using the below query to see the number of multiple copy commands are running on the PostgreSQL instance:

SELECT * FROM pg_stat_activity 
where usename = 'xyz' and application_name ='PostgreSQL JDBC Driver'

Is there something like this query or any tool or command that I can use to see amount of time t is taking to create the indexes. Also how much time is taken by each index?

Any idea, guidance or suggestion are welcome. I am not that familiar with PostgresQL.

Best Answer

As long as all indexes are B-tree indexes, modifying them all should take roughly the same time.

Of course, an index with an expensive expression will take somewhat longer, as well as an index on a string column with an expensive collation.

You could compare the duration of CREATE INDEX statements on a filled table – if an index takes twice as long to build as another one, it will probably also take about twice the time to modify.

If the amount of newly loaded data is large, it may be faster to drop and re-create the indexes. You'll have to experiment.