Postgresql – Check the time that PostgreSQL is taking to automatically create existing indexes when you do bulk insert using copy command from Spark

apache-sparkpostgresqlpostgresql-performance

I am doing a bulk insert from the spark to the postgres table. Amount of data that I am ingesting is huge. The number of records is around 120-130 million

I am first saving the records as multiple csv files on distributed storage location i.e. S3 bucket in my use case. Now I am using multiple copy command to copy the data in the PostgreSQL table. The actual PostgreSQL table has four indexes on it.

The copy command takes around 8 hours to save the data. I created a similar table without indexes and the data got saved in around 28-30 minutes. Based on searches that I have done on internet on multiple sites, they mentioned that indexes can slow down the performance and that is definitely seen based on the time difference that I have specified above.

Now the actual question is how I can identify which index creation is taking more time. Is there any utility, query or command that shows the time taken to create the indexes on the table when we are doing bulk inserts.
I am using the below query to see the number of multiple copy commands are running on the PostgreSQL instance:

SELECT * FROM pg_stat_activity 
where usename = 'xyz' and application_name ='PostgreSQL JDBC Driver'

Is there something like this query or any tool or command that I can use to see amount of time t is taking to create the indexes. Also how much time is taken by each index?

Any idea, guidance or suggestion are welcome. I am not that familiar with PostgresQL.

Best Answer

As long as all indexes are B-tree indexes, modifying them all should take roughly the same time.

Of course, an index with an expensive expression will take somewhat longer, as well as an index on a string column with an expensive collation.

You could compare the duration of CREATE INDEX statements on a filled table – if an index takes twice as long to build as another one, it will probably also take about twice the time to modify.

If the amount of newly loaded data is large, it may be faster to drop and re-create the indexes. You'll have to experiment.

Related Solutions

PostgreSQL Performance – Improve UPDATE Performance on Big Tables

Assuming the column uuid is supposed to be UNIQUE, this table definition should save some space and help performance:

CREATE TABLE public.calls (
   id serial PRIMARY KEY,
   result int4 DEFAULT 0 NOT NULL,
   uuid uuid UNIQUE NOT NULL  -- creates the index you need automatically
   call_start timestamp NOT NULL,
   call_end timestamp,  -- so this can be NULL?
   duration int4,
   recording_duration int4,
   callsid varchar(128),
   destination varchar(256),
   campaign varchar(128),
   request_data varchar(4096),
   status varchar(64),
   recording_file varchar(256),
   recording_url varchar(256),
   recording_text varchar(4096),
   recording_download bool DEFAULT false,
   description varchar(4096),
   analysis varchar(4096),
   is_fax bool DEFAULT false,
   is_test bool,
   hangup_cause varchar(128),
   media_detected bool DEFAULT false,
   sip_callid varchar(256),
   hangup_cause_override varchar(256),
   is_blacklisted bool DEFAULT false,
   sip_error varchar(256),
   hangup_cause_report varchar(128),
   summary varchar(1024)
);

The most important feature here is the UNIQUE constraint, which is implemented with a unique index, and an index is what you need more than anything else (like @ypercube already commented).

If uuid is not unique, create a plain btree index on it.

If uuid is not a valid uuid, leave it as character type (varchar or text), but still create that index.

Size and performance considerations for data type varchar vs. uuid:

Would index lookup be noticeably faster with char vs varchar when all values are 36 chars

All my other suggested changes are minor improvements. Detailed explanation here:

Configuring PostgreSQL for read performance

If you don't need to enforce a particular maximum length I would just use text for all your character columns. But that has hardly any immediate effect on performance. Some of the columns might be converted to a more fitting type (with an actual performance benefit).

Indexes and `UPDATE`

So why does that tutorial page say:

When should indexes be avoided?
[...]
- Tables that have frequent, large batch update or insert operations.

That's misleading by omission. You desperately need the one index on uuid to support the predicate for your update. All other indexes slow your update down, because they require additional work to keep them current after an update. So if you have no use for the PRIMARY KEY on id (to allow FK constraints to it for instance), you can drop that (and make uuid the PK instead - also indexed automatically).

Postgresql – How to improve the SQL query or change the table structure in postgresql

First, if products.brand, and brand.name are the same regardless of case then use citext. In order to get there, we have to find out where they collissions are. @RDFozz is right in the comments. We can figure out by analyzing both tables.

SELECT lower(brand), array_agg(brand)
FROM products
GROUP BY lower(brand)
HAVING count(*) > 1;

Do the same thing on brands..

SELECT lower(name), array_agg(name)
FROM brands
GROUP BY lower(name)
HAVING count(*) > 1;

For this update to work, that should return 0 row. Only after it returns 0 rows, can you actually run your update and fix this moving forward.

Next, if that's the kind of update your doing this is a case insensitive type. You should upgrade the type to using citext. You should force the products table to refer to the brands table and then the change will permanently be reflected in your schema.

ALTER TABLE brands
  ALTER COLUMN name
  SET DATA TYPE citext;

ALTER TABLE product
  ALTER COLUMN brand
  SET DATA TYPE citext;

Best Answer

Related Solutions

PostgreSQL Performance – Improve UPDATE Performance on Big Tables

Indexes and UPDATE

Postgresql – How to improve the SQL query or change the table structure in postgresql

Related Question

Indexes and `UPDATE`