PostgreSQL Performance – Optimize for Many INSERTS and bytea Updates

byteadatabase-tuninginsertperformanceperformance-tuningpostgresql-9.3

What we have (software):

  • PostrgeSQL 9.3 with base configuration (no changes in postgresql.conf)
  • Windows 7 64 bit

Hardware:

  • Intel Core i7-3770 3.9 Ghz
  • 32 Gb RAM
  • WDC WD10EZRX-00L4HBAta Drive (1000Gb, SATA III)

So, we have to load into DB aprox. 100.000.000 rows with bytea column, and more simple 500.000.000 rows (without LOBs). There are 2 varchar indexes on 1st table (with 13, 19 length) and 2 varchar indexes on 2nd table (18, 10 lengths).
There are also sequences for id generation for each table.

By now these operations are doing with 8 connections in parallel with 50 JDBC batch size. The picture below demonstrates system load: it is zero-load on postgresql processes. After 24 hours of loading we have loaded only 10.000.000 rows which is very slow result.

enter image description here

We are asking for help in tuning PostrgreSQL configuration in purposes of:

1) for ultra fast loading this amount of data, it is once-only operation, so it could be temporary configuration

2) for production mode for doing moderate number of SELECTs into these 2 tables by their indexes without join and without sorting.

Best Answer

For insert performance, see speeding up insert performance in PostgreSQL and bulk insert in PostgreSQL.

You're wasting your time with JDBC batching for insert. PgJDBC doesn't do anything useful with insert batches, it just runs each statement. <-- This is no longer true in newer PgJDBC versions, which can now batch prepared statements to reduce round-trip times considerably. But it's still better to:

Use COPY instead; see PgJDBC batch copy and the CopyManager. As for number of concurrent loaders: Aim for a couple per disk, if the operations are disk I/O bound. Eight is probably the most you'll want.

For your "production mode" I suggest loading a sample of data, setting up the queries you expect to run, and using explain analyze to investigate performance. For testing purposes only, use the enable_ params to explore different plan selections. Set the query planner cost parameters (random_page_cost, seq_page_cost, effective_cache_size, etc) appropriately for your system, and make sure shared_buffers is set appropriately. Continue to monitor as you add a simulated production workload, using the auto_explain module, log_min_duration_statement setting, the pg_stat_statements extension, etc.

For details, see the PostgreSQL user manual. I suggest popping back here when you have a more concrete problem with explain analyze query execution details, etc.