Postgresql – Improving PostgreSQL cluster performance

performancepostgresql-9.4

I am trying to cluster my table faster. The table is 27 GB worth of data, and has one index using a point geometry which is 19 GB in size. I am seeing about 5.5 hours to cluster the table.

I am assuming clustering is a IO heavy process, so anything to improve read/write times (new disk) would be my first thought, but are there any other things that could improve the time it takes to cluster a table?

I ran this test on a machine with an SSD and 12 GB of RAM, using postgresql 9.4

Best Answer

I found an interesting solution which you might be interested in. Basically, it involves creating a temporary clone of the table, where the data is copied into (this does the clustering). This will keep your original table available. After the insert is completed; you use the copy as new source; and discard the original (unclustered) table.

The table is partitioned into three child tables: static1, static2, and latest. The parent table has a rule that redirects all inserts into the "latest" partition. This serves as the data collection point, where new rows sit until they are migrated into one of the other two partitions. The "static1" and "static2" partitions serve as front and back buffers for the cluster operation. Only one partition will actually have any rows at any given time (the front buffer); the other will be empty (the back buffer).

You can also have a look at pg_repack; which claims to provide an alternative to VACUUM FULL without the exclusive lock. This plugin basically does the same as the blog above does manually:

To perform full table repacks, pg_repack creates a work table in the "repack" schema and sorts the rows in this table. Then, it updates the system catalogs directly to swap the work table and the original one.

Alternatively, playing with maintenance_work_mem might be a simple fix; increasing that might be enough.

Related Solutions

Sql-server – SQL: What is slowing down INSERTs if not CPU or IO

I know it's an old Question but this might still help searchers and it's a problem that pops up every now and then.

The main reason why you are hitting a performance ceiling without you seeing any resource bottleneck is because you've reached the limit of what is possible to process within one session single thread. The loop isn't processed in parallel, but all inserts are done serially.

In my case, it takes 36 seconds to insert 3 million rows. That means 36/30000000 = 0.000012 seconds per row. That's pretty fast. On my system, it simply takes 0.000012 to go through all the steps that are necessary.

The only way to get it done faster is start up a second session in parallel.

If I start 2 sessions in parallel both doing 15 million inserts. Both of them finish in 18 seconds. I could scale out more, but my current test setup is hitting 95% cpu with two parallel session, so doing 3 would skew the results since I would hit a CPU bottleneck.

If I start 2 parallel session both inserting 3 million rows, they both finish in 39 seconds. so that is now 6 million rows in 39 seconds.

Okay, that still leaves us with the NETWORK_IO wait showing up.

The NETWORK_IO waits are added by the fact that you are using extended events to trace them. In my case the insert takes 36 seconds (on avg). When using the extended event way (from the link above in the very first comment) this is what is registered:

Wait Type             Wait Count  Total Wait Time (ms) Total Resource Wait Time (ms) Total Signal Wait Time (ms)
NETWORK_IO            3455        68808                68802                         6
PAGEIOLATCH_SH        3           64                   64                            0
PAGEIOLATCH_UP        12          58                   58                            0
WRITE_COMPLETION      8           15                   15                            0
WRITELOG              3           9                    9                             0
PAGELATCH_UP          2           4                    4                             0
SOS_SCHEDULER_YIELD   32277       1                    0                             1
IO_COMPLETION         8           0                    0                             0
LATCH_SH              3           0                    0                             0
LOGBUFFER             1           0                    0                             0

You can see that 68 seconds of NETWORK_IO is registered. But since the insert loop is a single threaded action that took 36 seconds, this can't be. (Yes, multiple threads are used, but the operations are serial, never in parallel, so you can't acummulate more wait time than the total duration of the query)

If I don't use extended events but just the wait stats DMVs on a quiet instance (with just me running the insert) I get this:

Wait Type                   Wait Count  Total Wait Time (ms)  Total Resource Wait Time (ms) Signal Resource Wait Time (ms)
SOS_SCHEDULER_YIELD             8873                 0.21                                    0.01                                    0.20
PAGEIOLATCH_UP                  3                    0.02                                    0.02                                    0.00
PREEMPTIVE_OS_AUTHENTICATIONOPS 17                   0.02                                    0.02                                    0.00
PAGEIOLATCH_SH                  1                    0.00                                    0.00                                    0.00

So the NETWORK_IO you were seeing in the extended events log, wasn't related to your insert loop. (If you wouldn't turn nocount on, you would have massive async network IO waits, +1 Martin)

However I don't know why the NETWORK_IO show up in the extended event trace. Sure the writing out to a async file target of the events accumulates ASYNC_NETWORK_IO, but surely this is all done on a differenent SPID then the one we are filtering on. I might ask this as a new question myself)

PostgreSQL – Configuring for Read Performance

Data alignment and storage size

Actually, the overhead per index tuple is 8 byte for the tuple header plus 4 byte for the item identifier.

We have three columns for the primary key:

PRIMARY KEY ("Timestamp" , "TimestampIndex" , "KeyTag")

"Timestamp"      timestamp (8 bytes)
"TimestampIndex" smallint  (2 bytes)
"KeyTag"         integer   (4 bytes)

Results in:

 4 bytes for item identifier in the page header (not counting towards multiple of 8 bytes)

 8 bytes for the index tuple header
 8 bytes "Timestamp"
 2 bytes "TimestampIndex"
 2 bytes padding for data alignment
 4 bytes "KeyTag" 
 0 padding to the nearest multiple of 8 bytes
-----
28 bytes per index tuple; plus some bytes of overhead.

About measuring object size in this related answer:

Measure the size of a PostgreSQL table row

Order of columns in a multicolumn index

Read these two questions and answers to understand:

The way you have your index (primary key), you can retrieve rows without a sorting step, that's appealing, especially with LIMIT. But retrieving the rows seems extremely expensive.

Generally, in a multi-column index, "equality" columns should go first and "range" columns last:

Multicolumn index and performance

Therefore, try an additional index with reversed column order:

CREATE INDEX analogransition_mult_idx1
   ON "AnalogTransition" ("KeyTag", "TimestampIndex", "Timestamp");

It depends on data distribution. But with millions of row, even billion of rows this might be substantially faster.

Tuple size is 8 bytes bigger, due to data alignment & padding. If you are using this as plain index, you might try to drop the third column "Timestamp". May be a bit faster or not (since it might help with sorting).

You might want to keep both indexes. Depending on a number of factors, your original index may be preferable - in particular with a small LIMIT.

autovacuum and table statistics

Your table statistics need to be up to date. I am sure you have autovacuum running.

Since your table seems to be huge and statistics important for the right query plan, I would substantially increase the statistics target for relevant columns:

ALTER TABLE "AnalogTransition" ALTER "Timestamp" SET STATISTICS 1000;

... or even higher with billions of rows. Maximum is 10000, default is 100.

Do that for all columns involved in WHERE or ORDER BY clauses. Then run ANALYZE.

Table layout

While being at it, if you apply what you have learned about data alignment and padding, this optimized table layout should save some disk space and help performance a little (ignoring pk & fk):

CREATE TABLE "AnalogTransition"(
  "Timestamp" timestamp with time zone NOT NULL,
  "KeyTag" integer NOT NULL,
  "TimestampIndex" smallint NOT NULL,
  "TimestampQuality" smallint,
  "UpdateTimestamp" timestamp without time zone, -- (UTC)
  "QualityFlags" smallint,
  "Quality" boolean,
  "Value" numeric
);

`CLUSTER` / pg_repack / pg_squeeze

To optimize read performance for queries that use a certain index (be it your original one or my suggested alternative), you can rewrite the table in the physical order of the index. CLUSTER does that, but it's rather invasive and requires an exclusive lock for the duration of the operation.
pg_repack is a more sophisticated alternative that can do the same without exclusive lock on the table.
pg_squeeze is a later, similar tool (have not used it, yet).

This can help substantially with huge tables, since much fewer blocks of the table have to be read.

RAM

Generally, 2GB of physical RAM is just not enough to deal with billions of rows quickly. More RAM might go a long way - accompanied by adapted setting: obviously a bigger effective_cache_size to begin with.