I know it's an old Question but this might still help searchers and it's a problem that pops up every now and then.
The main reason why you are hitting a performance ceiling without you seeing any resource bottleneck is because you've reached the limit of what is possible to process within one session single thread. The loop isn't processed in parallel, but all inserts are done serially.
In my case, it takes 36 seconds to insert 3 million rows. That means 36/30000000 = 0.000012 seconds per row. That's pretty fast. On my system, it simply takes 0.000012 to go through all the steps that are necessary.
The only way to get it done faster is start up a second session in parallel.
If I start 2 sessions in parallel both doing 15 million inserts. Both of them finish in 18 seconds. I could scale out more, but my current test setup is hitting 95% cpu with two parallel session, so doing 3 would skew the results since I would hit a CPU bottleneck.
If I start 2 parallel session both inserting 3 million rows, they both finish in 39 seconds. so that is now 6 million rows in 39 seconds.
Okay, that still leaves us with the NETWORK_IO wait showing up.
The NETWORK_IO waits are added by the fact that you are using extended events to trace them. In my case the insert takes 36 seconds (on avg). When using the extended event way (from the link above in the very first comment) this is what is registered:
Wait Type Wait Count Total Wait Time (ms) Total Resource Wait Time (ms) Total Signal Wait Time (ms)
NETWORK_IO 3455 68808 68802 6
PAGEIOLATCH_SH 3 64 64 0
PAGEIOLATCH_UP 12 58 58 0
WRITE_COMPLETION 8 15 15 0
WRITELOG 3 9 9 0
PAGELATCH_UP 2 4 4 0
SOS_SCHEDULER_YIELD 32277 1 0 1
IO_COMPLETION 8 0 0 0
LATCH_SH 3 0 0 0
LOGBUFFER 1 0 0 0
You can see that 68 seconds of NETWORK_IO is registered. But since the insert loop is a single threaded action that took 36 seconds, this can't be. (Yes, multiple threads are used, but the operations are serial, never in parallel, so you can't acummulate more wait time than the total duration of the query)
If I don't use extended events but just the wait stats DMVs on a quiet instance (with just me running the insert) I get this:
Wait Type Wait Count Total Wait Time (ms) Total Resource Wait Time (ms) Signal Resource Wait Time (ms)
SOS_SCHEDULER_YIELD 8873 0.21 0.01 0.20
PAGEIOLATCH_UP 3 0.02 0.02 0.00
PREEMPTIVE_OS_AUTHENTICATIONOPS 17 0.02 0.02 0.00
PAGEIOLATCH_SH 1 0.00 0.00 0.00
So the NETWORK_IO you were seeing in the extended events log, wasn't related to your insert loop. (If you wouldn't turn nocount on, you would have massive async network IO waits, +1 Martin)
However I don't know why the NETWORK_IO show up in the extended event trace. Sure the writing out to a async file target of the events accumulates ASYNC_NETWORK_IO, but surely this is all done on a differenent SPID then the one we are filtering on. I might ask this as a new question myself)
Data alignment and storage size
Actually, the overhead per index tuple is 8 byte for the tuple header plus 4 byte for the item identifier.
Related:
We have three columns for the primary key:
PRIMARY KEY ("Timestamp" , "TimestampIndex" , "KeyTag")
"Timestamp" timestamp (8 bytes)
"TimestampIndex" smallint (2 bytes)
"KeyTag" integer (4 bytes)
Results in:
4 bytes for item identifier in the page header (not counting towards multiple of 8 bytes)
8 bytes for the index tuple header
8 bytes "Timestamp"
2 bytes "TimestampIndex"
2 bytes padding for data alignment
4 bytes "KeyTag"
0 padding to the nearest multiple of 8 bytes
-----
28 bytes per index tuple; plus some bytes of overhead.
About measuring object size in this related answer:
Order of columns in a multicolumn index
Read these two questions and answers to understand:
The way you have your index (primary key), you can retrieve rows without a sorting step, that's appealing, especially with LIMIT
. But retrieving the rows seems extremely expensive.
Generally, in a multi-column index, "equality" columns should go first and "range" columns last:
Therefore, try an additional index with reversed column order:
CREATE INDEX analogransition_mult_idx1
ON "AnalogTransition" ("KeyTag", "TimestampIndex", "Timestamp");
It depends on data distribution. But with millions of row, even billion of rows
this might be substantially faster.
Tuple size is 8 bytes bigger, due to data alignment & padding. If you are using this as plain index, you might try to drop the third column "Timestamp"
. May be a bit faster or not (since it might help with sorting).
You might want to keep both indexes. Depending on a number of factors, your original index may be preferable - in particular with a small LIMIT
.
autovacuum and table statistics
Your table statistics need to be up to date. I am sure you have autovacuum running.
Since your table seems to be huge and statistics important for the right query plan, I would substantially increase the statistics target for relevant columns:
ALTER TABLE "AnalogTransition" ALTER "Timestamp" SET STATISTICS 1000;
... or even higher with billions of rows. Maximum is 10000, default is 100.
Do that for all columns involved in WHERE
or ORDER BY
clauses. Then run ANALYZE
.
Table layout
While being at it, if you apply what you have learned about data alignment and padding, this optimized table layout should save some disk space and help performance a little (ignoring pk & fk):
CREATE TABLE "AnalogTransition"(
"Timestamp" timestamp with time zone NOT NULL,
"KeyTag" integer NOT NULL,
"TimestampIndex" smallint NOT NULL,
"TimestampQuality" smallint,
"UpdateTimestamp" timestamp without time zone, -- (UTC)
"QualityFlags" smallint,
"Quality" boolean,
"Value" numeric
);
CLUSTER
/ pg_repack / pg_squeeze
To optimize read performance for queries that use a certain index (be it your original one or my suggested alternative), you can rewrite the table in the physical order of the index. CLUSTER
does that, but it's rather invasive and requires an exclusive lock for the duration of the operation.
pg_repack
is a more sophisticated alternative that can do the same without exclusive lock on the table.
pg_squeeze
is a later, similar tool (have not used it, yet).
This can help substantially with huge tables, since much fewer blocks of the table have to be read.
RAM
Generally, 2GB of physical RAM is just not enough to deal with billions of rows quickly. More RAM might go a long way - accompanied by adapted setting: obviously a bigger effective_cache_size
to begin with.
Best Answer
I found an interesting solution which you might be interested in. Basically, it involves creating a temporary clone of the table, where the data is copied into (this does the clustering). This will keep your original table available. After the insert is completed; you use the copy as new source; and discard the original (unclustered) table.
You can also have a look at pg_repack; which claims to provide an alternative to
VACUUM FULL
without the exclusive lock. This plugin basically does the same as the blog above does manually:Alternatively, playing with
maintenance_work_mem
might be a simple fix; increasing that might be enough.